Deploying PostgreSQL for the California Civic Data Coalition's Django Project

25 November 2014

First, I'd like to introduce the California Civic Data Coalition. They are self described as a loosely coupled team from the Los Angeles Times Data Desk, The Center for Investigative Reporting and Stanford's Computational Journalism Lab.

I am not a part of the Los Angeles Times nor Stanford's computation Journalism Lab. However I am passionate about open data and leveraging my skills to help others. So I decided to start contributing. I started off with a rather small change then decided to sink my teeth into PostgreSQL support.

As a final thought I would encourage anyone and everyone to contribute to open data projects. I learned a lot setting up this feature and there is a lot of room for further contributions.

The aim of this document is to be a reference for others that may be challenged by similar feature development. I'll be covering some of the minute differences between MySQL and PostgreSQL and how I managed those differences. The project is written in Python and Django and started with only MySQL support.

Typically Django has a pretty strong background for supporting multiple databases so I figured that supporting one or another would be trivial, especially coming from MySQL. I was wrong.

What you'll find below is a fairly detailed log of the changes that I made and why I wrote the code that I wrote. Feel free to take a look at the code.


We load our data using some fairly sophisticated Django management commands. Using MySQL's INFILE support we bulk insert each file (while rearranging, cleaning, and renaming columns on the fly). Now PostgreSQL doesn't support INFILE loading but they've got a derivative: COPY.

Here's an example of a MySQL LOAD DATA INFO command.

LOAD DATA LOCAL INFILE '<file name>' INTO TABLE <table>
(@var1)
SET Date=str_to_date(SUBSTR(@var1,3,10),'%m/%d/%Y'), 
Time=SUBSTR(@var1,14,8), 
WindVelocity=SUBSTR(@var1,26,5),
WindDirection=SUBSTR(@var1,33,3),
WindCompass=SUBSTR(@var1,38,3),
WindNorth=SUBSTR(@var1,43,6),
WindEast=SUBSTR(@var1,51,6),
WindSamples=SUBSTR(@var1,61,4);

It's important to note all the @var fanciness. That's what makes this so easy(ish).

The problem is that COPY in Postgres, from my interpretation, is meant for unloading to a text file then loading it back in without really performing any modifications. It doesn't have any of the dexterity shown above. Here's an example of a COPY command.

COPY country FROM '/usr1/proj/bray/sql/country_data';

Obviously, we're not just unloading data and loading it back in. We are transforming dates and substrings, not to mention we are moving data from a completely different system into Postgres. That's where things get tricky. I can't format data on the fly and renaming columns isn't going to work with the support COPY has. I don't have nearly as many tools as MySQL does.

First Attempt: Clean before Insert


So I figured I could write some little python script to clean the the data before inputting it. The problem is some of the files are huge (> 1gb) so I could not do it with anything resembling speed. This attempt failed miserably, it was just too slow.

Second Attempt: A Temporary Table


Here's the process from a 10,000 foot view.

I create an empty table and insert all CSV data into that table. I then, through INSERT INTO and SELECT statements move that data from the temporary table into its true table. In this process I end up having to deal with a lot of unique cases not limited to integer fields being "Y/N", dates formatted in strange and inconsistent ways, and null constraints that are violated by the insert. These are all detailed in the methods like the one below.

def _make_XXXXXXX_case(self, _col):
    ....

I construct the insert and select statements on the fly as there are issues with column ordering, then attempt to insert - returning the errors with detailed reasoning.

Here's how I got there.

I had to use some special case statements in postgres to be able to coerce these types when taking them from the temporary table to our new table.

The second catch was a bit more nuanced. Basically for certain tables, we add certain fields (in Django). I'm not quite sure why these fields have been added but they have. Unfortunately we can't just ignore them because Null is not set the True in the model.

    intr_adr1 = models.CharField(
        max_length=55L, db_column='INTR_ADR1', blank=True
    )

So we have to manage that explicit case as well.

The third catch was boolean fields. In the remote database, these fields are stored as "Y"/"N". Now this shouldn't cause a problem however we insert them as integers in our database. This isn't really covered in the documentation so I was just getting errors where my inserts would fail.

Snooping around in my models showed me this code.

    accept_limit_yn = models.IntegerField(
        db_column='ACCEPT_LIMIT_YN', blank=True, null=True
    )

That's clearly meant to be a yes/no(read:boolean) field. So I had to modify my case statements to handle that. This is an example of accidental complexity - using the correct field type in our model could have allowed for a more graceful management of this information but instead we're adding another case statement to this in the database.

        ,CASE
            WHEN "ACCEPT_LIMIT_YN" = ''
                THEN NULL
            WHEN "ACCEPT_LIMIT_YN" = 'Y'
                THEN 1
            WHEN "ACCEPT_LIMIT_YN" = 'N'
                THEN 0
            WHEN "ACCEPT_LIMIT_YN" IS NOT NULL
                THEN "ACCEPT_LIMIT_YN"::int
        END AS "ACCEPT_LIMIT_YN"

The fourth issue was an interesting one, here's the select code we were using to transfer into the true table.

SELECT ""

        ,CASE
            WHEN "FILING_ID" = ''
                THEN NULL
            WHEN "FILING_ID" = 'Y'
                THEN 1
            WHEN "FILING_ID" = 'N'
                THEN 0
            WHEN "FILING_ID" IS NOT NULL
                THEN "FILING_ID"::int
        END AS "FILING_ID"

        ,CASE
            WHEN "FILING_TYPE" = ''
                THEN NULL
            WHEN "FILING_TYPE" = 'Y'
                THEN 1
            WHEN "FILING_TYPE" = 'N'
                THEN 0
            WHEN "FILING_TYPE" IS NOT NULL
                THEN "FILING_TYPE"::int
        END AS "FILING_TYPE"
FROM temporary_table;

Now SELECT "" obviously isn't going to work. What was happening is that the columns in that model were getting modified with the case statements. We have to add a dummy column to the temporary table and the new table, then remove it after the successful or unsuccessful insert.

When I got to this stage, I started getting some strange errors. MySQL using LOAD DATA seems to automatically truncate fields if they are too long. This made things complicated for me because PostgreSQL does no such thing with COPY. I had to find which fields were causing the issues and resolve them.

When I would grep certain files I'd see that there were plenty of fields causing issues. This is a great example (more can be seen in the Github issue.

The python code:

class FilernameCd(CalAccessBaseModel):
    DATE_FIELDS = ('EFFECT_DT',)
    xref_filer_id = models.CharField(
        max_length=7L,
        db_column='XREF_FILER_ID',
        db_index=True,
        help_text="The external filer id saved in the forms tables"
    )

The relevant file:

$ head -1 filername_cd.csv
"XREF_FILER_ID","FILER_ID","FILER_TYPE","STATUS","EFFECT_DT","NAML","NAMF","NAMT","NAMS","ADR1","ADR2","CITY","ST","ZIP4","PHON","FAX","EMAIL"
$ grep -E "R98\*-\d+" filername_cd.csv | wc -l
     609

The grep data:

"XREF_FILER_ID",....
"R98*-185420615",....

Finally the same data in the MySQL database:

mysql> SELECT XREF_FILER_ID FROM FILERNAME_CD where XREF_FILER_ID like 'R98*-18%';
+---------------+
| XREF_FILER_ID |
+---------------+
| R98*-18       |
| R98*-18       |
| R98*-18       |
....
+---------------+
29 rows in set (0.00 sec)

We can see there the data is getting truncated. It's unfortunate there was no warning that it was but I think that's still a response to how PostgreSQL and MySQL respond to data inserts.

So I'm running some of my final tests, end-to-end and I stumble upon another issue: incorrect data (or data definitions).

Here's where the issue was occurring:

 Loading CvrRegistrationCd
dataerror error,  invalid input syntax for integer: "01/0"
class CvrRegistrationCd(CalAccessBaseModel):
...
    ls_beg_yr = models.IntegerField(
        null=True, db_column='LS_BEG_YR', blank=True
    )
    ls_end_yr = models.IntegerField(db_column='LS_END_YR')
$ head -1 cvr_registration_cd.csv
"FILING_ID"..."RPT_DATE","LS_BEG_YR","LS_END_YR"....
$ grep "01/0" cvr_registration_cd.csv
"938431"..."6/7/2003 12:00:00 AM","01/0","12/3"...

It's one line in the file where the files should be integers but are instead "01/0","12/3". That had to be handled explicitly too.

Conclusions


All this logic took a serious amount of time to be able to complete in an appropriate way. Handling these edge cases is an important thing to do and when you're doing it in a system that will be used by a lot of others you have to be confident in what you're doing.

I hope others come and contribute to this project, it's been really rewarding to work on.