First, I'd like to introduce the California Civic Data Coalition. They are self described as a loosely coupled team from the Los Angeles Times Data Desk, The Center for Investigative Reporting and Stanford's Computational Journalism Lab.
I am not a part of the Los Angeles Times nor Stanford's computation Journalism Lab. However I am passionate about open data and leveraging my skills to help others. So I decided to start contributing. I started off with a rather small change then decided to sink my teeth into PostgreSQL support.
As a final thought I would encourage anyone and everyone to contribute to open data projects. I learned a lot setting up this feature and there is a lot of room for further contributions.
The aim of this document is to be a reference for others that may be challenged by similar feature development. I'll be covering some of the minute differences between MySQL and PostgreSQL and how I managed those differences. The project is written in Python and Django and started with only MySQL support.
Typically Django has a pretty strong background for supporting multiple databases so I figured that supporting one or another would be trivial, especially coming from MySQL. I was wrong.
What you'll find below is a fairly detailed log of the changes that I made and why I wrote the code that I wrote. Feel free to take a look at the code.
We load our data using some fairly sophisticated Django management commands. Using MySQL's INFILE
support we bulk insert each file (while rearranging, cleaning, and renaming columns on the fly). Now PostgreSQL doesn't support INFILE
loading but they've got a derivative: COPY
.
Here's an example of a MySQL LOAD DATA INFO command.
LOAD DATA LOCAL INFILE '<file name>' INTO TABLE <table>
(@var1)
SET Date=str\_to\_date(SUBSTR(@var1,3,10),'%m/%d/%Y'),
Time=SUBSTR(@var1,14,8),
WindVelocity=SUBSTR(@var1,26,5),
WindDirection=SUBSTR(@var1,33,3),
WindCompass=SUBSTR(@var1,38,3),
WindNorth=SUBSTR(@var1,43,6),
WindEast=SUBSTR(@var1,51,6),
WindSamples=SUBSTR(@var1,61,4);
It's important to note all the @var
fanciness. That's what makes this so easy(ish).
The problem is that COPY
in Postgres, from my interpretation, is meant for unloading to a text file then loading it back in without really performing any modifications. It doesn't have any of the dexterity shown above. Here's an example of a COPY
command.
COPY country FROM '/usr1/proj/bray/sql/country\_data';
Obviously, we're not just unloading data and loading it back in. We are transforming dates and substrings, not to mention we are moving data from a completely different system into Postgres. That's where things get tricky. I can't format data on the fly and renaming columns isn't going to work with the support COPY
has. I don't have nearly as many tools as MySQL does.
First Attempt: Clean before Insert
So I figured I could write some little python script to clean the the data before inputting it. The problem is some of the files are huge (> 1gb) so I could not do it with anything resembling speed. This attempt failed miserably, it was just too slow.
Second Attempt: A Temporary Table
Here's the process from a 10,000 foot view.
I create an empty table and insert all CSV data into that table. I then, through INSERT INTO
and SELECT
statements move that data from the temporary table into its true table. In this process I end up having to deal with a lot of unique cases not limited to integer fields being "Y/N", dates formatted in strange and inconsistent ways, and null constraints that are violated by the insert. These are all detailed in the methods like the one below.
def \_make\_XXXXXXX\_case(self, \_col):
....
I construct the insert and select statements on the fly as there are issues with column ordering, then attempt to insert - returning the errors with detailed reasoning.
Here's how I got there.
I had to use some special case statements in postgres to be able to coerce these types when taking them from the temporary table to our new table.
The second catch was a bit more nuanced. Basically for certain tables, we add certain fields (in Django). I'm not quite sure why these fields have been added but they have. Unfortunately we can't just ignore them because Null is not set the True in the model.
intr\_adr1 = models.CharField(
max\_length=55L, db\_column='INTR\_ADR1', blank=True
)
So we have to manage that explicit case as well.
The third catch was boolean fields. In the remote database, these fields are stored as "Y"/"N". Now this shouldn't cause a problem however we insert them as integers in our database. This isn't really covered in the documentation so I was just getting errors where my inserts would fail.
Snooping around in my models showed me this code.
accept\_limit\_yn = models.IntegerField(
db\_column='ACCEPT\_LIMIT\_YN', blank=True, null=True
)
That's clearly meant to be a yes/no(read:boolean) field. So I had to modify my case statements to handle that. This is an example of accidental complexity - using the correct field type in our model could have allowed for a more graceful management of this information but instead we're adding another case statement to this in the database.
,CASE
WHEN "ACCEPT\_LIMIT\_YN" = ''
THEN NULL
WHEN "ACCEPT\_LIMIT\_YN" = 'Y'
THEN 1
WHEN "ACCEPT\_LIMIT\_YN" = 'N'
THEN 0
WHEN "ACCEPT\_LIMIT\_YN" IS NOT NULL
THEN "ACCEPT\_LIMIT\_YN"::int
END AS "ACCEPT\_LIMIT\_YN"
The fourth issue was an interesting one, here's the select code we were using to transfer into the true table.
SELECT ""
,CASE
WHEN "FILING\_ID" = ''
THEN NULL
WHEN "FILING\_ID" = 'Y'
THEN 1
WHEN "FILING\_ID" = 'N'
THEN 0
WHEN "FILING\_ID" IS NOT NULL
THEN "FILING\_ID"::int
END AS "FILING\_ID"
,CASE
WHEN "FILING\_TYPE" = ''
THEN NULL
WHEN "FILING\_TYPE" = 'Y'
THEN 1
WHEN "FILING\_TYPE" = 'N'
THEN 0
WHEN "FILING\_TYPE" IS NOT NULL
THEN "FILING\_TYPE"::int
END AS "FILING\_TYPE"
FROM temporary\_table;
Now SELECT ""
obviously isn't going to work. What was happening is that the columns in that model were getting modified with the case statements. We have to add a dummy column to the temporary table and the new table, then remove it after the successful or unsuccessful insert.
When I got to this stage, I started getting some strange errors. MySQL using LOAD DATA
seems to automatically truncate fields if they are too long. This made things complicated for me because PostgreSQL does no such thing with COPY
. I had to find which fields were causing the issues and resolve them.
When I would grep
certain files I'd see that there were plenty of fields causing issues. This is a great example (more can be seen in the Github issue.
The python code:
class FilernameCd(CalAccessBaseModel):
DATE\_FIELDS = ('EFFECT\_DT',)
xref\_filer\_id = models.CharField(
max\_length=7L,
db\_column='XREF\_FILER\_ID',
db\_index=True,
help\_text="The external filer id saved in the forms tables"
)
The relevant file:
$ head -1 filername_cd.csv
"XREF\_FILER\_ID","FILER\_ID","FILER\_TYPE","STATUS","EFFECT\_DT","NAML","NAMF","NAMT","NAMS","ADR1","ADR2","CITY","ST","ZIP4","PHON","FAX","EMAIL"
$ grep -E "R98\\*-\d+" filername_cd.csv | wc -l
609
The grep data:
"XREF\_FILER\_ID",....
"R98\*-185420615",....
Finally the same data in the MySQL database:
mysql> SELECT XREF\_FILER\_ID FROM FILERNAME\_CD where XREF\_FILER\_ID like 'R98\*-18%';
+---------------+
| XREF\_FILER\_ID |
+---------------+
| R98\*-18 |
| R98\*-18 |
| R98\*-18 |
....
+---------------+
29 rows in set (0.00 sec)
We can see there the data is getting truncated. It's unfortunate there was no warning that it was but I think that's still a response to how PostgreSQL and MySQL respond to data inserts.
So I'm running some of my final tests, end-to-end and I stumble upon another issue: incorrect data (or data definitions).
Here's where the issue was occurring:
Loading CvrRegistrationCd
dataerror error, invalid input syntax for integer: "01/0"
class CvrRegistrationCd(CalAccessBaseModel):
...
ls\_beg\_yr = models.IntegerField(
null=True, db\_column='LS\_BEG\_YR', blank=True
)
ls\_end\_yr = models.IntegerField(db\_column='LS\_END\_YR')
$ head -1 cvr_registration_cd.csv
"FILING\_ID"..."RPT\_DATE","LS\_BEG\_YR","LS\_END\_YR"....
$ grep "01/0" cvr_registration_cd.csv
"938431"..."6/7/2003 12:00:00 AM","01/0","12/3"...
It's one line in the file where the files should be integers but are instead "01/0","12/3". That had to be handled explicitly too.
Conclusions
All this logic took a serious amount of time to be able to complete in an appropriate way. Handling these edge cases is an important thing to do and when you're doing it in a system that will be used by a lot of others you have to be confident in what you're doing.
I hope others come and contribute to this project, it's been really rewarding to work on.