The Task: Take and load, using MySQL, a flatfile of over 450 colums and over 145,000 rows of data and put it into a staging table.
The Conditions: The data may not be altered prior to the staging load, or in the staging load. Only when the data is loaded from staging to the final tables may it be altered. Don’t ask why.
The End Platform: MySQL on a Linux OS base.
Of course, one of the most important things was the understanding that though I am faster and more knowledgable with Windows systems, the end result must exist on Linux. Thankfully MySQL runs on both platforms, and it’s command line engine works the same (with the exception of file location syntax and directory access permissions, but these are solvable issues) so I can test and design on Windows, and then port to Linux with a minimum of alteration required.
The data file given to me was, in the words of one of the project leads ‘a horrendous example of a file that is NOT 3NF (3rd Normal Form)’ This was an accurate assesment. Intially I saw a very large file, that used a poor choice of field delimiter. Using CSVed to easily strip out the first row (header names) I was able to construct a table creation script, keeping it as close ot the original names as I could, only replacing spaces with underscores, removing potential reserved or special character conflicts to reduce potential problems. Notepad and Wordpad crash even attempting to load the main file, and even Word itself was very argumentative. So CSVed was a good thing to have, even though it too had some minor problems, mainly display issues, with the file given it’s size. Cross your fingers, and try an intial load based on first guesses of the fields. And the result is so bad, it surpassed the MySQL hardcoded counters for errors and warnings at load. A little digging and we find some major culprits. A novel here and there in the notes fields, date fields that are strings, and number fields that are not long enough. Now it’s perfectly possible in MySQL to alter a column’s type and size through the command line. However, XAMPP (via phpMyAdmin) provides a way to do it graphically, and affect multiple rows at once easily, so it makes the job much faster. This was a tool of immense value to this task. I could also use it to purge the table of data to help keep the memory footprint lower during the testing and working out of the problems.
A few column setting changes later and I got the errors lower, but not by much. So, using Excel (2007 is the only option since previous versions cannot process more than 65535 rows very well, if at all) and a file of notes, I looked at every single column and adjusted sizes. Given the column count and contents, I had to be concerned with the max size limit per record imposed by MySQL as well as max field size. The later I could work around by use of a different datatype, akin to BLOB or CLOB in Oracle, the former though was something to keep track of. After going through it all, I still had an immense number of errors. So I try something simple, I added a dummy column to the end of the table, and in a single step the error count dropped to a mere 6000 errors and warnings. Progress is good.
The problem still remained where all these errors were and what was causing them. The solution was arduous, but the safest. Manually extract and test the records from the master file and separate the good from the bad. In small situations this isn’t a big deal, but we were talking over 145,000 records. Testing them 300 at a time still took a while. The relative small number was due to the fact that the MySQL warning display defaults to only showing the first 64 errors or warnings, so I had to reduce the number of attempted records to be able to see and affect the changes needed to find truly bad records. In the end I was able to isolate the roughly 220 bad records and set them aside. Reassembling the test files back together I tested the ‘good’ records, and got zero load warnings. Good. Beginning with the bad records, I was able to isolate a few dozen that were simply caused by incorrect field lengths, since after all, phone numbers are supposed to be 35 characters long, aren’t they? Fixing those, I got down to a mere 195 bad records, less than .2 percent of the total. Acceptable in most circles. The remaining bad records were attributed to either more columns than needed, fewer columns than needed, or data shifting (after a point all data is one column to the left or right of where it should be). Sent the bad records back to the boss, and tested the results of the good data on the Linux test box, with success. Now to just await the next task while streamlining this one a bit more.