Staging Data with MySQL, CSVed, and XAMPP
Introduction
The Task: Take and load, using MySQL, a flatfile of over 450 colums and over 145,000 rows of data and put it into a staging table.
The Conditions: The data may not be altered prior to the staging load, or in the staging load. Only when the data is loaded from staging to the final tables may it be altered. Don’t ask why.
The End Platform: MySQL on a Linux OS base.
The Tools used: CSVed, Notepad, Winzip, MySQL, XAMPP (myphpAdmin), Excel 2007, Lots of time.
Of course, one of the most important things was the understanding that though I am faster and more knowledgable with Windows systems, the end result must exist on Linux. Thankfully MySQL runs on both platforms, and it’s command line engine works the same (with the exception of file location syntax and directory access permissions, but these are solvable issues) so I can test and design on Windows, and then port to Linux with a minimum of alteration required.
The Process
The data file given to me was, in the words of one of the project leads ‘a horrendous example of a file that is NOT 3NF (3rd Normal Form)’ This was an accurate assesment. Intially I saw a very large file, that used a poor choice of field delimiter. Using CSVed to easily strip out the first row (header names) I was able to construct a table creation script, keeping it as close ot the original names as I could, only replacing spaces with underscores, removing potential reserved or special character conflicts to reduce potential problems. Notepad and Wordpad crash even attempting to load the main file, and even Word itself was very argumentative. So CSVed was a good thing to have, even though it too had some minor problems, mainly display issues, with the file given it’s size. Cross your fingers, and try an intial load based on first guesses of the fields. And the result is so bad, it surpassed the MySQL hardcoded counters for errors and warnings at load. A little digging and we find some major culprits. A novel here and there in the notes fields, date fields that are strings, and number fields that are not long enough. Now it’s perfectly possible in MySQL to alter a column’s type and size through the command line. However, XAMPP (via phpMyAdmin) provides a way to do it graphically, and affect multiple rows at once easily, so it makes the job much faster. This was a tool of immense value to this task. I could also use it to purge the table of data to help keep the memory footprint lower during the testing and working out of the problems.
A few column setting changes later and I got the errors lower, but not by much. So, using Excel (2007 is the only option since previous versions cannot process more than 65535 rows very well, if at all) and a file of notes, I looked at every single column and adjusted sizes. Given the column count and contents, I had to be concerned with the max size limit per record imposed by MySQL as well as max field size. The later I could work around by use of a different datatype, akin to BLOB or CLOB in Oracle, the former though was something to keep track of. After going through it all, I still had an immense number of errors. So I try something simple, I added a dummy column to the end of the table, and in a single step the error count dropped to a mere 6000 errors and warnings. Progress is good.
Solution Grind
The problem still remained where all these errors were and what was causing them. The solution was arduous, but the safest. Manually extract and test the records from the master file and separate the good from the bad. In small situations this isn’t a big deal, but we were talking over 145,000 records. Testing them 300 at a time still took a while. The relative small number was due to the fact that the MySQL warning display defaults to only showing the first 64 errors or warnings, so I had to reduce the number of attempted records to be able to see and affect the changes needed to find truly bad records. In the end I was able to isolate the roughly 220 bad records and set them aside. Reassembling the test files back together I tested the ‘good’ records, and got zero load warnings. Good. Beginning with the bad records, I was able to isolate a few dozen that were simply caused by incorrect field lengths, since after all, phone numbers are supposed to be 35 characters long, aren’t they? Fixing those, I got down to a mere 195 bad records, less than .2 percent of the total. Acceptable in most circles. The remaining bad records were attributed to either more columns than needed, fewer columns than needed, or data shifting (after a point all data is one column to the left or right of where it should be). Sent the bad records back to the boss, and tested the results of the good data on the Linux test box, with success. Now to just await the next task while streamlining this one a bit more.
Related Information:
- MySQL and SQL Server - Oracle CONNECT BY PRIOR for Recursive, Hierarchical Data
Recursive queries are something I have used many, many times over the years to build dynamic, n-tiered hierarchies. Oracle’s CONNECT...
- Data Validation and Cleansing - Technology, a New “Mouse Trap”
You lie in wait, looking for your prey, it approaches, and looks around. You spring the trap and find the...
- Oracle ADF Faces 10g: How to pass data between pages
Oracle ADF Faces 10g: How to pass data between pages Do you have the need to pass data between pages,...
- Copy Table Columns and Data
As you may know, in Oracle the syntax from creating a table that is the same as another is: 1...
- Oracle Database vs Sun MySQL
Well, this is a topic we have reviewed numerous times over the years for projects and customers. The fact is,...
- APEX 3.2 Tutorial Preparation and OEHR Sample Data - Video Walkthrough
This tutorial video and article is meant to help you setup your APEX installation so you can proceed through the...
- Oracle System Administration - Linux Commands (Red Hat)
Following is a sortable table of the most common Linux commands we use on a day-to-day basis while implementing and...
- Amazon EC2 Server Pricing Matrix (Linux and Windows)
I was trying to figure out Amazon EC2 pricing recently, and found myself pulling out a calculator for what they...
