Staging Data with MySQL, CSVed, and XAMPP

By M&S Consulting (Kevin) on Jun 6th, 2008
Filed Under Technical Tips // Tags:

Introduction

The Task: Take and load, using MySQL, a flatfile of over 450 colums and over 145,000 rows of data and put it into a staging table.

The Conditions: The data may not be altered prior to the staging load, or in the staging load. Only when the data is loaded from staging to the final tables may it be altered. Don’t ask why.

The End Platform: MySQL on a Linux OS base.

The Tools used: CSVed, Notepad, Winzip, MySQL, XAMPP (myphpAdmin), Excel 2007, Lots of time.

Of course, one of the most important things was the understanding that though I am faster and more knowledgable with Windows systems, the end result must exist on Linux. Thankfully MySQL runs on both platforms, and it’s command line engine works the same (with the exception of file location syntax and directory access permissions, but these are solvable issues) so I can test and design on Windows, and then port to Linux with a minimum of alteration required.

The Process

The data file given to me was, in the words of one of the project leads ‘a horrendous example of a file that is NOT 3NF (3rd Normal Form)’ This was an accurate assesment. Intially I saw a very large file, that used a poor choice of field delimiter. Using CSVed to easily strip out the first row (header names) I was able to construct a table creation script, keeping it as close ot the original names as I could, only replacing spaces with underscores, removing potential reserved or special character conflicts to reduce potential problems. Notepad and Wordpad crash even attempting to load the main file, and even Word itself was very argumentative. So CSVed was a good thing to have, even though it too had some minor problems, mainly display issues, with the file given it’s size. Cross your fingers, and try an intial load based on first guesses of the fields. And the result is so bad, it surpassed the MySQL hardcoded counters for errors and warnings at load. A little digging and we find some major culprits. A novel here and there in the notes fields, date fields that are strings, and number fields that are not long enough. Now it’s perfectly possible in MySQL to alter a column’s type and size through the command line. However, XAMPP (via phpMyAdmin) provides a way to do it graphically, and affect multiple rows at once easily, so it makes the job much faster. This was a tool of immense value to this task. I could also use it to purge the table of data to help keep the memory footprint lower during the testing and working out of the problems.

A few column setting changes later and I got the errors lower, but not by much. So, using Excel (2007 is the only option since previous versions cannot process more than 65535 rows very well, if at all) and a file of notes, I looked at every single column and adjusted sizes. Given the column count and contents, I had to be concerned with the max size limit per record imposed by MySQL as well as max field size. The later I could work around by use of a different datatype, akin to BLOB or CLOB in Oracle, the former though was something to keep track of. After going through it all, I still had an immense number of errors. So I try something simple, I added a dummy column to the end of the table, and in a single step the error count dropped to a mere 6000 errors and warnings. Progress is good.

Solution Grind

The problem still remained where all these errors were and what was causing them. The solution was arduous, but the safest. Manually extract and test the records from the master file and separate the good from the bad. In small situations this isn’t a big deal, but we were talking over 145,000 records. Testing them 300 at a time still took a while. The relative small number was due to the fact that the MySQL warning display defaults to only showing the first 64 errors or warnings, so I had to reduce the number of attempted records to be able to see and affect the changes needed to find truly bad records. In the end I was able to isolate the roughly 220 bad records and set them aside. Reassembling the test files back together I tested the ‘good’ records, and got zero load warnings. Good. Beginning with the bad records, I was able to isolate a few dozen that were simply caused by incorrect field lengths, since after all, phone numbers are supposed to be 35 characters long, aren’t they? Fixing those, I got down to a mere 195 bad records, less than .2 percent of the total. Acceptable in most circles. The remaining bad records were attributed to either more columns than needed, fewer columns than needed, or data shifting (after a point all data is one column to the left or right of where it should be). Sent the bad records back to the boss, and tested the results of the good data on the Linux test box, with success. Now to just await the next task while streamlining this one a bit more.

Bookmark and Share

Related Information:

  1. MySQL and SQL Server - Oracle CONNECT BY PRIOR for Recursive, Hierarchical Data
    Recursive queries are something I have used many, many times over the years to build dynamic, n-tiered hierarchies. Oracle’s CONNECT...
  2. Data Validation and Cleansing - Technology, a New “Mouse Trap”
    You lie in wait, looking for your prey, it approaches, and looks around. You spring the trap and find the...
  3. Oracle ADF Faces 10g: How to pass data between pages
    Oracle ADF Faces 10g: How to pass data between pages Do you have the need to pass data between pages,...
  4. Copy Table Columns and Data
    As you may know, in Oracle the syntax from creating a table that is the same as another is: 1...
  5. Oracle Database vs Sun MySQL
    Well, this is a topic we have reviewed numerous times over the years for projects and customers. The fact is,...
  6. APEX 3.2 Tutorial Preparation and OEHR Sample Data - Video Walkthrough
    This tutorial video and article is meant to help you setup your APEX installation so you can proceed through the...
  7. Oracle System Administration - Linux Commands (Red Hat)
    Following is a sortable table of the most common Linux commands we use on a day-to-day basis while implementing and...
  8. Amazon EC2 Server Pricing Matrix (Linux and Windows)
    I was trying to figure out Amazon EC2 pricing recently, and found myself pulling out a calculator for what they...

Leave a Reply


Archives

Recent Comments

  • Kavitha Muniraj said: Hi, Could you please send a script which checks the status of forms and reports of oracle...
  • Siddharth said: Hey thanks a lot !! Have been lookin all over for this :) My PC may not be able t handle FlashBuilder...
  • Cyrex said: Hey man why I cant install Apex 4? is there anyone can help me.,?
  • sharanabasava said: ALTER TABLE [my_table_name] RENAME COLUMN [current_column_name] TO [new_column_name]; Sir,I...
  • Huub Vankan said: Thanks for this easy solution. Indeed strange that is does not work in the (x86) folder….

Calendar

July 2010
M T W T F S S
« Jun    
 1234
567891011
12131415161718
19202122232425
262728293031  

Featured Testimonial

"M&S has taken 'partnership' to a new level. They are not only a deeply skilled service provider, the M&S team is actually a mission critical component to our operations, taking on new initiatives as well as maintaining custom, open source, and COTS solutions."

CEO, MedPivot

Free SQL Service