Data Validation and Cleansing - Technology, a New “Mouse Trap”

By M&S Consulting (Kevin) on Jan 19th, 2009
Filed Under Case Studies, Industry Trends // Tags: , , , ,

You lie in wait, looking for your prey, it approaches, and looks around. You spring the trap and
find the spoils of war…the dreaded %20 (a normally unseen character code for a blank space).

Ok, maybe not all that exciting, but it brings up an important problem in data conversion and in
application development projects. Namely the importance of input data trapping and input
validation.

Background

I ran into this issue in my last project. A flat file based system with over 10 years of data
needed to be moved into a database that allowed for relational behaviors. Early on, this meant
identifying and and mapping the data we needed from the original extracts, outputted to .csv
files, to the staging and final tables where they would end up. After many weeks of work, a few things became obvious. The original system did little to no data trapping or validation on
input. We had data fields that were being used for notes, email fields that didn’t have e-mail
addresses, and all manners of odd stuff. I can’t blame the original users since chances are the
program didn’t enforce correct entry, allowing all this stuff to be put where it didn’t belong.

Core Issue

White Space

One of the larger problems we encountered was that as the use of the internet increased, the
original users were copy - pasting information from websites into the original system. Good in
some aspects (faster, fewer transcribing mistakes) but bad in one major aspect. When they copy -
pasted, they also picked up the special characters that a web browser interpets internally, but
is seen as ‘white space’ by the end user. We had well over 5,000 records that were broken,
showing up as almost 18,000 records due to unseen carrige returns and ‘breaks’ in the data
fields that were invisible to the user, but our data loaders and such saw them quite well, and
were more than happy to process them, even though it corrupted the data. And with the data
records each containing some 400 + fields per record, this constituted several man hours to find
and correct.

Special Characters

Another issue we ran into were special characters. This is something too many people don’t think
about. After all, there isn’t much difference between a forward and a backward slash right? Or
it’s fine to use a double quote, or a semi colon to delimit data right? Well with the slashes,
they are commonly used as escape characters by many languages, so they are not a good choice, as
well as being used to differentiate the directory structures between Windows and *nix based
systems. Other characters that people tend to not consider are the common ones, commas, quotes,
asterisks, pound sign, etc etc. The issue with these is often they are used in comments and
notes, or when citing formal sources. But many of these issues are not on the minds of the data
entry people.

Results

By trapping and validating, and by limiting as well, the input side of our projects and
applications we not only help safeguard against malicious intent (purposely putting SQL commands
or malformed information into a field to try and break the backend), but also help to better
ensure data integrity and stability. By ensuring that only numbers are entered for a date of
birth, or that a zip code is no longer than 10 characters (5 plus 4 in the US with the dash) we
can help ensure more consistent results.

In the end, we:

  1. Cleansed data during migration to go from over 30% failure rates to under 1%
  2. We implemented a new web 2.0 solution for capturing data from users and performing validation up-front

The customer has a solution that reflects accurate historical data and effectively self-manages data entry from users going forward.

Conclusion

The goal should be to produce a solid, stable end product that will politely tell the user when something they
entered is invalid, and ask for it to be checked for correctness. Worst case, it should merely
toss out any invalid data, and only keep the correct data, provided that all critical data is
present. This of course is based on how well the design was for the overall project. So watch your prey
carefully, and you should be able to anticipate most of the common traps you might need; after
all, success can also be measured by the number of times you never get a service call from the
client.

Bookmark and Share

Related Information:

  1. Staging Data with MySQL, CSVed, and XAMPP
    The Task: Take and load, using MySQL, a flatfile of over 450 colums and over 145,000 rows of data and...
  2. Oracle ADF Faces 10g: How to pass data between pages
    Oracle ADF Faces 10g: How to pass data between pages Do you have the need to pass data between pages,...
  3. Oracle WebCenter 10gR3 Review - Framework, Web 2.0 Services, and SOA
    Ashok Aggarwal, a Partner at M&S Consulting has recently written a complete review of Oracle WebCenter Suite. This has been...
  4. Enterprise Wiki for a Large Oracle Initiative

    The Challenge

    M&S was engaged by ABC, a publically traded company, for a multi-national Oracle initiative. Upon joining, M&S recognized...
  5. APEX 3.2 Tutorial Preparation and OEHR Sample Data - Video Walkthrough
    This tutorial video and article is meant to help you setup your APEX installation so you can proceed through the...
  6. Google Apps Reporting API - Not quite this easy with Exchange
    I have been somewhat enamored by Google Apps and its potential. Many enterprises have been happy with Exchange for mail,...
  7. MySQL and SQL Server - Oracle CONNECT BY PRIOR for Recursive, Hierarchical Data
    Recursive queries are something I have used many, many times over the years to build dynamic, n-tiered hierarchies. Oracle’s CONNECT...

Leave a Reply


Archives

Recent Comments

  • Kavitha Muniraj said: Hi, Could you please send a script which checks the status of forms and reports of oracle...
  • Siddharth said: Hey thanks a lot !! Have been lookin all over for this :) My PC may not be able t handle FlashBuilder...
  • Cyrex said: Hey man why I cant install Apex 4? is there anyone can help me.,?
  • sharanabasava said: ALTER TABLE [my_table_name] RENAME COLUMN [current_column_name] TO [new_column_name]; Sir,I...
  • Huub Vankan said: Thanks for this easy solution. Indeed strange that is does not work in the (x86) folder….

Calendar

July 2010
M T W T F S S
« Jun    
 1234
567891011
12131415161718
19202122232425
262728293031  

Featured Testimonial

"M&S has taken 'partnership' to a new level. They are not only a deeply skilled service provider, the M&S team is actually a mission critical component to our operations, taking on new initiatives as well as maintaining custom, open source, and COTS solutions."

CEO, MedPivot

Free SQL Service