Data Validation and Cleansing – Technology, a New “Mouse Trap”

You lie in wait, looking for your prey, it approaches, and looks around. You spring the trap and
find the spoils of war…the dreaded %20 (a normally unseen character code for a blank space).

Ok, maybe not all that exciting, but it brings up an important problem in data conversion and in
application development projects. Namely the importance of input data trapping and input
validation.

Background

I ran into this issue in my last project. A flat file based system with over 10 years of data
needed to be moved into a database that allowed for relational behaviors. Early on, this meant
identifying and and mapping the data we needed from the original extracts, outputted to .csv
files, to the staging and final tables where they would end up. After many weeks of work, a few things became obvious. The original system did little to no data trapping or validation on
input. We had data fields that were being used for notes, email fields that didn’t have e-mail
addresses, and all manners of odd stuff. I can’t blame the original users since chances are the
program didn’t enforce correct entry, allowing all this stuff to be put where it didn’t belong.

Core Issue

White Space

One of the larger problems we encountered was that as the use of the internet increased, the
original users were copy – pasting information from websites into the original system. Good in
some aspects (faster, fewer transcribing mistakes) but bad in one major aspect. When they copy –
pasted, they also picked up the special characters that a web browser interpets internally, but
is seen as ‘white space’ by the end user. We had well over 5,000 records that were broken,
showing up as almost 18,000 records due to unseen carrige returns and ‘breaks’ in the data
fields that were invisible to the user, but our data loaders and such saw them quite well, and
were more than happy to process them, even though it corrupted the data. And with the data
records each containing some 400 + fields per record, this constituted several man hours to find
and correct.

Special Characters

Another issue we ran into were special characters. This is something too many people don’t think
about. After all, there isn’t much difference between a forward and a backward slash right? Or
it’s fine to use a double quote, or a semi colon to delimit data right? Well with the slashes,
they are commonly used as escape characters by many languages, so they are not a good choice, as
well as being used to differentiate the directory structures between Windows and *nix based
systems. Other characters that people tend to not consider are the common ones, commas, quotes,
asterisks, pound sign, etc etc. The issue with these is often they are used in comments and
notes, or when citing formal sources. But many of these issues are not on the minds of the data
entry people.

Results

By trapping and validating, and by limiting as well, the input side of our projects and
applications we not only help safeguard against malicious intent (purposely putting SQL commands
or malformed information into a field to try and break the backend), but also help to better
ensure data integrity and stability. By ensuring that only numbers are entered for a date of
birth, or that a zip code is no longer than 10 characters (5 plus 4 in the US with the dash) we
can help ensure more consistent results.

In the end, we:

  1. Cleansed data during migration to go from over 30% failure rates to under 1%
  2. We implemented a new web 2.0 solution for capturing data from users and performing validation up-front

The customer has a solution that reflects accurate historical data and effectively self-manages data entry from users going forward.

Conclusion

The goal should be to produce a solid, stable end product that will politely tell the user when something they
entered is invalid, and ask for it to be checked for correctness. Worst case, it should merely
toss out any invalid data, and only keep the correct data, provided that all critical data is
present. This of course is based on how well the design was for the overall project. So watch your prey
carefully, and you should be able to anticipate most of the common traps you might need; after
all, success can also be measured by the number of times you never get a service call from the
client.

Leave a Reply

Your email address will not be published. Required fields are marked *