sql - how to valiate Millions of data? -
how validate scenario?
scenario 1:
source file flat file contains millions of data. data source file loaded target table in data base.
now question how validate if data loaded in target table correctly??
note: can’t use xls validate have millions of records in it.
there lots of ways 1 can validate data. of depends on 3 things:
how time have validation?
what processing capabilities?
is data on qa or production sql server?
if in qa , have lots of processing power, can basic checks:
- where there warnings or errors during data load?
- count total number of items in database vs. raw file
- count total number of null records in database
- check total number of columns vs. raw file
- check length of variables. expected?
- are character columns unexpectedly truncated?
- are numeric columns out correct number of significant digits?
- are dates reasonable? example, if expected dates 2004, 1970?
- how many duplicates there?
- check if data in columns make sense. few questions can ask: rows "shifted?" numeric variables in numeric columns? key column key? column names make sense? check of null records should detect these things.
- can manually calculate columns , compare calculation 1 in file?
if low on processing power or on production server , not want risk degrading performance other users, can many of above checks simple random sample. take, say, 100,000 rows @ time.; or, stratify if needed.
these few checks can do. more comparisons , sanity checks, better off are.
most importantly, communicate these findings , anything seems strange file owner. should able give additional insight data load correct, or if gave right file in first place.
you're loading data , providing many reasonable checks possible. if they're satisfied outcome, , you're satisfied outcome, should consider data valid.
Comments
Post a Comment