Automated sanity checks
I am reading An Introduction to Stata Programming, by Christopher Baum.
He suggests, in Chapter 5.2, a nice do-file method to validate your data: you use pairs of list
and assert
. For example, suppose you know that a variable v should have no missing values. If it indeed does not, then assert !missing(v)
should run without error. If it does, you want to know where they are: list if missing(v)
. Reversing the order of these two lines in a do-file will cause Stata to exit with an error, which alerts you that there are problems with your data, but not before it shows you where the problems are:
sysuse auto, clear
list if missing(make)
assert !missing(make)
list if missing(rep78)
assert !missing(rep78)
This is neat because you can always edit this do-file with a new pair of list/assert lines as needed. But, as the author mentions, sometimes a summarize
is plenty helpful too, especially if you remember that it accepts a list of variables as an argument. You could do this for example:
sysuse auto, clear
sum make rep78
Wait. That didn't work too well, because summarize
tells you nothing about make: it treats string variables as missing. You would know that that was indeed the case if before sum
you would have requested describe
.
So, I guess, before you do any kind of data validation, describe
is a good first step; you might also like codebook
; I don't. I find it too wordy. But it does do the job of giving you information about the whole data set.
One alternative to wordy output when you have a specific question regarding more than one variable is to use little custom programs for data checks. On such example is countIfMissing, shown in my previous post. Another, inspired by Christopher Baum's use of summarize
might be sumIfNumeric:
capture prog drop sumIfNumeric
program sumIfNumeric
unab fullset: _all
local numset
foreach varble in `fullset' {
capture confirm numeric variable `varble'
if _rc==0 {
local numset `numset' `varble'
}
}
local check: list sizeof numset
if `check'>0 {
sum `numset'
}
else {
di "No numeric variables found in this dataset."
}
end
This might be useful when your data set comes with a bunch of variables -- some numeric, some string. Though describe
will tell them apart easily enough, you may not care to list them explicitly. The usage is straightforward:
sysuse auto, clear
sumIfNumeric