Automated sanity checks

I am reading An Introduction to Stata Programming, by Christopher Baum.

He suggests, in Chapter 5.2, a nice do-file method to validate your data: you use pairs of list and assert. For example, suppose you know that a variable v should have no missing values. If it indeed does not, then assert !missing(v) should run without error. If it does, you want to know where they are: list if missing(v). Reversing the order of these two lines in a do-file will cause Stata to exit with an error, which alerts you that there are problems with your data, but not before it shows you where the problems are:

sysuse auto, clear
list if missing(make)
assert !missing(make)
list if missing(rep78)
assert !missing(rep78)


This is neat because you can always edit this do-file with a new pair of list/assert lines as needed. But, as the author mentions, sometimes a summarize is plenty helpful too, especially if you remember that it accepts a list of variables as an argument. You could do this for example:

sysuse auto, clear
sum make rep78


Wait. That didn't work too well, because summarize tells you nothing about make: it treats string variables as missing. You would know that that was indeed the case if before sum you would have requested describe.

So, I guess, before you do any kind of data validation, describe is a good first step; you might also like codebook; I don't. I find it too wordy. But it does do the job of giving you information about the whole data set.

One alternative to wordy output when you have a specific question regarding more than one variable is to use little custom programs for data checks. On such example is countIfMissing, shown in my previous post. Another, inspired by Christopher Baum's use of summarize might be sumIfNumeric:

capture prog drop sumIfNumeric
program sumIfNumeric

unab fullset: _all
local numset
foreach varble in `fullset' {
   capture confirm numeric variable `varble'
   if _rc==0 {
      local numset `numset' `varble'
   }
}
local check: list sizeof numset
if `check'>0 {
   sum `numset'
}
else {
   di "No numeric variables found in this dataset."
}

end


This might be useful when your data set comes with a bunch of variables -- some numeric, some string. Though describe will tell them apart easily enough, you may not care to list them explicitly. The usage is straightforward:

sysuse auto, clear
sumIfNumeric

6 Responses to “Automated sanity checks”

  1. Michael P. Manti writes:

    Are you familiar with -ds3- (-ssc install ds3-)? It provides an alternative to constructions such as -capture confirm numeric-. I'm also a fan of the -vartyp- command, which allows you to tag variables as discrete, ordinal, continuous, dates, etc., and then use -ds3- to create varlists consisting of either the native Stata datatypes or the extended vartyps. The combination of the two simplifies writing utilities such as sumIfNumeric.

  2. Gabi Huiber writes:

    I didn't know of ds3 and it is excellent. I just installed it. Thank you.

  3. Nick Cox writes:

    -ds3- has now been superseded by -findname-, also from SSC. -findname- was posted at the end of March and publicised on Statalist.

  4. Martin Weiss writes:

    So your "sumIfNumeric" is pretty much equivalent to:

    qui ds, has(type numeric)
    sum `r(varlist)'

    or the equivalent use of Nick`s recent -ssc d findname-?

  5. Michael P. Manti writes:

    Didn't know about -findname-. Thanks for the tip.

  6. Gabi Huiber writes:

    Nick, Martin: nice seeing you here. And many thanks for findname, nmissing, and npresent. I just used them all a minute ago for the first time.

Leave a Reply