Dynamic arithmetic expressions
Monday, 27 April 2009
A couple of days ago I showed how you can use logical operators inside arithmetic expressions. The point of that post was that if you had three variables, called say _var1, _var2 and _var3, then you could build another variable, call it _var4, that counted row-wise their non-zero instances, like so:
gen _var4=(_var1!=0)+(_var2!=0)+(_var3!=0)
This was actually spun off a larger project. I had several data sets that could each have all or any subset of a list of measurements saved as _var1 through _var17. I had to count the non-zero instances row-wise with a formula that worked across all of the data sets, regardless of which of these _var measurements each of them included. I also had to allow for the possibility that some data sets would have no such variables at all, and that some data sets added later would bring in additional measurements, labeled _var18 and onward. Clearly, a static formula like the one above won't do. Here's an alternative:
// 1. build a mock-up file to try this thing out
clear
set obs 4
gen id=_n
gen _var1=0
gen _var2=0
replace _var1=1.2 in 1
replace _var1=1.1 in 2
replace _var2=1.2 in 2
replace _var2=.5 in 3
// 2. build the formula
// 2.1. generate a list of the _var variables
// you have in this particular file, if any.
unab allvars: _all // this collects all the variable names
foreach varname in `allvars' {
if regexm("`varname'","_var") {
local myvars `myvars' `varname' // this collects those that start with "_var"
}
}
local myvars_ct: list sizeof myvars // this counts them
// check it out:
di "`myvars'"
di `myvars_ct'
// 2.2. build the actual formula
if `myvars_ct'>0 { // if you have any _var variables, then proceed
forvalues i=1/`myvars_ct' {
local this_var: word `i' of `myvars'
local vars_sum "`vars_sum'(`this_var'!=0)+"
}
// check it out:
di "`vars_sum'" // ok, got an extra + sign.
// 2.3 clean up the formula
local vars_sum=substr("`vars_sum'",1,strlen("`vars_sum'")-1)
// looks good now:
di "`vars_sum'"
// 3. now put the formula to use
gen vars_sum=`vars_sum'
}
else {
di "No _var found in this file."
}
This will build your formula for any combo of the _var measurements. One code fits all and it's future proof too. That's always nice.
No. 1 — April 29th, 2009 at 9:47 am
This will do it with less code:
reshape long _var, i(id) j(varnum)
gen _varnonzero = (_var!=0) // are you missing "& (_var!=.)" ?
bysort id: egen count = total( _varnonzero)
drop _varnonzero
reshape wide _var, i(id) j(varnum)
rename count vars_sum
No. 2 — April 29th, 2009 at 10:12 am
Reshape is one of my all-time favorite Stata commands. Nice having you here, Keith.
No. 3 — May 5th, 2009 at 3:10 pm
I enjoy the blog and this is definitely a helpful post.
To get all the variables that start with "_var", couldn't you just write:
unab vars : _vars*
I use that all the time when working with numerically named variables with a common stub.
No. 4 — May 5th, 2009 at 3:39 pm
Hi Nick. You're right. Your note sent me digging for a good reason why I built the _vars list the way I did, and I couldn't find one. The names of the actual variables of interest in the data set are different from simply _var, of course, but they do share a stub, enough for what you're proposing to work fine.
As it turns out,
unabis quite flexible in its pattern recognition capabilities. For example,unab vars: *var*would pick out all the variable names that contained the string "var" anywhere. So, sorry about the code bloat.Thank you for visiting. I'm glad you like the place.
No. 5 — May 5th, 2009 at 5:06 pm
Gabi,
That unab vars : *var* is another good trick. Anytime STATA syntax accepts a "varlist" you should be able to use the different wildcard or shorthand operators listed under the varlist helpfile.
Another quick thing that I ran into when implementing this code was that my vars_sum macro was getting cut off by running it through the substr to get rid of that last plus sign. As you probably know, STATA's max string length is 244 characters. So if you have long variable names or many variables to check for non-zero values, your formula might hit the max character length for strings. Since the maximum characters allowed in a macro is much larger, I kept my formula in the macro domain like so:
unab vars : _var1-_var39;
local vars_count : word count `vars';
local z = 1;
foreach x of local vars {;
if `z' == `vars_count' {;
local vars_sum "`vars_sum' (`x' != 0 & `x' < .)";
continue;
};
else {;
local vars_sum "`vars_sum' (`x' != 0 & `x' < .) +";
};
local z = `z' + 1;
};
gen _var_count = `vars_sum';
My code is delimited with a ';', and I needed to count all non-zero non-missing values, hence the addition of the `x'<. statement to your formula.
No. 6 — May 5th, 2009 at 11:37 pm
I figured that the equal sign in the definition of local var_sum was going to bite if the full expression turned out to be longer than 244 characters. And of course you can run through that many characters pretty quickly if your logical operators are compounded -- as in your case, where you want both non-zero and non-missing instances. Nice hack, that if-else workaround.
No. 7 — May 7th, 2009 at 10:18 pm
To add to the above comments, using the egen function -rownonmiss()- in place of your initial command:
gen _var4=(_var1!=0)+(_var2!=0)+(_var3!=0)
may be better in some cases where you have missing values or string values in your observations.
EX:
your command above still counts missing values, if you wanted to run the same command but not count missing values you could run:
egen _var4 = rownonmiss(_var*) if _var1!=0 | _var2!=0 | _var3!=0
Another approach is that the rononmiss() option can count the number of non-zero, non-missing observations in a row, even if there are string values present by using the "strok" option, e.g.,
egen _var4 = rownonmiss(_var*) if _var1!=0 | _var2!=0 | _var3!=0, strok
No. 8 — May 8th, 2009 at 10:13 am
I had no idea that rownonmiss() even existed. Thank you.