Doesn’t look like a number, but it is
Thursday, 23 April 2009
I came across a data set with some variables in it that recorded actual measurements, so their values could be zero or higher. Then a need arose to quickly look up how many of these variables would have recorded anything -- in effect, separating the yes/no part from the how much -- and adding the result. I essentially needed to treat continuous variables as dummies for this one purpose and then move on. Turns out there's a very easy way to do that. Consider a data set that looked something like this (sorry for the ugliness of tables in WordPress posts; there's probably a plugin for that somewhere):
| _var1 | _var2 | _var3 |
| 1.5 | 0 | 2.7 |
| 0 | 1.2 | 0 |
| 0 | 1 | 2 |
Now suppose you wanted to generate a fourth variable that was equal to the sum of the non-zero instances of the first three. You would do this:
gen _var4=(_var1!=0)+(_var2!=0)+(_var3!=0)
Trivial? Yes. Obvious? I dunno. I'm sure you use logical operators all the time inside compounded logical expressions. I thought it was kind of neat and post-worthy that you could also use them inside arithmetic expressions.
No. 1 — April 27th, 2009 at 12:36 pm
Okay, I think I get it, but I'm worried that I don't get it. Why would you leave out the zeroes in the sum variable? They don't add or subtract anything. Would creating a variable that sums non-missing values be another useful example? Or maybe I'm not getting it. :)
No. 2 — April 27th, 2009 at 1:24 pm
You do get it. There are a couple of things I left unsaid and that introduced some ambiguity. The gist of it is that var_4 is conceptually different. var_1 through var_3 are actual measurements. For example, if this were a file of subscribers to a certain newspaper, var_1 could be the amount of their bill. var_4, on the other hand, is a count of non-zero instances of these measurements. You could add the zero measurements, and you're right that it wouldn't change the result, but the formula is clearer if it only consists of the logical conditions that describe what you're trying to count and nothing else.
I spun this off a bigger job. I had several files and they had different ranges of these var_'s, and gaps were also possible -- for example, a file would have var_1 through var_7, another only var_2 and var_3, etc. I had to count the non-zero instances of whatever var_'s there were and write code that would work across all these files. That exercise might be worth a post of its own.