Dummy variables
Monday, 19 January 2009
There are two straightforward ways to turn string variables into corresponding dummies -- also known as categorical variables -- using Stata. One is an extension of the tab command:
tab stringvar, gen(dummy)
Another makes use of the fact that you seldom need dummies for their own sake. Usually you want them used in some sort of regression model. The xi: extension to various estimation commands turns string variables into dummies automatically, as in
xi: regress y x i.stringvar
Both are described in detail here and they both work well when your string variable translates into dummies directly. That, however, is not always the case. Think of a data set where you have a string variable named "color" which is equal to "red" for the first observation, "blue" for the second and "yellow, blue" for the third. You would want the dummy "color_red" to be equal to 1 in the first observation; the dummy "color_blue" to be 1 in the second and the third; and you'd want a separate dummy, "color_yellow", to be equal to 1 in the third observation.
I just ran across such a data set today. It had characteristics for a few hundred lottery games. The color described the ticket colors. There were a few other string variables that also could have observations that were comma-delimited lists. Moreover, the comma-delimited lists could include values that did not show up as unique values in other observations (like "yellow" in the example above).
So I thought I'd write a program that could deal with all of that without the need of any visual inspection or case-by-case manual labor on my part. I wanted it to be applicable to any string variable in this situation. My suggestion is below:
// ##### getDummies -- turns string to dummies. Takes one argument:
// ##### `1' -- string, the name of the variable of interest.
capture prog drop getDummies
prog def getDummies
local stringvar `1'
quietly count
local fullset=r(N)
quietly count if !regexm(`stringvar',",")
local uniques=r(N) // cases where `stringvar' is not a list
if `fullset'!=`uniques' {
quietly {
tab `stringvar' if regexm(`stringvar',",")
levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
preserve
tempfile `stringvar'_lists
keep `stringvar'
keep if regexm(`stringvar',",")
duplicates drop
split `stringvar', p(",")
save "``stringvar'_lists'", replace
restore
describe `stringvar'* using "``stringvar'_lists'", varlist // note(1)
local `stringvar'_stubs=r(varlist)
split `stringvar', p(",")
local stubs: list sizeof `stringvar'_stubs
forvalues i=2/`stubs' {
local stub: word `i' of ``stringvar'_stubs'
replace `stub'=trim(`stub') // note (2)
levelsof `stub' if `stub'!="", local(extras)
local tags: list tags | extras
}
local tags: list sort tags
}
}
else {
di "for each value of `stringvar' there corresponds one dummy variable"
capture drop __*
quietly levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
local `stringvar'_stubs `stringvar'
}
capture drop __*
local stubnum: list sizeof `stringvar'_stubs
local tagnum: list sizeof tags
quietly {
forvalues i=1/`tagnum' {
local thistag: word `i' of `tags'
local thistag: list clean thistag
gen byte _`stringvar'_`i'=0
forvalues j=1/`stubnum' {
capture drop __*
local thisstub: word `j' of ``stringvar'_stubs'
replace _`stringvar'_`i'=1 if `thisstub'=="`thistag'"
}
}
}
drop `stringvar'*
// this section is for listing stuff on screen and in the log
local `stringvar'_stubs: list `stringvar'_stubs-stringvar
local stubs: list sizeof `stringvar'_stubs
di ""
di "total number of games: `fullset'"
di "number of games where `stringvar' is not a list: `uniques'"
di "unique values of `stringvar': `tagnum'"
if `stubs'>0 {
di "where `stringvar' is a list, it is this long at most: `stubs'"
}
di ""
forvalues i=1/`tagnum' {
local thistag: word `i' of `tags'
local thistag: list clean thistag
di "_`stringvar'_`i' is for `stringvar' == `thistag'"
}
end
That's it. This program collects all the possible values that your stringvar can take, whether inside comma-delimited lists or by themselves, and produces accurate dummies that are equal to one every time such a value is encountered, whether by itself or in a list, and regardless of its position in the list. With your data set in memory, you simply call
getDummies color
Now, I don't post programs unless they contain something I just learned in the process of writing them. Today's such thing is in line 24, next to the comment "note(1)". Turns out -- if you call help describe -- there are two kinds of describe: one for data in memory, another for data using a file. The latter comes with a different set of options. One of them is varlist. It stores the name of the variables in r(varlist). I chose to preserve/restore and create the tiny temporary file "``stringvar'_lists'" so I could apply describe using to it, and get a variable list saved in the local ``stringvar'_stubs'. I'm using it later on.
This may look like a lot of work, and it is, but it's all up-front. You do it once, and if it works now it works forever. The marginal cost of creating dummies out of any number of such variables is zero from here on out.
Update (February 4, 2009): the first version of getDummies had a bug. The line marked with the comment "note(2)" was missing. As a result, the program produced more dummies than it should have. To use my color example, without this line getDummies will produce two separate dummies for the color "blue": one for the case where color was equal to "blue" strictly, and another for the case where color contained the string " blue". Notice the leading blank space. You want to trim() it.
No. 1 — January 20th, 2009 at 3:48 am
As far as your color example is concerned, the following lines will do the job:
clear
input str5 color
red
blue
green
end
cl
levelsof color, local(colors)
foreach l of local colors{
gen color_`l' = (color == "`l'")
}
list
No. 2 — January 20th, 2009 at 10:33 am
Stataplayer's quick code is a nice substitute for the "tab, gen()" method described at the top of my post. In fact, it's the first method described in the Stata FAQ page on dummy variables that I am referencing in my post (I have no idea how I managed to not notice it the first time, because it's clearly marked "Answer 1 or 3"). Like the other two, though, it wouldn't work if the third observation in this data set were something like "green, maroon" instead of "green". In other words, it won't work in the cases that my program was designed for -- e.g. when the three observations in his example might map to more than three dummies (red, blue, green, maroon).
No. 3 — January 22nd, 2009 at 11:41 am
/*
Well, this works nice, but it's not as general as you make it sound. There is still some work to be done, and unfortunately I don't think it can be generalized to avoid tabulating the data because it is case specific.
Try to see what you get:
*/
clear
set obs 12
gen var1 = ""
replace var1 = "aa" if _n == 1 | _n == 2 | _n == 3
replace var1 = "aa, bb" if _n == 4
replace var1 = "aa, cc" if _n == 5
replace var1 = "bb" if _n == 6
replace var1 = "aa / bb" if _n == 7
replace var1 = "aa/cc" if _n == 8
replace var1 = "aa/bb/cc" if _n == 9
replace var1 = "aa|cc" if _n == 10
replace var1 = "aa.cc" if _n == 11
replace var1 = "aa&cc" if _n == 12
* notice 11 aa, 4 bb, and 6 cc
* define here your getDummies program
*...
getDummies var1
/* Notice that the result is not what you expected. These are examples of real data that I worked with (except of course for the generic aa/bb/cc). You need to know the separators of the different values in your data. Now you allow for only space and comma. One solution would be to increase the allowed list to make the program more general, but the problem then might be that you could make it "too general", i.e., you would wrongly split legitimate entities that contain the assumed separators. For example, "." could be an entity separator or it could be part of a string (like in "Company A, Ltd.")
No. 4 — January 22nd, 2009 at 12:01 pm
You're right, though your concerns can be addressed, at least partially, with regular expressions. You may, for example, allow the separator "," as long as it's not in front of "Ltd.$ | ltd.$". Stata's regular expression functions can handle all of that. But you're right that you'd have to do some prior exploration in order to see what you're up against, and there are no guarantees that you'll ever cover all the possibilities. Then again, real-world problems aren't perfect problems either. Most of them don't require perfect solutions.
No. 5 — February 4th, 2009 at 12:03 pm
I tried to run the getDummies program (using the "webuse auto.dta" data and the string variable 'make') however it returned this error:
getDummies make
Unknown function ()
r(133);
I"ve run the program from the dofile editor as well as from a saved .ado file and got the same results both times. Any ideas on what I am doing wrong?
Also, how is the program different from the 'dummies' or 'mrdum' packages??
Thanks.
No. 6 — February 4th, 2009 at 12:25 pm
It works fine for me, I just checked. I have Stata/MP 10.1. I'm not sure what you're running, but you could do this:
set trace ongetDummies make
set trace off
This will tell you where exactly the thing is breaking down.
I just looked at dummies and mrdum. I wasn't aware of either before your comment. From the help files, it looks like getDummies and mrdum are doing the same thing (of course, in your case mrdum might have the distinct advantage that it works).
No. 7 — February 5th, 2009 at 12:18 am
Thanks for the quick reply. Below is what I get using trace on. I'm not sure why it says that regexm() is an unknown function. I checked and it is installed.
. getDummies make
-------------------------------------------------------------------------- begin getDummies ---
- local stringvar `1'
= local stringvar make
- quietly count
- local fullset=r(N)
- quietly count if !regexm(`stringvar',",")
= quietly count if !regexm(make,",")
Unknown function ()
---------------------------------------------------------------------------- end getDummies ---
r(133);
No. 8 — February 5th, 2009 at 1:17 am
Strange; regexm() is one of the Stata regular expression functions. In this particular case, it's counting the values of make where the comma character is absent. For the auto.dta dataset, this count should return the number of observations, because there are no cases where a car has more than one make, separated by commas. You can replicate it at the command line:
sysuse autocount
count if !regexm(make,",")
I'm stumped.
No. 9 — February 5th, 2009 at 6:09 pm
I am running Stata 10.1/SE on the Mac...it's not MP, but I can't think why this would make a difference.
It seems that the first problem was that I copied your program from the website and I did not think about the spacing getting messed up. When I copied and pasted your program into my text editor (AlphaX) it only put 2 spaces in front of the lines that were supposed to be tabbed over the line after a "{".
Once I corrected this it made it farther into the program. It hit an error here:
- levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
= levelsof make if !regexm(make,","), local(tags)
-------------------------------------------------------------------------- begin levelsof ---
- version 9
- syntax varname [if] [in] [, Separate(str) MISSing Local(str) Clean ]
option not allowed
---------------------------------------------------------------------------- end levelsof ---
local `stringvar'_stubs `stringvar'
}
I'm not sure what the problem is with this line. I tried running it in the command line alone and it gave me a slightly different error:
. levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
---------------------------------------------------------------------------- begin levelsof ---
- version 9
- syntax varname [if] [in] [, Separate(str) MISSing Local(str) Clean ]
varlist required
------------------------------------------------------------------------------ end levelsof ---
r(100);
Feel free to ignore my question, I can probably just use one of the other dummy packages out there, but I am starting to learn to program some ado files and I am just interested in figuring out what's happening here so that I can avoid these issues in the future. Thanks for any help & I enjoy your 'A Stata Mind' webpage--keep the good entries coming!
~DW
No. 10 — February 5th, 2009 at 6:26 pm
Regarding the first bit of trouble:
There's a reason for the two spaces: that's how many I used in the original version. Indentation helps with keeping track of what goes where, and how you do it doesn't matter as long as it does that job. Stata ignores both tabs and spaces. But some people prefer two or three spaces to one tab because that's unambiguous. You see, there's no standard translation of a tab. Some editors make it five, some eight spaces. I'm taking some CS classes at NC State and they discourage using tabs for indentation for this reason.
But if my lines run over and your text editor puts an end of line where I didn't mean to have one, that's a problem. Thanks for bringing it up. Now it's clear what went wrong. I'm going to have to tinker with the code font size so it renders properly across all browsers. Right now you can see my code as I intended it in Google Chrome, but not in Opera.
Regarding the r(100) error when trying to run levelsof in the command line:
The
"varlist required"message is one of the rare examples of error messages that are actually clear. In this case, you used a local -- `stringvar' -- that wasn't previously defined as local to the command line (it was only local to your do-file). Stata turns those guys into a blank. To see what I'm talking about, try this:display "`hello'"No. 11 — February 6th, 2009 at 10:09 pm
[...] wider page in order to accommodate longer lines of code. I needed it because some code lines in my Dummy variables post ran over. If you cut and pasted the code errant end-of-line characters could have crept in. [...]