Dummy variables

There are two straightforward ways to turn string variables into corresponding dummies -- also known as categorical variables -- using Stata. One is an extension of the tab command:

tab stringvar, gen(dummy)

Another makes use of the fact that you seldom need dummies for their own sake. Usually you want them used in some sort of regression model. The xi: extension to various estimation commands turns string variables into dummies automatically, as in

xi: regress y x i.stringvar

Both are described in detail here and they both work well when your string variable translates into dummies directly. That, however, is not always the case. Think of a data set where you have a string variable named "color" which is equal to "red" for the first observation, "blue" for the second and "yellow, blue" for the third. You would want the dummy "color_red" to be equal to 1 in the first observation; the dummy "color_blue" to be 1 in the second and the third; and you'd want a separate dummy, "color_yellow", to be equal to 1 in the third observation.

I just ran across such a data set today. It had characteristics for a few hundred lottery games. The color described the ticket colors. There were a few other string variables that also could have observations that were comma-delimited lists. Moreover, the comma-delimited lists could include values that did not show up as unique values in other observations (like "yellow" in the example above).

So I thought I'd write a program that could deal with all of that without the need of any visual inspection or case-by-case manual labor on my part. I wanted it to be applicable to any string variable in this situation. My suggestion is below:

// ##### getDummies -- turns string to dummies. Takes one argument:
// ##### `1' -- string, the name of the variable of interest.
capture prog drop getDummies
prog def getDummies

local stringvar `1'
quietly count
local fullset=r(N)
quietly count if !regexm(`stringvar',",")
local uniques=r(N) // cases where `stringvar' is not a list

if `fullset'!=`uniques' {
  quietly {
    tab `stringvar' if regexm(`stringvar',",")
    levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
    preserve
    tempfile `stringvar'_lists
    keep `stringvar'
    keep if regexm(`stringvar',",")
    duplicates drop
    split `stringvar', p(",")
    save "``stringvar'_lists'", replace
    restore
    describe `stringvar'* using "``stringvar'_lists'", varlist // note(1)
    local `stringvar'_stubs=r(varlist)
    split `stringvar', p(",")
    local stubs: list sizeof `stringvar'_stubs
    forvalues i=2/`stubs' {
      local stub: word `i' of ``stringvar'_stubs'
      replace `stub'=trim(`stub') // note (2)
      levelsof `stub' if `stub'!="", local(extras)
      local tags: list tags | extras
    }
    local tags: list sort tags
  }
}
else {
  di "for each value of `stringvar' there corresponds one dummy variable"
  capture drop __*
  quietly levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
  local `stringvar'_stubs `stringvar'
}
capture drop __*

local stubnum: list sizeof `stringvar'_stubs
local tagnum: list sizeof tags

quietly {
  forvalues i=1/`tagnum' {
    local thistag: word `i' of `tags'
    local thistag: list clean thistag
    gen byte _`stringvar'_`i'=0
    forvalues j=1/`stubnum' {
      capture drop __*
      local thisstub: word `j' of ``stringvar'_stubs'
      replace _`stringvar'_`i'=1 if `thisstub'=="`thistag'"
    }
  }
}
drop `stringvar'*

// this section is for listing stuff on screen and in the log
local `stringvar'_stubs: list `stringvar'_stubs-stringvar
local stubs: list sizeof `stringvar'_stubs
di ""
di "total number of games: `fullset'"
di "number of games where `stringvar' is not a list: `uniques'"
di "unique values of `stringvar': `tagnum'"
if `stubs'>0 {
  di "where `stringvar' is a list, it is this long at most: `stubs'"
}
di ""
forvalues i=1/`tagnum' {
  local thistag: word `i' of `tags'
  local thistag: list clean thistag
  di "_`stringvar'_`i' is for `stringvar' == `thistag'"
}

end

That's it. This program collects all the possible values that your stringvar can take, whether inside comma-delimited lists or by themselves, and produces accurate dummies that are equal to one every time such a value is encountered, whether by itself or in a list, and regardless of its position in the list. With your data set in memory, you simply call

getDummies color

Now, I don't post programs unless they contain something I just learned in the process of writing them. Today's such thing is in line 24, next to the comment "note(1)". Turns out -- if you call help describe -- there are two kinds of describe: one for data in memory, another for data using a file. The latter comes with a different set of options. One of them is varlist. It stores the name of the variables in r(varlist). I chose to preserve/restore and create the tiny temporary file "``stringvar'_lists'" so I could apply describe using to it, and get a variable list saved in the local ``stringvar'_stubs'. I'm using it later on.

This may look like a lot of work, and it is, but it's all up-front. You do it once, and if it works now it works forever. The marginal cost of creating dummies out of any number of such variables is zero from here on out.

Update (February 4, 2009): the first version of getDummies had a bug. The line marked with the comment "note(2)" was missing. As a result, the program produced more dummies than it should have.  To use my color example, without this line getDummies will produce two separate dummies for the color "blue": one for the case where color was equal to "blue" strictly, and another for the case where color contained the string " blue". Notice the leading blank space. You want to trim() it.

11 Responses to “Dummy variables”

  1. stataplayer writes:

    As far as your color example is concerned, the following lines will do the job:

    clear
    input str5 color
    red
    blue
    green
    end
    cl

    levelsof color, local(colors)
    foreach l of local colors{
    gen color_`l' = (color == "`l'")
    }

    list

  2. Gabi Huiber writes:

    Stataplayer's quick code is a nice substitute for the "tab, gen()" method described at the top of my post. In fact, it's the first method described in the Stata FAQ page on dummy variables that I am referencing in my post (I have no idea how I managed to not notice it the first time, because it's clearly marked "Answer 1 or 3"). Like the other two, though, it wouldn't work if the third observation in this data set were something like "green, maroon" instead of "green". In other words, it won't work in the cases that my program was designed for -- e.g. when the three observations in his example might map to more than three dummies (red, blue, green, maroon).

  3. Steli writes:

    /*
    Well, this works nice, but it's not as general as you make it sound. There is still some work to be done, and unfortunately I don't think it can be generalized to avoid tabulating the data because it is case specific.
    Try to see what you get:
    */
    clear
    set obs 12
    gen var1 = ""

    replace var1 = "aa" if _n == 1 | _n == 2 | _n == 3
    replace var1 = "aa, bb" if _n == 4
    replace var1 = "aa, cc" if _n == 5
    replace var1 = "bb" if _n == 6
    replace var1 = "aa / bb" if _n == 7
    replace var1 = "aa/cc" if _n == 8
    replace var1 = "aa/bb/cc" if _n == 9
    replace var1 = "aa|cc" if _n == 10
    replace var1 = "aa.cc" if _n == 11
    replace var1 = "aa&cc" if _n == 12

    * notice 11 aa, 4 bb, and 6 cc

    * define here your getDummies program
    *...
    getDummies var1

    /* Notice that the result is not what you expected. These are examples of real data that I worked with (except of course for the generic aa/bb/cc). You need to know the separators of the different values in your data. Now you allow for only space and comma. One solution would be to increase the allowed list to make the program more general, but the problem then might be that you could make it "too general", i.e., you would wrongly split legitimate entities that contain the assumed separators. For example, "." could be an entity separator or it could be part of a string (like in "Company A, Ltd.")

  4. Gabi Huiber writes:

    You're right, though your concerns can be addressed, at least partially, with regular expressions. You may, for example, allow the separator "," as long as it's not in front of "Ltd.$ | ltd.$". Stata's regular expression functions can handle all of that. But you're right that you'd have to do some prior exploration in order to see what you're up against, and there are no guarantees that you'll ever cover all the possibilities. Then again, real-world problems aren't perfect problems either. Most of them don't require perfect solutions.

  5. D Watson writes:

    I tried to run the getDummies program (using the "webuse auto.dta" data and the string variable 'make') however it returned this error:

    getDummies make
    Unknown function ()
    r(133);

    I"ve run the program from the dofile editor as well as from a saved .ado file and got the same results both times. Any ideas on what I am doing wrong?

    Also, how is the program different from the 'dummies' or 'mrdum' packages??

    Thanks.

  6. Gabi Huiber writes:

    It works fine for me, I just checked. I have Stata/MP 10.1. I'm not sure what you're running, but you could do this:

    set trace on
    getDummies make
    set trace off

    This will tell you where exactly the thing is breaking down.

    I just looked at dummies and mrdum. I wasn't aware of either before your comment. From the help files, it looks like getDummies and mrdum are doing the same thing (of course, in your case mrdum might have the distinct advantage that it works).

  7. D Watson writes:

    Thanks for the quick reply. Below is what I get using trace on. I'm not sure why it says that regexm() is an unknown function. I checked and it is installed.

    . getDummies make
    -------------------------------------------------------------------------- begin getDummies ---
    - local stringvar `1'
    = local stringvar make
    - quietly count
    - local fullset=r(N)
    - quietly count if !regexm(`stringvar',",") 
    = quietly count if !regexm(make,",") 
    Unknown function ()
    ---------------------------------------------------------------------------- end getDummies ---
    r(133);

  8. Gabi Huiber writes:

    Strange; regexm() is one of the Stata regular expression functions. In this particular case, it's counting the values of make where the comma character is absent. For the auto.dta dataset, this count should return the number of observations, because there are no cases where a car has more than one make, separated by commas. You can replicate it at the command line:
    sysuse auto
    count
    count if !regexm(make,",")

    I'm stumped.

  9. D Watson writes:

    I am running Stata 10.1/SE on the Mac...it's not MP, but I can't think why this would make a difference.

    It seems that the first problem was that I copied your program from the website and I did not think about the spacing getting messed up. When I copied and pasted your program into my text editor (AlphaX) it only put 2 spaces in front of the lines that were supposed to be tabbed over the line after a "{".

    Once I corrected this it made it farther into the program. It hit an error here:

    - levelsof `stringvar' if !regexm(`stringvar',","), local(tags) 
    = levelsof make if !regexm(make,","), local(tags) 
    -------------------------------------------------------------------------- begin levelsof ---
    - version 9
    - syntax varname [if] [in] [, Separate(str) MISSing Local(str) Clean ]
    option   not allowed
    ---------------------------------------------------------------------------- end levelsof ---
    local `stringvar'_stubs `stringvar'
    }

    I'm not sure what the problem is with this line. I tried running it in the command line alone and it gave me a slightly different error:

    . levelsof `stringvar' if !regexm(`stringvar',","), local(tags)
    ---------------------------------------------------------------------------- begin levelsof ---
    - version 9
    - syntax varname [if] [in] [, Separate(str) MISSing Local(str) Clean ]
    varlist required
    ------------------------------------------------------------------------------ end levelsof ---
    r(100);

    Feel free to ignore my question, I can probably just use one of the other dummy packages out there, but I am starting to learn to program some ado files and I am just interested in figuring out what's happening here so that I can avoid these issues in the future. Thanks for any help & I enjoy your 'A Stata Mind' webpage--keep the good entries coming!

    ~DW

  10. Gabi Huiber writes:

    Regarding the first bit of trouble:

    There's a reason for the two spaces: that's how many I used in the original version. Indentation helps with keeping track of what goes where, and how you do it doesn't matter as long as it does that job. Stata ignores both tabs and spaces. But some people prefer two or three spaces to one tab because that's unambiguous. You see, there's no standard translation of a tab. Some editors make it five, some eight spaces. I'm taking some CS classes at NC State and they discourage using tabs for indentation for this reason.

    But if my lines run over and your text editor puts an end of line where I didn't mean to have one, that's a problem. Thanks for bringing it up. Now it's clear what went wrong. I'm going to have to tinker with the code font size so it renders properly across all browsers. Right now you can see my code as I intended it in Google Chrome, but not in Opera.

    Regarding the r(100) error when trying to run levelsof in the command line:

    The "varlist required" message is one of the rare examples of error messages that are actually clear. In this case, you used a local -- `stringvar' -- that wasn't previously defined as local to the command line (it was only local to your do-file). Stata turns those guys into a blank. To see what I'm talking about, try this: display "`hello'"

  11. Human-readable code | A Stata Mind writes:

    [...] wider page in order to accommodate longer lines of code. I needed it because some code lines in my Dummy variables post ran over. If you cut and pasted the code errant end-of-line characters could have crept in. [...]

Leave a Reply