How many zeroes in that Poisson?

I have a data set, and some of the variables there are counts of a given event.

Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process.

You could check whether the variance is more or less equal to the mean, but with real-life data you can bet that there will be a difference between the two, and you'll be left scratching your head as to whether it's too big for Poisson, or just about right to pass the smell test.

Another thing you can do is check whether the count variable shows the right number of zeroes. In a Poisson distribution, the marginal probability of a zero outcome is exp(-mean). If the proportion of zeroes that you see is a lot higher than this value, and it usually is when you're looking at counts of rare events, then you will have to consider a zero-inflated poisson or a finite-mixture model, as discussed with wonderful clarity in chapter 17 of Microeconometrics using Stata by Cameron and Trivedi.

That brings me to the immediate cause for this post. I thought I'd code up a quick program to check those zeroes for a given data set and count variable, and I did this:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den r(N)
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num r(N)
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Then I ran the thing, and it kept turning up a share of 1.0, that is 100% zeroes observed, no matter the data set or the variable of interest y. You know why? Because local den r(N) will evaluate to r(N), and that will be filled in by the last command that returns such a thing before `den' is invoked. That command is sum `y'. The same thing happens to `num'. So I took the ratio of the same number. The returned values from the calls to count that I had made right before defining both local num and local den were quietly obliterated. Isn't that a sneaky bug? The correct code is below:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den `r(N)'
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num `r(N)'
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Now `den' stores the actual value returned by the count before it was defined as local den `r(N)', as intended. Mind your apostrophes, is all I'm saying.

Erratum: actually, that's not all that I should have said. As Nick Cox observes in the comments below, you should mind your equal signs too.

9 Responses to “How many zeroes in that Poisson?”

  1. Nick Cox writes:

    Gabi:

    You don't explain what -UseData- is, but I will take your "never mind" on trust.

    That aside:

    Your first -count- in your program counts observations, regardless. You could get that directly with _N or c(N).

    Your second -count- counts missings or zeros (as you just recoded missings to zeros).

    So, your fraction looks like (missings or zeros) / (missings or non-missings).

    Don't you want zeros / non-missings?

    The denominator for your calculation should be count of non-missing values of your variable.

    You can get that directly with

    count if y < .
    local den = r(N)
    count if y == 0
    di r(N) / `den'

    Another way:

    tab y, matcell(freq)
    di freq[1,1] / r(N)

    Nick

  2. Gabi Huiber writes:

    I like the spareness of the matcell() way you propose. As to c(N) or _N, I never remember to use those things. I vaguely recall having seen somewhere that there's some correspondence between the locals that we adorn with apostrophes like `this', and that preceding _underscore, but I'm not sure of the specifics. And you're right, normally one shouldn't conflate zeroes with missings. I'm taking some liberties here because I can. I should have qualified this properly.

  3. Nick Cox writes:

    The -matcell()- trick depends on zero being the lowest observed value.

    c(N) is something you can forget about without loss. Learning about _n and _N (both for entire datasets and under -by:-) helps in many, many problems.

    In reply to your middle sentence on `this' and underscores:

    Most explanations don't emphasise it because it does not help much, and because it might confuse, but locals are in a limited sense temporary globals. Compare

    . local Durham "original Durham is in UK, not in NH, SC or Canada"

    . mac li

    _Durham: original Durham is in UK, not in NH, SC or Canada

    . di "$_Durham"
    original Durham is in UK, not in NH, SC or Canada

    So, the local is a global!

    There are faint traces of this here and there, but it's best to regard locals and globals as utterly distinct.

    However, _N is really sui generis.

  4. Maarten Buis writes:

    The checking if the variance equals the mean or whether the observed and expected 0 counts are equal to one another assumes that the effects of all the explanatory variables equal zero. If that is not the case, than the dependent variable can easily be the result of a Poisson process and still fail these tests. Consider the example below. Y was created with a Poisson process, but still fails both tests.

    I like to use -hangroot- to check the observed distribution of the dependent variable against the marginal distribution assumed by the model. (I would since I wrote it) By adding simulated y variables assuming that the model is true, one can get an idea of how much deviation from the theoretical distribution can occur just by chance. Examples for several other count models can be found at this presentation I gave at last years Nordic Stata Users' meeting .

    *------------------ begin example ---------------------
    // create data with a Poisson process
    // and an independent/explanatory/x variable
    set seed 12345
    drop _all
    set obs 10000
    gen x = _n <= 5000
    gen y = rpoisson(1+5*x)

    // Mean does not equal variance
    sum y
    di r(Var)

    // expected and observed zero count don't match
    local f _col(60) %5.2fc as result
    local zpois exp(-`r(mean)')
    qui count if y == 0
    local zobs `r(N)'/_N
    di as txt "Share of y=0 in a Poisson process: " `f' `zpois'
    di as txt "Share of y=0 observed: " `f' `zobs'

    // estimate the model
    poisson y x

    // create random draws assuming the model is true
    predict lambda, n
    forvalues i = 1/20 {
    gen sim`i' = rpoisson(lambda)
    }

    // check the distribution of y against the marginal
    // distribution assumed by the model
    // (hangroot is user written, it first needs to be
    // installed by typing -ssc install hangroot-)
    hangroot , sims(sim*) jitter(5) name(hanging, replace)
    hangroot , sims(sim*) jitter(2) susp notheor name(susp, replace)
    *------------------ end example ---------------------

  5. Gabi Huiber writes:

    Hi Maarten, thanks for stopping by. Is this code snippet brand new? My installed version of hangroot doesn't want to take the option sims() and it's also complaining that I need a varlist for this distribution. I did adoupdate, update to no avail.

    That aside, I know I shouldn't expect my particular set of realizations of the dependent variable to look like a set of perfectly random draws from its underlying distribution. The way I understand it, this is because there's no reason to believe that my set of realizations of the covariates themselves are random draws from their own underlying distributions, nor does it matter if they are as far as the model's fit is concerned. Your code gives a perfect example: x is s forced to take on only two values.

    Still, show me somebody who hasn't drawn at least once a histogram of the dependent variable. Why do we bother with any such examination, other than to do routine data quality checks? It's a genuine question. I myself do it as a kind of idle play with data ahead of fitting anything to them. Do you ever try to fit your observed instances of the dependent variable to some known distribution? Why?

    Anyway, to be truthful, 'how many zeroes in that Poisson' was just an excuse to bring up the `r(N)' vs. r(N) difference. That's how I ran across it: while poking around at a count outcome. But I would still like to give your code example a go.

  6. Nick Cox writes:

    On the last point: Your example doesn't just turn on the difference between r(N) and `r(N)'. It turns on the fact that you defined your macro by copying, not evaluating. Had you written

    local den = r(N)

    you would not have been bitten. But as you said

    local den r(N)

    then -den- was just the text "r(N)" and Stata paid no attention to what it meant. That bit later when you did evaluate it, but r(N) was then different.

    So, there are two points at stake: r(N) or `r(N)' and copying or evaluating.

  7. Maarten Buis writes:

    I used the latest version available from SSC:

    . which hangroot
    c:\ado\plus\h\hangroot.ado
    *! version 1.5.0 MLB 12Jul2011

    Your error message suggests you are using an older version of -hangroot-. Maybe -ssc install hangroot, replace- will work in getting the latest version.

    I don't think that there is a problem with looking at the distribution of the dependent variable. I actually think it is good practice. The problem is the theoretical distribution with which it is compared. If you are using a regression type model, than you do not expect that the dependent variable follows a normal, Poisson, gamma, beta, ... distribution, but a mixture of these distributions. This mixture can look very different from the archetypical normal, Poisson, gamma, beta, ... distribution.

    Too often we see at the statalist a question of the form "my dependent variable isn't normally distributed, what should I do?". I think that the problem is that many (econometric) text books state that regression assumes a normally distributed dependent variable, but many students forget/ignore that the mean of that normal distribution isn't constant over individuals, and as a consequence that the distribution of the dependent variable may look nothing like the bell shape that we are used to. In case of linear regression that is easily solved by looking at the residuals instead. No such easy solution exists for other models, including Poisson. There is a whole cottage industry on inventing normalizing transformation for residuals from GLMs. With -hangroot- and -margdistfit- I have taken the opposite approach, and compared the distribution of the dependent variable with the mixture distribution implied by the model.

  8. Nick Cox writes:

    Expanding on Maarten's comment:

    Although there are fields in which predictors are readily to hand -- most medical and social science problems seem awash with possible predictors -- there are fields in which it is common to work with response variables, but there's often nothing else except perhaps time or place of observation. With sea levels, temperatures, river discharges, etc. in environmental science it is common to have just response data and those extra coordinates, so the exact form of the marginal distribution is still a central question. (Of course, trend, seasonality and dependence structures in time and/or space are often of concern too.)

  9. Gabi Huiber writes:

    Holy cow. I re-installed hangroot, ran your code snipped, and I just saw the nicest graphical explanation that a Poisson regression can indeed model a process generated by a Poisson + something else mixture probability mass function. Thank you. I remember I saw this rootogram thing before. I must have caught one of your announcements on the Statalist and installed hangroot then, but I had no idea what a useful tool it was.

Leave a Reply