<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: How many zeroes in that Poisson?</title>
	<atom:link href="http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/feed/" rel="self" type="application/rss+xml" />
	<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/</link>
	<description>computing for fun and profit</description>
	<lastBuildDate>Mon, 25 Feb 2013 21:41:37 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
	<item>
		<title>By: Gabi Huiber</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62576</link>
		<dc:creator>Gabi Huiber</dc:creator>
		<pubDate>Fri, 03 Feb 2012 03:42:14 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62576</guid>
		<description><![CDATA[Holy cow. I re-installed hangroot, ran your code snipped, and I just saw the nicest graphical explanation that a Poisson regression can indeed model a process generated by a Poisson + something else mixture probability mass function. Thank you. I remember I saw this rootogram thing before. I must have caught one of your announcements on the Statalist and installed hangroot then, but I had no idea what a useful tool it was.]]></description>
		<content:encoded><![CDATA[<p>Holy cow. I re-installed hangroot, ran your code snipped, and I just saw the nicest graphical explanation that a Poisson regression can indeed model a process generated by a Poisson + something else mixture probability mass function. Thank you. I remember I saw this rootogram thing before. I must have caught one of your announcements on the Statalist and installed hangroot then, but I had no idea what a useful tool it was.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick Cox</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62466</link>
		<dc:creator>Nick Cox</dc:creator>
		<pubDate>Thu, 02 Feb 2012 12:33:53 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62466</guid>
		<description><![CDATA[Expanding on Maarten&#039;s comment: 

Although there are fields in which predictors are readily to hand -- most medical and social science problems seem awash with possible predictors -- there are fields in which it is common to work with response variables, but there&#039;s often nothing else except perhaps time or place of observation. With sea levels, temperatures, river discharges, etc. in environmental science it is common to have just response data and those extra coordinates, so the exact form of the marginal distribution is still a central question. (Of course, trend, seasonality and dependence structures in time and/or space are often of concern too.)]]></description>
		<content:encoded><![CDATA[<p>Expanding on Maarten's comment: </p>
<p>Although there are fields in which predictors are readily to hand -- most medical and social science problems seem awash with possible predictors -- there are fields in which it is common to work with response variables, but there's often nothing else except perhaps time or place of observation. With sea levels, temperatures, river discharges, etc. in environmental science it is common to have just response data and those extra coordinates, so the exact form of the marginal distribution is still a central question. (Of course, trend, seasonality and dependence structures in time and/or space are often of concern too.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Maarten Buis</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62301</link>
		<dc:creator>Maarten Buis</dc:creator>
		<pubDate>Wed, 01 Feb 2012 09:00:21 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62301</guid>
		<description><![CDATA[I used the latest version available from SSC:

. which hangroot
c:\ado\plus\h\hangroot.ado
*! version 1.5.0 MLB 12Jul2011

Your error message suggests you are using an older version of -hangroot-. Maybe -ssc install hangroot, replace- will work in getting the latest version.

I don&#039;t think that there is a problem with looking at the distribution of the dependent variable. I actually think it is good practice. The problem is the theoretical distribution with which it is compared. If you are using a regression type model, than you do not expect that the dependent variable follows a normal, Poisson, gamma, beta, ... distribution, but a mixture of these distributions. This mixture can look very different from the archetypical normal, Poisson, gamma, beta, ... distribution. 

Too often we see at the statalist a question of the form &quot;my dependent variable isn&#039;t normally distributed, what should I do?&quot;. I think that the problem is that many (econometric) text books state that regression assumes a normally distributed dependent variable, but many students forget/ignore that the mean of that normal distribution isn&#039;t constant over individuals, and as a consequence that the distribution of the dependent variable may look nothing like the bell shape that we are used to. In case of linear regression that is easily solved by looking at the residuals instead. No such easy solution exists for other models, including Poisson. There is a whole cottage industry on inventing normalizing transformation for residuals from GLMs. With -hangroot- and -margdistfit- I have taken the opposite approach, and compared the distribution of the dependent variable with the mixture distribution implied by the model.]]></description>
		<content:encoded><![CDATA[<p>I used the latest version available from SSC:</p>
<p>. which hangroot<br />
c:\ado\plus\h\hangroot.ado<br />
*! version 1.5.0 MLB 12Jul2011</p>
<p>Your error message suggests you are using an older version of -hangroot-. Maybe -ssc install hangroot, replace- will work in getting the latest version.</p>
<p>I don't think that there is a problem with looking at the distribution of the dependent variable. I actually think it is good practice. The problem is the theoretical distribution with which it is compared. If you are using a regression type model, than you do not expect that the dependent variable follows a normal, Poisson, gamma, beta, ... distribution, but a mixture of these distributions. This mixture can look very different from the archetypical normal, Poisson, gamma, beta, ... distribution. </p>
<p>Too often we see at the statalist a question of the form "my dependent variable isn't normally distributed, what should I do?". I think that the problem is that many (econometric) text books state that regression assumes a normally distributed dependent variable, but many students forget/ignore that the mean of that normal distribution isn't constant over individuals, and as a consequence that the distribution of the dependent variable may look nothing like the bell shape that we are used to. In case of linear regression that is easily solved by looking at the residuals instead. No such easy solution exists for other models, including Poisson. There is a whole cottage industry on inventing normalizing transformation for residuals from GLMs. With -hangroot- and -margdistfit- I have taken the opposite approach, and compared the distribution of the dependent variable with the mixture distribution implied by the model.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick Cox</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62226</link>
		<dc:creator>Nick Cox</dc:creator>
		<pubDate>Tue, 31 Jan 2012 19:21:17 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62226</guid>
		<description><![CDATA[On the last point: Your example doesn&#039;t just turn on the difference between r(N) and `r(N)&#039;. It turns on the fact that you defined your macro by copying, not evaluating. Had you written 

local den = r(N) 

you would not have been bitten. But as you said 

local den r(N) 

then -den- was just the text &quot;r(N)&quot; and Stata paid no attention to what it meant. That bit later when you did evaluate it, but r(N) was then different. 

So, there are two points at stake: r(N) or `r(N)&#039; and copying or evaluating.]]></description>
		<content:encoded><![CDATA[<p>On the last point: Your example doesn't just turn on the difference between r(N) and `r(N)'. It turns on the fact that you defined your macro by copying, not evaluating. Had you written </p>
<p>local den = r(N) </p>
<p>you would not have been bitten. But as you said </p>
<p>local den r(N) </p>
<p>then -den- was just the text "r(N)" and Stata paid no attention to what it meant. That bit later when you did evaluate it, but r(N) was then different. </p>
<p>So, there are two points at stake: r(N) or `r(N)' and copying or evaluating.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gabi Huiber</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62200</link>
		<dc:creator>Gabi Huiber</dc:creator>
		<pubDate>Tue, 31 Jan 2012 15:06:09 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62200</guid>
		<description><![CDATA[Hi Maarten, thanks for stopping by. Is this code snippet brand new? My installed version of hangroot doesn&#039;t want to take the option sims() and it&#039;s also complaining that I need a varlist for this distribution. I did adoupdate, update to no avail.

That aside, I know I shouldn&#039;t expect my particular set of realizations of the dependent variable to look like a set of perfectly random draws from its underlying distribution. The way I understand it, this is because there&#039;s no reason to believe that my set of realizations of the covariates themselves are random draws from their own underlying distributions, nor does it matter if they are as far as the model&#039;s fit is concerned. Your code gives a perfect example: x is s forced to take on only two values.

Still, show me somebody who hasn&#039;t drawn at least once a histogram of the dependent variable. Why do we bother with any such examination, other than to do routine data quality checks? It&#039;s a genuine question. I myself do it as a kind of idle play with data ahead of fitting anything to them. Do you ever try to fit your observed instances of the dependent variable to some known distribution? Why?

Anyway, to be truthful, &#039;how many zeroes in that Poisson&#039; was just an excuse to bring up the `r(N)&#039; vs. r(N) difference. That&#039;s how I ran across it: while poking around at a count outcome. But I would still like to give your code example a go.]]></description>
		<content:encoded><![CDATA[<p>Hi Maarten, thanks for stopping by. Is this code snippet brand new? My installed version of hangroot doesn't want to take the option sims() and it's also complaining that I need a varlist for this distribution. I did adoupdate, update to no avail.</p>
<p>That aside, I know I shouldn't expect my particular set of realizations of the dependent variable to look like a set of perfectly random draws from its underlying distribution. The way I understand it, this is because there's no reason to believe that my set of realizations of the covariates themselves are random draws from their own underlying distributions, nor does it matter if they are as far as the model's fit is concerned. Your code gives a perfect example: x is s forced to take on only two values.</p>
<p>Still, show me somebody who hasn't drawn at least once a histogram of the dependent variable. Why do we bother with any such examination, other than to do routine data quality checks? It's a genuine question. I myself do it as a kind of idle play with data ahead of fitting anything to them. Do you ever try to fit your observed instances of the dependent variable to some known distribution? Why?</p>
<p>Anyway, to be truthful, 'how many zeroes in that Poisson' was just an excuse to bring up the `r(N)' vs. r(N) difference. That's how I ran across it: while poking around at a count outcome. But I would still like to give your code example a go.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Maarten Buis</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62167</link>
		<dc:creator>Maarten Buis</dc:creator>
		<pubDate>Tue, 31 Jan 2012 10:05:11 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62167</guid>
		<description><![CDATA[The checking if the variance equals the mean or whether the observed and expected 0 counts are equal to one another assumes that the effects of all the explanatory variables equal zero. If that is not the case, than the dependent variable can easily be the result of a Poisson process and still fail these tests. Consider the example below. Y was created with a Poisson process, but still fails both tests. 

I like to use -hangroot- to check the observed distribution of the dependent variable against the marginal distribution assumed by the model. (I would since I wrote it) By adding simulated y variables assuming that the model is true, one can get an idea of how much deviation from the theoretical distribution can occur just by chance. Examples for several other count models can be found at this presentation I gave at last years Nordic Stata Users&#039; meeting .

*------------------ begin example ---------------------
// create data with a Poisson process 
// and an independent/explanatory/x variable
set seed 12345
drop _all
set obs 10000
gen x = _n &lt;= 5000
gen y = rpoisson(1+5*x)

// Mean does not equal variance
sum y 
di r(Var)

// expected and observed zero count don&#039;t match
local f _col(60) %5.2fc as result
local zpois exp(-`r(mean)&#039;)
qui count if y == 0
local zobs `r(N)&#039;/_N
di as txt &quot;Share of y=0 in a Poisson process: &quot; `f&#039; `zpois&#039;
di as txt &quot;Share of y=0 observed: &quot; `f&#039; `zobs&#039; 

// estimate the model
poisson y x

// create random draws assuming the model is true
predict lambda, n
forvalues i = 1/20 {
	gen sim`i&#039; = rpoisson(lambda)
}

// check the distribution of y against the marginal 
// distribution assumed by the model
// (hangroot is user written, it first needs to be 
// installed by typing -ssc install hangroot-)
hangroot , sims(sim*) jitter(5) name(hanging, replace)
hangroot , sims(sim*) jitter(2) susp notheor name(susp, replace)
*------------------ end example ---------------------]]></description>
		<content:encoded><![CDATA[<p>The checking if the variance equals the mean or whether the observed and expected 0 counts are equal to one another assumes that the effects of all the explanatory variables equal zero. If that is not the case, than the dependent variable can easily be the result of a Poisson process and still fail these tests. Consider the example below. Y was created with a Poisson process, but still fails both tests. </p>
<p>I like to use -hangroot- to check the observed distribution of the dependent variable against the marginal distribution assumed by the model. (I would since I wrote it) By adding simulated y variables assuming that the model is true, one can get an idea of how much deviation from the theoretical distribution can occur just by chance. Examples for several other count models can be found at this presentation I gave at last years Nordic Stata Users' meeting .</p>
<p>*------------------ begin example ---------------------<br />
// create data with a Poisson process<br />
// and an independent/explanatory/x variable<br />
set seed 12345<br />
drop _all<br />
set obs 10000<br />
gen x = _n &lt;= 5000<br />
gen y = rpoisson(1+5*x)</p>
<p>// Mean does not equal variance<br />
sum y<br />
di r(Var)</p>
<p>// expected and observed zero count don&#039;t match<br />
local f _col(60) %5.2fc as result<br />
local zpois exp(-`r(mean)&#039;)<br />
qui count if y == 0<br />
local zobs `r(N)&#039;/_N<br />
di as txt &quot;Share of y=0 in a Poisson process: &quot; `f&#039; `zpois&#039;<br />
di as txt &quot;Share of y=0 observed: &quot; `f&#039; `zobs&#039; </p>
<p>// estimate the model<br />
poisson y x</p>
<p>// create random draws assuming the model is true<br />
predict lambda, n<br />
forvalues i = 1/20 {<br />
	gen sim`i&#039; = rpoisson(lambda)<br />
}</p>
<p>// check the distribution of y against the marginal<br />
// distribution assumed by the model<br />
// (hangroot is user written, it first needs to be<br />
// installed by typing -ssc install hangroot-)<br />
hangroot , sims(sim*) jitter(5) name(hanging, replace)<br />
hangroot , sims(sim*) jitter(2) susp notheor name(susp, replace)<br />
*------------------ end example ---------------------</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick Cox</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62071</link>
		<dc:creator>Nick Cox</dc:creator>
		<pubDate>Mon, 30 Jan 2012 15:10:22 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62071</guid>
		<description><![CDATA[The -matcell()- trick depends on zero being the lowest observed value. 

c(N) is something you can forget about without loss. Learning about _n and _N (both for entire datasets and under -by:-) helps in many, many problems. 

In reply to your middle sentence on `this&#039; and underscores: 

Most explanations don&#039;t emphasise it because it does not help much, and because it might confuse, but locals are in a limited sense temporary globals. Compare 

. local Durham &quot;original Durham is in UK, not in NH, SC or Canada&quot;

. mac li
 
_Durham:        original Durham is in UK, not in NH, SC or Canada

. di &quot;$_Durham&quot;
original Durham is in UK, not in NH, SC or Canada

So, the local is a global! 

There are faint traces of this here and there, but it&#039;s best to regard locals and globals as utterly distinct. 

However, _N is really sui generis.]]></description>
		<content:encoded><![CDATA[<p>The -matcell()- trick depends on zero being the lowest observed value. </p>
<p>c(N) is something you can forget about without loss. Learning about _n and _N (both for entire datasets and under -by:-) helps in many, many problems. </p>
<p>In reply to your middle sentence on `this' and underscores: </p>
<p>Most explanations don't emphasise it because it does not help much, and because it might confuse, but locals are in a limited sense temporary globals. Compare </p>
<p>. local Durham "original Durham is in UK, not in NH, SC or Canada"</p>
<p>. mac li</p>
<p>_Durham:        original Durham is in UK, not in NH, SC or Canada</p>
<p>. di "$_Durham"<br />
original Durham is in UK, not in NH, SC or Canada</p>
<p>So, the local is a global! </p>
<p>There are faint traces of this here and there, but it's best to regard locals and globals as utterly distinct. </p>
<p>However, _N is really sui generis.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gabi Huiber</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62064</link>
		<dc:creator>Gabi Huiber</dc:creator>
		<pubDate>Mon, 30 Jan 2012 14:21:35 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62064</guid>
		<description><![CDATA[I like the spareness of the matcell() way you propose. As to c(N) or _N, I never remember to use those things. I vaguely recall having seen somewhere that there&#039;s some correspondence between the locals that we adorn with apostrophes like `this&#039;, and that preceding _underscore, but I&#039;m not sure of the specifics. And you&#039;re right, normally one shouldn&#039;t conflate zeroes with missings. I&#039;m taking some liberties here because I can. I should have qualified this properly.]]></description>
		<content:encoded><![CDATA[<p>I like the spareness of the matcell() way you propose. As to c(N) or _N, I never remember to use those things. I vaguely recall having seen somewhere that there's some correspondence between the locals that we adorn with apostrophes like `this', and that preceding _underscore, but I'm not sure of the specifics. And you're right, normally one shouldn't conflate zeroes with missings. I'm taking some liberties here because I can. I should have qualified this properly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick Cox</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/comment-page-1/#comment-62006</link>
		<dc:creator>Nick Cox</dc:creator>
		<pubDate>Mon, 30 Jan 2012 02:10:15 +0000</pubDate>
		<guid isPermaLink="false">http://enoriver.net/?p=1919#comment-62006</guid>
		<description><![CDATA[Gabi: 

You don&#039;t explain what -UseData- is, but I will take your &quot;never mind&quot; on trust. 

That aside: 

Your first -count- in your program counts observations, regardless. You could get that directly with _N or c(N). 

Your second -count- counts missings or zeros (as you just recoded missings to zeros). 

So, your fraction looks like (missings or zeros) / (missings or non-missings). 

Don&#039;t you want zeros / non-missings? 

The denominator for your calculation should be count of non-missing values of your variable.

You can get that directly with 

count if y &lt; . 
local den = r(N) 
count if y == 0 
di r(N) / `den&#039; 

Another way: 

tab y, matcell(freq) 
di freq[1,1] / r(N) 

Nick]]></description>
		<content:encoded><![CDATA[<p>Gabi: </p>
<p>You don't explain what -UseData- is, but I will take your "never mind" on trust. </p>
<p>That aside: </p>
<p>Your first -count- in your program counts observations, regardless. You could get that directly with _N or c(N). </p>
<p>Your second -count- counts missings or zeros (as you just recoded missings to zeros). </p>
<p>So, your fraction looks like (missings or zeros) / (missings or non-missings). </p>
<p>Don't you want zeros / non-missings? </p>
<p>The denominator for your calculation should be count of non-missing values of your variable.</p>
<p>You can get that directly with </p>
<p>count if y &lt; .<br />
local den = r(N)<br />
count if y == 0<br />
di r(N) / `den&#039; </p>
<p>Another way: </p>
<p>tab y, matcell(freq)<br />
di freq[1,1] / r(N) </p>
<p>Nick</p>
]]></content:encoded>
	</item>
</channel>
</rss>
