<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Stata Things</title>
	<atom:link href="http://enoriver.net/index.php/feed/" rel="self" type="application/rss+xml" />
	<link>http://enoriver.net</link>
	<description>computing for fun and profit</description>
	<lastBuildDate>Mon, 07 May 2012 13:43:02 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Stata for stocks</title>
		<link>http://enoriver.net/index.php/2012/03/20/stata-for-stocks/</link>
		<comments>http://enoriver.net/index.php/2012/03/20/stata-for-stocks/#comments</comments>
		<pubDate>Tue, 20 Mar 2012 17:00:21 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=2137</guid>
		<description><![CDATA[The people at StataCorp are on Facebook, and the other day they linked to this blog post by Paul Clist about checking on a stock you might own through clever use of the stockquote Stata command. Last year I bought some Netflix stock when it fell to $77 after the Qwikster fail. I agreed with [...]]]></description>
			<content:encoded><![CDATA[<p>The people at StataCorp are on Facebook, and the other day they linked to <a href="http://aidwriting.wordpress.com/2012/03/16/checking-a-simple-stock-portfolio-with-stata-easily/">this</a> blog post by Paul Clist about checking on a stock you might own through clever use of the <a href="http://ideas.repec.org/c/boc/bocode/s456990.html"><code>stockquote</code></a> Stata command.</p>
<p>Last year I bought some Netflix stock when it fell to $77 after the Qwikster fail. I agreed with the general public that it was a stupid idea, but I still thought that the hit their stock took was a bit of an overreaction. The streaming business was still good. Maybe not $250 per share good, once the content suppliers would catch wise and raise their prices, but my family was still happy with it as a TV substitute. That's about the full extent of thought I'm going to ever put into picking any stock, so don't be too surprised that I don't make big bets. This one was just shy of $500 -- whatever round number of shares plus the broker's commission fit there.</p>
<p>Still, it was just never clear to me how good this choice of spending $500 was relative to the Nasdaq Composite. Of course, I could have looked it up on <a href="http://bigcharts.marketwatch.com/">bigcharts.com</a>, but why not have a picture with the real dollars I have at stake on the y axis, and my true time line on the x axis? It wasn't too hard to expand Paul's code to a set of programs that can take any of the four stocks in my toy portfolio and put it against some appropriate stock market index, to show how it's been doing in one quick <code>tsline</code> graph. Here's Netflix as of last Friday:</p>
<p><a href="http://enoriver.net/blog/wp-content/uploads/2012/03/NFLX.png"><img src="http://enoriver.net/blog/wp-content/uploads/2012/03/NFLX-300x218.png" alt="" title="NFLX" width="300" height="218" class="alignnone size-medium wp-image-2141" /></a></p>
<p>My code takes the starting dollar amount from my brokerage statement, and it augments both the holdings and the index baseline with any subsequent purchases or sales (there aren't any in this case). This way I can simply <code>collapse (sum)</code> both the "index" (transformed into a price-per-unit times the units held of each stock following Paul's formula) and the valuation of each stock into daily totals, and plot the performance of the whole portfolio relative to my stock index of choice, with actual dollars on the y axis. </p>
<p>I like it. It works fine for me. I already use Stata to do the household's budget, I used it to compare true costs to own of water heaters when I was in the market for one, and I used it to track wet and dirty diapers when my kid was a few days old. So, thank you, Paul, for helping me find yet another civilian use for this fine piece of software. </p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/03/20/stata-for-stocks/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Turn a date into Stata format quickly</title>
		<link>http://enoriver.net/index.php/2012/03/18/turn-a-date-into-stata-format-quickly/</link>
		<comments>http://enoriver.net/index.php/2012/03/18/turn-a-date-into-stata-format-quickly/#comments</comments>
		<pubDate>Sun, 18 Mar 2012 15:02:19 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=2124</guid>
		<description><![CDATA[There's a little program that's shown up more than once now in my housekeeping do-files, so it may be useful enough for a blog post, but it doesn't quite warrant a spot in c(sysdir_personal) as a stand-alone ado-file. Here: // turn this date to Stata format // if it's not that way already capture prog [...]]]></description>
			<content:encoded><![CDATA[<p>There's a little program that's shown up more than once now in my housekeeping do-files,  so it may be useful enough for a blog post, but it doesn't quite warrant a spot in <code>c(sysdir_personal)</code> as a stand-alone ado-file. Here:</p>
<pre><code>
// turn this date to Stata format
// if it's not that way already
capture prog drop setStataDate
program setStataDate

args v fmt // fmt can be MDY or YMD

tempname x
capture confirm string variable `v'
if _rc==0 {
   local l`v' : variable label `v'
   gen `x'=date(`v',"`fmt'")
   format `x' %td
   drop `v'
   rename `x' `v'
   label variable `v' "`l`v''"
   // order `v'
}

end
</code></pre>
<p>I use it with data sets derived from merging other data sets. It's useful if in the original data sets there are string dates in mixed formats -- maybe YYYY-MM-DD in the "master", and MM/DD/YYYY in the "using" -- or if these string dates have labels I want to keep. So, you see why it's not clear that this is worth an ado-file. I don't want to type all the code between the curlies more than once, but usually I don't have to. </p>
<p>I do want to be able to call this program by name, as in <code>setStataDate somedate MDY</code> from within another program, then forget about it, safe in the knowledge that it won't make any difference if <em>somedate</em> is already in Stata format. That's the job of the if-condition you see there, and this is all this little program does.</p>
<p>As to the reason for using the temporary name x, see the comment thread. Temporary names for variables generated only to be renamed are the safe option. The alternative is to use a one-letter convenience name, like x, and I did that first. But what if you want to use this program with a data set that includes a variable named x already?</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/03/18/turn-a-date-into-stata-format-quickly/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Human right stats, one last thing</title>
		<link>http://enoriver.net/index.php/2012/03/05/human-right-stats-one-last-thing/</link>
		<comments>http://enoriver.net/index.php/2012/03/05/human-right-stats-one-last-thing/#comments</comments>
		<pubDate>Mon, 05 Mar 2012 15:55:24 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=2084</guid>
		<description><![CDATA[The R code in my previous post could also produce the picture below. The implication is this: A small sample is still bad news. It is biased toward underestimating the population. There's nothing you can do about that. The larger the sample, the better. How large a sample do you need? You might get lucky [...]]]></description>
			<content:encoded><![CDATA[<p>The R code in my <a href="http://enoriver.net/index.php/2012/03/04/human-rights-stats-part-2/" title="Human rights stats, part 2">previous post</a> could also produce the picture below. The implication is this:</p>
<p>A small sample is still bad news. It is biased toward underestimating the population. There's nothing you can do about that. The larger the sample, the better. How large a sample do you need? You might get lucky with as little as 1,000, for the reason that I mentioned in my first installment on the topic: small samples only need a small overlap to guess pretty well. That's why the green curve now peaks at the true population mark. But you'd have to be lucky, as my previous picture demonstrates by counter-example. And even a correct guess will be surrounded by a lot of uncertainty if you have a small sample: the green curve is still the flattest of the three that guess correctly. Finally, the gain from increasing the catch limit from 10,000 to 20,000 is not trivial after all: the purple curve is quite a bit peakier than the blue one.</p>
<p><a href="http://enoriver.net/blog/wp-content/uploads/2012/03/mse_sim2.png"><img src="http://enoriver.net/blog/wp-content/uploads/2012/03/mse_sim2.png" alt="" title="mse_sim2" width="550" height="431" class="alignnone size-full wp-image-2085" /></a></p>
<p>What this simulation shows is that MSE relies on having representative samples of the true population. There's no way out of that requirement. You also want to run your code more than once. Though you will be able to dismiss easily sample sizes that are clearly too small, there may be a range of sample sizes that can provide false comfort. I could have easily seen this picture first and concluded that 1,000 isn't great, but it still hits the mark, so maybe it's good enough. That would have been wrong. On the other hand, now I'm pretty sure that 10,000 is still alright, though not as good as it looked before.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/03/05/human-right-stats-one-last-thing/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Human rights stats, part 2</title>
		<link>http://enoriver.net/index.php/2012/03/04/human-rights-stats-part-2/</link>
		<comments>http://enoriver.net/index.php/2012/03/04/human-rights-stats-part-2/#comments</comments>
		<pubDate>Sun, 04 Mar 2012 15:44:26 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=2039</guid>
		<description><![CDATA[My previous post promised some simulations. To refresh your memory, I am trying to see how reliably Multiple Systems Estimation, as described here, can guess the true number of fish in a pond. The density plot below tells the story. The true number of fish is 150,000. Each catch limit lets you make a best [...]]]></description>
			<content:encoded><![CDATA[<p>My <a title="Human rights stats, part 1" href="http://enoriver.net/index.php/2012/03/03/human-rights-stats-part-1/">previous post</a> promised some simulations. To refresh your memory, I am trying to see how reliably Multiple Systems Estimation, as described <a href="http://www.foreignpolicy.com/articles/2012/02/27/the_body_counter?page=full">here</a>, can guess the true number of fish in a pond.</p>
<p>The density plot below tells the story. The true number of fish is 150,000. Each catch limit lets you make a best guess, which is the x-coordinate of the peak of its associated bell curve. The shape of each bell curve measures the uncertainty surrounding the guess: the flatter the bell, the more uncertain the guess. Perfect foresight would be a spike at the 150,000 x-mark. Curves that peak away from that mark make biased guesses.</p>
<p>It is obvious that larger daily catch limits allow you to guess better. Very low catch limits set you up for severe downward bias. The red curve, corresponding to a daily catch of up to 500 fish, cannot help underestimating the true population size, for reasons discussed in the previous post. Then there seems to be a range of catch limits that improve on the bias, but increase the uncertainty horribly -- that's the green curve. So, you need higher limits, but you don't have to go crazy. The gain in precision after some point is not worth it: though a limit of 20,000 is very accurate, a limit of half that is not much worse.</p>
<p><a href="http://enoriver.net/blog/wp-content/uploads/2012/03/mse_sim.png"><img class="alignnone size-full wp-image-2050" title="Simulated Multiple Systems Estimation" src="http://enoriver.net/blog/wp-content/uploads/2012/03/mse_sim.png" alt="" width="550" height="431" /></a></p>
<p>I was inspired to run this exercise by <a href="http://j.mp/G2001">a class I'm taking</a>. The work would have been slower without help from <a href="http://www.statmethods.net/">here</a> (I recommend the <a href="http://www.manning.com/kabacoff/">book</a>, I bought it) and <a href="http://www.stackoverflow.com">here</a>. The picture would not have looked this good without <a href="http://had.co.nz/ggplot2/">ggplot2</a> (you should get <a href="http://tinyurl.com/ggplot2-book">that book</a> too). All errors are my own. Here's the code:</p>
<pre><code>
# http://www.foreignpolicy.com/articles/2012/02/27/the_body_counter?page=full
# simulate MSE = Multiple Systems Estimation

# SOME HOUSEKEPING FIRST

# pretty picture comes from here
library("ggplot2")

# true population size
population <- 150000 

# number of guesses you can take to estimate
# maximum-likelihood mean, standard deviation
# of your final guess
n <- 10

# this many draws will help you simulate
# the uncertainty of your guess
sims <- 10^4

# define a function that will render big numbers with comma
# separators for thousands. the original version is here:
# https://stat.ethz.ch/pipermail/r-help/2010-November/259488.html
commaUS <- function(x) {
   sprintf("%s", formatC(x, format="fg", big.mark = ","))
}

# THE PIECES OF THE ACTUAL SIMULATION

# make one guess, with daily catches of up to x:
# -- if x=100, the daily catch is between 0 and 99
# -- if x=1000, the daily catch is between 0 and 999.
mymse <- function(x) {
   day1 <- round(runif(1)*x)       # count of fish caught on day 1
   day2 <- round(runif(1)*x)       # count of fish caught on day 2
   pond <- c(1:population)
   day1 <- sample(pond,day1)               # list of fish caught on day 1
   day2 <- sample(pond,day2)               # list of fish caught on day 2
   overlap <- length(intersect(day1,day2)) # count of fish caught twice
   guess <- length(day1)*length(day2)
   if(overlap>0) {
      guess <- guess/overlap
   }
   return(round(guess))
}

# assume that the fish population guesses for a given pond
# are distributed N(mu,sigma). this is the log likelihood
# function of a given set of y guesses:
myll <- function(par, y) {
   mu <- par[1]
   sigma <- par[2]
   l <- length(y)*log(sigma^2)
   a <- 1/sigma^2
   ll <- sum((y-mu)^2)
   return(-(l+a*ll)/2)
}

# collect n guesses of catches limited to
# a given maxcatch, then estimate mu, sigma
# and make sims draws from N(mu, sigma)
mysim <- function(n,maxcatch) {
   # 1. make n guesses of fish in the pond
   a <- apply(as.matrix(1:n),1,function(x) mymse(maxcatch))
   # 2. get the maximum likelihood estimates of mu, sigma
   # that might have produced the n guesses:
   opt <- optim(par=c(1,1), fn = myll, control = list(fnscale = -1),
          y=a, method = "BFGS", hessian = TRUE)
   # 3. model the uncertainty around these estimates
   return(rnorm(sims,mean=opt$par[1],sd=opt$par[2]))
}   

# THE FINAL PICTURE

# now make some comparisons of the effect of
# maxcatch with a given number of guesses n
# on the population estimate and its precision
maxes  <- c(500,1000,10000,20000)
simspic <- matrix(0,sims,length(maxes))
for(i in 1:length(maxes)) {
   simspic[,i] <- mysim(n,maxes[i])
}

# Set up a stacked 2-column matrix where the second
# column will be used for grouping these guesses
x.1 <- cbind(simspic[,1],maxes[1])
x.2 <- cbind(simspic[,2],maxes[2])
x.3 <- cbind(simspic[,3],maxes[3])
x.4 <- cbind(simspic[,4],maxes[4])

# now plot the guesses with the
# catch limits set in maxes
d <- as.data.frame(rbind(x.1,x.2,x.3,x.4))
p <- qplot(V1, colour=factor(V2), data=d, geom="density", xlab="Fish in the pond")
t <- paste("Precision of the population estimate with",n,"guesses",sep=" ")
p <- p + scale_colour_discrete(name = "Catch limit") + opts(title=t)
p
</code></pre>
<p>If you made it here, you might also want to see <a href="http://enoriver.net/index.php/2012/03/05/human-right-stats-one-last-thing/">one last thing</a> on the matter.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/03/04/human-rights-stats-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Human rights stats, part 1</title>
		<link>http://enoriver.net/index.php/2012/03/03/human-rights-stats-part-1/</link>
		<comments>http://enoriver.net/index.php/2012/03/03/human-rights-stats-part-1/#comments</comments>
		<pubDate>Sat, 03 Mar 2012 15:39:39 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1978</guid>
		<description><![CDATA[I follow @simplystats on Twitter, and on March 1 they had a post that linked to an article in Foreign Policy about a guy who has the coolest job in applied stats. He works here. The original piece described a quick algorithm that you can use to estimate the number of human rights violations using [...]]]></description>
			<content:encoded><![CDATA[<p>I follow <a href="http://simplystatistics.tumblr.com/" title="Simply Statistics">@simplystats</a> on Twitter, and on March 1 they had a post that linked to an article in Foreign Policy about a guy who has <a href="http://www.foreignpolicy.com/articles/2012/02/27/the_body_counter?page=full">the coolest job in applied stats</a>. He works <a href="http://www.benetech.org/">here</a>.</p>
<p>The original piece described a quick algorithm that you can use to estimate the number of human rights violations using a technique first devised for counting fish in a pond. The gist of it is this: catch and release fish over two days. Tag the fish caught on the first day. Count each day's catch and the number of fish caught twice. That is the overlap. To estimate the number of fish in the pond, multiply the two days' catches and divide the total by the overlap.</p>
<p>I had a data set of insurance claims in Stata's memory at the time of my reading, with observations uniquely identified by a variable named claim_id. </p>
<p>I decided to use it as the model of a pond with as many fish in it as observations in my data set, so I wrote a little fishing program. It takes one argument: some round upper bound of the number of fish I might catch in a day. I'll call it <em>n</em>. It can be 100, or it can be 1,000. Here:</p>
<pre><code>
// try MSE
capture prog drop guessObservations
program guessObservations

args n // upper bound of a day's catch.

qui {
   local day1fishcount=int(runiform()*`n')
   local day2fishcount=int(runiform()*`n')

   forvalues i=1/2 {
      preserve
      tempfile day`i'fishlist
      sample `day`i'fishcount', count
      keep claim_id
      save "`day`i'fishlist'", replace
      restore
   }

   preserve
   drop _all
   use "`day1fishlist'"
   merge 1:1 claim_id using "`day2fishlist'"
   count if _merge==3
   local overlap=r(N)
   restore

   local totalfish=`day1fishcount'*`day2fishcount'
   if `overlap'>0 {
      local totalfish=`totalfish'/`overlap'
   }
   count
   local truect=r(N)
}

local fmt _col(30) %10.0fc
di ""
di "Fish caught on day 1:" `fmt' `day1fishcount'
di "Fish caught on day 2:" `fmt' `day2fishcount'
di "Overlap:"              `fmt' `overlap'
di "Estimate:"             `fmt' `totalfish'
di "True count:"           `fmt' `truect'

end
</code></pre>
<p>My data set has some 150,000 observations. Choosing a small <em>n</em>, say <code>guessObservations 100</code>, sets me up for an overlap of zero, but even so the two catches multiplied together won't even come close to the true size of the population. This is a technique for counting hungry fish in a small pond, not in an ocean. The size of the daily catch should be representative of the total, so you can have some decent overlap.</p>
<p>Setting <em>n</em>=1,000 keeps it small enough relative to the total population that it's still possible to have zero overlap, but <em>n</em> is now large enough to overshoot wildly in that case. If I catch 900 fish each day with zero overlap, I will guess that there are 810,000 fish there. However, an overlap as small as 5 will get me pretty close to the true population.</p>
<p>Setting <em>n</em>=10,000 performs much better. I may still have a day when the fish won't bite, and get this:</p>
<pre><code>
. guessObservations 10000

Fish caught on day 1:                49
Fish caught on day 2:             4,182
Overlap:                              3
Estimate:                        68,306
True count:                     157,638
</code></pre>
<p>But with any luck, I will probably get this:</p>
<pre><code>
. guessObservations 10000

Fish caught on day 1:             9,662
Fish caught on day 2:             3,220
Overlap:                            220
Estimate:                       141,417
True count:                     157,638
</code></pre>
<p>The larger <em>n</em>, the larger the overlap, and the better the precision. That makes sense: in the limit, the true number times itself divided by itself will yield the true number. </p>
<p>But does <em>n</em> have to be very large relative to the size of the population? And does my guess -- or the uncertainty surrounding it -- depend on what probability distribution function I assume for the daily catch? Next time I'll be doing some simulations.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/03/03/human-rights-stats-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Stata 12 with MacVim, updated</title>
		<link>http://enoriver.net/index.php/2012/02/27/stata-12-with-macvim-updated/</link>
		<comments>http://enoriver.net/index.php/2012/02/27/stata-12-with-macvim-updated/#comments</comments>
		<pubDate>Mon, 27 Feb 2012 16:35:00 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Vim]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1971</guid>
		<description><![CDATA[A while back I showed how to get Stata 12 to work with MacVim. This is to let you know about a bug fix. I posted the details on the Statalist just now. If you're reading this blog and you're not also a Statalist subscriber, you may want to change that.]]></description>
			<content:encoded><![CDATA[<p>A while back I <a href="http://enoriver.net/index.php/2011/09/14/stata-12-with-macvim/" title="Stata 12 with MacVim">showed</a> how to get <a href="http://www.stata.com">Stata 12</a> to work with <a href="http://code.google.com/p/macvim/">MacVim</a>. This is to let you know about a bug fix. I posted the details on the <a href="http://www.stata.com/statalist/">Statalist</a> just now. If you're reading this blog and you're not also a Statalist subscriber, you may want to change that.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/02/27/stata-12-with-macvim-updated/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fighting the R graphics</title>
		<link>http://enoriver.net/index.php/2012/02/13/fighting-the-r-graphics/</link>
		<comments>http://enoriver.net/index.php/2012/02/13/fighting-the-r-graphics/#comments</comments>
		<pubDate>Tue, 14 Feb 2012 01:29:01 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[plot()]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1950</guid>
		<description><![CDATA[If you've ever seen the Error in plot.new(): figure margins too large message before, this is the best overview of the problem that I could find anywhere. There can be a lot of knobs to turn when it comes to graphics, no matter what statistical programming environment you use. In R, typing par() at the [...]]]></description>
			<content:encoded><![CDATA[<p>If you've ever seen the <code>Error in plot.new(): figure margins too large</code> message before, <a href="http://research.stowers-institute.org/efg/R/Graphics/Basics/mar-oma/index.htm">this</a> is the best overview of the problem that I could find anywhere. </p>
<p>There can be a lot of knobs to turn when it comes to graphics, no matter what statistical programming environment you use. In R, typing <code>par()</code> at the prompt will list them all.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/02/13/fighting-the-r-graphics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A quick tip for using Stata in interactive mode</title>
		<link>http://enoriver.net/index.php/2012/02/08/a-quick-tip-for-using-stata-in-interactive-mode/</link>
		<comments>http://enoriver.net/index.php/2012/02/08/a-quick-tip-for-using-stata-in-interactive-mode/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 18:09:58 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[cmdlog]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1939</guid>
		<description><![CDATA[You don't always want to start a do-file in the editor for every small thing, though I usually do, and then trash it if I don't need it. So, my default stance is that I want to preserve work for later. Yours may be the opposite. If so, one option is to type in the [...]]]></description>
			<content:encoded><![CDATA[<p>You don't always want to start a do-file in the editor for every small thing, though I usually do, and then trash it if I don't need it. So, my default stance is that I want to preserve work for later. </p>
<p>Yours may be the opposite. If so, one option is to type in the Command window. If you decide that you do want that work preserved for later after all, you can always save the content of the Review window as a .do file.</p>
<p>Another option is to have this in your <a href="http://www.stata.com/support/faqs/lang/profiledo.html">profile.do</a> file:</p>
<pre><code>
// log today's interactive commands
cmdlog using "~/data/cmdlogs/cmdlog `c(current_date)'.smcl", append
</code></pre>
<p>This saves a running log with everything you typed at the command line on a given day, in the folder data/cmdlogs. This will save the commands, but not the output (that's the difference between calling <code>cmdlog</code> as opposed to <code>log</code>).</p>
<p>More on this topic <a href="http://www.michaelnormanmitchell.com/stow/always-starting-a-log.html#comment7811938">here</a>. That may well be where I got the idea to put this in my own profile.do, but if anybody thinks otherwise, I'll be glad to append this post with the correct credit.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/02/08/a-quick-tip-for-using-stata-in-interactive-mode/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How many zeroes in that Poisson?</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/</link>
		<comments>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 04:11:06 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[count]]></category>
		<category><![CDATA[Poisson]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1919</guid>
		<description><![CDATA[I have a data set, and some of the variables there are counts of a given event. Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process. You could check whether the [...]]]></description>
			<content:encoded><![CDATA[<p>I have a data set, and some of the variables there are counts of a given event.</p>
<p>Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process. </p>
<p>You could check whether the variance is more or less equal to the mean, but with real-life data you can bet that there will be a difference between the two, and you'll be left scratching your head as to whether it's too big for Poisson, or just about right to pass the smell test.</p>
<p>Another thing you can do is check whether the count variable shows the right number of zeroes. In a Poisson distribution, the marginal probability of a zero outcome is exp(-mean). If the proportion of zeroes that you see is a lot higher than this value, and it usually is when you're looking at counts of rare events, then you will have to consider a zero-inflated poisson or a finite-mixture model, as discussed with wonderful clarity in chapter 17 of <a href="http://www.stata-press.com/books/musr.html" title="Microeconometrics using Stata" target="_blank">Microeconometrics using Stata</a> by Cameron and Trivedi.</p>
<p>That brings me to the immediate cause for this post. I thought I'd code up a quick program to check those zeroes for a given data set and count variable, and I did this:</p>
<pre><code>
capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den r(N)
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num r(N)
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end
</code></pre>
<p>Then I ran the thing, and it kept turning up a share of 1.0, that is 100% zeroes observed, no matter the data set or the variable of interest y. You know why? Because <code>local den r(N)</code> will evaluate to r(N), and that will be filled in by the last command that returns such a thing before <code>`den'</code> is invoked. That command is <code>sum `y'</code>. The same thing happens to <code>`num'</code>. So I took the ratio of the same number. The returned values from the calls to <code>count</code> that I had made right before defining both <code>local num</code> and <code>local den</code> were quietly obliterated. Isn't that a sneaky bug? The correct code is below:</p>
<pre><code>
capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den `r(N)'
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num `r(N)'
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end
</code></pre>
<p>Now <code>`den'</code> stores the actual value returned by the <code>count</code> before it was defined as <code>local den `r(N)'</code>, as intended. Mind your apostrophes, is all I'm saying.</p>
<p>Erratum: actually, that's not all that I should have said. As Nick Cox observes in the comments below, you should mind your equal signs too.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Mapping Durham</title>
		<link>http://enoriver.net/index.php/2011/11/19/mapping-durham/</link>
		<comments>http://enoriver.net/index.php/2011/11/19/mapping-durham/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 16:23:37 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Durham playgrounds]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1908</guid>
		<description><![CDATA[Today, Kirstin wanted to make a grocery trip to the Whole Foods at Bull City Market, then take Kate to the nearest playground. That seems to be Oval Drive Park, but it won't be obvious from querying the Durham Park Locator . No worries. The Durham Park Locator gives you a pretty nice table with all [...]]]></description>
			<content:encoded><![CDATA[<p>Today, Kirstin wanted to make a grocery trip to the Whole Foods at Bull City Market, then take Kate to the nearest playground. That seems to be Oval Drive Park, but it won't be obvious from querying the <a href="http://www.ci.durham.nc.us/gis_apps/parkapp/mainmap.cfm">Durham Park Locator </a>.</p>
<p>No worries. The Durham Park Locator gives you a pretty nice table with all 55 playgrounds as of today. You load it into Stata, hit it with <code>geocode</code> and <code>writekml</code>, and you get <a href="http://maps.google.com/maps/ms?ie=UTF&#038;msa=0&#038;msid=205190672353608758732.0004b22c1ad29f72ce797">this Google map</a>. Easy.</p>
<p>On this occasion I also discovered that my <code>writekml</code>, as submitted, had a tiny bug. I submitted the fix a few minutes ago. It should be up by your next <code>update ado</code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/11/19/mapping-durham/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

