Stata for stocks

The people at StataCorp are on Facebook, and the other day they linked to this blog post by Paul Clist about checking on a stock you might own through clever use of the stockquote Stata command.

Last year I bought some Netflix stock when it fell to $77 after the Qwikster fail. I agreed with the general public that it was a stupid idea, but I still thought that the hit their stock took was a bit of an overreaction. The streaming business was still good. Maybe not $250 per share good, once the content suppliers would catch wise and raise their prices, but my family was still happy with it as a TV substitute. That's about the full extent of thought I'm going to ever put into picking any stock, so don't be too surprised that I don't make big bets. This one was just shy of $500 -- whatever round number of shares plus the broker's commission fit there.

Still, it was just never clear to me how good this choice of spending $500 was relative to the Nasdaq Composite. Of course, I could have looked it up on bigcharts.com, but why not have a picture with the real dollars I have at stake on the y axis, and my true time line on the x axis? It wasn't too hard to expand Paul's code to a set of programs that can take any of the four stocks in my toy portfolio and put it against some appropriate stock market index, to show how it's been doing in one quick tsline graph. Here's Netflix as of last Friday:

My code takes the starting dollar amount from my brokerage statement, and it augments both the holdings and the index baseline with any subsequent purchases or sales (there aren't any in this case). This way I can simply collapse (sum) both the "index" (transformed into a price-per-unit times the units held of each stock following Paul's formula) and the valuation of each stock into daily totals, and plot the performance of the whole portfolio relative to my stock index of choice, with actual dollars on the y axis.

I like it. It works fine for me. I already use Stata to do the household's budget, I used it to compare true costs to own of water heaters when I was in the market for one, and I used it to track wet and dirty diapers when my kid was a few days old. So, thank you, Paul, for helping me find yet another civilian use for this fine piece of software.

Turn a date into Stata format quickly

There's a little program that's shown up more than once now in my housekeeping do-files, so it may be useful enough for a blog post, but it doesn't quite warrant a spot in c(sysdir_personal) as a stand-alone ado-file. Here:


// turn this date to Stata format
// if it's not that way already
capture prog drop setStataDate
program setStataDate

args v fmt // fmt can be MDY or YMD

tempname x
capture confirm string variable `v'
if _rc==0 {
   local l`v' : variable label `v'
   gen `x'=date(`v',"`fmt'")
   format `x' %td
   drop `v'
   rename `x' `v'
   label variable `v' "`l`v''"
   // order `v'
}

end

I use it with data sets derived from merging other data sets. It's useful if in the original data sets there are string dates in mixed formats -- maybe YYYY-MM-DD in the "master", and MM/DD/YYYY in the "using" -- or if these string dates have labels I want to keep. So, you see why it's not clear that this is worth an ado-file. I don't want to type all the code between the curlies more than once, but usually I don't have to.

I do want to be able to call this program by name, as in setStataDate somedate MDY from within another program, then forget about it, safe in the knowledge that it won't make any difference if somedate is already in Stata format. That's the job of the if-condition you see there, and this is all this little program does.

As to the reason for using the temporary name x, see the comment thread. Temporary names for variables generated only to be renamed are the safe option. The alternative is to use a one-letter convenience name, like x, and I did that first. But what if you want to use this program with a data set that includes a variable named x already?

Human right stats, one last thing

The R code in my previous post could also produce the picture below. The implication is this:

A small sample is still bad news. It is biased toward underestimating the population. There's nothing you can do about that. The larger the sample, the better. How large a sample do you need? You might get lucky with as little as 1,000, for the reason that I mentioned in my first installment on the topic: small samples only need a small overlap to guess pretty well. That's why the green curve now peaks at the true population mark. But you'd have to be lucky, as my previous picture demonstrates by counter-example. And even a correct guess will be surrounded by a lot of uncertainty if you have a small sample: the green curve is still the flattest of the three that guess correctly. Finally, the gain from increasing the catch limit from 10,000 to 20,000 is not trivial after all: the purple curve is quite a bit peakier than the blue one.

What this simulation shows is that MSE relies on having representative samples of the true population. There's no way out of that requirement. You also want to run your code more than once. Though you will be able to dismiss easily sample sizes that are clearly too small, there may be a range of sample sizes that can provide false comfort. I could have easily seen this picture first and concluded that 1,000 isn't great, but it still hits the mark, so maybe it's good enough. That would have been wrong. On the other hand, now I'm pretty sure that 10,000 is still alright, though not as good as it looked before.

Human rights stats, part 2

My previous post promised some simulations. To refresh your memory, I am trying to see how reliably Multiple Systems Estimation, as described here, can guess the true number of fish in a pond.

The density plot below tells the story. The true number of fish is 150,000. Each catch limit lets you make a best guess, which is the x-coordinate of the peak of its associated bell curve. The shape of each bell curve measures the uncertainty surrounding the guess: the flatter the bell, the more uncertain the guess. Perfect foresight would be a spike at the 150,000 x-mark. Curves that peak away from that mark make biased guesses.

It is obvious that larger daily catch limits allow you to guess better. Very low catch limits set you up for severe downward bias. The red curve, corresponding to a daily catch of up to 500 fish, cannot help underestimating the true population size, for reasons discussed in the previous post. Then there seems to be a range of catch limits that improve on the bias, but increase the uncertainty horribly -- that's the green curve. So, you need higher limits, but you don't have to go crazy. The gain in precision after some point is not worth it: though a limit of 20,000 is very accurate, a limit of half that is not much worse.

I was inspired to run this exercise by a class I'm taking. The work would have been slower without help from here (I recommend the book, I bought it) and here. The picture would not have looked this good without ggplot2 (you should get that book too). All errors are my own. Here's the code:


# http://www.foreignpolicy.com/articles/2012/02/27/the_body_counter?page=full
# simulate MSE = Multiple Systems Estimation

# SOME HOUSEKEPING FIRST

# pretty picture comes from here
library("ggplot2")

# true population size
population <- 150000 

# number of guesses you can take to estimate
# maximum-likelihood mean, standard deviation
# of your final guess
n <- 10

# this many draws will help you simulate
# the uncertainty of your guess
sims <- 10^4

# define a function that will render big numbers with comma
# separators for thousands. the original version is here:
# https://stat.ethz.ch/pipermail/r-help/2010-November/259488.html
commaUS <- function(x) {
   sprintf("%s", formatC(x, format="fg", big.mark = ","))
}

# THE PIECES OF THE ACTUAL SIMULATION

# make one guess, with daily catches of up to x:
# -- if x=100, the daily catch is between 0 and 99
# -- if x=1000, the daily catch is between 0 and 999.
mymse <- function(x) {
   day1 <- round(runif(1)*x)       # count of fish caught on day 1
   day2 <- round(runif(1)*x)       # count of fish caught on day 2
   pond <- c(1:population)
   day1 <- sample(pond,day1)               # list of fish caught on day 1
   day2 <- sample(pond,day2)               # list of fish caught on day 2
   overlap <- length(intersect(day1,day2)) # count of fish caught twice
   guess <- length(day1)*length(day2)
   if(overlap>0) {
      guess <- guess/overlap
   }
   return(round(guess))
}

# assume that the fish population guesses for a given pond
# are distributed N(mu,sigma). this is the log likelihood
# function of a given set of y guesses:
myll <- function(par, y) {
   mu <- par[1]
   sigma <- par[2]
   l <- length(y)*log(sigma^2)
   a <- 1/sigma^2
   ll <- sum((y-mu)^2)
   return(-(l+a*ll)/2)
}

# collect n guesses of catches limited to
# a given maxcatch, then estimate mu, sigma
# and make sims draws from N(mu, sigma)
mysim <- function(n,maxcatch) {
   # 1. make n guesses of fish in the pond
   a <- apply(as.matrix(1:n),1,function(x) mymse(maxcatch))
   # 2. get the maximum likelihood estimates of mu, sigma
   # that might have produced the n guesses:
   opt <- optim(par=c(1,1), fn = myll, control = list(fnscale = -1),
          y=a, method = "BFGS", hessian = TRUE)
   # 3. model the uncertainty around these estimates
   return(rnorm(sims,mean=opt$par[1],sd=opt$par[2]))
}   

# THE FINAL PICTURE

# now make some comparisons of the effect of
# maxcatch with a given number of guesses n
# on the population estimate and its precision
maxes  <- c(500,1000,10000,20000)
simspic <- matrix(0,sims,length(maxes))
for(i in 1:length(maxes)) {
   simspic[,i] <- mysim(n,maxes[i])
}

# Set up a stacked 2-column matrix where the second
# column will be used for grouping these guesses
x.1 <- cbind(simspic[,1],maxes[1])
x.2 <- cbind(simspic[,2],maxes[2])
x.3 <- cbind(simspic[,3],maxes[3])
x.4 <- cbind(simspic[,4],maxes[4])

# now plot the guesses with the
# catch limits set in maxes
d <- as.data.frame(rbind(x.1,x.2,x.3,x.4))
p <- qplot(V1, colour=factor(V2), data=d, geom="density", xlab="Fish in the pond")
t <- paste("Precision of the population estimate with",n,"guesses",sep=" ")
p <- p + scale_colour_discrete(name = "Catch limit") + opts(title=t)
p

If you made it here, you might also want to see one last thing on the matter.

Human rights stats, part 1

I follow @simplystats on Twitter, and on March 1 they had a post that linked to an article in Foreign Policy about a guy who has the coolest job in applied stats. He works here.

The original piece described a quick algorithm that you can use to estimate the number of human rights violations using a technique first devised for counting fish in a pond. The gist of it is this: catch and release fish over two days. Tag the fish caught on the first day. Count each day's catch and the number of fish caught twice. That is the overlap. To estimate the number of fish in the pond, multiply the two days' catches and divide the total by the overlap.

I had a data set of insurance claims in Stata's memory at the time of my reading, with observations uniquely identified by a variable named claim_id.

I decided to use it as the model of a pond with as many fish in it as observations in my data set, so I wrote a little fishing program. It takes one argument: some round upper bound of the number of fish I might catch in a day. I'll call it n. It can be 100, or it can be 1,000. Here:


// try MSE
capture prog drop guessObservations
program guessObservations

args n // upper bound of a day's catch.

qui {
   local day1fishcount=int(runiform()*`n')
   local day2fishcount=int(runiform()*`n')

   forvalues i=1/2 {
      preserve
      tempfile day`i'fishlist
      sample `day`i'fishcount', count
      keep claim_id
      save "`day`i'fishlist'", replace
      restore
   }

   preserve
   drop _all
   use "`day1fishlist'"
   merge 1:1 claim_id using "`day2fishlist'"
   count if _merge==3
   local overlap=r(N)
   restore

   local totalfish=`day1fishcount'*`day2fishcount'
   if `overlap'>0 {
      local totalfish=`totalfish'/`overlap'
   }
   count
   local truect=r(N)
}

local fmt _col(30) %10.0fc
di ""
di "Fish caught on day 1:" `fmt' `day1fishcount'
di "Fish caught on day 2:" `fmt' `day2fishcount'
di "Overlap:"              `fmt' `overlap'
di "Estimate:"             `fmt' `totalfish'
di "True count:"           `fmt' `truect'

end

My data set has some 150,000 observations. Choosing a small n, say guessObservations 100, sets me up for an overlap of zero, but even so the two catches multiplied together won't even come close to the true size of the population. This is a technique for counting hungry fish in a small pond, not in an ocean. The size of the daily catch should be representative of the total, so you can have some decent overlap.

Setting n=1,000 keeps it small enough relative to the total population that it's still possible to have zero overlap, but n is now large enough to overshoot wildly in that case. If I catch 900 fish each day with zero overlap, I will guess that there are 810,000 fish there. However, an overlap as small as 5 will get me pretty close to the true population.

Setting n=10,000 performs much better. I may still have a day when the fish won't bite, and get this:


. guessObservations 10000

Fish caught on day 1:                49
Fish caught on day 2:             4,182
Overlap:                              3
Estimate:                        68,306
True count:                     157,638

But with any luck, I will probably get this:


. guessObservations 10000

Fish caught on day 1:             9,662
Fish caught on day 2:             3,220
Overlap:                            220
Estimate:                       141,417
True count:                     157,638

The larger n, the larger the overlap, and the better the precision. That makes sense: in the limit, the true number times itself divided by itself will yield the true number.

But does n have to be very large relative to the size of the population? And does my guess -- or the uncertainty surrounding it -- depend on what probability distribution function I assume for the daily catch? Next time I'll be doing some simulations.

Stata 12 with MacVim, updated

A while back I showed how to get Stata 12 to work with MacVim. This is to let you know about a bug fix. I posted the details on the Statalist just now. If you're reading this blog and you're not also a Statalist subscriber, you may want to change that.

Fighting the R graphics

If you've ever seen the Error in plot.new(): figure margins too large message before, this is the best overview of the problem that I could find anywhere.

There can be a lot of knobs to turn when it comes to graphics, no matter what statistical programming environment you use. In R, typing par() at the prompt will list them all.

A quick tip for using Stata in interactive mode

You don't always want to start a do-file in the editor for every small thing, though I usually do, and then trash it if I don't need it. So, my default stance is that I want to preserve work for later.

Yours may be the opposite. If so, one option is to type in the Command window. If you decide that you do want that work preserved for later after all, you can always save the content of the Review window as a .do file.

Another option is to have this in your profile.do file:


// log today's interactive commands
cmdlog using "~/data/cmdlogs/cmdlog `c(current_date)'.smcl", append

This saves a running log with everything you typed at the command line on a given day, in the folder data/cmdlogs. This will save the commands, but not the output (that's the difference between calling cmdlog as opposed to log).

More on this topic here. That may well be where I got the idea to put this in my own profile.do, but if anybody thinks otherwise, I'll be glad to append this post with the correct credit.

How many zeroes in that Poisson?

I have a data set, and some of the variables there are counts of a given event.

Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process.

You could check whether the variance is more or less equal to the mean, but with real-life data you can bet that there will be a difference between the two, and you'll be left scratching your head as to whether it's too big for Poisson, or just about right to pass the smell test.

Another thing you can do is check whether the count variable shows the right number of zeroes. In a Poisson distribution, the marginal probability of a zero outcome is exp(-mean). If the proportion of zeroes that you see is a lot higher than this value, and it usually is when you're looking at counts of rare events, then you will have to consider a zero-inflated poisson or a finite-mixture model, as discussed with wonderful clarity in chapter 17 of Microeconometrics using Stata by Cameron and Trivedi.

That brings me to the immediate cause for this post. I thought I'd code up a quick program to check those zeroes for a given data set and count variable, and I did this:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den r(N)
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num r(N)
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Then I ran the thing, and it kept turning up a share of 1.0, that is 100% zeroes observed, no matter the data set or the variable of interest y. You know why? Because local den r(N) will evaluate to r(N), and that will be filled in by the last command that returns such a thing before `den' is invoked. That command is sum `y'. The same thing happens to `num'. So I took the ratio of the same number. The returned values from the calls to count that I had made right before defining both local num and local den were quietly obliterated. Isn't that a sneaky bug? The correct code is below:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den `r(N)'
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num `r(N)'
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Now `den' stores the actual value returned by the count before it was defined as local den `r(N)', as intended. Mind your apostrophes, is all I'm saying.

Erratum: actually, that's not all that I should have said. As Nick Cox observes in the comments below, you should mind your equal signs too.

Mapping Durham

Today, Kirstin wanted to make a grocery trip to the Whole Foods at Bull City Market, then take Kate to the nearest playground. That seems to be Oval Drive Park, but it won't be obvious from querying the Durham Park LocatorĀ .

No worries. The Durham Park Locator gives you a pretty nice table with all 55 playgrounds as of today. You load it into Stata, hit it with geocode and writekml, and you get this Google map. Easy.

On this occasion I also discovered that my writekml, as submitted, had a tiny bug. I submitted the fix a few minutes ago. It should be up by your next update ado.