<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Stata Things &#187; Stata</title>
	<atom:link href="http://enoriver.net/index.php/category/stata/feed/" rel="self" type="application/rss+xml" />
	<link>http://enoriver.net</link>
	<description>computing for fun and profit</description>
	<lastBuildDate>Tue, 31 Jan 2012 20:03:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>How many zeroes in that Poisson?</title>
		<link>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/</link>
		<comments>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/#comments</comments>
		<pubDate>Fri, 27 Jan 2012 04:11:06 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[count]]></category>
		<category><![CDATA[Poisson]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1919</guid>
		<description><![CDATA[I have a data set, and some of the variables there are counts of a given event. Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process. You could check whether the [...]]]></description>
			<content:encoded><![CDATA[<p>I have a data set, and some of the variables there are counts of a given event.</p>
<p>Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process. </p>
<p>You could check whether the variance is more or less equal to the mean, but with real-life data you can bet that there will be a difference between the two, and you'll be left scratching your head as to whether it's too big for Poisson, or just about right to pass the smell test.</p>
<p>Another thing you can do is check whether the count variable shows the right number of zeroes. In a Poisson distribution, the marginal probability of a zero outcome is exp(-mean). If the proportion of zeroes that you see is a lot higher than this value, and it usually is when you're looking at counts of rare events, then you will have to consider a zero-inflated poisson or a finite-mixture model, as discussed with wonderful clarity in chapter 17 of <a href="http://www.stata-press.com/books/musr.html" title="Microeconometrics using Stata" target="_blank">Microeconometrics using Stata</a> by Cameron and Trivedi.</p>
<p>That brings me to the immediate cause for this post. I thought I'd code up a quick program to check those zeroes for a given data set and count variable, and I did this:</p>
<pre><code>
capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den r(N)
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num r(N)
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end
</code></pre>
<p>Then I ran the thing, and it kept turning up a share of 1.0, that is 100% zeroes observed, no matter the data set or the variable of interest y. You know why? Because <code>local den r(N)</code> will evaluate to r(N), and that will be filled in by the last command that returns such a thing before <code>`den'</code> is invoked. That command is <code>sum `y'</code>. The same thing happens to <code>`num'</code>. So I took the ratio of the same number. The returned values from the calls to <code>count</code> that I had made right before defining both <code>local num</code> and <code>local den</code> were quietly obliterated. Isn't that a sneaky bug? The correct code is below:</p>
<pre><code>
capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den `r(N)'
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num `r(N)'
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end
</code></pre>
<p>Now <code>`den'</code> stores the actual value returned by the <code>count</code> before it was defined as <code>local den `r(N)'</code>, as intended. Mind your apostrophes, is all I'm saying.</p>
<p>Erratum: actually, that's not all that I should have said. As Nick Cox observes in the comments below, you should mind your equal signs too.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2012/01/26/how-many-zeroes-in-that-poisson/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Mapping Durham</title>
		<link>http://enoriver.net/index.php/2011/11/19/mapping-durham/</link>
		<comments>http://enoriver.net/index.php/2011/11/19/mapping-durham/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 16:23:37 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Durham playgrounds]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1908</guid>
		<description><![CDATA[Today, Kirstin wanted to make a grocery trip to the Whole Foods at Bull City Market, then take Kate to the nearest playground. That seems to be Oval Drive Park, but it won't be obvious from querying the Durham Park Locator . No worries. The Durham Park Locator gives you a pretty nice table with all [...]]]></description>
			<content:encoded><![CDATA[<p>Today, Kirstin wanted to make a grocery trip to the Whole Foods at Bull City Market, then take Kate to the nearest playground. That seems to be Oval Drive Park, but it won't be obvious from querying the <a href="http://www.ci.durham.nc.us/gis_apps/parkapp/mainmap.cfm">Durham Park Locator </a>.</p>
<p>No worries. The Durham Park Locator gives you a pretty nice table with all 55 playgrounds as of today. You load it into Stata, hit it with <code>geocode</code> and <code>writekml</code>, and you get <a href="http://maps.google.com/maps/ms?ie=UTF&#038;msa=0&#038;msid=205190672353608758732.0004b22c1ad29f72ce797">this Google map</a>. Easy.</p>
<p>On this occasion I also discovered that my <code>writekml</code>, as submitted, had a tiny bug. I submitted the fix a few minutes ago. It should be up by your next <code>update ado</code>.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/11/19/mapping-durham/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>From Stata to Google Maps</title>
		<link>http://enoriver.net/index.php/2011/10/07/from-stata-to-google-maps/</link>
		<comments>http://enoriver.net/index.php/2011/10/07/from-stata-to-google-maps/#comments</comments>
		<pubDate>Fri, 07 Oct 2011 15:02:22 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Google Maps API]]></category>
		<category><![CDATA[writekml]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1812</guid>
		<description><![CDATA[At the Stata command line, type "findit geocode". You will turn up a command that matches physical addresses with latitude and longitude coordinates using the Google Maps API. Then if you type "findit writekml" you will turn up my first contribution to the SSC: a command that writes a KML file using latitude and longitude [...]]]></description>
			<content:encoded><![CDATA[<p>At the Stata command line, type "findit geocode". You will turn up a command that matches physical addresses with latitude and longitude coordinates using the Google Maps API.</p>
<p>Then if you type "findit writekml" you will turn up my first contribution to the SSC: a command that writes a KML file using latitude and longitude coordinates in a Stata data set. Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/10/07/from-stata-to-google-maps/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Stata 12 with MacVim</title>
		<link>http://enoriver.net/index.php/2011/09/14/stata-12-with-macvim/</link>
		<comments>http://enoriver.net/index.php/2011/09/14/stata-12-with-macvim/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 14:30:46 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Vim]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1770</guid>
		<description><![CDATA[I used to run Stata 10 with Vim on Windows. Now I run Stata 12 with MacVim. In Windows, there is a nice way to integrate Stata and Vim based on the work of Friedrich Huebler and Dimitriy Masterov. A fairly straightforward combination of bash scripts, Vim functions and Applescript calls can achieve the same [...]]]></description>
			<content:encoded><![CDATA[<p>I used to run Stata 10 with <a href="http://www.vim.org/">Vim</a> on Windows. Now I run Stata 12 with <a href="http://code.google.com/p/macvim/">MacVim</a>. </p>
<p>In Windows, there is a nice way to integrate Stata and Vim based on the work of <a href="http://s281191135.onlinehome.us/2008/20080427-stata.html">Friedrich Huebler</a> and <a href="http://stata.com/statalist/archive/2006-06/msg00905.html">Dimitriy Masterov</a>. </p>
<p>A fairly <a href="http://www.stata.com/statalist/archive/2011-09/msg00556.html">straightforward combination</a> of bash scripts, Vim functions and Applescript calls can achieve the same behavior in MacVim. I got it to work last night. Thank you, <a href="http://health.uchicago.edu/People/Schumm-Phil">Phil</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/09/14/stata-12-with-macvim/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Factors in Stata and R</title>
		<link>http://enoriver.net/index.php/2011/08/22/factors-in-stata-and-r/</link>
		<comments>http://enoriver.net/index.php/2011/08/22/factors-in-stata-and-r/#comments</comments>
		<pubDate>Mon, 22 Aug 2011 19:16:59 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1772</guid>
		<description><![CDATA[The quick version of this post goes like this: -- # in Stata is : in R -- ## in Stata is * in R. The long version is that both Stata and R handle very nicely factor variables in regression models. If you want a full-factorial interaction between a factor variable x1 and a [...]]]></description>
			<content:encoded><![CDATA[<p>The quick version of this post goes like this:<br />
-- # in Stata is : in R<br />
-- ## in Stata is * in R.</p>
<p>The long version is that both Stata and R handle very nicely factor variables in regression models. If you want a full-factorial interaction between a factor variable x1 and a continuous variable x2, the Stata way is to say</p>
<pre><code>
regress y i.x1##c.x2
</code></pre>
<p>whereas the R way is to say</p>
<pre><code>
lm(y~factor(x1)*x2)
</code></pre>
<p>Now, if you just want to interact x1 with the slope of x2, the Stata way becomes</p>
<pre><code>
regress y i.x1#c.x2
</code></pre>
<p>whereas the R way becomes</p>
<pre><code>
lm(y~factor(x1):x2)
</code></pre>
<p>That's all.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/08/22/factors-in-stata-and-r/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Delayed macro substitution</title>
		<link>http://enoriver.net/index.php/2011/02/10/delayed-macro-substitution/</link>
		<comments>http://enoriver.net/index.php/2011/02/10/delayed-macro-substitution/#comments</comments>
		<pubDate>Thu, 10 Feb 2011 19:12:21 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1480</guid>
		<description><![CDATA[In Stata, you have local and global macros that can encapsulate all sorts of specific text: names of variables, constant values, even entire chunks of Stata code. Stata will interpret macros as soon as you invoke them, so that if you define local thedata sysuse auto local themodel regress mpg foreign weight you can simply [...]]]></description>
			<content:encoded><![CDATA[<p>In Stata, you have local and global macros that can encapsulate all sorts of specific text: names of variables, constant values, even entire chunks of Stata code. Stata will interpret macros as soon as you invoke them, so that if you define</p>
<pre><code>
local thedata sysuse auto
local themodel regress mpg foreign weight
</code></pre>
<p>you can simply call</p>
<pre><code>
`thedata'
`themodel'
</code></pre>
<p>You can also chain or nest local macros. It makes for a little extra work and produces code that's a little harder to read, but using macros is the best way to ensure code consistency, so it's a good thing to get used to them.</p>
<p>One neat feature of macros is that you can delay their substitution with backslashes. This allows you to nest macros in a very specific way. Let's expand on the example above. You could have defined the macro `model' as a nested one:</p>
<pre><code>
local rhs foreign weight
local model regress mpg `rhs'
</code></pre>
<p>From your standpoint, the local `rhs' is nested inside the local `model'. Stata does not care, because as soon as you have it read "`rhs'", it substitutes "foreign weight". From its standpoint, this `model' is identical to what the previous definition would have generated: the string "regress mpg foreign weight". But now suppose that you need a little flexibility in what should go in the right-hand side of this regression equation: suppose that headroom might also matter to fuel economy, presumably because gains in headroom come at a cost in aerodynamics. You could do this:</p>
<pre><code>
local rhs1 foreign weight
local rhs2 foreign weight headroom
</code></pre>
<p>Then you could do</p>
<pre><code>
local model regress mpg `rhs1'
`model'
local model regress mpg `rhs2'
`model'
</code></pre>
<p>You have to redefine the local `model' twice because Stata substitutes the values of `rhs1' and `rhs2' as soon as you invoke them. There's a way around that. You could nest `rhs' into the definition of `model' with a delayed, as opposed to immediate substitution, using a backslash, and only change its content when needed:</p>
<pre><code>
local model regress mpg \`rhs'
local rhs foreign weight
`model'
local rhs foreign weight headroom
`model'
</code></pre>
<p>Delayed substitution is elegant. It lets you nest macros using their names as placeholders, and have Stata fill them in only when they are needed. Here's one final working demo that shows you how you can use delayed macro substitution in program definitions:</p>
<pre><code>
capture prog drop myDemo
program myDemo

syntax anything

local displaythis "Argument \`i' is \`addthis'"

local argct: list sizeof anything
forvalues i=1/`argct' {
   local addthis: word `i' of `anything'
   di "`displaythis'"
}

end

myDemo three blind mice
</code></pre>
<p>For more information, see <a href="http://www.stata.com/support/faqs/lang/backslash.html">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/02/10/delayed-macro-substitution/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Simulation</title>
		<link>http://enoriver.net/index.php/2011/01/03/simulation/</link>
		<comments>http://enoriver.net/index.php/2011/01/03/simulation/#comments</comments>
		<pubDate>Mon, 03 Jan 2011 18:04:35 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1421</guid>
		<description><![CDATA[This paper by Victor S. Y. Lo came up in a conversation about how to best model treatment response in a business context, where the population of interest is the customers, and the treatment is some kind of marketing action -- you try and up-sell, cross-sell or simply keep them from buying from somewhere else. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.sigkdd.org/explorations/issues/4-2-2002-12/lo.pdf">This paper</a> by <a href="http://www.bentley.edu/arts-sciences-center/visiting-research-fellow.cfm">Victor S. Y. Lo</a> came up in a conversation about how to best model treatment response in a business context, where the population of interest is the customers, and the treatment is some kind of marketing action -- you try and up-sell, cross-sell or simply keep them from buying from somewhere else.</p>
<p>Because this is stuff I should know about now, I replicated the author's simulation in Stata. That was a success and the code is enclosed, in case any Stata user ever googles "victor lo database marketing" and ends up here. But the exercise got me thinking again about programming and long-term code maintenance.</p>
<p>I have a bit of adult education in two widely-used programming languages: C++ and Perl. In my meager experience, there's one big difference between the two. C++ forces you to write code in a very rigid way. Perl does not. C++ is like a regular game for older kids, played by rules. Perl is like free-flowing toddler play. Perl is easier on you because it's less capable than C++: not being able to do some things means that you don't get to worry about them. With this disclaimer out of the way, what matters for this story is that Perl instructions need not be encapsulated into subroutines. You write them as you go, you check that they do what you want, then you can clean up after yourself later, or not. </p>
<p>This means that quick and dirty things are easier to do in Perl than in C++, but usually quick and dirty things end up needing to be useful in the long run, so you will end up having to go slow and clean soon enough. With Perl that is optional. Optional will bite you in the ass, but clean isn't an unadulterated blessing, either: clean somehow always manages to be more obscure than dirty. I wish it weren't, and there's probably a right way to do it that would prevent this inconvenience, but I'm too old and too trapped in middle-class comfort to afford an apprenticeship, so I'll just have to muddle through.</p>
<p>Stata is like Perl. Because of this, I start out writing do-files in freestyle. This lets me explore at my leisure. Then, as things grow clear enough in my head, I encapsulate my code into reusable programs -- not so much to avoid cutting and pasting code, which is a worthy goal, but to enforce consistency and check my thinking.</p>
<p>This leaves me with code that looks good and tidy at the end of a project, but that tidiness makes it awful hard to follow when the specifics of the problem I tried to solve are no longer fresh in my mind. So: I am posting this simulation below, in its finished form, with plenty of comments, hoping that it will forever make sense to me, but I make no guarantees. If you stumble across it a month from now or later, odds are you're on your own.</p>
<pre><code>
clear
set type double
set more off
pause on

version 10

set obs 100000

// GLOBAL VARIABLES

// my random seed value and cutoffs:
global seed     12345671  // seed for random number generator
global traincut .49875    // at this seed, this cutoff splits data 50/50 exactly into training and holdout
global treatcut .8003884  // at this seed, this cutoff splits data 80/20 exactly into treatment and control

// my file path for graphs:
global savehere c:/data/lo simulation/

// Victor Lo's variables:
global varz age wealth asset homevalue

// means
global m_age        45
global m_wealth     935
global m_asset      680.5
global m_homevalue  654.5

// standard deviations
global sd_age       13
global sd_wealth    150
global sd_asset     150
global sd_homevalue 70

// parameters of the true response functions. a means intercept. t means treatment
// baseline (control)
global a        -7.5
global b_age      .02
global b_wealth   .005
global b_asset    .00155

// treatment dummies (add to baseline)
global ta        -.5
global tb_age     .02
global tb_wealth 0
global tb_asset   .0001

// PROGRAM DEFINITIONS SECTION

// generate data according to Lo's DGP:
capture prog drop generateData
program generateData

foreach k in seed traincut treatcut varz {
	local `k' ${`k'}
}
// means and standard deviations
foreach k in m sd {
	foreach var in `varz' {
		local `k'_`var' ${`k'_`var'}
	}
}
// variances
foreach k in `varz' {
	local var_`k'=`sd_`k''^2
}
// matrices of interest for 3 vars that make it into the final model
foreach k in m sd {
	matrix `k'z=(``k'_age',``k'_wealth',``k'_asset')
}
// Lo's own covariance matrix:
matrix covz=(`var_age',507,152\507,`var_wealth',6750\152,6750,`var_asset')

// simulate data with given seed value
drop _all
set seed `seed'
drawnorm age wealth asset, n(100000) means(mz) cov(covz)

// set up treatment and control group and their respective true response functions
gen treatment=runiform()

// find cutoff for treatment group exactly equal to random 80000 observations
local tcut `treatcut'
count if treatment<=`tcut'
gen byte treated=(treatment<=`tcut')
drop treatment
tab treated

// get true response for treatment and control group
foreach k in a b_age b_wealth b_asset {
	local `k'  ${`k'}
	local t`k' ${t`k'}
}
local intercept     (`a'+`ta'*treated)
foreach k in age wealth asset {
	local `k'_effect    (`b_`k''+`tb_`k''*treated)*`k'
}
gen logit_response=`intercept'+`age_effect'+`wealth_effect'+`asset_effect'

// split the data randomly 50/50 into training and holdout subsets
gen training=runiform()
local tcut `traincut'
count if training<=`tcut'
gen byte trained=(training<=`tcut')
drop training
tab trained treated

// now jitter the response so you can run a regression
gen error=rnormal()
gen logit_jittered=logit_response+error

// now identify observations, to check overlap of top decile btw current
// and proposed methodology.
gen id=_n
order id

end

// predict treated response in holdout set, then rank and decile it:
capture prog drop getDeciles
program getDeciles

args whichone // "current" or "proposed"

keep if !trained            // keep only holdout group

local current_treatment  _b[_cons]+_b[age]*age+_b[wealth]*wealth+_b[asset]*asset // estimated by current regression model
local current_control    ${a}+${b_age}*age+${b_wealth}*wealth+${b_asset}*asset   // known from simulation (not estimated by current model)

local proposed_treatment (_b[_cons]+_b[_Itreated_1])+(_b[age]+_b[_ItreXage_1])*age+_b[wealth]*wealth+_b[asset]*asset // estimated by proposed regression model
local proposed_control   _b[_cons]+_b[age]*age+_b[wealth]*wealth+_b[asset]*asset                                     // ditto (so, better than current)

local current_decile  treatment_response // In current methodology Lo ranks the holdout sample by decile of the estimated treatment response rate,
                                         // because that's what's being modeled. That's the only thing you can try and maximize. But that won't maximize
                                         // true lift, as shown in Figure 4.
local proposed_decile observed_lift      // In proposed methodology Lo ranks the holdout sample by decile of the estimated lift, because he models
                                         // the response of both the treatment and the control group, which allows him to maximize true lift. 

// Now fill in individual response probability for either case -- what it would have been if:
// -- individual had been picked in target group
// -- individual had stayed in the control group
// Then based on that, fill in the lift you think you can credit to the treatment (observed lift).
gen logit_treatment=``whichone'_treatment'
gen logit_control=``whichone'_control'
foreach k in treatment control {
	gen `k'_response=exp(logit_`k')/(1+exp(logit_`k'))
	drop logit_`k'
}
gen observed_lift=treatment_response-control_response

// get true response to treatment and true response in control group (simulated).
// then based on that, fill in actual (simulated) lift that treatment could have
// given (you don't know this when you have real data, because you don't know
// the true response functions in the two groups).
foreach k in a b_age b_wealth b_asset {
	local `k'  ${`k'}
	local t`k' ${t`k'}
}
local treatment_intercept     `a'+`ta'
local control_intercept     `a'
foreach k in age wealth asset {
	local treatment`k'_effect  (`b_`k''+`tb_`k'')*`k'
	local control`k'_effect    `b_`k''*`k'
}
foreach k in treatment control {
	gen logit_`k'=``k'_intercept'+``k'age_effect'+``k'wealth_effect'+``k'asset_effect'
	gen `k'_actual=exp(logit_`k')/(1+exp(logit_`k'))
	drop logit_`k'
}
gen actual_lift=treatment_actual-control_actual
drop treatment_actual control_actual

label var actual_lift        "Actual Lift"
label var observed_lift      "Observed Lift"
label var treatment_response "Treatment Response"
label var control_response   "Control Response"

// now get deciles -- based on current model (treatment response) or proposed (observed lift)
local response ``whichone'_decile'
quietly {
	pctile pct = `response', nq(10)
	levelsof pct, local(deciles)
	drop pct
	gen decile=.
	forvalues i=1/9 {
		local decile: word `i' of `deciles'
		replace decile=10-`i'+1 if `response'<=`decile' &#038; missing(decile)
	}
	replace decile=1 if missing(decile)
}
tab decile, missing
tabstat treatment_response control_response actual_lift observed_lift, stats(mean n) by(decile)

end

// Draw graph of response rate -- figure 3 or 5.
capture prog drop drawResponse
program drawResponse

args whichone // "current" or "proposed"

local stitle subtitle((`whichone' methodology))

local current_fignum   3
local proposed_fignum  5
local responses        treatment_response control_response
local fignum           ``whichone'_fignum'

preserve
collapse (mean) `responses', by(decile)
local title   Fig `fignum'. Response rate by decile
local titles  title(`title') `stitle'
tempfile fig
graph bar `responses', over(decile) ytitle(response rate) `titles' saving("`fig'", replace)
graph export "${savehere}Lo Figure `fignum'.png", as(png) replace
restore

end

// Draw graph of lift -- figure 4 or 6.
capture prog drop drawLift
program drawLift

args whichone // "current" or "proposed"

local stitle subtitle((`whichone' methodology))

local current_fignum   4
local proposed_fignum  6
local responses        actual_lift observed_lift

local fignum           ``whichone'_fignum'
local control          ``whichone'_control'

preserve
collapse (mean) `responses', by(decile)
local ytitle ytitle(lift (treatment minus control))
local title  Fig `fignum'. Lift by decile
local titles `ytitle' title(`title') `stitle'
tempfile fig
graph bar `responses', over(decile) `titles' saving("`fig'", replace)
graph export "${savehere}Lo Figure `fignum'.png", as(png) replace
restore

end

// estimate model -- current or proposed methodology
capture prog drop estimateModel
program estimateModel

args whichone // "current" or "proposed"

// estimate the response in the training set.
// for current methodology, include only treated subset in e(sample). this
// will mimic the true model, so you can't run a regression on logit_response
// directly. instead, you must use logit_jittered, so you have an error term.
local current_condition if trained &#038; treated
local current_model     regress logit_jittered age wealth asset `current_condition'

// for proposed methodology, no need to jitter the response to run the
// regression, because though e(sample) includes both the treated and
// control subsets and uses a dummy, the regression equation does not
// mimic true model. only age is interacted with treatment dummy, not also asset.
// but, no harm in using logit_jittered, either.
local proposed_condition if trained
local proposed_model     xi: regress logit_response i.treated*age wealth asset `proposed_condition'

// Generate the data for the three variables that
// make it into the model: age, wealth and asset.
quietly generateData

// estimate the response in the training set.
``whichone'_model'

// use model to predict treated response in holdout set:
getDeciles `whichone'

end

// EXECUTION SECTION

// for current methodology, estimate model based only on treatment subset
// of the training data set
// for proposed methodology, you will need both treatment and control
// subsets of the training data set

// CURRENT METHODOLOGY (DRAW FIGURES 3 AND 4)

estimateModel current

// OK, this will draw Fig 3.
drawResponse current

// Now draw Fig 4.
drawLift current

// Keep top decile according to current methodology
tempfile topdecile_current
keep if decile==1
keep id observed_lift
rename observed_lift lift
sort id
save "`topdecile_current'", replace

// PROPOSED METHODOLOGY (DRAW FIGURES 5 AND 6)

estimateModel proposed

// OK, this will draw Fig 5.
drawResponse proposed

// Now draw Fig 6.
drawLift proposed

// Keep top decile according to proposed methodology
tempfile topdecile_proposed
keep if decile==1
keep id observed_lift
rename observed_lift true_lift
sort id
save "`topdecile_proposed'", replace

// Now see how much the two overlap.
merge id using "`topdecile_current'"
tab _merge
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2011/01/03/simulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dogs, bedbugs and Reverend Bayes</title>
		<link>http://enoriver.net/index.php/2010/11/12/dogs-bedbugs-and-reverend-bayes/</link>
		<comments>http://enoriver.net/index.php/2010/11/12/dogs-bedbugs-and-reverend-bayes/#comments</comments>
		<pubDate>Fri, 12 Nov 2010 20:52:01 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Bayes' Theorem]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1400</guid>
		<description><![CDATA[This morning I finished watching this, then I went and got my New York Times fix, where I found this. The two are connected because Hilary Mason's talk gave me a nice little reminder of the counterintuitive Bayes' Theorem, and the idea to use it to see if the Times has any base in casting [...]]]></description>
			<content:encoded><![CDATA[<p>This morning I finished watching <a href="http://www.infoq.com/presentations/Machine-Learning">this</a>, then I went and got my New York Times fix, where I found <a href="http://nyti.ms/aGJYdi">this</a>. The two are connected because Hilary Mason's talk gave me a nice little reminder of the counterintuitive <a href="http://en.wikipedia.org/wiki/Bayes'_theorem">Bayes' Theorem</a>, and the idea to use it to see if the Times has any base in casting doubts over the good name of the city's fine corps of bedbug-sniffing dogs. </p>
<p>The answer is that it depends. If the infestation rate in New York City is 5% rather than the 10% that <a href="http://bit.ly/ckMGzx">The Daily News</a> claims, and if the dogs' accuracy in the field is 95% rather than the hoped-for 98%, then really, if you live in one of the typical New York apartments and a dog says you have bedbugs, you might as well flip a coin. As hard as that may be to believe, that's how the numbers work out. When the actual infestation rate is low, a 95% accuracy rate is pretty limiting.</p>
<p>Here's my Stata code:</p>
<pre><code>
capture prog drop findBedbugs
program findBedbugs

version 10

args bir_prb bsd_prb

// legend of the variable names in this program
local bir "reported NYC bedbug infestation rate"
local bsd "bedbug sniffing dog"
local nhu "NYC housing units in 2009"
local prb "probability"
local num "number"

// Bayes' theorem: p(A|B)=p(B|A)*p(A)/p(B)
// P(A) is the true infestation rate -- bir_prb.
// P(B) is the detected infestation rate (both true and false positives) -- bsd_prb.
// P(B|A) is the dog's accuracy rate.
// P(A|B) is the probability of a true positive.

// data sources
// for bir_prb=.1 (10% infestation rate):
// http://bit.ly/ckMGzx
// for nhu_num=8,017,263 (over 8 million housing units) in NYC:
// http://quickfacts.census.gov/qfd/states/36000.html
// for bsd_prb=.98 (the dogs' accuracy rate is 98%):
// http://bit.ly/apbldH 

// collected values
local nhu_num  8017263	 

// some formatting
local bsdr: di %2.0f 100*`bsd_prb' "%"
local bsdw: di %2.0f 100*(1-`bsd_prb') "%"
local ir:   di %2.0f 100*`bir_prb' "%"

// some labels
local DS "Total possible cases where the dog"
local DR "`DS' is right (`bsdr' of infested units)"
local DW "`DS' is wrong (`bsdw' of units not infested)"
local TP "P(you really have bedbugs when this dog says so)"

// some calculations
local DR_num=`bsd_prb'*`bir_prb'*`nhu_num'         // true positives
local DW_num=(1-`bsd_prb')*(1-`bir_prb')*`nhu_num' // false positives
local I_num=`nhu_num'*`bir_prb'                    // number of infestations
local npi=.98*`I_num'/(`DR_num'+`DW_num')          // net P(infestation)

// show this on screen
local col 75
di ""
di "The infestation rate is `ir' and the dog is `bsdr' accurate."
di "Total number of housing units in NYC: " _col(`col') %12.0fc `nhu_num'
di "Infested units: " _col(`col') %12.0fc `I_num'
di "`DR': "           _col(`col') %12.0fc `DR_num'
di "`DW': "           _col(`col') %12.0fc `DW_num'
di ""
di "`TP': "           %4.2fc `npi'*100 "%"

end
</code></pre>
<p>And here's what you get when you run this, first with a 10% infestation rate and 98% dog accuracy, and then with a 5% infestation rate and a 95% dog accuracy:</p>
<pre><code>
. findBedbugs .1 .98

The infestation rate is 10% and the dog is 98% accurate.
Total number of housing units in NYC:                                        8,017,263
Infested units:                                                                801,726
Total possible cases where the dog is right (98% of infested units):           785,692
Total possible cases where the dog is wrong ( 2% of units not infested):       144,311

P(you really have bedbugs when this dog says so): 84.48%

. findBedbugs .05 .95

The infestation rate is  5% and the dog is 95% accurate.
Total number of housing units in NYC:                                        8,017,263
Infested units:                                                                400,863
Total possible cases where the dog is right (95% of infested units):           380,820
Total possible cases where the dog is wrong ( 5% of units not infested):       380,820

P(you really have bedbugs when this dog says so): 51.58%
</code></pre>
<p>The point is that this low probability of having bedbugs when the dog says so is a function of both how bad the problem is, and how accurate the dog is. The 10% infestation rate expectation comes from a Daily News poll. The 95% accuracy rate comes from a clinical trial. It's hard to say which is more trustworthy. It's possible that dogs perform worse in the field than they do in the lab. It's also possible that the poll suffers from non-response bias.</p>
<p>Plus, if you live in a less-infested neighborhood (and you can tell from <a href="http://samizdat.cc/bdbgs/">here</a>) your chance that you're wasting your money if you hire a sniffer dog increases greatly. That's not the dog's fault.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2010/11/12/dogs-bedbugs-and-reverend-bayes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Good to know</title>
		<link>http://enoriver.net/index.php/2010/07/15/good-to-know/</link>
		<comments>http://enoriver.net/index.php/2010/07/15/good-to-know/#comments</comments>
		<pubDate>Thu, 15 Jul 2010 18:46:38 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1311</guid>
		<description><![CDATA[The user-written commands you download to your ado/plus directory are updated once in a while on that RepEc server they come from. So, after you findit and then net install it, your imported command might need to be refreshed occasionally. That is what adoupdate, update does. I was reminded of this when I tried to [...]]]></description>
			<content:encoded><![CDATA[<p>The user-written commands you download to your ado/plus directory are updated once in a while on that RepEc server they come from. So, after you <code>findit</code> and then <code>net install</code> it, your imported command might need to be refreshed occasionally. That is what <code>adoupdate, update</code> does.</p>
<p>I was reminded of this when I tried to run <code>freduse</code> today. Actually, the problem that reminded me of it -- a "not found" error message thrown when the command invoked the Mata function  _fredifinparse() -- didn't go away, but "adoupdate, update" is still a good thing to do. What did fix _fredifinparse() is described <a href="http://vhaguiar.wordpress.com/2010/02/07/stata-tip-importing-stock-info-from-yahoo-finance-and-fed-macroeconomic-data/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2010/07/15/good-to-know/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The limits of encapsulation</title>
		<link>http://enoriver.net/index.php/2010/07/09/the-limits-of-encapsulation/</link>
		<comments>http://enoriver.net/index.php/2010/07/09/the-limits-of-encapsulation/#comments</comments>
		<pubDate>Fri, 09 Jul 2010 15:10:46 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=1304</guid>
		<description><![CDATA[I just read this. I liked it. It put some bit of anguish I've been having into clearer words than I could. My Stata code between 2000 and 2006 consisted exclusively of do-files that put to work either standard Stata commands or user-written commands from the SSC. There was not a single program definition anywhere [...]]]></description>
			<content:encoded><![CDATA[<p>I just read <a href="http://www.johndcook.com/blog/2010/06/30/where-the-unix-philosophy-breaks-down/">this</a>. I liked it. It put some bit of anguish I've been having into clearer words than I could.</p>
<p>My Stata code between 2000 and 2006 consisted exclusively of do-files that put to work either standard Stata commands or user-written commands from the <a href="http://www.stata.com/help.cgi?ssc">SSC</a>. There was not a single program definition anywhere and things worked alright. These do-files were pretty elaborate and their functionality overlapped a fair bit, but that was never that much of an inconvenience.</p>
<p>Then in early 2007, during my brief tenure at RTI Health Solutions, that way of working showed its limitations when I tried to program in plain Stata matrices something that was normally being done in GAUSS. It had to do with the design of factorial experiments and my project ended in an instructive kind of failure, because it got me started on using Stata programs. I still like those things. I can define them once and then nest and have them call each other every which way to my heart's content. They take arguments, return values, and generally they make you feel like a real programmer.</p>
<p>Then in 2008 I had my introduction to C++, and the instructions were clear: break down a problem in small morsels; use as many functions as you need; if a function definition fills up a screen, it's way too big, so break it down further; encapsulation is a good thing. Then came header files, namespaces, classes, templates, the works. It was an extreme kind of validation of the way I had started to do business, and my enthusiasm for modular code only grew from there.</p>
<p>Then, about a year ago, I started running into problems. Component programs can be debugged individually, sure, and you only need to fix them once, in one place, which is great. In fact, if they're small and simple enough, you don't even need to do that; they just work. But with complicated projects you're going to have so many interlocked small and simple programs that it will just be too hard to keep tabs on which programs call which, where, and why. It's also pretty expensive to write them in such a way that they can talk with not just one other program, but are truly universal within the context of the given problem. </p>
<p>So I'm not sure anymore that I would recommend my way of writing Stata code to everybody. It still has its uses, but I can see a growing number of circumstances where it's simply not worth the trouble.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2010/07/09/the-limits-of-encapsulation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

