Define local macros in one place, use them everywhere
Thursday, 18 February 2010
In The Stata Journal Vol. 9, No. 3, 2009 there's a Stata tip (# 77) on re-using macros in multiple do-files, by Jeph Herrin. His solution is to define any local macros in a separate do-file, say locals.do. You can call that do-file with the include command at the top of any do-file that might refer to some of those macros. This is useful, as the author explains, when you want to keep all your macro definitions in one easily editable place, and make them available to whatever projects can use them without having to duplicate the definitions. This is similar to how you would #include header files in C or C++.
Unfortunately, include does not work well with programs. You can invoke it inside a program but include itself cannot be compiled, so the do-file it refers to will have to be interpreted all over again every time you call that program: see help include for the original reference. This negates the speed advantage that you expect when you package your Stata work into programs as opposed to simple do-files. It does not negate all the other advantages -- such as better code portability and the possibility of unit testing -- but still, speed matters. My own solution is to define locals inside an r-class program. Then other programs that need some of these locals call that program, and recover only the locals they need. Below is one example, worked first with include and then with my proposed alternative:
// locals.do starts here
local file_path "c:/data/my_path/"
local file_name "file_name.dta"
local my_file "`file_path'`file_name'"
// and ends here.
OK, now let's put these locals to work:
// work.do starts here:
include locals.do
use "`my_file'"
Alright. This is not bad and it does the job that your locals are neatly separated from the rest of the work. If you use do-files to split a big job into smaller pieces, this is all you need. If, on the other hand, you want to chop the job into pieces smaller still, and encapsulate those pieces into programs, then this is inadequate, as explained above.
The workaround might look like a bit more overhead, but then that's the usual cost of writing programs instead of plain do-files. Some of us, sometimes, find that overhead acceptable:
capture prog drop defineMyLocals
program defineMyLocals, rclass
local file_path "c:/data/my_path/"
local file_name "file_name.dta"
local my_file "`file_path'`file_name'"
local things "file_path file_name my_file"
foreach thing in `things' {
return local `thing' ``thing''
}
end
And now let's recover the local `my_file':
defineMyLocals
local my_file `r(my_file)'
use "`my_file'"
That's it. Your locals.do consists of the definition of the program defineMyLocals. The overhead is obvious. First, adding new locals inside this definition requires the extra step of adding their names to the things list, so that defineMyLocals can return them. Second, using defineMyLocals is also slightly more complicated than a simple include, because you need to recover the locals you need explicitly, via `r()'.
That said, now your locals are defined inside a program, and you can move them around faster. If you're like me and you favor programs over do-files and this isn't your solution, then I'd be curious to see your preferred alternative.
No. 1 — February 26th, 2010 at 8:55 pm
Maybe I don't understand your use case. If you run any other program that stores its results to return space, you've got to call -defineMyLocals- again because your locals have been clobbered. Doesn't this negate the performance advantage?
I suppose I structure my work differently. I usually write my programs as -ado- files that are passed information via -syntax-; I rarely write programs midstream in a -do- file. When necessary, I will pass information via judicious use of -global- a la the -ml- suite of programs. But, then, when I encapsulate code via -program-, it typically performs a well-defined task of general utility such as a statistical procedure or a data manipulation.
Also, I use -include- in my -do- files rather liberally, and I haven't noticed a significant performance hit. Have you profiled this? How much speed are you gaining?
No. 2 — February 27th, 2010 at 2:31 pm
No, I hadn't profiled it, but I did now -- meaning, I wrote a do-file that ran both solutions 500, then 1000 times and checked run time. You're right that there's no noticeable performance gain in what I propose. I'm surprised. Thanks for bringing it up.
About my use case: I seldom write ado-files. Instead, I write do-files, but my usual do-file is structured as a collection of separate programs that call each other as needed. It doesn't always start that way, but that's what it looks like in its final form. This way I can avoid repeating code and I can debug pieces of it individually, without worrying about stuff that already works. I explained the general structure here a while back. When I make do-files reference each other, it's always the case that a parent do-file runs a child do-file and the child has no explicit program calls (it's missing Section 4, Program execution). The child, in other words, just makes some additional program definitions available to the parent.
If anybody's curious about my profiler, I put it all on snipt and I could embed it in a future post.
No. 3 — February 27th, 2010 at 3:16 pm
Thanks for the pointer to your previous post. Like you, I outline my do-files, and I do have master do-files that call other do-files in sequence. But I suspect that you take encapsulation farther than I do. If I recall correctly, you have clients that you provide regularly updated analyses, right? Whereas I often perform an analysis only once, but I want and need it to be reproducible by myself and others.
I adapted my workflow from J. Scott Long's The Workflow of Data Analysis Using Stata. (I'm still trying to get my colleagues to drink the Kool Aid.) Have you read it? I think you would find it interesting.
I'm curious about your profiler! Please do embed it in a future post. Thanks!
No. 4 — May 22nd, 2012 at 11:38 pm
[...] in one place as local macros and returns them. I first started doing this back in 2010, as detailed here. At the time it seemed like a slick thing to do. I just assumed, wrongly, that this would be faster [...]