Do-file rules, revisited

Back in 2009 I wrote this post, detailing what at the time I thought would be a good way to write do-files. Some of the ideas there have stood the test of time. Others haven't. The changes are driven by Stata's evolution, by new things I've learned and by ways that my work changed. This is a quick review.

First, as of Stata 12, you don't need to set memory anymore. Second, clear should now be replaced by clear all. In addition, J. Scott Long recommends that you type macro drop _all right after it. I know this because I'm reading his Workflow book right now. I know I'm three years late (two if you count from where one of this blog's readers recommended it to me for the first time). I'm still finding useful stuff there. Next, as Jess suggested in the comments thread to the original post, I now use set varabbrev off.

Finally, my do-files no longer have a Globals section. Instead, there's now a program that defines all macros in one place as local macros and returns them. I first started doing this back in 2010, as detailed here. At the time it seemed like a slick thing to do. I just assumed, wrongly, that this would be faster at execution time than the original solution that gave me the idea (using a separate do-file, called with include).

The staying power of having a program for defining locals came not from its execution speed, but from its versatility. You can define all the locals you want inside a program you call, say, setLocals. If you need more locals as the requirements for your code grow, you just pile them on inside this program, and remember to also return them.

Then, whenever any specific local macro is needed, you call setLocals and only recover the `r()' value that you need then. Locals can be substituted for the obvious things -- like operating system-specific file paths or hard-coded numbers -- and also for names of programs you define somewhere else. This will also spare you the inconvenience of reading a do-file where `this' local shows up all of a sudden and if it's not obvious what it holds, you must work your way up to see where it was defined: if all locals are defined in setLocals, you will always know where to look.

This is probably a terrible way to use memory, but the convenience of having all my locals defined in one place and any arbitrary subset of them available with the same simple call to setLocals is well worth it. You can extend this model in all sorts of ways, with some care. You can, for example, give setLocals an optional argument (using something as flexible as syntax [anything]) so its behavior is changed according to whether the argument is present. For example, a call to setLocals on its own will return a default set of local macros that apply everywhere; a call to setLocals andAlsoTheseOtherLocals will return the default macros plus a set defined inside a second program, andAlsoTheseOtherLocals, to be used in some specific context.

4 Responses to “Do-file rules, revisited”

  1. Nick Cox writes:

    Just a comment on "if all locals are defined in setLocals, you will always know where to look".

    If all locals are defined earlier in any of my programs, as they are, I know exactly where to look too, and it's in the same file.

    Am I misunderstanding something radically? This to me sounds like a quirky way to use locals. I don't see the benefit.

  2. Gabi Huiber writes:

    The benefit is nil in academic research, or in one-off projects. In those cases, you're right to do what you do already: one do-file, all the locals defined at the top. But in industrial applications that have to run multiple times and their requirements evolve, your code must be more modular than that. Suppose you try to run a linear regression with a matrix M1 consisting of dependent variable y1 and a set of regressors X1 collected over time for a given country. Now suppose you get data for another n matrices for the same country, and need to repeat the analysis. Next you add another c countries, but not all of them are of interest at every run of this process, and within each run, not all of their matrices are of interest: sometimes you're curious about country a, set M1, and country b, set M3. At this point, to make sure that you don't have inconsistent code, you need modularity. You start defining programs that you can unit-test and run selectively. If there are a lot of programs, you group them inside separate do-files in ways that make sense -- this do-file for data import, this other for transformations, this other for actual analysis, and another for reporting. If you're in this situation, having one setLocals program to hold all these little knobs you need to turn becomes pretty useful. It looked quirky to me too at first, but I just couldn't think of a more practical solution.

  3. Nick Cox writes:

    There is plenty of call for consistency in academic research too! But it now sounds as if you are re-inventing globals.

  4. Gabi Huiber writes:

    I guess I am reinventing a kind of ephemeral globals. Is this what they mean by garbage collection?

Leave a Reply