Do-file rules — one suggestion

I've gone through several iterations with my idea of best do-file practices, and I'm sure I'll go through some more before I retire. But right now, here's where I stand:

My do-files start with a handful of header commands that I found useful at various times. They might look something like this:

clear
set more off
set type double
set mem 100m
pause on

This stuff varies a bit occasionally (I may set matsize or specify the version) but it's a bit like a cover sheet, in that it's almost always the same thing. I could have equally well put all of this in profile.do.

Next come the main sections of the do-file:

   Overview
   Globals
   Program definitions
   Program execution

The Overview section is all comments. It's got a structure of its own. It always consists of three parts. The first is just a list of the names of the programs defined in this do-file. The second describes what each program does, in one paragraph per program. The third describes which programs are called explicitly (because some programs on the list can be components of others) and in what order. The first part, the quick list, is useful for when you're fishing for code you might want to recycle. If you give your programs descriptive names, a quick look at the top of any do-file is usually enough to tell you if you're going to find useful stuff there.

The Globals section defines the macros such as file paths that will be used by more than one program. Since local macros are local to programs, anything that you mean to be shared across multiple programs must be a global. Having them all bundled here has another advantage. If you send your do-file to be run on another computer, all your file path changes are made in one place, once.

The Program definitions section does just what you might expect. Programs here can be stand-alone things that you call explicitly in the last section, or can be components, called implicitly. Defining such components is useful when you need to use the same code more than once. If that code is broken, you only need to fix it in one place.

You might want two types of comments with your program definitions. One is a header before the program define line (or, if you're cautious, before the capture program drop line) that tells you at least whether your program takes any arguments, lists them if yes, and tells you a little about each of them. You'd want to know, at a minimum, which ones are string and which are numeric. The other is a set of in-line comments, throughout the program definition, as needed. I find it useful to explain any local macros I declare with a couple of words at least.

The Program execution section does the actual work. It has consequences on disk and on screen.

I settled on this way of writing do-files after I took an online class on C++ at NC State. I treat Stata programs inside a do-file the way one would treat functions inside a .cpp file. My Program execution section is the equivalent of main(). In C++, function declarations (also known as prototypes) are mandatory and go at the top of the source file. My do-file equivalent for those is the Overview section. Except, of course, Overview is not mandatory at all. It consists solely of comments.

Having to submit to this sort of discipline might strike you as negating the benefits of Stata's easygoing nature. After all, if you pined for structure, you'd be programming in SAS, where it's mandatory. 

Well, there are two good very good reasons for structure: one is that your code must be portable across your team; another is that it must be readable two weeks later, when you will have forgotten all about it. But I still wouldn't want it imposed upon me by the design of the programming environment. Neither of those very good reasons overrides the importance of on-the-job fun. When you program, you want flow. You need to be free to write up things as they come to you. The programming environment should accommodate that. Only in the tidying-up stage, when your thinking's done and your problem's solved, should you need to worry about structure.

Programming environments that impose structure on you, as opposed to letting you volunteer it, result in beautiful code that takes a long time to write and robs you of most of the pleasure of solving the original problem. The latter might well cause you to do a mediocre job of it. When help like this also costs you more in licensing fees and specialized labor, that's just insult upon injury.

3 Responses to “Do-file rules — one suggestion”

  1. Jess writes:

    This is a great post. I don't have the background in computer programming that you do, but I have come to find structuring my .do files in the same way (pretty similar to yours) really helps keep me organized. I'm probably not as strict about it as I should be, but I almost always eventually go back and fill in the blanks anyway.

    One header command that I have recently started adding to my .do files is the -set varabbrev off- command. Of course, you may want this toggled on, but I've caught errors in my .do files by going back and adding this command. If you're working with a large dataset, or panel data where many variables have similar names, it's easy to think you have one variable in your dataset when you actually have another. If -set varabbrev- is toggled on (as is the default), a typo could easily be overlooked and the variable you're using may not be what you think.

  2. Gabi Huiber writes:

    Thank you for "set varabbrev off". It is an excellent idea. Variable name abbreviation looks convenient, but I was never wild about it. It bit me several times on collaborative projects. It just never occurred to me to look into whether turning it off would be an option. Thanks again.

  3. Consider ado-files | A Stata Mind writes:

    [...] A while ago I suggested a particular do-file architecture that seemed to work well for me at the time. The post is here. [...]

Leave a Reply