Program vs. include smackdown
Sunday, 28 February 2010
When it comes to defining local macros in a different place from where you use them, you have two options: a do-file you include as needed or an r-class program that you call as needed. I talked about it here and said that a program is a better choice, without any evidence to back up that claim. A reader called me on it, so I went and checked. Turns out he's right. Below is how I went about it:
I wrote a do-file, called work.do, that just uses a dataset, nothing more. That data set's name is handled by a local macro, defined in a separate do-file, called locals.do, which work.do includes. Then I wrote locals_program.do which does the same job via a program, which then work_program.do calls by name. Finally, I wrote a profiler.do file that called all four files a few times, and measured the time they all took doing their thing. According to profiler.do, an include is usually faster than a program. Below is the code:
// work.do starts here:
include locals.do
use "`my_file'"
// locals.do starts here
local file_path "C:/work/romanian papers circulation figures/data/"
local file_name "file_combined.dta"
local my_file "`file_path'`file_name'"
// this is locals_program.do
capture prog drop defineMyLocals
program defineMyLocals, rclass
local file_path "C:/work/romanian papers circulation figures/data/"
local file_name "file_combined.dta"
local my_file "`file_path'`file_name'"
local things "file_path file_name my_file"
foreach thing in `things' {
return local `thing' ``thing''
}
end
// this is work_program.do
defineMyLocals
local my_file `r(my_file)'
use "`my_file'"
// and finally, this is profiler.do
set more off
cd "c:/work/programming/putterin/profiler"
// SECTION 1: OVERVIEW
/*
1. What's going on:
"program" vs. "include" profiler. this do-file defines
two programs that will measure the performance difference
in defining locals separately using two alternate ways
(1) locals are defined in a do-file called with include
(2) they are defined in a program called by name
The exercise is to "use" a file. This file is called via
a handle defined as a local macro. That definition can be
either in a do-file included (1) or in a program called
by name (2).
2. Programs defined here and their dependencies:
runProfile1
defineMyLocals // defined in locals_program.do
runProfile2
*/
// SECTION 2: GLOBALS
// SECTION 3: PROGRAM DEFINITIONS
// ### programs defined elsewhere and called via "run"
// ### locals_program.do defines the program named
// ### defineMyLocals, which returns a file handle
// ### as an r() local.
run locals_program.do
// ### work_program.do uses a file whose handle
// ### comes from calling defineMyLocals and
// ### retrieving an r() local.
capture prog drop runProfile1
program runProfile1
args counter
local time_start=tc("`c(current_date)'" "`c(current_time)'")
forvalues i=1/`counter' {
run work_program.do
}
drop _all
local time_end=tc("`c(current_date)'" "`c(current_time)'")
di ""
di "Profile 1 (program), `counter' reps"
di "Time elapsed (ms): "`time_end'-`time_start'
end
// ### work.do includes locals.do, which
// ### defines a file handle as a local.
// ### work.do uses that file by calling
// ### that local.
capture prog drop runProfile2
program runProfile2
args counter
local time_start=tc("`c(current_date)'" "`c(current_time)'")
forvalues i=1/`counter' {
run work.do
}
drop _all
local time_end=tc("`c(current_date)'" "`c(current_time)'")
di ""
di "Profile 2 (include), `counter' reps"
di "Time elapsed (ms): "`time_end'-`time_start'
end
// SECTION 4: PROGRAM CALLS
local cycles "100 200 500 1000 1500"
foreach cycle in `cycles' {
forvalues i=1/2 {
runProfile`i' `cycle'
}
}
And that's it. Put all five files into the same directory, make the profiler cd to it, change the path and name locals to your own data set, and see what you get on your machine. Below is my output:
Profile 1 (program), 100 reps
Time elapsed (ms): 2000
Profile 2 (include), 100 reps
Time elapsed (ms): 2000
Profile 1 (program), 200 reps
Time elapsed (ms): 4000
Profile 2 (include), 200 reps
Time elapsed (ms): 3000
Profile 1 (program), 500 reps
Time elapsed (ms): 9000
Profile 2 (include), 500 reps
Time elapsed (ms): 8000
Profile 1 (program), 1000 reps
Time elapsed (ms): 18000
Profile 2 (include), 1000 reps
Time elapsed (ms): 17000
Profile 1 (program), 1500 reps
Time elapsed (ms): 26000
Profile 2 (include), 1500 reps
Time elapsed (ms): 26000
No. 1 — March 4th, 2010 at 4:21 pm
Thanks for posting your do-files and logs.
Do you plan to switch to using include files instead of programs? Or do you think that the encapsulation benefits justify the (quite modest) performance hit?
No. 2 — March 4th, 2010 at 10:27 pm
Hard to say. Encapsulation looked like a great idea at the time. It allowed me to run things piecemeal in the debugging stage, and to run complex projects with the order of operation self-enforced. But it has a mixed record for code reuse, and things that look simple enough on the first draft do-file can turn into the most arcane puzzle after I'm done encapsulating the snot out of them. So often encapsulation looks like this bad friend who will let me get away with plugging holes in my reasoning with wads of code. On the other hand, the proof is in how well the finished product works, and how quickly I got it there. On that account, spiking my do-files with programs has done more good than harm so far, I think. You've read J. Scott Long's book; what's his view on the matter? I'm scrambling to read Christopher Baum's before somebody recalls it. I know he favors plain do-files.
No. 3 — April 5th, 2010 at 8:16 pm
To the best of my recollection--the book's at the office--Long doesn't explicitly address the tradeoff between the two approaches. He spends much more time on master do-files and include-files than programs. I suspect that's as much due to the aims of his particular book as due to his opinion on -program- vs. -include-, though.