<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Stata Things &#187; extended macro function</title>
	<atom:link href="http://enoriver.net/index.php/tag/extended-macro-function/feed/" rel="self" type="application/rss+xml" />
	<link>http://enoriver.net</link>
	<description>computing for fun and profit</description>
	<lastBuildDate>Fri, 13 Aug 2010 20:42:39 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Set theory and the extended macro function list</title>
		<link>http://enoriver.net/index.php/2009/02/19/set-theory-and-the-extended-macro-function-list/</link>
		<comments>http://enoriver.net/index.php/2009/02/19/set-theory-and-the-extended-macro-function-list/#comments</comments>
		<pubDate>Thu, 19 Feb 2009 20:37:51 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[extended macro function]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=575</guid>
		<description><![CDATA[I have a project where I need to do some things to all files whose names start with "sub" but not "subs". The former are files I receive, the latter are files I produce. When it occurred to me that their common file name root might cause problems, I rewrote my code to name the [...]]]></description>
			<content:encoded><![CDATA[<p>I have a project where I need to do some things to all files whose names start with "sub" but not "subs". The former are files I receive, the latter are files I produce. When it occurred to me that their common file name root might cause problems, I rewrote my code to name the files formerly known as "subs" to something different but (to me) still descriptive, and I picked "snap".</p>
<p>The program that presents this little problem cleans up after itself. Every week, if it's Friday, it deletes whatever "sub", "subs", "snaps" or what have you files it finds that are older than <em>n</em> weeks. So, eventually, after the change above, all the "subs" files will disappear. There will only be "snap" files left in their place, and the ambiguity between "sub" and "subs" will no longer be an issue.</p>
<p>In the meantime, though, I need a piece of code that lists file names that start with "sub" but not with "subs", which is easy enough. But it needs to work even after the last "subs" file will have left the disk. I also want it to make use of things already declared elsewhere. I basically want to be free to forget I ever wrote it long after it outlived its usefulness but, should it turn up to be needed again for some reason, I want it to pick up where it left off.  So here's how the - (minus) operator of the extended macro function <code>list</code> can handle that:</p>
<p>Right now, in the directory in question, there are four strings that file names can start with: "rat", "sub", "subs" and "snap". The first two are file names I don't want to change. Every week there will be a new file starting with either string in that directory. The last two, like I mentioned, I have more control over. The set of file names that start with "snap" is empty now but will fill up over time, and the set of file names that start with "subs" is not empty yet, but it's headed that way. I don't care about the "rat" files for this particular purpose, but whatever is irrelevant should not have to be explicitly excluded. Here's the code:<br />
<code><br />
local from=strlower("`c(pwd)'")<br />
local whichfiles "rat sub subs snap"<br />
foreach name in `whichfiles' {<br />
    local `name'list: dir "`from'" files "`name'*"<br />
//  local `name'count: list sizeof `name'list<br />
//  di "`name' count is ``name'count'"<br />
}<br />
local sub sub                        // (1)<br />
local dropthese: list whichfiles-sub // (2)<br />
di "`dropthese'"<br />
foreach name in `dropthese' {<br />
//  di "`name'"<br />
    local sublist: list sublist-`name'list<br />
//  local subcount: list sizeof sublist<br />
//  di "now sub count is `subcount'"<br />
}</code></p>
<p>This assumes that your directory of interest is c(pwd) -- type <code>creturn list</code> in Stata if this makes no sense. Notice that you have to declare the string "sub" as a local in order for the list operator minus to work -- see lines commented (1) and (2).</p>
<p>The beauty of the minus operator is that it acts the way you would expect it to from set theory. It will leave `sublist' unchanged if:</p>
<p>a) ``name'list' is empty (as in the file names that start with "subs" after the last such file will have been deleted) or</p>
<p>b) ``name'list' is not a proper subset (as in the file names that start with "rat", right now) or</p>
<p>c) both (the case of the file names that start with "snap" right now; remember I don't yet have any such files).</p>
<p>So I don't need to declare a `whichfiles' list with strictly the file names of interest now -- "sub" and "subs". The minus operator will gracefully handle everything. If I happen to have saved a more expansive `whichfiles' list in some other spot, for some other reason, then I can just use it from there. The lines I commented out were useful for testing, but will be a nuisance when this is live. I could have equally well enclosed the whole thing in <code>quietly {}</code>. Come to think of it, that would have been more elegant.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2009/02/19/set-theory-and-the-extended-macro-function-list/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Encode uses and pitfalls</title>
		<link>http://enoriver.net/index.php/2008/09/05/encode-uses-and-pitfalls/</link>
		<comments>http://enoriver.net/index.php/2008/09/05/encode-uses-and-pitfalls/#comments</comments>
		<pubDate>Sat, 06 Sep 2008 02:45:46 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[encode]]></category>
		<category><![CDATA[extended macro function]]></category>
		<category><![CDATA[regular expression]]></category>
		<category><![CDATA[type]]></category>

		<guid isPermaLink="false">http://host1.tld/?p=54</guid>
		<description><![CDATA[You won't always receive data saved in the most efficient format and there is no relief in sight. Bandwidth and hard drive sizes are growing unabated and making this only more likely, not less. It's common, for example, that data sets have string variables with a limited range of values. Suppose you have a file [...]]]></description>
			<content:encoded><![CDATA[<p>You won't always receive data saved in the most efficient format and there is no relief in sight. Bandwidth and hard drive sizes are growing unabated and making this only more likely, not less. It's common, for example, that data sets have string variables with a limited range of values.</p>
<p>Suppose you have a file of 200,000 newspaper subscribers, half of them to The New York Times, the other to the Washington Post, and they are identified as such by a string variable named "paper" with two values -- "The New York Times" showing up 100,000 times and "Washington Post" another 100,000 times.</p>
<p>If "paper" were instead an integer variable with value labels "The New York Times" for 1 and "Washington Post" for 2, it would be a lot more practical. Labels need only be remembered once and your computer will have an easier time reading 1 and 2 than it would parsing the corresponding strings "The New York Times" and "Washington Post" in every single one of the 200,000 observations. The fix is simple:</p>
<p><code>encode paper, gen(x)<br />
drop paper<br />
rename x paper</code></p>
<p>There are some options available with <code>encode</code>, you can see them in the Stata help, but this will do the job: "paper" now has values 1 and 2 for the two papers. You can now <code>save, replace</code> and all is well: the dataset is smaller, it loads more quickly, and if you want to look at it with the data browser you will still see "Washington Post" instead of 2 on screen -- it simply shows up in blue, not red.</p>
<p>Now suppose you have fifty such data sets, with subscribers spanning some 150 different newspapers. And suppose you have a master list of newspapers that has 150 observations in it, where "paper" is in its original, string type. Now you have a problem. Encoding "paper" here will give you an integer with values between 1 and 150 and with value labels guaranteed to not match those in your data sets. In a data file this small, you may want to leave "paper" in string type.</p>
<p>There's another problem. Suppose you need to append together two data sets -- the Times/Post one in the example above and another, that combines subscribers to Baltimore Sun and Chicago Tribune. If "paper" is encoded in both sets, it has the same range of values -- 1 and 2. In the first set 1 means "The New York Times" and in the second "Baltimore Sun". Not a good thing, that.</p>
<p>You could recast "paper" as string, append the two datasets, then encode it again. Now it has values ranging from 1 to 4, labeled from "Baltimore Sun" to "Washington Post" in alphabetical order. But if you do this a few times with different data sets, you won't be able to keep your values and labels straight. This is just so you're aware of the problem. There are ways around it, and in all fairness sometimes the memory savings from encoding string variables aren't worth the hassle. The growing availability of 64-bit computing along with the falling cost of RAM work in your favor too.</p>
<p>Of course now you have the same variable in different types across datasets. What do you do if you need to operate on it inside a do-file loop that picks it out of datasets regardless of its type? One way to avoid having to remember which type "paper" is in for any given dataset is to use the extended macro function <code>type</code>. It works like this:</p>
<p><code> local paper_type: type paper<br />
if regexm("`paper_type'","str") {<br />
* perform operations on paper as if it were of type string<br />
}<br />
else {<br />
* perform operations on paper as if it were of type numeric<br />
}<br />
</code></p>
<p>Notice the <code>regexm()</code> expression. Stata does regular expression matching. That's a matter for another post.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2008/09/05/encode-uses-and-pitfalls/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A way to handle runaway project scopes</title>
		<link>http://enoriver.net/index.php/2008/09/03/a-way-to-handle-runaway-project-scopes/</link>
		<comments>http://enoriver.net/index.php/2008/09/03/a-way-to-handle-runaway-project-scopes/#comments</comments>
		<pubDate>Thu, 04 Sep 2008 03:33:00 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[dir]]></category>
		<category><![CDATA[extended macro function]]></category>

		<guid isPermaLink="false">http://host1.tld/?p=34</guid>
		<description><![CDATA[The extended macro function dir will make Stata run whatever files it finds in a given directory. You don't have to enumerate those files. Sometimes that is useful. Say you start out with a simple idea, like running an OLS regression. And say your dependent variable and some of the regressors are in one file, [...]]]></description>
			<content:encoded><![CDATA[<p>The extended macro function <code>dir</code> will make Stata run whatever files it finds in a given directory. You don't have to enumerate those files. Sometimes that is useful.</p>
<p>Say you start out with a simple idea, like running an OLS regression. And say your dependent variable and some of the regressors are in one file, and the rest of the regressors are in another.</p>
<p>It's a good idea to keep your do-files short and sensible and to have them follow the general logic of the project. In this case you might split the work between two separate do-files: one collects the data needed into one joint dataset, the other runs the actual regression. Appropriate file names and judicious comments make it easy to quickly see what each do-file does and how, and all is well.</p>
<p>Then complications come up. The client remembers there's a third file with some regressors of potential interest, but it's in an exotic format and it has a bunch of stuff in it you don't need; or the client wants a summary report that's elaborate enough to warrant a third do-file.</p>
<p>Clients like it when you can accommodate gracefully their changing requirements. One easy way to do this is to keep a master do-file that points Stata to the project code directory and has it run whatever do-files it finds there without explicitly enumerating them. The extended macro function <code>dir</code> can do that. If you add a third project do-file, <code>dir</code> will find it and run it along with the others. Combine two do-files in one and <code>dir</code> will roll with that too.</p>
<p>For <code>dir</code> to work properly you need to help it read your files in the right succession. One way to do that is that in addition to naming these target do-files something suggestive -- like "merge first and second source file.do" -- you also pre-pend their order in the queue. If the file in this example is first in the process, simply call it "step 1 merge first and second source file.do" and number the remaining do-files in a similar fashion. That's helpful regardless of your usage of <code>dir</code> -- just give yourself two weeks away from the project and try to remember what goes where.</p>
<p>A <code>dir</code>-enabled master do-file will look something like this:</p>
<p><code>#delimit ;<br />
global my_project_root=strlower("C:/My Stata Projects/This Project/Do files/");<br />
local run_these: dir "${my_project_root}" file *;<br />
</code><code><br />
foreach dofile in `run_these' {<br />
do "${my_project_root}`dofile'";<br />
</code><code><br />
/* or, if this won't work, try<br />
do `"${my_project_root}`"`dofile'"'"';<br />
*/<br />
</code><code><br />
};<br />
</code></p>
<p>Notice a few things: first, I do this <code>strlower()</code> thing, because Stata's macro extended function <code>dir</code> likes its file paths in all lower case; second, I use UNIX-style forward slashes to delimit nodes in the directory path; third, when your directory and file names include spaces, you have to make Stata use adorned strings. OK, the last two items may require some additional explanations.</p>
<p>On using forward slashes: I think this is a good habit to pick up. Windows doesn't care either way, and forward slashes eliminate the possibility of misleading Stata into thinking that a directory delimiter is an escape order. That alone is a good reason to just never use backslashes in file paths.</p>
<p>Adornments are Stata speak for preserved quotation marks. Adorned strings obviously require compound quotes. How you compound them takes a little trial and error. That is a bit of a pain, but there's no way around it if you insist on spaces in file names.</p>
<p>If, for example, you have two do-files, one called "foo 0.do" and the other "bar 1.do", adornments help the dir function correctly establish that the local macro `run_these' is a list of two items (as in ""foo 0.do" "bar 1.do"") and not a string of four words (as in "foo 0.do bar 1.do"). Adornments and compound quotes make it possible to use spaces in file names. Spaces help legibility. It's up to you whether that benefit is worth the extra cost of more complicated syntax.</p>
<p>But that was a digression. The point of this post was the extended macro function <code>dir</code>. This function allows you to have Stata read whatever do-files you put in a directory, so that if your project scope changes over time your list of do-files is free to change along with it. I think that's a neat option to have.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2008/09/03/a-way-to-handle-runaway-project-scopes/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
