Set theory and the extended macro function list
Thursday, 19 February 2009
I have a project where I need to do some things to all files whose names start with "sub" but not "subs". The former are files I receive, the latter are files I produce. When it occurred to me that their common file name root might cause problems, I rewrote my code to name the files formerly known as "subs" to something different but (to me) still descriptive, and I picked "snap".
The program that presents this little problem cleans up after itself. Every week, if it's Friday, it deletes whatever "sub", "subs", "snaps" or what have you files it finds that are older than n weeks. So, eventually, after the change above, all the "subs" files will disappear. There will only be "snap" files left in their place, and the ambiguity between "sub" and "subs" will no longer be an issue.
In the meantime, though, I need a piece of code that lists file names that start with "sub" but not with "subs", which is easy enough. But it needs to work even after the last "subs" file will have left the disk. I also want it to make use of things already declared elsewhere. I basically want to be free to forget I ever wrote it long after it outlived its usefulness but, should it turn up to be needed again for some reason, I want it to pick up where it left off. So here's how the - (minus) operator of the extended macro function list can handle that:
Right now, in the directory in question, there are four strings that file names can start with: "rat", "sub", "subs" and "snap". The first two are file names I don't want to change. Every week there will be a new file starting with either string in that directory. The last two, like I mentioned, I have more control over. The set of file names that start with "snap" is empty now but will fill up over time, and the set of file names that start with "subs" is not empty yet, but it's headed that way. I don't care about the "rat" files for this particular purpose, but whatever is irrelevant should not have to be explicitly excluded. Here's the code:
local from=strlower("`c(pwd)'")
local whichfiles "rat sub subs snap"
foreach name in `whichfiles' {
local `name'list: dir "`from'" files "`name'*"
// local `name'count: list sizeof `name'list
// di "`name' count is ``name'count'"
}
local sub sub // (1)
local dropthese: list whichfiles-sub // (2)
di "`dropthese'"
foreach name in `dropthese' {
// di "`name'"
local sublist: list sublist-`name'list
// local subcount: list sizeof sublist
// di "now sub count is `subcount'"
}
This assumes that your directory of interest is c(pwd) -- type creturn list in Stata if this makes no sense. Notice that you have to declare the string "sub" as a local in order for the list operator minus to work -- see lines commented (1) and (2).
The beauty of the minus operator is that it acts the way you would expect it to from set theory. It will leave `sublist' unchanged if:
a) ``name'list' is empty (as in the file names that start with "subs" after the last such file will have been deleted) or
b) ``name'list' is not a proper subset (as in the file names that start with "rat", right now) or
c) both (the case of the file names that start with "snap" right now; remember I don't yet have any such files).
So I don't need to declare a `whichfiles' list with strictly the file names of interest now -- "sub" and "subs". The minus operator will gracefully handle everything. If I happen to have saved a more expansive `whichfiles' list in some other spot, for some other reason, then I can just use it from there. The lines I commented out were useful for testing, but will be a nuisance when this is live. I could have equally well enclosed the whole thing in quietly {}. Come to think of it, that would have been more elegant.
No. 1 — March 5th, 2009 at 12:37 pm
Thanks for the great post.
I had a similar problem where I am receiving constantly many .xls files (with protected sheets) with sequential names "file1" "file2" and need to put them all into one stata dataset. I modified your code above to read in all the csv files, clean them up, and append them to one stata dataset (I did have to manually change all of them to .csv files because I haven't yet figured out how to get stata to read the .xls or .xlsx files without stat transfer).
Thanks for the help, you've saved me a lot of time.
No. 2 — March 5th, 2009 at 10:56 pm
Glad to help. As to getting data from Excel to Stata, I am curious where the snag is. I'm assuming that you have already googled the matter, and maybe searched this blog for
odbc loadas well.No. 3 — March 6th, 2009 at 12:43 am
Yeah, I've tried several approaches. I am using a Mac (and the 64-bit version of Stata) and I think this is where I ran into the snag with the ODBC drivers. The only driver I've found is for Open Office (OpenBase). I"ve gotten this driver to load, but I've not been successful in getting the data loaded into Stata via the odbc load commands. More over, my preference is to go from excel to stata without trying to import it into Open Office first.
I saw a post on statalist a couple of days ago about the unavailability of 64 bit drivers for the Mac, but if I could find a 32bit sql driver for Mac then I'd switch back the 32-bit version of Stata.