Using Mata for string processing
My friend Dan Blanchette showed me a little Mata function yesterday that he wrote for changing the case -- lower, upper, proper -- for strings longer than 244 characters. It was fresh in my head today as I went looking for something while babysitting my daughter -- can't remember what; babysitting requires undivided attention -- and ended up here.
This post is the result of the conversation I started in the comment thread with Gabriel Rossman. I will attempt to use Mata for string processing within a suitably large text file, as opposed to just a blob of text you can call as a local.
Step 1: google "very large text file". This took me to a magical place where the 1980's are preserved in perpetuity. I went through the categories, and picked this one. At exactly 12,133 lines, it should do nicely.
Step 2: get the Mata book -- because I still run Stata 10 at home, so no pdf documentation yet.
Step 3: muck around. Eventually I came up with this thing:
mata
real scalar checkmatch(string scalar theFile, string scalar thePattern)
{
real scalar n,i,check
string matrix A
A=cat(theFile)
n=rows(A)
check=0
for(i=1; i<=n; i++) {
if(strmatch(A[i,1],thePattern)) {
check=1
return(check)
}
}
return(check)
}
end
This is a Mata function that returns 1 if a string pattern is found anywhere in a given text file, and 0 otherwise. It makes use of Mata's built-in cat() function, which reads an ASCII file of n lines into a column vector of n string elements, one for each line in the original file. I want checkmatch() to exit with 1 as soon as it first finds the string pattern it's looking for. I'm guessing that the first return(check), inside the if clause, does it, but I'm not sure.
With a text file this big, the 0 case might be the harder one to test, but if you're fishing for patterns you're unlikely to find in an English-language document no matter how big, a Hungarian word is a pretty good bet. So, this is the output:
. mata: checkmatch(`"dostech.pro"',`"*BIOS*"')
1
. mata: checkmatch(`"dostech.pro"',`"*Kolozsvar*"')
0
Now for a real illustration of Mata's string and file processing capabilities, see here.