<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Stata Things &#187; Mata</title>
	<atom:link href="http://enoriver.net/index.php/tag/mata/feed/" rel="self" type="application/rss+xml" />
	<link>http://enoriver.net</link>
	<description>computing for fun and profit</description>
	<lastBuildDate>Tue, 31 Jan 2012 20:03:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>My first useful Mata function</title>
		<link>http://enoriver.net/index.php/2009/07/28/my-first-useful-mata-function/</link>
		<comments>http://enoriver.net/index.php/2009/07/28/my-first-useful-mata-function/#comments</comments>
		<pubDate>Wed, 29 Jul 2009 02:12:08 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[cluster analysis]]></category>
		<category><![CDATA[Mata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=818</guid>
		<description><![CDATA[Or so I thought. I'm working on a cluster analysis project. There are multiple data sets, they are massive, and there are several variable subsets by which one could plausibly cluster the observations. Agglomerative hierarchical clustering is the way to go when you don't have any notion of how many clusters there should be, but [...]]]></description>
			<content:encoded><![CDATA[<p>Or so I thought.</p>
<p>I'm working on a cluster analysis project. There are multiple data sets, they are massive, and there are several variable subsets by which one could plausibly cluster the observations.</p>
<p>Agglomerative hierarchical clustering is the way to go when you don't have any notion of how many clusters there should be, but it is impractical for large data sets -- its run time is O(n-squared). For big data sets you want to use some partition clustering method, such as k-means. This, in turn, has the drawback that you need to specify how many clusters you want (never mind the bigger issues of overlaps, or unequal-sized clusters for now; you're in the exploration stage).</p>
<p>So you take a suitably small, suitably stratified random sample of the data, and use hierarchical clustering on it, just to figure out how many clusters there might be in the original population. Whatever number you get, that's what you use for k-means clustering on the full data set.</p>
<p>Stata offers several options for picking the number of clusters. The default is to pick the number of scores that maximizes the Calinski-Harabasz pseudo-<em>F</em> index. That's easy enough to look up in the table, but how do you make Stata store it?</p>
<p>To see what I'm talking about, run this first (call it code snippet 1):</p>
<pre><code>
drop _all
use http://www.stata-press.com/data/r10/physed
set varabbrev on
cluster averagelink flex speed strength, name(avglnk)
cluster stop avglnk, matrix(a)
</code></pre>
<p>You see a table; now it's clear that you want 4 clusters, because that's what corresponds to the highest pseudo-F value. OK, now you need to get Stata to find it for itself, and keep it in mind somehow.</p>
<p>Time for some Mata I figured, after I asked the Statalist for advice and didn't trouble myself with waiting for it. So I wrote up this:</p>
<pre><code>
// Mata function for getting value in col i
// on row that corresponds to max in col j
// (that is, the i neighbor of the max in j)
capture mata mata drop maxneighbor()
mata
real scalar maxneighbor(real matrix A, real scalar i, real scalar j)
{
  real scalar k, r, max
  r=rows(A)
  max=colmax(A)[1,j]
  k=1
  while(A[k,j]&lt;max) {
     k=k+1
  }
  return(A[k,i])
}
mata mosave maxneighbor(), dir("${adoroot}") replace
end
</code></pre>
<p>Feeling all good about it, I thought I'd trumpet it on the Statalist too, only to get this response:</p>
<pre><code>
sort(A,j)[rows(A),i]
</code></pre>
<p>Really, this one-line thing does the exact same job as the slab of code I proposed above. Good thing I'm a big proponent of learning by doing; it does wonders for one's self-esteem.</p>
<p>So, for my documentation and yours, below is how both work. You enter Mata for calculations, and then you need to have Mata send the result to Stata. Since the result we're after is a scalar, (the number 4 in this example), we use Mata's st_numscalar() function. The Statalist example first (after code snipped 1, shown above):</p>
<pre><code>
mata
A=st_matrix("a")
st_numscalar("maxneighbor",sort(A,2)[rows(A),1]) // here's the one-liner
end
di maxneighbor // and here's your value
</code></pre>
<p>With my own function, the code above would have been</p>
<pre><code>
mata
A=st_matrix("a")
st_numscalar("maxneighbor",maxneighbor(A,1,2))
end
di maxneighbor
</code></pre>
<p>As you can see, in the implementation phase the two functions look like they'd take about the same amount of effort to use. That said, it's still a bit silly to build obscure functions of your own concoction when standard tools for doing the same job already exist.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2009/07/28/my-first-useful-mata-function/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Putting your Mata functions to work</title>
		<link>http://enoriver.net/index.php/2009/03/10/putting-your-mata-functions-to-work/</link>
		<comments>http://enoriver.net/index.php/2009/03/10/putting-your-mata-functions-to-work/#comments</comments>
		<pubDate>Tue, 10 Mar 2009 18:58:16 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Mata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=622</guid>
		<description><![CDATA[Yesterday I showed you how to write Mata functions. Today we will look at how they work with Stata. In interactive mode, a Mata function like mymulti() is called simply as  mata: mymulti(st_matrix("first"),st_matrix("second")) This assumes that the two matrices are Stata matrices previously declared and currently in memory. Mata and Stata matrices are different things. [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I showed you how to write Mata functions. Today we will look at how they work with Stata.</p>
<p>In interactive mode, a Mata function like <a href="http://enoriver.net/index.php/2009/03/09/finally-dabbling-in-mata/">mymulti()</a> is called simply as </p>
<p><code>mata: mymulti(st_matrix("first"),st_matrix("second"))</code></p>
<p>This assumes that the two matrices are Stata matrices previously declared and currently in memory. Mata and Stata matrices are different things. Mata needs this translator function, st_matrix(), to read Stata matrices. The call above will output to screen the matrix resulting from multiplying the two matrices, if they're conformable, or will give you an error message if they're not.</p>
<p>What if you want to save this result into a proper Stata matrix? You will want to set up a shell matrix of the correct size, using the J() matrix function, as in</p>
<p><code>matrix third=J(`r',`c',0)</code></p>
<p>The `r' and `c' locals will have to be right -- `r' will be the number of rows of <code>first</code>, `c' the number of columns of <code>second</code>. Once that's in place, Mata will replace the zeroes with all the right figures like so:</p>
<p><code>mata: st_matrix("third",mymulti(st_matrix("first"),st_matrix("second")))</code></p>
<p>As you guessed, Mata uses the same st_matrix() translation function for two different jobs: if you call it with a string argument it will attempt to read the Stata matrix with that name. If you call it with two arguments separated by a comma, it will replace the Stata matrix named as in the first argument with the contents of the matrix resulting from the second argument -- which can either be a currently existing matrix, or it can be a result derived from a function like mymulti().</p>
<p>Then there all the other things you can already do with Stata's programming capabilities. You could, for example, encapsulate all the Mata business into programs that your clients can use blissfully oblivious that there's Mata stuff under the hood, like so:<br />
<code><br />
capture prog drop matrixMultiplication<br />
prog def matrixMultiplication<br />
</code><code><br />
version 10<br />
args first second<br />
mata: mymulti("`first'","`second'")<br />
</code><code><br />
end<br />
</code><br />
I kept the same argument names for convenience, but you probably figured out already that they're local to the Mata function and the Stata program respectively. You could have defined the program above as<br />
<code><br />
capture prog drop matrixMultiplication<br />
prog def matrixMultiplication<br />
</code><code><br />
version 10<br />
args foo bar<br />
mata: mymulti("`foo'","`bar'")<br />
</code><code><br />
end<br />
</code><br />
And it would have worked just the same. Finally, if the client's matrices are called A and B, that's fine. They can be multiplied either with<br />
<code><br />
mata: mymulti(st_matrix("A"),st_matrix("B"))<br />
</code><br />
or with<br />
<code><br />
matrixMultiplication A B<br />
</code><br />
That is the beauty of encapsulation. Locals are local to whatever function or program definition uses them. You can call them any names you want outside those definitions, and they will still work. This is very helpful if you write numerous pieces of separate code that need to work with each other -- or if you have several people working on the same project. They only have to share their functions' and programs' names and syntax and need not worry about any of the guts.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2009/03/10/putting-your-mata-functions-to-work/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Finally dabbling in Mata</title>
		<link>http://enoriver.net/index.php/2009/03/09/finally-dabbling-in-mata/</link>
		<comments>http://enoriver.net/index.php/2009/03/09/finally-dabbling-in-mata/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 21:40:15 +0000</pubDate>
		<dc:creator>Gabi Huiber</dc:creator>
				<category><![CDATA[Stata]]></category>
		<category><![CDATA[Mata]]></category>

		<guid isPermaLink="false">http://enoriver.net/?p=601</guid>
		<description><![CDATA[I'm taking a discrete math class at NC State and today I had a homework assignment due that had to do with matrix algebra. I didn't feel like doing it with paper and pencil, but the point of it was that I was supposed to understand how things like matrix multiplication worked, so I couldn't [...]]]></description>
			<content:encoded><![CDATA[<p>I'm taking a discrete math class at NC State and today I had a homework assignment due that had to do with matrix algebra. I didn't feel like doing it with paper and pencil, but the point of it was that I was supposed to understand how things like matrix multiplication worked, so I couldn't just whip up <br />
<code><br />
mata<br />
a=(1,2\3,4)<br />
b=(5,6\7,8)<br />
c=a*b<br />
c<br />
end</code></p>
<p>This presented an excellent opportunity to see how Mata programming worked. Turns out it works like C programming, which will mean nothing to you if you don't program in any of the usual general-purpose, high-level programming languages. And that, I think, is the reason why my first attempt at understanding what use Mata was, back in 2006, ended in abject failure.</p>
<p>Mata programmers and Stata users mean different things by the same words. I started out as a Stata user without any programming education. I was in grad school, taking econometrics and trying to get through it one day at a time. To me a variable was a member of a data set (as in <code>sysuse auto.dta</code> and then <code>describe mpg</code>). Programming without a data set was too abstract a thing to countenance. I did the right things -- used do-files, kept notes, commented out the code -- and over the years I became a competent Stata user, but my Stata vocabulary grew without any idea how that fit into the larger field of computer science. It grew around my applied econometrics needs.</p>
<p>Then I saw Mata. It was talking about functions, but these weren't mathematical functions; it was also talking about variables, but there was not a word about the data set in memory; and it bragged about how fast it was and why, as in "it's compiled into bytecode". Huh?</p>
<p>Whatever. Let's pretend that Stata doesn't know how to do matrix multiplication, and we must help it out with a Mata function that would do that.  Let's call it <code>mymulti()</code>. In this post I will go over programming this function. Then in a future post I will show how you put it to work.</p>
<p>You write Mata code in a regular do-file. You let Stata know that you're about to write Mata code by typing <code>mata</code> and you tell it that you're done with that by typing <code>end</code>. Everything between <code>mata</code> and <code>end</code> is Mata code: different syntax, different rules. Here's the do-file for defining <code>mymulti()</code>:<br />
<code><br />
// Mata function for matrix multiplication<br />
capture mata mata drop mymulti()<br />
mata<br />
real matrix mymulti(real matrix A, real matrix B)<br />
{<br />
  real scalar r, c<br />
  real matrix C<br />
  if(cols(A)==rows(B)) {<br />
    r=rows(A)<br />
    c=cols(B)<br />
    C=J(r,c,0) // shell matrix (r by c zeroes)<br />
    for(i=1;i&lt;=r;i++) {<br />
      for(j=1;j&lt;=c;j++) {<br />
        for(q=1;q&lt;=cols(A);q++)<br />
          C[i,j]=C[i,j]+A[i,q]*B[q,j]<br />
        }<br />
     }<br />
     return(C)<br />
  }<br />
  else {<br />
    exit(error(503)) // conformability error<br />
  }<br />
}<br />
mata mosave mymulti(), dir(c:/ado/personal/m) replace<br />
end</code></p>
<p>The <code>capture mata</code> line does the same job as <code>capture program drop</code>: it flushes mymulti() out of Stata's memory so on subsequent runs of this do-file you don't get an "already exists" error message.</p>
<p>Matrix multiplication returns a value: the resultant matrix. Functions that return values must be declared as being of the same type as the value returned. That is why the next line starts with "real matrix". Had mymulti() performed a multiplication of real numbers and returned a real number as a result, its type declaration would be "real scalar".</p>
<p>The arguments of mymulti() are two matrices of real numbers. They too need to be declared as such. All the code that defines how this function will work and what it will return goes between a set of curly brackets. That's how Mata knows where a function's definition starts and ends. It is a matter of convention, not a requirement, that the open curly bracket at the start of a function definition's body gets to have its own empty line, while open curly brackets that help the flow of control -- whether in if-else statements or in loops -- follow in-line. It helps readability.</p>
<p>Inside this function there's some bookkeeping.  If the two matrices are conformable, a scalar named r will be assigned the number of rows of the first matrix; another, named c, will be assigned the number of columns of the second matrix. A shell matrix full of zeroes, of size r by c, will be generated and then filled in with the actual elements of the resultant matrix. The scalars r and c and the matrix C are internal to mymulti(). Programmers sometimes call them variables; other times they call them data objects. Whatever they're called, they need to be initialized (which is another word for introduced) before you can use them. Variables of the same type can be initialized on the same line, separated by commas.</p>
<p>Notice that the innermost nested loop, with the counter q, does not need curly brackets. That is a quirk of the C programming language: you only need curly brackets if you run more than one line of code inside a loop. Also notice the Mata equivalent of Stata's <code>forvalues</code> loop syntax. This <code>for()</code> spelling with two semicolons and an increment operator (++) is also straight from the C language.</p>
<p>Finally, your function can be saved for future use in either the PERSONAL folder (type <code>sysdir</code> in Stata to see where yours is) or along some existing file path declared explicitly, as shown above. I use Stata's convention of saving my .ado and .mo files in subfolders of PERSONAL named after the first letter of any such file names.</p>
<p>There's an important difference between .ado and .mo files, and it is what makes the latter faster. An .ado file is interpreted: you can open it in Notepad++ and you will see familiar Stata syntax -- your original code as written at the time. You will see no such thing if you try that with a .mo file. Your code, unless you saved your do-file, is gone. What you have here is <a href="http://en.wikipedia.org/wiki/Bytecode">bytecode</a>. That is a good thing.</p>
<p>Traditionally, the core Stata commands were written in C and the rest were written in Stata's usual syntax, also known as ado language. That made Stata, even before including any user-generated commands, partially open-source: you could see the source of any the command implemented as an .ado file as clear as daylight. That's nice, but not the reason you buy Stata. You want a fast and reliable statistics and data management package. You got it, but Mata makes it better. Stata syntax is interpreted. This makes it easy to use and write code in, but slower to run than a compiled alternative would be. Mata offers such an alternative. As more and more of Stata's commands move from the .ado language to Mata implementation, Stata will be incrementally faster.</p>
]]></content:encoded>
			<wfw:commentRss>http://enoriver.net/index.php/2009/03/09/finally-dabbling-in-mata/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

