Note to self: ssh does not like loose home directory permissions

This really is a note to myself, regarding my FreeBSD setup. I use this section to jot down things that I expect to forget, because I use them rarely. Google is alright, but note-taking won't hurt anything. So here:

A couple of days ago PuTTY started to demand password authentication, which it shouldn't, because I normally log in with a public-private key pair. That was a good excuse to refresh my keys anyway, so I generated a new pair and overwrote the authorized_keys file. I hoped, again and without any basis in reason or logic, that the problem would go away as quietly as it turned up. It did not.

Fiddling with /etc/ssh/sshd_config didn't help either. What did help was fixing the home directory permissions: doing ls -l /usr/home as a super-user showed soon enough that they were drwxrwxr-x, when they should have been drwr-xr-x at a minimum. I can't remember when I changed them or why, but the problem is fixed with chmod 755 /usr/home/username. That this was indeed the fault of loose permissions was confirmed by a quick examination of /var/log/auth.log. The idea to look there came from here. Some further clarifications on ssh, including a sample config file, are available here.

The other thing I learned from /var/log/auth.log is that I am getting hit by a flood of unauthorized login attempts from the weirdest places every day. So far, so good.

Bookmark and Share

Quantity and quality in CS research

The other day I ran across this ranking of computer science research institutions worldwide. It includes academic, private sector and government entities combined. What's nice about it is that it includes a measure of quantity -- publications -- and one of quality -- citations. This got me wondering about three questions:

First, does publication output exhibit diminishing citation returns in computer science? If it did, it would mean that researchers at the most prolific institutions have both the incentives and the support to write a lot, and in the process they churn out a lot of forgettable, seldom-cited stuff. If it did not, it would mean that the most prolific places are pretty good at quality control: people there write a lot, but whether because they're smart in the first place or because they have an active seminar circuit, what makes it into publications is usually good, often-cited stuff.

Second, how evenly are the goods spread across the field? Are the top PhD programs doing all the publishing and getting all the citations, or is there good work being done and getting notice at lesser-ranked places too? In the latter case, if you're thinking about grad school it won't be a waste of your time to get your education at one of the more modest state schools.

Third, if you could come up with a quantitative answer to the previous question, would you expect it to be different between the sciences and the humanities? There's a subset of labor economics research that's been trying to figure out whether education really builds human capital (teaches you things) or it just certifies it (e.g., if you went to Harvard we know you're smart and we don't care what you did there). The evidence so far suggests that it does a bit of both. Some of us suspect that sciences do a little more building and a little less certifying, and in humanities it's the other way around. If we're right, then we should see that in the sciences the publications and citations are more evenly spread across the research institutions than they are in the humanities. To settle this question we would need a standardized measure of output inequality. The Gini coefficient, suggested below, is such a measure.

As far as I can tell from this data set, in CS the answer to the first question is that publications show increasing citation returns. If you take the ratio of citations to publications you can show pretty dramatically that the most productive schools are also cited the most, with the mean citations per publication dropping from about 5 to about 2 within the first 400 ranked institutions, and staying there for the remaining 600. However, there is a fair amount of variation about this mean. In other words, the superstars of the field are sprinkled all over the top 1,000 institutions, which suggests that the answer to the second question is that publications and citations are quite evenly spread across the ranked institutions:

OK, but what does "quite evenly spread" mean? You expect the higher-ranked places to be more productive than the lesser-ranked ones, but how much more? Absent a frame of reference -- e.g. data from a few other related fields like math, electrical engineering, etc. -- we simply don't know. But one might as well start building that frame of reference with one data point, CS, and hope that if the question is interesting enough, the exercise will be improved upon and then replicated in these other fields by someone else.

You could start with a look at the numbers: the top 50% of these 1,000 institutions produce 82% of the publications and 89% of the citations. Is that a lot? Maybe it is. But at the first glance it seems that the distribution of the publication output is about as unequal as that of the citation output: the top guys do get more citations, as shown above, but the difference across the entire top half of the ranking is not very dramatic. How about splitting the data into quartiles? Or deciles?

Deciles are nice because drawing the cumulative percentages of the output of interest from one decile to the next will produce the Lorenz curve for that output. For CS publications and citations, this is what they look like:

The curvature of the bottom edge of the blue areas -- let's call them sails -- gives an idea of how unequally the output is distributed across the producers. The bigger the dip (and therefore the larger the blue sail) the larger the share of the total output credited to the top producers. The ratio of the sail surface to that of the triangle under the 45 degree line is the Gini coefficient: a measure of output inequality. By construction, the surface of the triangle is equal to 1/2, so the Gini coefficient is just twice the size of the blue sail. That sail can't be larger than 1/2, so the Gini coefficient is a number between 0 and 1, which makes it a standardized measure.

If the sails are about the same size in the two graphs, then you can say that the top producers of publications aren't any more cited than those of more humble ranking. If instead the citation sail is smaller, then you could say that more prolific institutions tend to have less-cited output. Here you can see that the citations sail is larger, which confirms that the more prolific institutions are also more cited. I figure that the respective Gini coefficients are .63 for citations and .53 for publications, by approximating the blue sails with 1000 little rectangles of width=.001 each.

Now, I don't know what a reasonable difference between the two Gini coefficients might be, or how big they each should be in an absolute sense. If similar publications/citations data were available in other fields, we could get an idea of the observed ranges of Gini coefficients there. With data on enough fields somebody might answer my third question.

Bookmark and Share

Good to know

The user-written commands you download to your ado/plus directory are updated once in a while on that RepEc server they come from. So, after you findit and then net install it, your imported command might need to be refreshed occasionally. That is what adoupdate, update does.

I was reminded of this when I tried to run freduse today. Actually, the problem that reminded me of it -- a "not found" error message thrown when the command invoked the Mata function _fredifinparse() -- didn't go away, but "adoupdate, update" is still a good thing to do. What did fix _fredifinparse() is described here.

Bookmark and Share

The limits of encapsulation

I just read this. I liked it. It put some bit of anguish I've been having into clearer words than I could.

My Stata code between 2000 and 2006 consisted exclusively of do-files that put to work either standard Stata commands or user-written commands from the SSC. There was not a single program definition anywhere and things worked alright. These do-files were pretty elaborate and their functionality overlapped a fair bit, but that was never that much of an inconvenience.

Then in early 2007, during my brief tenure at RTI Health Solutions, that way of working showed its limitations when I tried to program in plain Stata matrices something that was normally being done in GAUSS. It had to do with the design of factorial experiments and my project ended in an instructive kind of failure, because it got me started on using Stata programs. I still like those things. I can define them once and then nest and have them call each other every which way to my heart's content. They take arguments, return values, and generally they make you feel like a real programmer.

Then in 2008 I had my introduction to C++, and the instructions were clear: break down a problem in small morsels; use as many functions as you need; if a function definition fills up a screen, it's way too big, so break it down further; encapsulation is a good thing. Then came header files, namespaces, classes, templates, the works. It was an extreme kind of validation of the way I had started to do business, and my enthusiasm for modular code only grew from there.

Then, about a year ago, I started running into problems. Component programs can be debugged individually, sure, and you only need to fix them once, in one place, which is great. In fact, if they're small and simple enough, you don't even need to do that; they just work. But with complicated projects you're going to have so many interlocked small and simple programs that it will just be too hard to keep tabs on which programs call which, where, and why. It's also pretty expensive to write them in such a way that they can talk with not just one other program, but are truly universal within the context of the given problem.

So I'm not sure anymore that I would recommend my way of writing Stata code to everybody. It still has its uses, but I can see a growing number of circumstances where it's simply not worth the trouble.

Bookmark and Share

Your Linux VM can talk to your Windows PC

If you run a virtual Linux machine as a VMware appliance, and you have VMWare Tools installed, you can let it write to folders accessible from the Windows host. This takes two steps.

First, in VMware Player (as of 3.0) you edit the virtual machine settings -- enable Shared Folders in the Options tab, and add a host path there -- say C:\data\share_with_vm. You can add several distinct Windows paths here.

Next, you add this line to /etc/rc.local right before the "exit 0" line:

mount -t vmhgfs .host:/ /home/user/Shares


Your Linux virtual computer will see any of the Windows folders you shared at the first step as children of the ~/Shares directory. Adding this line to /etc/rc.local makes Shares available to you as soon as you start the VM.

This is how I do Linux on my work computer now, after I first tried Cygwin, then had a dual-boot setup for a while.

Bookmark and Share

“Edit with Vim” in Windows 7

Sooner or later I will have to make a permanent move from Windows XP to Windows 7. I have a spare hard drive for this sort of experiments, so I transferred my settings to it with the wonderful Windows Easy Transfer and watched to see what might break.

Sure enough, Vim promptly lost the ability to write any backup files (the ones with the tilde after the extension) into the temp directory as I originally intended. I'm not sure how to fix this, so I cobbled together the workaround below:

First, I figured that Vim's sudden inability to write to the temp folder must have been caused by this UAC stuff that started in Vista and annoyed a bunch of people at the time, so I went and disabled it with help from here. That did the trick, but with the side effect of neutering the "Edit with Vim" context menu entry: clicking on it would simply remind me to put gvim on the PATH, and then keep doing so even after I did add "C:\Program Files\Vim\vim72\" to the PATH. The context menu, by the way, is what you get when you right-click on a file name.

So, I had to tinker with the registry. That took three steps: first, back up the registry as explained here; next, eliminate the unresponsive "Edit with Vim" entry from the context menu as explained here; finally, add a working "Edit with Vim" entry to the context menu as explained here.

So far, it doesn't look like I wrecked anything. Vim is as good as I remember it from XP. It can run Stata do-files with my old settings out of the box. Next up: Subversion and TortoiseSVN.

Bookmark and Share

Using Mata for string processing

My friend Dan Blanchette showed me a little Mata function yesterday that he wrote for changing the case -- lower, upper, proper -- for strings longer than 244 characters. It was fresh in my head today as I went looking for something while babysitting my daughter -- can't remember what; babysitting requires undivided attention -- and ended up here.

This post is the result of the conversation I started in the comment thread with Gabriel Rossman. I will attempt to use Mata for string processing within a suitably large text file, as opposed to just a blob of text you can call as a local.

Step 1: google "very large text file". This took me to a magical place where the 1980's are preserved in perpetuity. I went through the categories, and picked this one. At exactly 12,133 lines, it should do nicely.

Step 2: get the Mata book -- because I still run Stata 10 at home, so no pdf documentation yet.

Step 3: muck around. Eventually I came up with this thing:

mata
real scalar checkmatch(string scalar theFile, string scalar thePattern)
{
   real scalar n,i,check
   string matrix A
   A=cat(theFile)
   n=rows(A)
   check=0
   for(i=1; i<=n; i++) {
      if(strmatch(A[i,1],thePattern)) {
         check=1
         return(check)
      }
   }
   return(check)
}
end


This is a Mata function that returns 1 if a string pattern is found anywhere in a given text file, and 0 otherwise. It makes use of Mata's built-in cat() function, which reads an ASCII file of n lines into a column vector of n string elements, one for each line in the original file. I want checkmatch() to exit with 1 as soon as it first finds the string pattern it's looking for. I'm guessing that the first return(check), inside the if clause, does it, but I'm not sure.

With a text file this big, the 0 case might be the harder one to test, but if you're fishing for patterns you're unlikely to find in an English-language document no matter how big, a Hungarian word is a pretty good bet. So, this is the output:

. mata: checkmatch(`"dostech.pro"',`"*BIOS*"')
  1
. mata: checkmatch(`"dostech.pro"',`"*Kolozsvar*"')
  0


Now for a real illustration of Mata's string and file processing capabilities, see here.

Bookmark and Share

Filling the gaps in your panel with winsor

I recently worked on a project where I had to model groundwater salinity as an indirect function of population growth. The idea is that more people will draw more fresh water from the aquifer; other things equal, saline water will be displace it. I had to do this for sixteen counties in Southern Florida and my data -- on population, salinity, and water withdrawals, by year and by county -- had some gaps here and there.

A search on how to best fill them had to start with the Statalist archive. I quickly found this simple solution by Scott Merryman, which solved the core of the problem. But then things got complicated. First, I had several variables of interest and I had no reason to expect that they all would have a linear time trend. Second, some of my variables -- such as measured groundwater salinity -- showed pretty unbelievable outliers.

So I needed some kind of wrapper that would accommodate different variables, trends of different orders, and dummy variables to account for any differences between counties or groups of counties. And I needed a quick and easy way to deal with the outliers. The wrapper is below:

// #### higher-order predictions of anything against time.
// takes three arguments:
// 1. order, numeric= 1,2,3,4,5,etc. for linear, squared,
//    cubed, etc. trend
// 2. lhs, string, the name of the left-hand side variable:
//    salinity, usage, population, etc.
// 3. dummy, string -- name of the group identifier on the
//    right-hand side, e.g., county fips code.
capture prog drop getValueHat
program getValueHat

args order lhs dummy

levelsof `dummy', local(countem)
local checkit: list sizeof countem
// if `dummy' is not degenerated to one value, then use xi: regress
if `checkit'>1 {
   local regressors i.`dummy'*year1
   forvalues i=1/`order' {
      gen year`i'=year^`i'
      if `i'>1 {
         local regressors `regressors' i.`dummy'*year`i'
      }
   }
   xi: regress `lhs' `regressors'
}
// otherwise, just regress
else {
   local regressors year1
   forvalues i=1/`order' {
      gen year`i'=year^`i'
      if `i'>1 {
         local regressors `regressors' year`i'
      }
   }
   regress `lhs' `regressors'
}
predict `lhs'_hat
replace `lhs'_hat=max(0,`lhs'_hat)
if `checkit'>1 {
   drop _I*
}
forvalues i=1/`order' {
   drop year`i'
}

end

Now I can use the same code for different variables with different functional forms, which is nice. For example, if I want to fill gaps in salinity at the county level with a quadratic trend, this program call will get me the right salinity_hat:

getValueHat 2 salinity county

Dealing with the outliers was a far simpler matter: Nick Cox's winsor command -- findit winsor if you don't have it installed. The complete solution to my problem of gaps and outliers is:

winsor salinity, h(3) gen(x)
getValueHat 2 x county
rename x_hat salinity_hat
drop x

If, like me, you've never Winsorized before, here's the Wikipedia entry on the procedure.

Bookmark and Share

Automated sanity checks

I am reading An Introduction to Stata Programming, by Christopher Baum.

He suggests, in Chapter 5.2, a nice do-file method to validate your data: you use pairs of list and assert. For example, suppose you know that a variable v should have no missing values. If it indeed does not, then assert !missing(v) should run without error. If it does, you want to know where they are: list if missing(v). Reversing the order of these two lines in a do-file will cause Stata to exit with an error, which alerts you that there are problems with your data, but not before it shows you where the problems are:

sysuse auto, clear
list if missing(make)
assert !missing(make)
list if missing(rep78)
assert !missing(rep78)


This is neat because you can always edit this do-file with a new pair of list/assert lines as needed. But, as the author mentions, sometimes a summarize is plenty helpful too, especially if you remember that it accepts a list of variables as an argument. You could do this for example:

sysuse auto, clear
sum make rep78


Wait. That didn't work too well, because summarize tells you nothing about make: it treats string variables as missing. You would know that that was indeed the case if before sum you would have requested describe.

So, I guess, before you do any kind of data validation, describe is a good first step; you might also like codebook; I don't. I find it too wordy. But it does do the job of giving you information about the whole data set.

One alternative to wordy output when you have a specific question regarding more than one variable is to use little custom programs for data checks. On such example is countIfMissing, shown in my previous post. Another, inspired by Christopher Baum's use of summarize might be sumIfNumeric:

capture prog drop sumIfNumeric
program sumIfNumeric

unab fullset: _all
local numset
foreach varble in `fullset' {
   capture confirm numeric variable `varble'
   if _rc==0 {
      local numset `numset' `varble'
   }
}
local check: list sizeof numset
if `check'>0 {
   sum `numset'
}
else {
   di "No numeric variables found in this dataset."
}

end


This might be useful when your data set comes with a bunch of variables -- some numeric, some string. Though describe will tell them apart easily enough, you may not care to list them explicitly. The usage is straightforward:

sysuse auto, clear
sumIfNumeric

Bookmark and Share

Count missing observations

With one variable, that's easy enough: count if missing(variable-name). If you have several variables, you can put them in a foreach loop. But if you have to do this for arbitrary lists of variables in several files, it may be interesting to package that foreach loop inside a quick command that might handle special display instructions as well.

Here is one suggestion:

// countIfMissing: display the total count of observations, then
// any counts of missing observations for each variable in a list.
capture prog drop countIfMissing
program countIfMissing

version 11
syntax varlist

quietly count
local count=r(N)

// now make things align nicely
local sum=`count'
local tens=1
while `sum'/10>1 {
   local sum=`sum'/10
   local tens=`tens'+1
}
local width=`tens'+int(`tens'/3)
local varct: list sizeof varlist

di ""
di "Observations:"
di %`width'.0fc `count'
di ""
di "Missing:"
foreach varble in `varlist' {
   qui count if missing(`varble')
   local ct=r(N)
   local pct: di %4.2fc 100*`ct'/`count'
   if `pct'>0 {
      di %`width'.0fc `ct' " `varble' (`pct'%)"
   }
   else {
      local varct=`varct'-1
   }
}
if `varct'==0 {
   local offset=`width'+2
   di _column(`offset') "none of `varlist'"
}
di ""

end

For an example of usage, you can try this:

sysuse auto
local myvars "make price foreign"
countIfMissing `myvars'
countIfMissing m*       // (1)
countIfMissing _all     // (2)

As you can see in (1) and (2), the usual varlist conveniences apply here.

Bookmark and Share