## How I backed up a bunch of old pictures to Amazon Glacier

This is from a home server that runs Fedora 14, to which I have ssh access from my MacBook Pro.

1. I `git clone`'d this.

2. Then, as super-user, I called

``````
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python
``````

as instructed here, to install the `setuptools` module.

3. Then, also as super-user, I called

``````
python setup.py install
``````

4. At this point, it was time to fill out the .glacier-cmd configuration file, as shown in the README.md.

5. Bookkeeping using Amazon SimpleDB requires setting up an Amazon SimpleDB domain (= database) first. You cannot do this through the AWS Management Console.

6. So I googled, and found official directions here.

7. Unfortunately, my Chrome wouldn't render properly the SimpleDB Scratchpad web app. That caused some unnecessary confusion. The solution was to just run Scratchpad in Safari.

8. Your computer has folders and files. Amazon Glacier has vaults and archives. One archive = one upload. This can be an individual file, but it's more practical to bundle individual files into tarballs first, so one archive = one tarball.

9. I'm in business: two large tarballs uploaded and showing up in my SimpleDB domain that keeps tabs on this particular vault, one on the way.

It looks like everything works, but I can't be sure until Amazon Glacier gets around to producing an inventory (this happens about once a day, it seems). I can then check SHA sums between what's on Glacier and what I thought I sent there. Next I will upload something small, then download it the next day.

Glacier is the digital equivalent of self-storage. You put stuff there that you don't really want anymore; you think you might, but you don't. It's a problem that comes with ease of acquiring such stuff in the first place. I don't think there's a big self-storage industry in Zambia, and I'm sure that storing old photos wasn't much of a problem back when you had to take them on film and you only had 32 frames in a roll.

I have no idea why we bother with digital self-storage. I guess simply deleting old pictures and a bunch of music we no longer listen to makes us feel like jerks. It's a total trap.

## I put up my first post on RPubs

Sure, it may be the 4chan of data analysis, but it's so nice to be able to do R Markdown right there in RStudio and just hit the Publish button.

Of course, this convenience has downsides. I know it's prudent to sit with your work a bit, just like thinking carefully before you go skinny-dipping, especially when you don't have the benefit of peer review.

On the other hand, it's no use to wait until nobody cares anymore. So, here goes.

## Stata 13 is coming on June 24

Yellow color scheme is out, sky-blue is in, plus expanded capabilities, as one might expect. Notable among them, `xtologit`, `xtoprobit` and long strings -- 2 billion character long, that is. One of these days you won't need an RDBMS anymore. Wouldn't that be nice?

See more details here.

## Keeping knitr happy after upgrading to R 3.0.0

As noted here, after upgrading to R 3.0.0 you must run

``````
update.packages(checkBuilt=TRUE)
``````

This is because a bunch of packages have to be to rebuilt under R 3.0.0 in order to keep working.

So I did, but that was not enough for LyX to be able to compile my pdf's from knitr like it used to only a week ago. What I had to do besides was this:

``````
remove.packages("tikzDevice")
``````

That is right. The package `tikzDevice` can no longer be installed directly from R-forge as a binary, as in
``` ```

``````install.packages("tikzDevice", repos="http://R-Forge.R-project.org")
``````
``` ```

Also, the source files are only available as a .tar.gz archive. To install from it on a Windows machine, you must have Rtools installed first.

## A quick note on rJava

I recently had to set up a PC with similar kit as I have on my Mac. On this PC the OS is Windows 7 64-bit but the browser is IE8 32-bit. This causes `jucheck.exe` to install (and occasionally update) 32-bit Java. This is unfortunate if you use 64-bit R, because it breaks the `rJava` package, which in turn breaks the `xlsx` package, with the practical consequence that you cannot read Excel worksheets into R.

There is a workaround. First, install Oracle's manual download of 64-bit Java. As of this writing, its Windows 7 home will be in `C:\Program Files\Java\jre7`. You should add this to the `%path%` environment variable. In addition, the `rJava` package depends on `jvm.dll`, and R might be looking for it in the wrong spot. It won't hurt, then, to add this to your `%path%` as well: `C:\Program Files\Java\jre7\bin\server`. There's more on this, as usual, on StackOverflow.

As Oracle warns, your manually-installed 64-bit Java will not be automatically updated. That is a problem when security flaws hit Java, but I find being able to read Excel files into R so useful that I'm willing to just live with this risk, though I don't have a good idea how to best manage it. I'll just keep an eye on ArsTechnica for bug news. If anybody has a better way, I'm all ears.

## An R-squared for logistic regression, packaged

This morning I checked Paul Allison's Statistical Horizons blog and found a post on $R^2$ measures for logistic regression. It introduced me to Tjur's $R^2$ by way of an example, which I repackaged below:

``````
// Reference: http://www.statisticalhorizons.com/r2logistic

// program definition
capture prog drop tjur2
program tjur2, rclass

if !inlist(e(cmd),"logit","logistic") {
di as err "Tjur's R-squared only works after logit or logistic."
exit 498 // Thank you, Nick Cox.
}
tempname yhat
predict `yhat' if e(sample)
local y `e(depvar)'
quietly ttest `yhat', by(`y')
local r2logistic r(mu_2)-r(mu_1)
di "Tjur's R-squared " _col(20) %4.3f `r2logistic'
return local r2logistic `r2logistic'

end

// use case
use "http://www.uam.es/personal_pdi/economicas/rsmanga/docs/mroz.dta", clear
logistic inlf kidslt6 age educ huswage city exper
tjur2
``````

I'm not sure yet if it's worth saving this program as `ado/personal/t/tjur2.ado` for my future logistic regression diagnostic needs, but I haven't posted anything Stata-related in too long, so there you have it.

## Tidying up your R packages

Do you have the same R packages installed in two places? Would you like to remove the duplicates? You might find the script below useful:

``````
rm(list=ls(all=TRUE))

# define function to return duplicate packages and paths
tidyup <- function() {
packs <- as.data.frame(installed.packages())
paths <- levels(packs\$LibPath)
main <- subset(packs, LibPath==paths[2]) # base and recommended
mine <- subset(packs, LibPath==paths[3]) # stuff I installed
dups <- intersect(main\$Package,mine\$Package)
return(list(paths,dups))
}

# do the work:
cleanthis <- tidyup()
removethese <- cleanthis[[2]]          # here's the list of dups
fromhere <- cleanthis[[1]][3]          # I only want them on the main path
remove.packages(removethese, fromhere) # done

# check the result:
# if length(tidyup()[[2]])=0, all is well. no dup packages left.
checkthis <- as.numeric(length(tidyup()[[2]]))
``````

Why I wrote this:

A while back I chose to separate my package library over two file paths. One would be for base and recommended packages (1), the other for everything else (2). My notes on how I did that are here, and my reasons are here.

Today, I wanted to update my Zelig. I used the wizard -- `source("http://r.iq.harvard.edu/zelig.installer.R")` -- so I would get all the add-ons in one step. The wizard works under the assumptions that your library is all in one place. It installed a few packages that Zelig and its add-ons depend on on path (2), because it didn't find them there. They were present on path (1) though, so I ended up with duplicates. This is how I got rid of them.

## I’m taking intro to biostats and epi (PH207x) from EdX

So far, it's been great fun. It's the first MOOC I saw where the software used is Stata, and I would not be surprised if this were a first among all commercial software packages. The topics covered and quality of the instruction are excellent. I am glad to see Stata introduced to a large audience in such nice company. StataCorp made free temporary licenses available to all registered students worldwide for the duration of the course.

But enough about Stata. What really blew me away was the textbook for the biostats section. Beautifully written, it takes its time to cover properly everything you need to know about hypothesis testing if you're an applied researcher. Most texts I've seen before hurried through this part like they couldn't wait to jump into regression diagnostics and the like. Maybe it's because I've only seen econometrics texts before.

Anyway, buy it if you're looking for an introductory text for applied stats of any kind, and biostats in particular. I'm not getting a penny for this plug, by the way, so feel free to try and find it for less elsewhere. And take the course when they offer it again. At the very least, it will make you an educated consumer of public health information.

## Setting up my R library folder on a Mac

My understanding is that there are three kinds of R packages: base, recommended, and everything else. You can tell which is which by inspecting the output of `installed.packages()`. That is easiest done in RStudio by sending that output to a data frame, like this

``````
packs <- as.data.frame(installed.packages())
``````

You can see that this data frame has a column named `Priority`. The output of `table(packs\$Priority, exclude=NULL)` shows that I have 14 base packages, 15 recommended ones, and 70 of the other kind -- user-contributed kit that I installed over time as I bumbled my way through learning and using R.

Looking at `packs` in the top left pane of Rstudio also shows that the rows are named after the packages. This means that you can collect the names of base and recommended packages easily:

``````
> names(subset(packs, Priority=="recommended")\$Package)
[1] "boot"       "class"      "cluster"    "codetools"  "foreign"    "KernSmooth"
[7] "lattice"    "MASS"       "Matrix"     "mgcv"       "nlme"       "nnet"
[13] "rpart"      "spatial"    "survival"
``````

Having all the R packages in the same default library, which is `/Library/Frameworks/R.framework/Versions/2.15/Resources/library` as of this writing, comes with the disadvantage that when I upgrade to the next version of R I will have to re-install the 70 packages of the third kind.

It would be nice if I could set them aside in a different library, and any future version of R will know to look for them there and update them as needed.

There are two steps to this job: one is the actual moving of package folders; the other is to show R where to look for them.

First, I created a new folder called Rlibs. Then I moved the folders around with this Bash script, which I called movePacks.sh:

``````
#!/bin/bash

# declare an array with the names of the base packages
basepacks=("base" "compiler" "datasets" "graphics" "grDevices" "grid" "methods" "parallel" "splines" "stats" "stats4" "tcltk" "tools" "utils")

# and another with the names of the recommended ones
recpacks=("boot" "class" "cluster" "codetools" "foreign" "KernSmooth" "lattice" "MASS" "Matrix" "mgcv" "nlme" "nnet" "rpart" "spatial" "survival")

# and now concatenate them
allpacks=("\${basepacks[@]}" "\${recpacks[@]}")

# where you're moving from
oldLib="/Library/Frameworks/R.framework/Versions/2.15/Resources/library"

# where you're moving to
newLib="/Users/ghuiber/Rlibs"

# first, move everything over
mv \${oldLib}/* \${newLib}

# then move the base and recommended packages back to their default location
for i in "\${allpacks[@]}"
do
mv \${newLib}/\${i} \${oldLib}
done
``````

Finally, to point R to the library folders, I created this `.Renviron` file as instructed by Christophe Lalanne in the comments to my earlier post on the topic:

``````
R_PAPERSIZE=letter
R_LIBS=/Users/ghuiber/Rlibs
EDITOR=vim
``````

The ideas for the R_PAPERSIZE and EDITOR environment variables came from here.

## Benchmarks

I went googling for some examples of quadratic programming done in Mata, and stumbled across a fairly recent Statalist discussion. The original question is here and the official response, typically prompt, is here. I tested Patrick Roland's code on my own machine (2011 MacBook Pro Core2 i5) but with Octave instead of MATLAB, and with R in addition. Octave took about 2 seconds. My R code is

``````
system.time(chol2inv(matrix(rnorm(2000^2),2000,2000)))
``````

This took about 4 seconds to run, whether in RStudio or in command-line R 2.15.1. Mata, meanwhile, still takes about 30 seconds. I run Stata 12MP, all up to date. I'd be curious how SAS/IML does. I don't have it.