Amending a hasty commit

On my current project I occasionally have to report anomalies in input data to people upstream, who can look into them. I do this with html files which I knit from R Markdown. They have to include prose, code, results, and pictures. As I edit them on my way to the final product, it's practical to cache some of the code chunks, especially if they require loading large input files or reading from a database.

The cache folders can get quite big. If it so happens that I haven't edited .gitignore to skip cache folders, my next commit will be slow and the Sync will fail because the enterprise GitHub server has a 50M file size limit.

But if you've already hit 'commit' and the Sync button in the GitHub app, what do you do? It turns out you can edit .gitignore so it knows better next time, maybe delete the offending files -- not that you strictly have to, but these cache folders won't be needed once the report is in its final format and they just take up disk space -- and then do this at the command line:

git status
git add --all
git commit --amend --no-edit
git status

The first git status will show that you have deleted files and modified .gitignore. The git add --all construct, unlike git add ., will stage the deletions as well as the modification. Then git commit --amend --no-edit is a brand new commit on top of the old one, as explained here. The second git status confirms that all is well, which you can see again when you switch back to the GitHub app: the Sync button tells you that you're ahead by one commit, you click it, and the push is quick, because the huge cache files are gone.

Recipe for pairing up RStudio with GitHub

On both Windows and Mac I have been happy to use RStudio for R development and the GitHub app for handling version control. The app seems to be GitHub's own preferred interface, and if you use it you don't even need Git for Windows. I'm not sure why you wouldn't just do that. The only cost is that you have to flit between RStudio and the GitHub app every time you make a commit, but how much of an interruption is that? You flit between RStudio and the browser all the time to check StackOverflow, don't you?

Regardless, suppose that you find the thought of doing your version control from inside RStudio appealing. Below are the setup steps that worked for me, pieced together from many places in the process of integrating my startUpViz repo into RStudio's Git workflow.

Step 1: Give RStudio the Git

Install Git for Windows or, if it's installed already, tell RStudio about it as explained in the five-step method described here. Make sure to stop at "Restart RStudio and that is all there is to it!" My instructions below supersede the remaining instructions there, not because those are wrong, but because I have a specific kind of R project in mind: a package.

Step 2: Create a brand new R project

Create a brand new R project in a brand new directory and check the box "Create a Git repository." You might as well place this at the end of c:/users/[yourusername]/documents/GitHub/ because this is probably where you keep all the work that ends up published on GitHub.

Step 3: Open the Git shell and configure your SSH key pair

In RStudio your brand new project comes with a Git tab in the top right corner where you usually only pay attention to the Environment and History tabs. That Git tab has a gear icon, and the drop-down menu that opens when you click it has a "Shell..." option. That is your GitHub BASH shell. Open it, create a new SSH key pair, then send the public one to GitHub. The complete instructions for this are here, but there's one crucial twist: instead of ssh-agent -s as shown in the last screenshot at step 2 on that page, you must type eval `ssh-agent -s`. Only then can you ssh-add your new private key. Details are offered on StackOverflow. You only need to do this step once. Subsequent RStudio projects that you'll be version-controlling on GitHub from inside RStudio will use the same key pair for authentication.

Step 4: Create a brand new GitHub repo at github.com

Now go to GitHub and create a bare-bones new repo of the same name as your directory at Step 2 above. Do not check the box "initalize this repo with a README file" because that part can wait, and doing so will bypass some options you'll want. When you create this new repo, you will be given the options
- Quick setup — if you've done this kind of thing before
- …or create a new repository on the command line
- …or push an existing repository from the command line
The third is the one you want. Now you go back to RStudio, but before you leave notice that though the "clone URL" that you can copy to clipboard says https:// by default, that is not your only option. If you read the hint that "You can clone with..." and click SSH, the URL changes to something that starts with git@github.com:. That's the one you'll want. Save it to the clipboard now.

Step 5: Add a remote repo

Now you're back in RStudio. If at the Git BASH shell prompt you type git remote -v and hit Enter, you should see nothing, because you don't yet have a remote repo. You add one with git remote add origin git@github.com:[yourusername]/[yourrepo].git where the part after the word origin comes from what you saved to the clipboard at Step 4 above.

Conclusion

Once you complete the 5 steps above, you can git push -u origin master for the first commit, and then commit, push, pull, etc. directly from RStudio. Either skipping this little "eval" tweak or using the https://[...] URL for the remote repo instead of the git@[...] one will cause the ssh connection to fail, and RStudio will be unable to push to the remote.

I don't know why this had to be so hard to set up, but there it is. I wrote it down because it took trial and error. Anyway, this is how you hook RStudio for the first time to a pristine, not-yet-having-had-the-first-commit GitHub repo. This is the kind of repo you need for an R project that starts in a new directory, with the further options that it can be an empty project, a package, or a Shiny app.

A much easier way to go (especially once you have a SSH key pair set up) is to set up your repo at github.com with the box "initalize this repo with a README file" checked. That immediately triggers your first commit. Next, you go to RStudio and start a brand new project but this time you pick the "Version Control" option (bottom of the dialog box) instead of either of the two above it ("New Directory" or "Existing Directory as of this writing). You then pick Git, then give it the git@[...] URL you saved to the clipboard at step 4, and you're good to go.

This, though, will be a bare-bones project, and it's up to you to fill in the goods. None of the setup work that RStudio provides for new packages or Shiny apps will be done by default. You can see why this is: the typical use case for a project checked out from a version control repository is that you're picking up where you or somebody else left off earlier: there's some work in progress you'll be making use of, not just an empty repo that you happened to start with the box "initalize this repo with a README file" checked.

Either way you do it, starting afresh or checking out from an existing repo, you still have the option to revert to the GitHub app if, say, you miss the Sync button. You just need to do a push and pull from RStudio to make sure that your local copy is identical to the remote, and then let the GitHub app clone the remote onto the local one. No conflicts will be reported, and from then on you can handle the version control from either app.

Here’s to MOOC’s. They’re better than textbooks

The job of textbooks is to separate brilliance, which has zero marginal cost, from individual attention, which is labor-intensive. Everybody is better off when the few brilliant teachers write books that the many dedicated ones can teach from.

MOOC's do the same job better. They are cheaper to make and distribute. They are cheaper to improve on, because student response is automatic and quite precise: all you have to do is look for videos most rewound, or quiz answers most missed. Improvements can be spliced in as needed, one four-minute video replacing another. MOOC's are also much better at avoiding bloat. Textbooks grow thicker and more colorful over time, driven by relentless yearly print runs. It is not clear how much of this reflects truly new content, more effective delivery, or the need to kill off the resale market. With MOOC's, there is no such uncertainty. The resale market is not a concern. Courses that are not watched will be abandoned. Lectures rewound a lot or whose accompanying quizzes have low pass rates will be re-shot, improved. And videos preserve the kind of author's flair for delivery that is lost on the printed page no matter how colorful the latest version is, or how interactive the accompanying website.

Many trees are felled for making textbooks that are returned to the publisher. While they are out, they are clutter that makes it hard to find the good ones: they're all equally thick and colorful and pushed by equally enthusiastic reps. MOOC's, on the other hand, produce all kinds of vital statistics -- viewership, attrition rate, forum participation, topics most discussed, etc. -- as soon as they go live. They are easy to kill off if they don't catch on and it does not take long to know whether they might. MOOC's may look like a monoculture, but what looks like diversity in textbooks is just market inefficiency.

MOOC's don't work that well on their own for the same reason that textbooks don't: both are complements to individual attention, not substitutes for it. But MOOC's paired with a flipped classroom will do a better job than textbooks paired with a reading schedule have done so far. Thanks to them the workers of the future will be more productive than we are.

Introducing syncR

A new and improved version of the syncPacks() function is now part of a GitHub package, which you can install through devtools::install_github('ghuiber/syncR'). If you're into that, you can help develop it further too.

Thanks go to Hilary Parker for her thorough instructions and to Hadley Wickham for devtools and roxygen2.

Some unresolved hiccups with R 3.1.0 on Mavericks, and a workaround

If you're going to download the Mac binaries for the latest R, you will see that they come in "Snow Leopard and higher" and "Mavericks and higher" flavors. If you run Mavericks, the latter is a natural choice, though the former clearly says "and higher" too, so it's got to be a valid option as well.

As it turns out, it's the better option, at least as of this writing.

The Mavericks build crashes with a segmentation fault upon attempting to load either the caret or data.table library, as reported here and here. A brief search through the R-SIG-Mac Archives returned no useful leads for fixing the problem.

Dropping the Mavericks build and installing the Snow Leopard one gave me back both caret and data.table. This works for me.


> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RCurl_1.95-4.1   bitops_1.0-6     scales_0.2.4     ggplot2_1.0.0    reshape_0.8.5    data.table_1.9.2
[7] MASS_7.3-31     

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 digest_0.6.4     grid_3.1.0       gtable_0.1.2     htmltools_0.2.4  munsell_0.4.2   
 [7] plyr_1.8.1       proto_0.3-10     Rcpp_0.11.2      reshape2_1.4     rmarkdown_0.2.46 stringr_0.6.2   
[13] tools_3.1.0  

Macbook Pro running hot, draining battery after upgrading to SSD?

Mine did. That was an unpleasant surprise. Googling for a solution brought up untold amounts of speculation and wasted time.

What ended up working for me was resetting the System Management Controller (SMC), as documented here and especially here. You should see that Reddit comment thread, especially if you're also wondering whether you're supposed to enable TRIM.

Resetting the SMC brought down the CPU core temperatures from about 90°C to about 60°C, low enough for the fan to not kick in. My Mac is once again as quiet as it used to be.

FreeNAS works as advertised

I decided to replace the HDD with a SSD in my Mac for Christmas, but I only got as far as buying the thing and backing up the computer using Time Machine as explained here to a poor man's FreeNAS server that I cobbled together from a USB stick (for the OS) and the old Fedora 14 home server whose sole 500G HDD is now one big ZFS volume, with 2G of RAM.

That's right, ZFS on one HDD with 2G of RAM. I'm not saying that this is a good setup. The official hardware recommendation is 8G for ZFS. But this is the kit I had lying around, and I just wanted to move on with the actual disk replacement; my D510MO board won't even support more than 4G of RAM (though I'm not sure why, since it was made to accommodate a 64-bit CPU). Anyway, I managed to make one first complete Time Machine backup and a few incremental ones before leaving for work on Monday, January 6.

I flew back on Thursday, January 9, and found a non-responsive Mac with a HDD so sick that an erase-and-install OS restore was in order. That's what you end up having to do when upon entering your password at boot-up you see the apple logo for a while, then that "prohibited access" barred circle, while the gear animation is spinning and spinning.

I have no idea how this happened. I felt very fortunate for having made that backup. I decided that the accident was a good excuse to just proceed with the SSD installation already.

The proof of this particular pudding was going to be in restoring the old system from that Time Machine backup, over the LAN, off the grossly inadequate NAS box. I am happy to report that the restore succeeded, and my Mac is back in business, now with a SSD.

What I'm saying is this: if you don't have a Time Capsule but do have some idle hardware, FreeNAS may be a good Time Machine backup solution for you too.

One thing you will want to know about is user quotas: a 500G NAS HDD will fill up quickly if you let Time Machine have its way with it. The solution is to set some reasonable user quotas for people in your house who might use the FreeNAS box as their Time Machine backup destination. You can do that from the web GUI. The Advanced Mode of the Create ZFS Dataset menu under Storage (or, for an existing dataset, the Advanced Mode of Edit ZFS Options) lets you set quotas four different ways; for specifics, google thin and thick provisioning. This seems to be advanced sysadmin stuff.

There is also a command-line recipe for setting user quotas here. You get to the FreeNAS shell from the web GUI: look at the bottom of the vertical navigation menu on the left.

Smaller quotas will force Time Machine to keep a shorter history. It deletes old backups as it runs out of space -- so, less room, shorter history. That is not a bad thing.

Invisible methods

R objects come with various methods that make them useful. I tend to stumble over these by googling something I want to do, and finding some code example on StackOverflow. But today I learned (from @RLangTip) that there is a straightforward way to list them all: you simply call e.g., methods(class='lm').

That's nice, but mileage varies and I don't have a good explanation for it. Take Zelig for example. It has this sim() function which produces a simulation object with some methods of its own. One of these is plot.ci(), illustrated here. Unfortunately, you won't find it with the methods() call:


> library("Zelig", lib.loc="C:/Program Files/R/library")
Loading required package: boot
Loading required package: MASS
Loading required package: sandwich
ZELIG (Versions 4.2-2, built: 2013-10-22)

+----------------------------------------------------------------+
|  Please refer to http://gking.harvard.edu/zelig for full       |
|  documentation or help.zelig() for help with commands and      |
|  models support by Zelig.                                      |
|                                                                |
|  Zelig project citations:                                      |
|    Kosuke Imai, Gary King, and Olivia Lau.  (2009).            |
|    ``Zelig: Everyone's Statistical Software,''                 |
|    http://gking.harvard.edu/zelig                              |
|   and                                                          |
|    Kosuke Imai, Gary King, and Olivia Lau. (2008).             |
|    ``Toward A Common Framework for Statistical Analysis        |
|    and Development,'' Journal of Computational and             |
|    Graphical Statistics, Vol. 17, No. 4 (December)             |
|    pp. 892-913.                                                |
|                                                                |
|   To cite individual Zelig models, please use the citation     |
|   format printed with each model run and in the documentation. |
+----------------------------------------------------------------+



Attaching package: ‘Zelig’

The following object is masked from ‘package:utils’:

    cite

> methods(class='sim')
[1] plot.sim*   print.sim*   repl.sim*   simulation.matrix.sim*
[5] summary.sim           

   Non-visible functions are asterisked

See that? There's a non-visible plot() method listed, but no plot.ci() method, yet it exists and it works. I wonder why that is. Is it maybe that plot.ci() is some kind of child of plot()? If so, how do you list such children?

How I backed up a bunch of old pictures to Amazon Glacier

This is from a home server that runs Fedora 14, to which I have ssh access from my MacBook Pro.

1. I git clone'd this.

2. Then, as super-user, I called


wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python

as instructed here, to install the setuptools module.

3. Then, also as super-user, I called


python setup.py install

4. At this point, it was time to fill out the .glacier-cmd configuration file, as shown in the README.md.

5. Bookkeeping using Amazon SimpleDB requires setting up an Amazon SimpleDB domain (= database) first. You cannot do this through the AWS Management Console.

6. So I googled, and found official directions here.

7. Unfortunately, my Chrome wouldn't render properly the SimpleDB Scratchpad web app. That caused some unnecessary confusion. The solution was to just run Scratchpad in Safari.

8. Your computer has folders and files. Amazon Glacier has vaults and archives. One archive = one upload. This can be an individual file, but it's more practical to bundle individual files into tarballs first, so one archive = one tarball.

9. I'm in business: two large tarballs uploaded and showing up in my SimpleDB domain that keeps tabs on this particular vault, one on the way.

It looks like everything works, but I can't be sure until Amazon Glacier gets around to producing an inventory (this happens about once a day, it seems). I can then check SHA sums between what's on Glacier and what I thought I sent there. Next I will upload something small, then download it the next day.

Glacier is the digital equivalent of self-storage. You put stuff there that you don't really want anymore; you think you might, but you don't. It's a problem that comes with ease of acquiring such stuff in the first place. I don't think there's a big self-storage industry in Zambia, and I'm sure that storing old photos wasn't much of a problem back when you had to take them on film and you only had 32 frames in a roll.

I have no idea why we bother with digital self-storage. I guess simply deleting old pictures and a bunch of music we no longer listen to makes us feel like jerks. It's a total trap.

I put up my first post on RPubs

Sure, it may be the 4chan of data analysis, but it's so nice to be able to do R Markdown right there in RStudio and just hit the Publish button.

Of course, this convenience has downsides. I know it's prudent to sit with your work a bit, just like thinking carefully before you go skinny-dipping, especially when you don't have the benefit of peer review.

On the other hand, it's no use to wait until nobody cares anymore. So, here goes.