Friday 28 February 2014

Intro to GitHub (for scientists)

The problem

I talked to a young woman yesterday who is a bio-engineering postdoc at Stanford. She has some code that she'd like to 'upload' to GitHub. She admitted that, well, actually, she hadn't managed to get any of her code onto her GitHub account yet, and she looked so overwhelmed and dejected that I felt bad. I know that frustration.
So here is yet another blog post to try and help with the learning curve that is git. I'm writing this for the scientist who has written some code and wants to share it on GitHub. Most scientists write code in order to accomplish a particular task, and thus are not familiar with professional programming practices, including writing documentation, unit testing, SCM (source control management) and version control. There are great benefits to learning these techniques, and their usefulness is becoming more and more apparent to academics as research relies more and more on computer programs. In fact, the table of contents for the journal Nature Methods just arrived in my inbox with a lead editorial on reproducible research. I quote:

Nature Methods strongly encourages researchers to take advantage of the opportunity that code repositories, such as GitHub, provide to improve a software tool before submission. Even if others do not examine and test the code, the act of preparing the code and necessary documentation for deposit and use by others will help avoid delays in publication.
In short, if you are coding and publishing work derived from your code, the process of uploading your code for collaboration will help bring it up to a publishable quality. And yes, it's even more important than making the figures pretty.

GitHub == collaboration

Firstly, if you just want to upload some code, you need to take step back, a deep breath and 'think different'. There is no 'upload' button on GitHub for a good reason. It is not built for uploading code and leaving it to rot on a server, but for fostering collaboration between programmers. Thus, in order to share your code on GitHub, you first need to get it ready for collaboration. To do that, you need to set up version control. This is somewhat more complicated than finding the 'track changes' setting in Microsoft Word, but it is also far more useful.
As you try to think different, keep in mind that GitHub was built with a particular set of workflows in mind. Those workflows have to do with managing a code-base that is constantly being updated by multiple contributors. GitHub tries to ease the difficulties of this type of collaboration by bringing several things together.
  • version control -- similar to 'track changes' in a word processor, but for whole projects
  • cloud-based storage -- simultaneously backup and share your code
  • user accounts -- keep it private, share with a team or make it public as your needs change
  • social -- there's messaging, so you can talk, argue, document, discuss.... collaborate!
But version control is the main thing that sets GitHub apart from any number of social sharing platforms, and makes it so powerful for people who code. To use GitHub effectively you therefore need to understand the basics of version control with git. It's hairy and scary at first, but no worse than... well, ok, it is worse than a lot of things, but sometimes a learning curve is the price you pay for really capable tools. So, put a set of bookmarks in your browser, make a cheatsheet and keep it handy. A local cheatsheet with a good searchable title does wonders. If you're thinking 'Oh, I'll just use it once' or 'I'll remember', well, git is for professionals. Are you a professional?

Git == version control

Version control is to 'track changes' for a text document, as Superman is to Tarzan; as Microsoft Word is to TextEdit, as New York City is to Detroit. Firstly, with version control, you are tracking changes to an entire project, not a single document. Changes are tracked line by line with comments and attribution through time as the project grows, changes, and splits; as subroutines develop into full projects of their own; as new owners take control of the code base. It is flexible and thorough and reliable. You should learn to use it.
Use case: You've got some code you developed for your thesis that you want to upload to GitHub. Maybe someone else will find it useful. There are GUI front ends to git, which may help with many tasks, but git was designed to be run from the command line, and this simple use case it not too difficult to master at that level, so let's just go for it.

Download and install git

Walk through the steps at github set-up. Today, they suggest that you download their native app, but note that this only manages part of the workflow. The steps you have to do to get git working still involve the terminal. Be brave. The steps are:
  1. get a GitHub account
  2. download and install git
  3. link your local system to your GitHub account
    • tell git your name, email, GitHub login information
    • set up security keys so that GitHub knows you are you

Prepare your codebase

There are a couple of adjustments you probably want to make before releasing your code in the wild. GitHub recommends that all code comes with a License, a Readme and a .gitignore file.
  • License If your code comes with a license, it's easier for collaborators to re-use the code and build on it. Specifically, it makes clear what they are allowed to do. It may not be important to you, but your code will be easier to share if you make it clear what the rules are.
  • Readme
    You have to explain your work at some point. If you do it in a file named README, GitHub will automatically put it on the front page of the repository. This is very helpful for anyone trying to understand what you did and why. The README can be just a text file. If it is in markdown, perhaps with additional flavoring GitHub will render it with headings and styles, which is much nicer for the reader and not difficult for the writer.
  • .gitignore
    Some files don't need to be tracked. For instance, some old Mac directories contain .DS_Store files with directory display information for the Finder app. That doesn't need to be part of the repository. So here's my .gitignore for an old MatLab project:
      $cat .gitignore
      .DS_Store 
    
    Pretty simple. You might also add *.log or tmp/ to the .gitignore file, depending on your context. Basically, any files that are automatically generated or updated on compile should not be tracked.

Make a local repository

The project you want to get onto GitHub probably consists of a directory or directory tree containing a series of text files, and possibly some image files or data files. In order for git to track changes to this project, you have to put these files into a repository. This is simple to do once you've prepared it for sharing.
Find your command line. On a Mac, you can use the Terminal app. Navigate to the directory holding your project. If you have never ever used the command line before, this might be challenging. If you want to dive in, you can certainly do so with three little commands: ls, cd, pwd. You can look at the man pages for these commands by typing, for example $man ls, or you could try a crash course in using the command line.
Once you get to your project directory, type:
$git init   
You should see a message from git, something like:
Initialized empty git repository in /Users/suz/programming/octave/OrX/.git/
This initialises an empty repository, which looks like a file named .git. You can check that it's there by typing
$ls -a
Git gives some feedback about what it has done, but I often find it useful to check with
$git status
after each command to see what has happened.
Now we can get the project into the repository. To do this, first type
$git add .   
This prepares git to add your files to the repository, a process known as 'staging'.
Note: The . tells git to stage all the files in the directory tree to the repository. This is very handy when we add files because they don't all have to be specified by name. On the other hand, it isn't ideal, because there are often binary files or .log or even image files that update automatically. You won't want to keep track of those changes. Fortunately, git will automatically look for the .gitignore file we already prepared to get the list of exceptions.

Ready for some commitment? Type:
$git commit -m 'initial commit' 
You should get a full list of the files that git is committing to the repository. Check the status when it's done and you should give a reassuring message:
# On branch master
nothing to commit (working directory clean)

Success! now your project is actually in the repository and git can track any changes to the files. The repository will also keep all the messages that you put with each commit. Always use the -m flag and take the time to add a meaningful message.

Congratulations! Your code is now in a git repository, under version control. You are ready to collaborate.

Share your work


Make a repository on GitHub

  1. Log into your GitHub account
  2. Create a new repository on GitHub
    On your profile page, in the 'Repositories' tab is a list of repositories that you've contributed to. At the upper right should be a bright green 'New' button.
  3. Follow the directions in the form, adding an informative description so others know what treasure they have found.
Congratulations! You have a GitHub repository to share your code from!

Link the repositories

In git terminology, the current state of your code is the head of a branch. By default, the initial commit is called the 'master' branch. You can make other local branches, and probably should to try out new features. You can also make remote branches. At this point, your new GitHub repository is essentially an empty remote branch. By custom, this branch will be referred to as 'origin'. To point git to it, type (on one line), with your appropriate username and project title:

$git remote add origin https://github.com/username/project.git

This command translates roughly as "Dear git; Please be so kind as to add a connection to a remote repository. I will be referring to the remote repository as 'origin' in our future correspondence. It can be found at https://...... Thank you for your kind assistance in this matter. Sincerely, yours truly, esq."

Upload your code

Ok, ok, there is no upload on GitHub, but it is payoff time. Once you have a local repository linked to a remote repository, you can just push the code from one to the other.

$ git push -u origin master

Translation: "Dear git; Please push the recent changes I committed to my local repository, known as master, into the remote repository known as origin. Also, please be aware that I'd like this to be the default remote repository, sometimes referred to as 'upstream'. Thank you again for your kind assistance. I am forever in your debt. Sincerely, Yours truly, esq. and, etc."

Success!!! You have now successfully pushed your code to GitHub.

Or at least I hope you were successful. If not, if you've tried to follow this post and the directions at GitHub and you still feel lost, there is more help out there. Many universities are running Software Carpentry bootcamps to help students and faculty develop more professional programming skills. The skills taught aim to improve software collaboration and impart the skills needed to carry out reproducible research. Two key tools they teach are version control with git and collaboration via GitHub.
Live long and collaborate!