A quick introduction to version control with Git and GitHub

(1) Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA
(2) Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
(3) Software Carpentry Foundation, Toronto, Ontario, Canada

Introduction to version control

Many scientists write code as part of their research. Just as experiments are logged in laboratory notebooks, it is important to document the code you use for analysis. However, a few key problems can arise when iteratively developing code that make it difficult to document and track which code version was used to create each result. First, you often need to experiment with new ideas, such as adding new features to a script or increasing the speed of a slow step, but you do not want to risk breaking the currently working code. One often utilized solution is to make a copy of the script before making new edits. However, this can quickly become a problem because it clutters your filesystem with uninformative filenames, e.g. , , , etc. It is difficult to remember the differences between the versions of the files, and more importantly which version you used to produce specific results, especially if you return to the code months later. Second, you will likely share your code with multiple lab mates or collaborators and they may have suggestions on how to improve it. If you email the code to multiple people, you will have to manually incorporate all the changes each of them sends.

Fortunately, software engineers have already developed software to manage these issues: version control. A version control system (VCS) allows you to track the iterative changes you make to your code. Thus you can experiment with new ideas but always have the option to revert to a specific past version of the code you used to generate particular results. Furthermore, you can record messages as you save each successive version so that you (or anyone else) reviewing the development history of the code is able to understand the rationale for the given edits. Also, it facilitates collaboration. Using a VCS, your collaborators can make and save changes to the code, and you can automatically incorporate these changes to the main code base. The collaborative aspect is enhanced with the emergence of websites that host version controlled code.

In this quick guide, we introduce you to one VCS, Git (, and one online hosting site, GitHub (, both of which are currently popular among scientists and programmers in general. More importantly, we hope to convince you that although mastering a given VCS takes time, you can already achieve great benefits by getting started using a few simple commands. Furthermore, not only does using a VCS solve many common problems when writing code, it can also improve the scientific process. By tracking your code development with a VCS and hosting it online, you are performing science that is more transparent, reproducible, and open to collaboration (Ram 2013, Wilson 2014). There is no reason this framework needs to be limited only to code; a VCS is well-suited for tracking any plain-text files: manuscripts, electronic lab notebooks, protocols, etc.

Version your code

The first step is to learn how to version your own code. In this tutorial, we will run Git from the command line of the Unix shell. Thus we expect readers are already comfortable with navigating a filesystem and running basic commands in such an environment. You can find directions for installing Git for the operating system running on your computer by following one of the links provided in Table 1. There are many graphical user interfaces (GUIs) available for running Git (Table 1), which we encourage you to explore, but learning to use Git on the command line is necessary for performing more advanced operations and using Git on a remote machine.

To follow along, first create a folder in your home directory named thesis . Next download the three files provided in Supporting Information and place them in the thesis directory. Imagine that as part of your thesis you are studying the transcription factor CTCF, and you want to identify high-confidence binding sites in kidney epithelial cells. To do this, you will utilize publicly available ChIP-seq data produced by the ENCODE consortium (ENCODE Project Consortium 2012). ChIP-seq is a method for finding the sites in the genome where a transcription factor is bound, and these sites are referred to as peaks (Bailey 2013). downloads the ENCODE CTCF ChIP-seq data from multiple types of kidney samples and calls peaks (LABEL:S1_Data), filters peaks with a fold change cutoff and merges peaks from the different kidney samples (LABEL:S2_Data), and analyze.R creates diagnostic plots on the length of the peaks and their distribution across the genome (LABEL:S3_Data).

If you have just installed Git, the first thing you need to do is provide some information about yourself, since it records who makes each change to the file(s). Set your name and email by running the following lines, but replacing “First Last” and “user@domain” with your full name and email address, respectively.

$ git config --global "First Last" $ git config --global "user@domain"

To start versioning your code with Git, navigate to your newly created directory, ~/thesis . Run the command git init to initialize the current folder as a Git repository (Fig. \ref{fig:Fig1}, \ref{fig:Fig2}A). A repository (or repo, for short) refers to the current version of the tracked files as well as all the previously saved versions (Box 1). Only files that are located within this directory (and any subdirectories) have the potential to be version controlled, i.e. Git ignores all files outside of the initialized directory. For this reason, projects under version control tend to be stored within a single directory to correspond with a single Git repository. For strategies on how to best organize your own projects, see Noble, 2009 (Noble 2009).

$ cd ~/thesis $ ls analyze.R $ git init Initialized empty Git repository in ~/thesis/.git/

Now you are ready to start versioning your code (Fig. \ref{fig:Fig1}). Conceptually, Git saves snapshots of the changes you make to your files whenever you instruct it to. For instance, after you edit a script in your text editor, you save the updated script to your thesis folder. If you tell Git to save a shapshot of the updated document, then you will have a permanent record of the file in that exact version even if you make subsequent edits to the file. In the Git framework, any changes you have made to a script, but have not yet recorded as a snapshot with Git, reside in the working directory only (Fig. \ref{fig:Fig1}). To follow what Git is doing as you record the initial version of your files, use the informative command git status .

$ git status On branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) analyze.R nothing added to commit but untracked files present (use "git add" to track)

There are a few key things to notice from this output. First, the three scripts are recognized as untracked files because you have not told Git to start tracking anything yet. Second, the word “commit” is Git terminology for snapshot. As a noun it means “a version of the code”, e.g. “the figure was generated using the commit from yesterday” (Box 1). This word can also be used as a verb, in which case it means “to save”, e.g. “to commit a change.” Lastly, the output explains how you can track your files using git add . Start tracking the file .

$ git add

And check its new status.

$ git status On branch master Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: Untracked files: (use "git add <file>..." to include in what will be committed) analyze.R

Since this is the first time that you have told Git about the file , two key things have happened. First, this file is now being tracked, which means Git recognizes it as a file you wish to be version controlled (Box 1). Second, the changes made to the file (in this case the entire file because it is the first commit) have been added to the staging area (Fig. \ref{fig:Fig1}). Adding a file to the staging area will result in the changes to that file being included in the next commit, or snapshot of the code (Box 1). As an analogy, adding files to the staging area is like putting things in a box to mail off, and committing is like putting the box in the mail.

Since this will be the first commit, or first version of the code, use git add to begin tracking the other two files and add their changes to the staging area as well. Then create the first commit using the command git commit .

$ git add analyze.R $ git commit -m "Add initial version of thesis code." [master (root-commit) 660213b] Add initial version of thesis code. 3 files changed, 154 insertions(+) create mode 100644 analyze.R create mode 100644 create mode 100644

Notice the flag -m was used to pass a message for the commit. This message describes the changes that have been made to the code and is required. If you do not pass a message at the command line, the default text editor for your system will open so you can enter the message. You have just performed the typical development cycle with Git: make some changes, add updated files to the staging area, and commit the changes as a snapshot once you are satisfied with them (Fig. \ref{fig:Fig2}).

Since Git records all of the commits, you can always look through the complete history of a project. To view the record of your commits, use the command git log . For each commit, it lists the unique identifier for that revision, author, date, and commit message.

$ git log commit 660213b91af167d992885e45ab19f585f02d4661 Author: First Last <user@domain> Date: Fri Aug 21 14:52:05 2015 -0500 Add initial version of thesis code.

The commit identifier can be used to compare two different versions of a file, restore a file to a previous version from a past commit, and even retrieve tracked files if you accidentally delete them.

Now you are free to make changes to the files knowing that you can always revert them to the state of this commit by referencing its identifier. As an example, edit so that the fold change cutoff for filtering peaks is more stringent. Here is the current bottom of the file.

$ tail # Filter based on fold-change over control sample fc_cutoff = 10 epithelial = epithelial.filter(filter_fold_change, fc = fc_cutoff).saveas() proximal_tube = proximal_tube.filter(filter_fold_change, fc = fc_cutoff).saveas() kidney = kidney.filter(filter_fold_change, fc = fc_cutoff).saveas() # Identify only those sites that are peaks in all three tissue types combined = pybedtools.BedTool().multi_intersect( i = [epithelial.fn, proximal_tube.fn, kidney.fn]) union = combined.filter(lambda x: int(x[3]) == 3).saveas() union.cut(range(3)).saveas(data + "/sites-union.bed")

Using a text editor, increase the fold change cutoff from 10 to 20.

$ tail # Filter based on fold-change over control sample fc_cutoff = 20 epithelial = epithelial.filter(filter_fold_change, fc = fc_cutoff).saveas() proximal_tube = proximal_tube.filter(filter_fold_change, fc = fc_cutoff).saveas() kidney = kidney.filter(filter_fold_change, fc = fc_cutoff).saveas() # Identify only those sites that are peaks in all three tissue types combined = pybedtools.BedTool().multi_intersect( i = [epithelial.fn, proximal_tube.fn, kidney.fn]) union = combined.filter(lambda x: int(x[3]) == 3).saveas() union.cut(range(3)).saveas(data + "/sites-union.bed")

Because Git is tracking , it recognizes that the file has been changed since the last commit.

$ git status # On branch master # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: # no changes added to commit (use "git add" and/or "git commit -a")

The report from git status indicates that the changes to are not staged, i.e. they are in the working directory (Fig. \ref{fig:Fig1}). To view the unstaged changes, run the command git diff .

$ git diff diff --git a/ b/ index 7b8c058..76d84ce 100644 --- a/ +++ b/ @@ -28,7 +28,7 @@ def filter_fold_change(feature, fc = 1): return False # Filter based on fold-change over control sample -fc_cutoff = 10 +fc_cutoff = 20 epithelial = epithelial.filter(filter_fold_change, fc = fc_cutoff).saveas() proximal_tube = proximal_tube.filter(filter_fold_change, fc = fc_cutoff).saveas() kidney = kidney.filter(filter_fold_change, fc = fc_cutoff).saveas()

Any lines of text that have been added to the script are indicated with a + and any lines that have been removed with a - . Here, we altered the line of code which sets the value of fc_cutoff . git diff displays this change as the previous line being removed and a new line being added with our update incorporated. You can ignore the first five lines of output because they are directions for other software programs that can merge changes to files. If you wanted to keep this edit, you could add to the staging area using git add and then commit the change using git commit , as you did above. Instead, this time undo the edit by following the directions from the output of git status to “discard changes in the working directory” using the command git checkout .

$ git checkout -- $ git diff

Now git diff returns no output because git checkout undid the unstaged edit you had made to . And this ability to undo past edits to a file is not limited to unstaged changes in the working directory. If you had committed multiple changes to the file and then decided you wanted the original version from the initial commit, you could replace the argument -- with the commit identifier of the first commit you made above (your commit identifier will be different; use git log to find it). The -- used above was simply a placeholder for the first argument because by default git checkout restores the most recent version of the file from the staging area (if you haven’t staged any changes to this file, as is the case here, the version of the file in the staging area is identical to the version in the last commit). Instead of using the entire commit identifier, use only the first seven characters, which is simply a convention since this is usually long enough for it to be unique.

$ git checkout 660213b

At this point, you have learned the commands needed to version your code with Git. Thus you already have the benefits of being able to make edits to files without copying them first, to create a record of your changes with accompanying messages, and to revert to previous versions of the files if needed. Now you will always be able to recreate past results that were generated with previous versions of the code (see the command git tag for a method to facilitate finding specific past versions) and see the exact changes you have made over the course of a project.