# John D. Blischak(1), Emily R. Davenport(2), Greg Wilson(3) (1) Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA (2) Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA (3) Software Carpentry Foundation, Toronto, Ontario, Canada

## Introduction to version control

Many scientists write code as part of their research. Just as experiments are logged in laboratory notebooks, it is important to document the code you use for analysis. However, a few key problems can arise when iteratively developing code that make it difficult to document and track which code version was used to create each result. First, you often need to experiment with new ideas, such as adding new features to a script or increasing the speed of a slow step, but you do not want to risk breaking the currently working code. One often utilized solution is to make a copy of the script before making new edits. However, this can quickly become a problem because it clutters your filesystem with uninformative filenames, e.g. analysis.sh , analysis_02.sh , analysis_03.sh , etc. It is difficult to remember the differences between the versions of the files, and more importantly which version you used to produce specific results, especially if you return to the code months later. Second, you will likely share your code with multiple lab mates or collaborators and they may have suggestions on how to improve it. If you email the code to multiple people, you will have to manually incorporate all the changes each of them sends.

Fortunately, software engineers have already developed software to manage these issues: version control. A version control system (VCS) allows you to track the iterative changes you make to your code. Thus you can experiment with new ideas but always have the option to revert to a specific past version of the code you used to generate particular results. Furthermore, you can record messages as you save each successive version so that you (or anyone else) reviewing the development history of the code is able to understand the rationale for the given edits. Also, it facilitates collaboration. Using a VCS, your collaborators can make and save changes to the code, and you can automatically incorporate these changes to the main code base. The collaborative aspect is enhanced with the emergence of websites that host version controlled code.

In this quick guide, we introduce you to one VCS, Git (git-scm.com), and one online hosting site, GitHub (github.com), both of which are currently popular among scientists and programmers in general. More importantly, we hope to convince you that although mastering a given VCS takes time, you can already achieve great benefits by getting started using a few simple commands. Furthermore, not only does using a VCS solve many common problems when writing code, it can also improve the scientific process. By tracking your code development with a VCS and hosting it online, you are performing science that is more transparent, reproducible, and open to collaboration (Ram 2013, Wilson 2014). There is no reason this framework needs to be limited only to code; a VCS is well-suited for tracking any plain-text files: manuscripts, electronic lab notebooks, protocols, etc.

The first step is to learn how to version your own code. In this tutorial, we will run Git from the command line of the Unix shell. Thus we expect readers are already comfortable with navigating a filesystem and running basic commands in such an environment. You can find directions for installing Git for the operating system running on your computer by following one of the links provided in Table 1. There are many graphical user interfaces (GUIs) available for running Git (Table 1), which we encourage you to explore, but learning to use Git on the command line is necessary for performing more advanced operations and using Git on a remote machine.

To follow along, first create a folder in your home directory named thesis . Next download the three files provided in Supporting Information and place them in the thesis directory. Imagine that as part of your thesis you are studying the transcription factor CTCF, and you want to identify high-confidence binding sites in kidney epithelial cells. To do this, you will utilize publicly available ChIP-seq data produced by the ENCODE consortium (ENCODE Project Consortium 2012). ChIP-seq is a method for finding the sites in the genome where a transcription factor is bound, and these sites are referred to as peaks (Bailey 2013). process.sh downloads the ENCODE CTCF ChIP-seq data from multiple types of kidney samples and calls peaks (LABEL:S1_Data), clean.py filters peaks with a fold change cutoff and merges peaks from the different kidney samples (LABEL:S2_Data), and analyze.R creates diagnostic plots on the length of the peaks and their distribution across the genome (LABEL:S3_Data).

If you have just installed Git, the first thing you need to do is provide some information about yourself, since it records who makes each change to the file(s). Set your name and email by running the following lines, but replacing “First Last” and “user@domain” with your full name and email address, respectively.

 $git config --global user.name "First Last"$ git config --global user.email "user@domain" 

To start versioning your code with Git, navigate to your newly created directory, ~/thesis . Run the command git init to initialize the current folder as a Git repository (Fig. \ref{fig:Fig1}, \ref{fig:Fig2}A). A repository (or repo, for short) refers to the current version of the tracked files as well as all the previously saved versions (Box 1). Only files that are located within this directory (and any subdirectories) have the potential to be version controlled, i.e. Git ignores all files outside of the initialized directory. For this reason, projects under version control tend to be stored within a single directory to correspond with a single Git repository. For strategies on how to best organize your own projects, see Noble, 2009 (Noble 2009).

 $cd ~/thesis$ ls analyze.R clean.py process.sh $git init Initialized empty Git repository in ~/thesis/.git/  Now you are ready to start versioning your code (Fig. \ref{fig:Fig1}). Conceptually, Git saves snapshots of the changes you make to your files whenever you instruct it to. For instance, after you edit a script in your text editor, you save the updated script to your thesis folder. If you tell Git to save a shapshot of the updated document, then you will have a permanent record of the file in that exact version even if you make subsequent edits to the file. In the Git framework, any changes you have made to a script, but have not yet recorded as a snapshot with Git, reside in the working directory only (Fig. \ref{fig:Fig1}). To follow what Git is doing as you record the initial version of your files, use the informative command git status . $ git status On branch master Initial commit Untracked files: (use "git add &lt;file&gt;..." to include in what will be committed) analyze.R clean.py process.sh nothing added to commit but untracked files present (use "git add" to track) 

There are a few key things to notice from this output. First, the three scripts are recognized as untracked files because you have not told Git to start tracking anything yet. Second, the word “commit” is Git terminology for snapshot. As a noun it means “a version of the code”, e.g. “the figure was generated using the commit from yesterday” (Box 1). This word can also be used as a verb, in which case it means “to save”, e.g. “to commit a change.” Lastly, the output explains how you can track your files using git add . Start tracking the file process.sh .

 $git add process.sh  And check its new status. $ git status On branch master In