# John D. Blischak(1), Emily R. Davenport(2), Greg Wilson(3) (1) Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA (2) Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA (3) Software Carpentry Foundation, Toronto, Ontario, Canada

## Introduction to version control

Many scientists write code as part of their research. Just as experiments are logged in laboratory notebooks, it is important to document the code you use for analysis. However, a few key problems can arise when iteratively developing code that make it difficult to document and track which code version was used to create each result. First, you often need to experiment with new ideas, such as adding new features to a script or increasing the speed of a slow step, but you do not want to risk breaking the currently working code. One often utilized solution is to make a copy of the script before making new edits. However, this can quickly become a problem because it clutters your filesystem with uninformative filenames, e.g. analysis.sh , analysis_02.sh , analysis_03.sh , etc. It is difficult to remember the differences between the versions of the files, and more importantly which version you used to produce specific results, especially if you return to the code months later. Second, you will likely share your code with multiple lab mates or collaborators and they may have suggestions on how to improve it. If you email the code to multiple people, you will have to manually incorporate all the changes each of them sends.

Fortunately, software engineers have already developed software to manage these issues: version control. A version control system (VCS) allows you to track the iterative changes you make to your code. Thus you can experiment with new ideas but always have the option to revert to a specific past version of the code you used to generate particular results. Furthermore, you can record messages as you save each successive version so that you (or anyone else) reviewing the development history of the code is able to understand the rationale for the given edits. Also, it facilitates collaboration. Using a VCS, your collaborators can make and save changes to the code, and you can automatically incorporate these changes to the main code base. The collaborative aspect is enhanced with the emergence of websites that host version controlled code.

In this quick guide, we introduce you to one VCS, Git (git-scm.com), and one online hosting site, GitHub (github.com), both of which are currently popular among scientists and programmers in general. More importantly, we hope to convince you that although mastering a given VCS takes time, you can already achieve great benefits by getting started using a few simple commands. Furthermore, not only does using a VCS solve many common problems when writing code, it can also improve the scientific process. By tracking your code development with a VCS and hosting it online, you are performing science that is more transparent, reproducible, and open to collaboration (Ram 2013, Wilson 2014). There is no reason this framework needs to be limited only to code; a VCS is well-suited for tracking any plain-text files: manuscripts, electronic lab notebooks, protocols, etc.

To follow along, first create a folder in your home directory named thesis . Next download the three files provided in Supporting Information and place them in the thesis directory. Imagine that as part of your thesis you are studying the transcription factor CTCF, and you want to identify high-confidence binding sites in kidney epithelial cells. To do this, you will utilize publicly available ChIP-seq data produced by the ENCODE consortium (ENCODE Project Consortium 2012). ChIP-seq is a method for finding the sites in the genome where a transcription factor is bound, and these sites are referred to as peaks (Bailey 2013). process.sh downloads the ENCODE CTCF ChIP-seq data from multiple types of kidney samples and calls peaks (LABEL:S1_Data), clean.py filters peaks with a fold change cutoff and merges peaks from the different kidney samples (LABEL:S2