Swabs to Genomes: A Comprehensive Workflow

Madison I. Dunitz (1*)
Jenna M. Lang (1*)
Guillaume Jospin (1)
Aaron E. Darling (2)
Jonathan A. Eisen(1#)
David A. Coil (1)
(1) UC Davis, Genome Center
(2) ithree institute, University of Technology Sydney, Australia
(*) These authors contributed equally to this work.
(#) Corresponding author: jaeisen@ucdavis.edu


The sequencing, assembly, and basic analysis of microbial genomes, once a painstaking and expensive undertaking, has become much easier for research labs with access to standard molecular biology and computational tools. However, there are a confusing variety of options available for DNA library preparation and sequencing, and inexperience with bioinformatics can pose a significant barrier to entry for many who may be interested in microbial genomics. The objective of the present study was to design, test, troubleshoot, and publish a simple, comprehensive workflow from the collection of an environmental sample (a swab) to a published microbial genome; empowering even a lab or classroom with limited resources and bioinformatics experience to perform it.


Thanks to decreases in cost and difficulty, sequencing the genome of a microorganism is becoming a relatively common activity in many research and educational institutions. However, such microbial genome sequencing is still far from routine or simple. The objective of this work was to design, test, troubleshoot, and publish a comprehensive workflow for microbial genome sequencing, encompassing everything from culturing new organisms to depositing sequence data; enabling even a lab with limited resources and bioinformatics experience to perform it.

In late 2011, our lab began a project with the goal of having undergraduate students generate genome sequences for microorganisms isolated from the "built environment". The project focused on the built environment because it was part of the larger "microBEnet" (microbiology of the built environment network, www.microbe.net) effort. This project serves many purposes, including (1) engaging undergraduates in research on microbiology of the built environment, (2) generating "reference genomes" for microbes that are found in the built environment, and (3) providing a resource for educational activities on the microbiology of the built environment. As part of this project, undergraduate students isolated and classified microbes, sequenced and assembled their genomes, submitted the genome sequences to databases housed by The National Center for Biotechnology Information (NCBI), and published the genomes (Lo 2013)(Bendiks 2013)(Flanagan 2013)(Diep 2013)(Coil 2013)(Holland-Moritz 2013). Despite the reduced cost of genome sequencing and the availability of diverse tools making many of the steps easier, (e.g., kits for library prep, cost-effective sequencing, bioinformatics pipelines), there were still a significant number of stumbling blocks. Moreover, some portions of the project involve choosing between a wide variety of options (e.g., choice of assembly program) which can create a barrier for a lab without a bioinformatician. Each option comes with its own advantages and disadvantages in terms of complexity, expense, computing power, time, and experience required. In this workflow, we describe an approach to genome sequencing that allows a researcher to go from a swab to a published paper (Figure 1). We used this workflow to process a novel Tatumella sp. isolate and publish the genome (Dunitz 2014). The data from every step of the workflow, using this Tatumella isolate, is available on Figshare (Coil; 2014)

The sequencing and de novo assembly of genomes has yielded enormous scientific insight revolutionizing a wide range of fields, from epidemiology to ecology. Our hope is that this workflow will help make this revolution more accessible to all scientists, as well as present educational opportunities for undergraduate researchers and classes.

There are several excellent resources that focus on smaller portions of this entire workflow. Examples include the Computational Genomics Pipeline (Kislyuk 2010) and a "Beginner’s guide to comparative bacterial genome analysis" (Edwards 2013). Clarke et. al., 2014 describes a similar pipeline focused on human mitochondrial genomes (Clarke 2014).


Background: bioinformatics

Command Line/Terminal Tutorial

This workflow is written assuming that the user is using a computer running Mac OS X or Linux. It is also possible to carry out many of the computational parts of this workflow in a Windows environment but getting these steps to work in Windows is outside the scope of this project.

Some parts of this workflow require the user to provide text instructions for software programs by using a command line interface. While potentially intimidating to computer novices, the use of command line interfaces is sometimes necessary (e.g., some programs do not have graphical interfaces) and is also sometimes much more efficient. To access the command line on a Mac open the Terminal program (the default location for this program is in the "Utilities" folder under "Applications").

When this application is launched, a new window will appear. This is known as a "terminal" or a "terminal window". In the terminal window, you can interact with your computer without using a mouse. Many popular programs have a GUI (Graphical User Interface) but some programs used in this workflow will not. So, instead of double-clicking to make a program run, you will type a command in the terminal window. Throughout this tutorial, we will instruct you to type commands, but copying and pasting them (when possible) will reduce the occurrence of typos. We will walk you through how to run all of the programs required for this workflow, but you must first acquire a basic familiarity with how to interact with your computer through the terminal window. Below is a list of commands that will be required to use thi