Authorea

Jenna M. Lang edited Genome Assembly and Annotation.md about 10 years ago

Commit id: 3482c1ab804273fcd138364d9f6119bb7d8f03b1

deletions | additions

Genome #Genome Assembly and Annotation Assembly ##Assembly Genome assembly typically consists of data cleaning (quality filtering and adaptor removal), error correction, contig assembly, scaffolding, and verification of scaffolds/contigs. There are a large array of programs that can perform some, or most of these steps. These programs include commercial and open-source options, with some choice being very user friendly and some being extremely difficult to use/install. Good sources for overviews of assemblers and the assembly process include the GAGE project (REF), the GAGE-2 project (REF), and the Assemblathon Project (REF). Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF). For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command (REF). A5 is designed to work with raw, demultiplexed Illumina data and a recent version has been optimized for longer reads from the MiSeq (REF). Input reads can be paired or unpaired, and the files can be separate (forward reads in one file reverse reads in another) or interleaved.

A5 is a command line based program, on a mac you will need to run it from the terminal see section II "Using the Terminal", for an introduction to the terminal. Running ###Running A5 Once you have opened the terminal navigate to the folder you just created because A5 will output the files your location when you call the program. In this example I created the folder on the desktop and named it a5_ouput so the syntax for navigating to the folder is $ cd Desktop/a5_output/

For more on interpreting these numbers proceed to Section VII, "Verification of the Assembly". Verification ###Verification of the Assembly There are three portions to the verification of a genome assembly. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Here we use Phylosift to assess the presence or absence of 37 highly conserved single copy bacterial genes in the assembly as a rough proxy for completeness. Interpretation ###Interpretation of A5 stats The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other extreme a bacterial assembly in 1000 contigs would be very fragmented. In our experience bacterial assemblies using PE300bp Ilumina data assembled with A5 tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (REF). "Genome Size" and "Longest Scaffold" are simply represented as base-pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, finding only 2MB in the assembly would be problematic. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better.

Finally "X_cov" shows the average coverage across the genome. For Illumina data we recommend that this number be between ~30X and 100X. Much less than 30X coverage and the quality of any given base in the assembly may come into question. Conversely, too much coverage can reduce the quality of the assembly and require downsampling. Verification ###Verification of 16S Sequence Follow the steps described in Section IX, "Making a Phylogenetic Tree" for obtaining and BLASTing the full length 16s sequence. PhyloSift