Authorea

Jenna M. Lang edited Genome Assembly and Annotation.md over 9 years ago

Commit id: 89144ae13c3f9236edb46b7020778811ad073fa6

deletions | additions

There is a plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options, some are very user friendly and some are extremely difficult to use/install. Common assemblers for bacterial genomes include SPAdes \cite{Bankevich_2012}, MIRA \cite{Chevreux_2004}, SGA \cite{Simpson_2010}, Velvet \cite{Zerbino_2008} CLC (CLC Bio), and A5 \cite{Tritt_2012}. Good sources for overviews of genome assemblers and the assembly process include the GAGE project \cite{Salzberg_2012}, the GAGE-B project \cite{Magoc_2013}, and the Assemblathon Project \cite{Earl_2011}. For In this workflow workflow, we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command \cite{Tritt\_2012}. A5 is designed to work with raw, demultiplexed Illumina data and a recent version, A5-miseq, version (A5-miseq) has been optimized for longer reads from the MiSeq \cite{25338718}. Input reads must be paired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved. These files should have the .fastq extension. See (http://en.wikipedia.org/wiki/FASTQ_format) for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the two following (per genome): 1) a single .fastq file that contains both forward and reverse reads, or 2) two .fastq files, one with forward reads and one with the corresponding reverse reads. These FastQ files can optionally be gzip compressed (as indicated by the .gz file name extension). Download/Install A5 from http://sourceforge.net/projects/ngopt/

After downloading and unzipping the program, change the name of the folder to a5\_pipeline and move it from your downloads folder to your Applications folder. Then, create a new folder which will contain the files generated by the pipeline on your Desktop. By the way, there's nothing special about having your file on the Desktop, it's just there to simplify our instructions. We will refer to this folder as "a5_output", but you should use a more informative name. ###Running A5-miseq Open a Terminal window and navigate to a5\_output. A5-miseq will write all of the assembly output files to the same folder from which you run the program. In this example example, the newly created folder is on the Desktop and named a5_output so the syntax for navigating to the folder in a Terminal window is cd Desktop/a5_output/

/Applications/a5_pipeline/bin/a5_pipeline.pl SequenceFile1.fastq SequenceFile2.fastq MyGenome The program may take a few hours to run. Once it is completed completed, the terminal will display Final assembly in MyGenome.final.scaffolds.fasta. The complete assembly will be located in the a5\_output folder. Among the numerous files generated by A5, two of particular importance are the "MyGenome.contigs.fasta" and "MyGenome.final.scaffolds.fasta" which contain the contigs and scaffolds, respectively.

For more on interpreting these numbers proceed to "Assembly Validation". ###Assembly Validation There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5-miseq (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. using a BLAST search (see section 7 above). The third is "completeness" "completeness," which is difficult to measure except in cases where without a close closely-related reference is available. genome. Nevertheless, we can get an idea of how complete the genome is by looking for highly conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called PhyloSift \cite{Darling_2014} to assess the presence or absence of 37 housekeeping genes in the assembly to infer completeness. completeness (see Section X). ###Interpretation of A5-miseq stats To open A5-miseq stats, import it into Excel as a tab delimited CSV file. The first two numbers, shown in columns 2 and 3, are the number of contigs and scaffolds. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig with no unresolved nucleotides but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Illumina PE300 data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% using A5-miseq \cite{25338718}. "Genome Size" and "Longest Scaffold" are simply represented in base pairs. While genome size can vary within taxa, this can be a second useful sanity check for the assembly. When expecting a 5MB genome based on other sequenced isolates from the same genus, if the assembled genome size is 2 MB or 10 MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… quality; larger is better. An N50 of 5,000 bp would be quite poor... poor, meaning that half of the entire assembly is in contigs smaller than 5,000 bp. On the other hand an N50 of 1,000,000 bp is considered very good for bacterial genomes sequenced with Illumina technology. The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data has been discarded. A very large difference between these numbers ("% reads passing EC"/"% nt passing EC") would indicate either poor quality sequence data or significant adapter contamination. Adapter contamination rates can be high when the insert size is too small or if there were problems during library preparation. Poor quality sequence data can result from loading the libraries at a molar concentration that was too high for the instrument, from mechanical issues preventing focus of the sequencing instrument's cameras, or from use of a compromised batch of sequencing reagents. Resolution of these issues would entail a discussion with your sequencing provider. A5-miseq reports three depth of coverage statistics which can be used to assess whether sufficient data has been collected for genome assembly. First is the "Raw cov" which is simply the total number of base pairs of sequence data, divided by the assembly size. This gives an estimate of the average number of reads covering each base in the assembly. The actual number of reads at each site can and will vary substantially from the average. The second statistic is the "Median cov" which gives the median depth of coverage among all sites in the assembly. That is, 50% of sites will have greater coverage and 50% will have less than this value. "10th percentile cov" indicates a coverage level below which only 10% of sites in the assembly fall. For Illumina data, the ideal median coverage will lie between ~20X and 100X. Much If you have much less than 20X median coverage and coverage, the quality of individual base calls may be compromised. Ideally, the 10th percentile coverage will be higher than 10, for similar reasons. A separate metric of the base call quality is also reported by A5-miseq as "bases >= Q40". Following assembly, A5-miseq realigns the reads to the assembled sequence and estimates the accuracy of the nucleotide called at each site in the assembly. These accuracies are provided as PHRED quality scores \cite{green2009phrap}, which represent log-scaled probabilities of accuracy. For example a PHRED score of 20 indicates a 99% chance of the correct base, while Q30 and Q40 indicate 99.9% and 99.99% probabilities of the correct base being called. A5-miseq reports the number of assembly bases called with at least Q40.

From the PhyloSift directory Move to the "PS\_temp" directory Within this directory, Phylosift has created a directory with the same name as the input file. Move (cd) to this new directory, and then move to "blastDir". Open the marker\_summary.txt file in the blastDir less marker_summary.txt The DNGNGWU0001-00040 markers represent 37 highly conserved bacterial genes, if one is missing it won't show up as a zero, it is necessary to manually verify the list. Most of the genes should only appear once. An occasional 2 is fine, but if all/a majority of the genes appear twice or even three times you have most likely sequenced multiple bacteria together. Additionally Additionally, check to make sure there is no 18S RNA sequence (at the top of the list) to ensure your sample has not been contaminated with a eukaryote (e.g. (_e.g_. yeast). Important Note: Markers 4, 8 and 38 are no longer included in the Phylosift analysis so do not be concerned if they are not listed. Conversely, Marker 13 is sometimes present in multiple copies and this is not a cause for concern. ##Annotation ###Options

There are a number of different pipelines available for annotation of bacterial genomes. These include Prokka \cite{Seemann_2014}, IMG \cite{Markowitz_2014}, RAST \cite{Overbeek_2014}, GLIMMER \cite{Delcher_2007}, PGAP \cite{Angiuoli_2008} and others. Each of these pipelines has advantages and disadvantages, and each will give slightly different results. Here we recommend RAST since it is web-based, easy to use, returns results within hours, and provides a convenient toolbox for analyzing the results. However, RAST annotations are very difficult to submit to NCBI so we recommend allowing NCBI to re-annotate the genome with PGAP upon submission. Also, we recommend reporting the annotation results from the PGAP annotations in the genome announcement (for consistency.)Why do we also run a RAST annotation? Because we are impatient and we like to see results right away. We do not like having to wait for the NCBI submission process to be completed before we start exploring our data. ###RAST Annotation Navigate to http://rast.nmpdr.org/ and register a new account. Once you have created an account, log in.