Authorea

Aaron Darling edited Genome Assembly and Annotation.md almost 10 years ago

Commit id: 2b260fe404aec773a1e7799480a95c4ea01db304

deletions | additions

4. scaffolding 5. verification of scaffolds/contigs There is a plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options, some are very user friendly and some are extremely difficult to use/install. Common assemblers for bacterial genomes include SPADES SPAdes \cite{Bankevich_2012}, MIRA \cite{Chevreux_2004}, SGA \cite{Simpson_2010}, Velvet \cite{Zerbino_2008} CLC (CLC Bio), and A5 \cite{Tritt_2012}. Good sources for overviews of genome assemblers and the assembly process include the GAGE project \cite{Salzberg_2012}, the GAGE-B project \cite{Magoc_2013}, and the Assemblathon Project \cite{Earl_2011}. For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command \cite{Tritt\_2012}. A5 is designed to work with raw, demultiplexed Illumina data and a recent version version, A5-miseq, has been optimized for longer reads from the MiSeq (Coil et al submitted). Input reads must be paired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved. These files should have the .fastq extension. See (http://en.wikipedia.org/wiki/FASTQ_format) for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the three following (per genome): 1) a single .fastq file that contains both forward and reverse reads, or 2) two .fastq files, one with forward reads and one with the corresponding reverse reads. These FastQ files can optionally be gzip compressed (as indicated by the .gz file name extension). Download/Install A5 from http://sourceforge.net/projects/ngopt/

Among the numerous files generated by A5, two of particular importance are the "MyGenome.contigs.fasta" and "MyGenome.final.scaffolds.fasta" which contain the contigs and scaffolds, respectively. In addition, A5 generates a file containing information about the quality of the assembly called "assembly_stats.csv" "MyGenome.assembly_stats.csv" To view this file use the "less" command: less assembly_stats.csv MyGenome.assembly_stats.csv For more on interpreting these numbers proceed to "Assembly Validation". ###Assembly Validation There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 A5-miseq (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Nevertheless, we can get an idea of how complete the genome is by looking for highly conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called Phylosift PhyloSift \cite{Darling_2014} to assess the presence or absence of 37 housekeeping genes in the assembly to infer completeness. ###Interpretation of A5 A5-miseq stats To open A5 A5-miseq stats, import it into excel as a tab deliminated CSV file. The first two numbers numbers, shown in columns 2 and 3, are the number of contigs and scaffolds, respectively. scaffolds. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig with no unresolved nucleotides but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% using A5 A5-miseq (Coil et al, submitted). "Genome Size" and "Longest Scaffold" are simply represented as base pairs. While genome size can vary within taxa, this can be a second useful sanity check for the assembly. When expecting a 5MB genome, genome based on other sequenced isolates from the same genus, if the assembled genome size is 2MB, 2MB or 10MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better. An N50 of 5,000 bp would be pretty quite poor... meaning that half of the entire assembly is in contigs smaller than 5,000 bp. On the other hand an N50 of 1,000,000 bp would be great is considered very good fora bacterial genome. genomes sequenced with Illumina technology. The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data has been discarded. A very large difference between these numbers ("% reads passing EC"/"% nt passing EC") would indicate either poor quality input sequence data or significant adapter contamination. Adaptor Adapter contamination rates can be high when the insert size is too small or if there were problems during library preparation. Poor quality sequence data can result from loading the libraries at a molar concentration that was too high for the instrument, from mechanical issues preventing focus of the sequencing instrument's cameras, or from use of a compromised batch of sequencing reagents. AARON DESCRIBE THE COVERAGE STATS HERE. For Illumina A5-miseq reports three dpeth of coverage statistics which can be used to assess whether sufficient data we recommend that this has been collected for genome assembly. First is the "Raw cov" which is simply the total number be between ~30X and 100X. Much less than 30X coverage and of base pairs of sequence data, divided by the quality assembly size. This gives an estimate of any given the average number of reads covering each base in the assembly may come into question. Conversely, too much coverage assembly. The actual number of reads at each site can reduce and will vary substantially from the quality average. The second statistic is the "Median cov" which gives the median depth of coverage among all sites in the assembly and require downsampling. **Instructions or reference for downsampling?** If you assembly. That is, 50% of sites will have greater coveragesignificantly higher than 100x and wish to downsample your data we 50% will have written a script (sub\_sample\_reads) for less than this purpose. You will first need to calculate how many reads you want the script to sample. We recommend determining how many reads would be equivalent to 100x value. "10th percentile cov" indicates a coverage (divide level below which only 10% of sites in the genome size by assembly fall. For Illumina data, the average read length ideal median coverage will lie between ~20X and multiply by 100). You can download the script using the curl command. Create a new directory containing the reads you wish to downsample. In the terminal navigate the directory you just created 100X. Much less than 20X median coverage anddownload the script using quality of individual base calls may be compromised. Ideally, the following syntax 10th percentile coverage will be higher than 10, for similar reasons. curl https://raw.githubusercontent.com/gjospin/scripts/master/subsample_reads.pl > sub_sample_reads.pl To downsample the data use the following command /sub_sample_reads file1 file2 #_reads_to_keep output_file_name for example /Users/Madison/Desktop/sub_sample/sub_sample_reads.pl test_1.fq test_2.fq 250 my_reads.fastq A separate metric of the base call quality is also reported by A5-miseq as "bases >= Q40". Following assembly, A5-miseq realigns the reads to the assembled sequence and estimates the accuracy of the nucleotide called at each site in the assembly. These accuracies are provided as PHRED quality scores (cite PHRED here), which represent log-scaled probabilities of accuracy. For further directions/documentation you can view the script on github https://github.com/gjospin/scripts/blob/master/subsample_reads.pl example a PHRED score of 20 indicates a 99% chance of the correct base, while Q30 and Q40 indicate 99.9% and 99.99% probabilities of the correct base being called. A5-miseq reports the number of assembly bases called with at least Q40. ###Verification of 16S Sequence Follow the steps described in Section ??, "Making a Phylogenetic Tree" for obtaining and performing a BLAST search of the full length 16s sequence.