deletions | additions
diff --git a/Genome Assembly and Annotation.md b/Genome Assembly and Annotation.md
index ecc79d4..11d6f56 100644
--- a/Genome Assembly and Annotation.md
+++ b/Genome Assembly and Annotation.md
...
4. scaffolding
5. verification of scaffolds/contigs
There is a plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options, some are very user friendly and some are extremely difficult to use/install. Common assemblers for bacterial genomes include
SPADES SPAdes \cite{Bankevich_2012}, MIRA \cite{Chevreux_2004}, SGA \cite{Simpson_2010}, Velvet \cite{Zerbino_2008} CLC (CLC Bio), and A5 \cite{Tritt_2012}. Good sources for overviews of genome assemblers and the assembly process include the GAGE project \cite{Salzberg_2012}, the GAGE-B project \cite{Magoc_2013}, and the Assemblathon Project \cite{Earl_2011}.
For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command \cite{Tritt\_2012}. A5 is designed to work with raw, demultiplexed Illumina data and a recent
version version, A5-miseq, has been optimized for longer reads from the MiSeq (Coil et al submitted). Input reads must be paired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved. These files should have the .fastq extension. See (http://en.wikipedia.org/wiki/FASTQ_format) for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the three following (per genome): 1) a single .fastq file that contains both forward and reverse reads, or 2) two .fastq files, one with forward reads and one with the corresponding reverse reads.
These FastQ files can optionally be gzip compressed (as indicated by the .gz file name extension).
Download/Install A5 from
http://sourceforge.net/projects/ngopt/
...
Among the numerous files generated by A5, two of particular importance are the "MyGenome.contigs.fasta" and "MyGenome.final.scaffolds.fasta" which contain the contigs and scaffolds, respectively.
In addition, A5 generates a file containing information about the quality of the assembly called
"assembly_stats.csv" "MyGenome.assembly_stats.csv"
To view this file use the "less" command:
less
assembly_stats.csv MyGenome.assembly_stats.csv
For more on interpreting these numbers proceed to "Assembly Validation".
###Assembly Validation
There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by
A5 A5-miseq (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Nevertheless, we can get an idea of how complete the genome is by looking for highly conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called
Phylosift PhyloSift \cite{Darling_2014} to assess the presence or absence of 37 housekeeping genes in the assembly to infer completeness.
###Interpretation of
A5 A5-miseq stats
To open
A5 A5-miseq stats, import it into excel as a tab deliminated CSV file.
The first two
numbers numbers, shown
in columns 2 and 3, are the number of contigs and
scaffolds, respectively. scaffolds. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig
with no unresolved nucleotides but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% using
A5 A5-miseq (Coil et al, submitted).
"Genome Size" and "Longest Scaffold" are simply represented as base pairs. While genome size can vary within taxa, this can be a second
useful sanity check for the assembly. When expecting a 5MB
genome, genome based on other sequenced isolates from the same genus, if the assembled genome size is
2MB, 2MB or 10MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better. An N50 of 5,000 bp would be
pretty quite poor... meaning that half of the entire assembly is in contigs smaller than 5,000 bp. On the other hand an N50 of 1,000,000 bp
would be great is considered very good for
a bacterial
genome. genomes sequenced with Illumina technology.
The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data has been discarded. A very large difference between these numbers ("% reads passing EC"/"% nt passing EC") would indicate either poor quality
input sequence data or significant adapter contamination.
Adaptor Adapter contamination
rates can be high when the insert size is too small or if there were problems during library preparation.
Poor quality sequence data can result from loading the libraries at a molar concentration that was too high for the instrument, from mechanical issues preventing focus of the sequencing instrument's cameras, or from use of a compromised batch of sequencing reagents.
AARON DESCRIBE THE COVERAGE STATS HERE. For Illumina A5-miseq reports three dpeth of coverage statistics which can be used to assess whether sufficient data
we recommend that this has been collected for genome assembly. First is the "Raw cov" which is simply the total number
be between ~30X and 100X. Much less than 30X coverage and of base pairs of sequence data, divided by the
quality assembly size. This gives an estimate of
any given the average number of reads covering each base in the
assembly may come into question. Conversely, too much coverage assembly. The actual number of reads at each site can
reduce and will vary substantially from the
quality average. The second statistic is the "Median cov" which gives the median depth of
coverage among all sites in the
assembly and require downsampling. **Instructions or reference for downsampling?**
If you assembly. That is, 50% of sites will have
greater coverage
significantly higher than 100x and
wish to downsample your data we 50% will have
written a script (sub\_sample\_reads) for less than this
purpose. You will first need to calculate how many reads you want the script to sample. We recommend determining how many reads would be equivalent to 100x value. "10th percentile cov" indicates a coverage
(divide level below which only 10% of sites in the
genome size by assembly fall. For Illumina data, the
average read length ideal median coverage will lie between ~20X and
multiply by 100). You can download the script using the curl command. Create a new directory containing the reads you wish to downsample. In the terminal navigate the directory you just created 100X. Much less than 20X median coverage and
download the
script using quality of individual base calls may be compromised. Ideally, the
following syntax 10th percentile coverage will be higher than 10, for similar reasons.
curl https://raw.githubusercontent.com/gjospin/scripts/master/subsample_reads.pl > sub_sample_reads.pl
To downsample the data use the following command
/sub_sample_reads file1 file2 #_reads_to_keep output_file_name
for example
/Users/Madison/Desktop/sub_sample/sub_sample_reads.pl test_1.fq test_2.fq 250 my_reads.fastq A separate metric of the base call quality is also reported by A5-miseq as "bases >= Q40". Following assembly, A5-miseq realigns the reads to the assembled sequence and estimates the accuracy of the nucleotide called at each site in the assembly. These accuracies are provided as PHRED quality scores (cite PHRED here), which represent log-scaled probabilities of accuracy. For
further directions/documentation you can view the script on github
https://github.com/gjospin/scripts/blob/master/subsample_reads.pl example a PHRED score of 20 indicates a 99% chance of the correct base, while Q30 and Q40 indicate 99.9% and 99.99% probabilities of the correct base being called. A5-miseq reports the number of assembly bases called with at least Q40.
###Verification of 16S Sequence
Follow the steps described in Section ??, "Making a Phylogenetic Tree" for obtaining and performing a BLAST search of the full length 16s sequence.