deletions | additions
diff --git a/Genome Assembly and Annotation.md b/Genome Assembly and Annotation.md
index c7c7ac2..74fe103 100644
--- a/Genome Assembly and Annotation.md
+++ b/Genome Assembly and Annotation.md
...
4. scaffolding
5. verification of scaffolds/contigs
There is a plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options, some are very user friendly and some are extremely difficult to use/install. Common assemblers for bacterial genomes include SPADES (\cite{Bankevich_2012}), MIRA
(REF), (\cite{Chevreux_2004}), SGA
(REF), (\cite{Simpson_2010}), Velvet
(REF) (\cite{Zerbino_2008}) CLC
(REF), (http://www.clcbio.com/files/whitepapers/whitepaper-denovo-assembly-4.pdf-**This was the best reference I could find on CLC but I'm not sure how to cite it/if I should use something else**), and A5
(REF). (\cite{Tritt_2012}). Good sources for overviews of genome assemblers and the assembly process include the GAGE project (\cite{Salzberg_2012}), the GAGE-2 project (REF), and the Assemblathon Project
(REF). (\cite{Earl_2011}).
For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command
(REF). (\cite{Tritt_2012}). A5 is designed to work with raw, demultiplexed Illumina data and a recent version has been optimized for longer reads from the MiSeq (Coil et al submitted). Input reads can be paired or unpaired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved. These files should have the .fastq extension. See HERE for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the three following (per genome): 1) a single .fastq file that contains your single reads (if paired-end sequencing was not requested), 2) a single .fastq file that contains both forward and reverse reads, or 3) two .fastq files, one with forward reads and one with the corresponding reverse reads.
Download/Install A5
Download A5 from
...
For more on interpreting these numbers proceed to "Assembly Validation".
###Assembly Validation
There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Nevertheless, we can get an idea of how complete the genome is by looking for high;y conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called Phylosift
(REF) (\cite{Darling_2014}) to assess the presence or absence of 37 housekeeping genes in the assembly to infer completeness.
###Interpretation of A5 stats
The first two numbers shown are the number of contigs and scaffolds, respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (Coil et al, submitted).
...
##Annotation
###Options
There are a number of different pipelines available for annotation of bacterial genomes. These include Prokka
(REF), (\cite{Seemann_2014}), IMG
(REF), (\cite{Markowitz_2014}), RAST
(REF), (\cite{Overbeek_2014}), PGAP
(REF) (\cite{Angiuoli_2008}) and others.
+ Prokka
Command line based