Authorea

Alberto Pepe edited Genome Assembly and Annotation.md about 9 years ago

Commit id: 38d44d02dbf206fe72377aa6fe8df84850a4dcd4

deletions | additions

The first step simply removes poor quality sequences, as well as adapter sequences left over from sequencing. Some assemblers follow this with error correction where reads are compared to each other to eliminate sequencing errors. Next is contig assembly where overlapping reads are assembled into long continuous stretches of sequences. Scaffolding refers to the alignment and orientation of these contigs relative to each other (where possible). The last step is verification where reads are mapped back to the contigs/scaffolds to reduce misassemblies. There is a plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options, some are very user friendly and some are extremely difficult to use/install. Common assemblers for bacterial genomes include SPAdes \cite{Bankevich_2012}, MIRA \cite{Chevreux_2004}, SGA \cite{Simpson_2010}, Velvet \cite{Zerbino_2008} CLC (CLC Bio), and A5 \cite{Tritt_2012}. \cite{Tritt2012}. Good sources for overviews of genome assemblers and the assembly process include the GAGE project \cite{Salzberg_2012}, the GAGE-B project \cite{Magoc_2013}, and the Assemblathon Project \cite{Earl_2011}. In this workflow, we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command \cite{Tritt\_2012}. \cite{Tritt2012}. A5 is designed to work with raw, demultiplexed Illumina data and a recent version (A5-miseq) has been optimized for longer reads from the MiSeq \cite{25338718}. Input files should have the .fastq extension. See (http://en.wikipedia.org/wiki/FASTQ_format) for a description of the fastq format. You will need one of the two following (per genome): 1) a single .fastq file that contains both forward and reverse reads, or 2) two .fastq files, one with forward reads and one with the corresponding reverse reads. These .fastq files can optionally be gzip compressed (as indicated by the .gz file name extension). You may need assistance from your sequencing center in locating and accessing these files. Download/Install A5 from [http://sourceforge.net/projects/ngopt/](http://sourceforge.net/projects/ngopt/)