deletions | additions
diff --git a/Genome Assembly and Annotation.md b/Genome Assembly and Annotation.md
index 85196f0..5fb980c 100644
--- a/Genome Assembly and Annotation.md
+++ b/Genome Assembly and Annotation.md
...
4. scaffolding
5. verification of scaffolds/contigs
There
are is a
large array plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options,
with some
choice being are very user friendly and some
being are extremely difficult to use/install.
Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF). Good sources for overviews of
genome assemblers and the assembly process include the GAGE project (\cite{Salzberg_2012}), the GAGE-2 project (REF), and the Assemblathon Project (REF).
Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF).
For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command (REF). A5 is designed to work with raw, demultiplexed Illumina data and a recent version has been optimized for longer reads from the MiSeq (Coil et al submitted). Input reads can be paired or unpaired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved.
These files should have the .fastq extension. See HERE for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the three following (per genome): 1) a single .fastq file that contains your single reads (if paired-end sequencing was not requested), 2) a single .fastq file that contains both forward and reverse reads, or 3) two .fastq files, one with forward reads and one with the corresponding reverse reads.
Download/Install A5
Download A5 from
...
Follow these instructions:
After downloading and unzipping the program,
change the name of the folder to a5\_pipeline and move it from your downloads folder to your
desktop.
Create Applications folder. Then, create a new folder which will contain the files generated by the
pipeline. pipeline on your Desktop. By the way, there's nothing special about having your file on the Desktop, it's just there to simplify our instructions. We will refer to this folder as "a5_output", but you should use a more informative name.
###Running A5
Once you have opened the terminal Open a Terminal window and navigate to
the folder you just created because a5\_output. A5 will
output write all of the
assembly output files
your location when to the same folder from which you
call run the program. In this example the newly created folder is on the
desktop Desktop and named a5_ouput so the syntax for navigating to the folder
in a Terminal window is
cd Desktop/a5_output/
Once there the easiest way to run the program is to drag and drop the a5 pipeline into the terminal. Open the bin folder located Now that you are in the
downloaded folder. Drag the file labeled a5\_pipeline.pl into the terminal
__add arrow folder where you want your genome assembly to
picture___
then drag in the input file(s) (the paired end read files, interleaved or not). Finally name the output files appear, you are ready to run the
final syntax will read program. First, type (don't hit return yet!):
a5_pipeline.pl read_1.fastq read_2.fastq mygenome /Applications/a5\_pipeline/bin/a5\_pipeline.pl
Then, drag and drop in the input file(s) into the same Terminal window (or type the path to them if you know it). Finally, type a name that will be used as part of all of your output files. So, your command line should look like this:
/Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/bin/a5\_pipeline.pl is the pipeline and its location
/Users/Madison/Desktop/a5\_miseq_macOS\_20140113/example/phiX\_p1.fastq is the first paired end read
/Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/example/phiX\_p2.fastq is the second paired end read
example\_sequence is the name of the output file /Applications/a5\_pipeline/bin/a5\_pipeline.pl SequenceFile1.fastq SequenceFile2.fastq MyGenome
Once the program finishes running you will have a complete assembly located in the folder you created under the name you specified.
Once the program finishes running you will have a complete assembly located in the a5\_output folder.
Among the numerous files generated by A5,
the two of particular importance are the
"example\_sequence.contigs.fasta" "MyGenome.contigs.fasta" and
"example\_sequence.final.scaffolds.fasta" "MyGenome.final.scaffolds.fasta" which contain the contigs and
scaffolds scaffolds, respectively.
In addition, A5 generates a file containing information about the quality of the assembly called "assembly_stats.csv"
...
less assembly_stats.csv
For more on interpreting these numbers proceed to
"Verification of the Assembly". "Assembly Validation".
###Verification of the Assembly ###Assembly Validation
There are three
portions components to
the verification of a genome
assembly. assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5
(e.g. (e.g., number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Here we use a program called Phylosift to assess the presence or absence of 37 highly conserved single copy bacterial genes in the assembly as a rough proxy for completeness.
###Interpretation of A5 stats
The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other extreme a bacterial assembly in 1000 contigs would be very fragmented. In our experience bacterial assemblies using PE300bp Ilumina data assembled with A5 tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (Coil et al, submitted).