Jenna M. Lang edited Genome Assembly and Annotation.md  almost 10 years ago

Commit id: 97bfd610ce909fda1d70116f5d397f2ded93df9b

deletions | additions      

       

4. scaffolding  5. verification of scaffolds/contigs   There are is  a large array plethora  of programs that can perform some, or most of these steps. These programs include commercial and open-source options,with  some choice being are  very user friendly and some being are  extremely difficult to use/install. Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF).  Good sources for overviews of genome  assemblers and the assembly process include the GAGE project (\cite{Salzberg_2012}), the GAGE-2 project (REF), and the Assemblathon Project (REF).Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF).  For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command (REF). A5 is designed to work with raw, demultiplexed Illumina data and a recent version has been optimized for longer reads from the MiSeq (Coil et al submitted). Input reads can be paired or unpaired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved. These files should have the .fastq extension. See HERE for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the three following (per genome): 1) a single .fastq file that contains your single reads (if paired-end sequencing was not requested), 2) a single .fastq file that contains both forward and reverse reads, or 3) two .fastq files, one with forward reads and one with the corresponding reverse reads.  Download/Install A5  Download A5 from  

  Follow these instructions:  After downloading and unzipping the program, change the name of the folder to a5\_pipeline and  move it from your downloads folder to your desktop.  Create Applications folder. Then, create  a new folder which will contain the files generated by the pipeline. pipeline on your Desktop. By the way, there's nothing special about having your file on the Desktop, it's just there to simplify our instructions. We will refer to this folder as "a5_output", but you should use a more informative name.  ###Running A5  Once you have opened the terminal Open a Terminal window and  navigate to the folder you just created because a5\_output.  A5 will output write all of  the assembly output  files your location when to the same folder from which  you call run  the program. In this example the newly created folder is on the desktop Desktop  and named a5_ouput so the syntax for navigating to the folder in a Terminal window  is cd Desktop/a5_output/  Once there the easiest way to run the program is to drag and drop the a5 pipeline into the terminal. Open the bin folder located Now that you are  in the downloaded folder. Drag the file labeled a5\_pipeline.pl into the terminal   __add arrow folder where you want your genome assembly  to picture___  then drag in the input file(s) (the paired end read files, interleaved or not). Finally name the output files appear, you are ready to run  the final syntax will read program. First, type (don't hit return yet!):  a5_pipeline.pl read_1.fastq read_2.fastq mygenome /Applications/a5\_pipeline/bin/a5\_pipeline.pl    Then, drag and drop in the input file(s) into the same Terminal window (or type the path to them if you know it). Finally, type a name that will be used as part of all of your output files. So, your command line should look like this:  /Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/bin/a5\_pipeline.pl is the pipeline and its location  /Users/Madison/Desktop/a5\_miseq_macOS\_20140113/example/phiX\_p1.fastq is the first paired end read  /Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/example/phiX\_p2.fastq is the second paired end read  example\_sequence is the name of the output file /Applications/a5\_pipeline/bin/a5\_pipeline.pl SequenceFile1.fastq SequenceFile2.fastq MyGenome  Once the program finishes running you will have a complete assembly located in the folder you created under the name you specified.  Once the program finishes running you will have a complete assembly located in the a5\_output folder.  Among the numerous files generated by A5,the  two of particular importance are the "example\_sequence.contigs.fasta" "MyGenome.contigs.fasta"  and "example\_sequence.final.scaffolds.fasta" "MyGenome.final.scaffolds.fasta"  which contain the contigs and scaffolds scaffolds,  respectively. In addition, A5 generates a file containing information about the quality of the assembly called "assembly_stats.csv" 

less assembly_stats.csv  For more on interpreting these numbers proceed to "Verification of the Assembly". "Assembly Validation".  ###Verification of the Assembly ###Assembly Validation  There are three portions components  tothe verification of a  genome assembly. assembly validation.  The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. (e.g.,  number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Here we use a program called Phylosift to assess the presence or absence of 37 highly conserved single copy bacterial genes in the assembly as a rough proxy for completeness. ###Interpretation of A5 stats  The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other extreme a bacterial assembly in 1000 contigs would be very fragmented. In our experience bacterial assemblies using PE300bp Ilumina data assembled with A5 tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (Coil et al, submitted).