Authorea

Jenna M. Lang edited Genome Assembly and Annotation.md almost 10 years ago

Commit id: 97bfd610ce909fda1d70116f5d397f2ded93df9b

deletions | additions

4. scaffolding 5. verification of scaffolds/contigs There are is a large array plethora of programs that can perform some, or most of these steps. These programs include commercial and open-source options,with some choice being are very user friendly and some being are extremely difficult to use/install. Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF). Good sources for overviews of genome assemblers and the assembly process include the GAGE project (\cite{Salzberg_2012}), the GAGE-2 project (REF), and the Assemblathon Project (REF).Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF). For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command (REF). A5 is designed to work with raw, demultiplexed Illumina data and a recent version has been optimized for longer reads from the MiSeq (Coil et al submitted). Input reads can be paired or unpaired, and the files can be separate (forward reads in one file, reverse reads in another) or interleaved. These files should have the .fastq extension. See HERE for a description of the fastq format. You may need assistance from your sequencing center in locating and accessing these files. You will need one of the three following (per genome): 1) a single .fastq file that contains your single reads (if paired-end sequencing was not requested), 2) a single .fastq file that contains both forward and reverse reads, or 3) two .fastq files, one with forward reads and one with the corresponding reverse reads. Download/Install A5 Download A5 from

Follow these instructions: After downloading and unzipping the program, change the name of the folder to a5\_pipeline and move it from your downloads folder to your desktop. Create Applications folder. Then, create a new folder which will contain the files generated by the pipeline. pipeline on your Desktop. By the way, there's nothing special about having your file on the Desktop, it's just there to simplify our instructions. We will refer to this folder as "a5_output", but you should use a more informative name. ###Running A5 Once you have opened the terminal Open a Terminal window and navigate to the folder you just created because a5\_output. A5 will output write all of the assembly output files your location when to the same folder from which you call run the program. In this example the newly created folder is on the desktop Desktop and named a5_ouput so the syntax for navigating to the folder in a Terminal window is cd Desktop/a5_output/ Once there the easiest way to run the program is to drag and drop the a5 pipeline into the terminal. Open the bin folder located Now that you are in the downloaded folder. Drag the file labeled a5\_pipeline.pl into the terminal __add arrow folder where you want your genome assembly to picture___ then drag in the input file(s) (the paired end read files, interleaved or not). Finally name the output files appear, you are ready to run the final syntax will read program. First, type (don't hit return yet!): a5_pipeline.pl read_1.fastq read_2.fastq mygenome /Applications/a5\_pipeline/bin/a5\_pipeline.pl Then, drag and drop in the input file(s) into the same Terminal window (or type the path to them if you know it). Finally, type a name that will be used as part of all of your output files. So, your command line should look like this: /Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/bin/a5\_pipeline.pl is the pipeline and its location /Users/Madison/Desktop/a5\_miseq_macOS\_20140113/example/phiX\_p1.fastq is the first paired end read /Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/example/phiX\_p2.fastq is the second paired end read example\_sequence is the name of the output file /Applications/a5\_pipeline/bin/a5\_pipeline.pl SequenceFile1.fastq SequenceFile2.fastq MyGenome Once the program finishes running you will have a complete assembly located in the folder you created under the name you specified. Once the program finishes running you will have a complete assembly located in the a5\_output folder. Among the numerous files generated by A5,the two of particular importance are the "example\_sequence.contigs.fasta" "MyGenome.contigs.fasta" and "example\_sequence.final.scaffolds.fasta" "MyGenome.final.scaffolds.fasta" which contain the contigs and scaffolds scaffolds, respectively. In addition, A5 generates a file containing information about the quality of the assembly called "assembly_stats.csv"

less assembly_stats.csv For more on interpreting these numbers proceed to "Verification of the Assembly". "Assembly Validation". ###Verification of the Assembly ###Assembly Validation There are three portions components tothe verification of a genome assembly. assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. (e.g., number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Here we use a program called Phylosift to assess the presence or absence of 37 highly conserved single copy bacterial genes in the assembly as a rough proxy for completeness. ###Interpretation of A5 stats The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other extreme a bacterial assembly in 1000 contigs would be very fragmented. In our experience bacterial assemblies using PE300bp Ilumina data assembled with A5 tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (Coil et al, submitted).