Authorea

David Coil edited Genome Assembly and Annotation.md almost 10 years ago

Commit id: 2329e2c4008208e519111d9df7b912edd90c16fb

deletions | additions

Once there the easiest way to run the program is to drag and drop the a5 pipeline into the terminal. Open the bin folder located in the downloaded folder. Drag the file labeled a5\_pipeline.pl into the terminal __add arrow to picture___ then drag in the input file(s) (the paired end read files). files, interleaved or not). Finally name the output files the final syntax will read a5_pipeline.pl read_1.fastq read_2.fastq mygenome

/Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/bin/a5\_pipeline.pl is the pipeline and its location /Users/Madison/Desktop/a5\_miseq_macOS\_20140113/example/phiX\_p1.fastq is the first paired end read /Users/Madison/Desktop/a5\_miseq\_macOS\_20140113/example/phiX\_p2.fastq is the second paired end read example_sequence example\_sequence is the name of the output file Once the program finishes running you will have a complete assembly located in the folder you created under the name you specified. Among the numerous files generated by A5, the two of particular importance are the "example\_sequence.contigs.fasta" and "example\_sequence.final.scaffolds.fasta" which contain the contigs and scaffolds respectively. In addition, A5 generates a file containing information about the quality of the assembly called "???????" "assembly_stats.csv" To view this file use the "less" command: less assembly_stats.csv For more on interpreting these numbers proceed toSection VII, "Verification of the Assembly". ###Verification of the Assembly There are three portions to the verification of a genome assembly. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Here we use a program called Phylosift to assess the presence or absence of 37 highly conserved single copy bacterial genes in the assembly as a rough proxy for completeness. ###Interpretation of A5 stats The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other extreme a bacterial assembly in 1000 contigs would be very fragmented. In our experience bacterial assemblies using PE300bp Ilumina data assembled with A5 tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (REF). (Coil et al, submitted). "Genome Size" and "Longest Scaffold" are simply represented as base-pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, finding only 2MB in the assembly would be problematic. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better.

###Verification of 16S Sequence Follow the steps described in Section IX, ??, "Making a Phylogenetic Tree" for obtaining and BLASTing the full length 16s sequence. PhyloSift Navigate to

Each of these pipelines has advantages and disadvantages, and each will give slightly different results. Here we recommend RAST since it is web-based, easy to use, returns results within hours and provides a framework for analyzing the results. However, RAST annotations are very difficult to submit to NCBI so we recommend allowing NCBI to annotate the genome with PGAP upon submission. ###RAST Annotation Annotation of the genome using RAST is also an easy way to locate the full-length 16S gene which is required for the Section IX, ??, "Building A Phylogenetic Tree" portion of the workflow. Navigate to http://rast.nmpdr.org/

Click Finish the Upload The annotation will take some time, ranging from 2 hours to a few days, depending on server load. RAST will email you when it is complete. Once the annotation is complete, use their SEED Viewer to explore the annotation and metabolic pathways of the organism. In section ?? we describe how to use the SEED viewer to get the full-length 16S rRNA sequence.