Authorea

Jenna M. Lang added Genome Assembly and Annotation.md about 10 years ago

Commit id: 4b4f1bb1803b7e49c905d3aadd581961a2bde626

deletions | additions

Genome Assembly and Annotation Assembly Genome assembly typically consists of data cleaning (quality filtering and adaptor removal), error correction, contig assembly, scaffolding, and verification of scaffolds/contigs. There are a large array of programs that can perform some, or most of these steps. These programs include commercial and open-source options, with some choice being very user friendly and some being extremely difficult to use/install. Good sources for overviews of assemblers and the assembly process include the GAGE project (REF), the GAGE-2 project (REF), and the Assemblathon Project (REF). Common assemblers for bacterial genomes include SPADES (REF), MIRA (REF), SGA (REF), Velvet (REF) CLC (REF), and A5 (REF). For this workflow we recommend use of the open source A5 assembly pipeline which automates all of the steps described above with a single command (REF). A5 is designed to work with raw, demultiplexed Illumina data and a recent version has been optimized for longer reads from the MiSeq (REF). Input reads can be paired or unpaired, and the files can be separate (forward reads in one file reverse reads in another) or interleaved. Download/Install A5 Download A5 from https://code.google.com/p/ngopt/downloads/list Follow the (expert) instructions located https://code.google.com/p/ngopt/wiki/A5PipelineREADME or Follow a video made by David Coil https://code.google.com/p/ngopt/ http://youtu.be/ePGUIj9Qbvc or Follow these instructions After downloading and unzipping the program, move it from your downloads folder to your desktop. Create a new folder which will contain the files generated by the pipeline. A5 is a command line based program, on a mac you will need to run it from the terminal see section II "Using the Terminal", for an introduction to the terminal. Running A5 Once you have opened the terminal navigate to the folder you just created because A5 will output the files your location when you call the program. In this example I created the folder on the desktop and named it a5_ouput so the syntax for navigating to the folder is $ cd Desktop/a5_output/ Once there the easiest way to run the program is to drag and drop the a5 pipeline into the terminal. Open the bin folder located in the downloaded folder. Drag the file labeled a5_pipeline.pl into the terminal __add arrow to picture___ then drag in the input file(s) (the paired end read files). Finally name the output files the final syntax will read $ a5_pipeline.pl read_1.fastq read_2.fastq mygenome /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/a5_pipeline.pl is the pipeline and its location /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX_p1.fastq is the first paired end read /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX_p2.fastq is the second paired end read example_sequence is the name of the output file Once the program finishes running you will have a complete assembly located in the folder you created under the name you specified. Among the numerous files generated by A5, the two of particular importance are the "example_sequence.contigs.fasta" and "example_sequence.final.scaffolds.fasta" which contain the contigs and scaffolds respectively. In addition, A5 generates a file containing information about the quality of the assembly called "???????" To view this file use the "less" command: $ less assembly_stats.csv For more on interpreting these numbers proceed to Section VII, "Verification of the Assembly". Verification of the Assembly There are three portions to the verification of a genome assembly. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (e.g. number of contigs and contig N50). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Here we use Phylosift to assess the presence or absence of 37 highly conserved single copy bacterial genes in the assembly as a rough proxy for completeness. Interpretation of A5 stats The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other extreme a bacterial assembly in 1000 contigs would be very fragmented. In our experience bacterial assemblies using PE300bp Ilumina data assembled with A5 tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (REF). "Genome Size" and "Longest Scaffold" are simply represented as base-pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, finding only 2MB in the assembly would be problematic. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better. The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data got discarded. A very large difference between these numbers (the "Pct" stats) would indicate either poor quality input data or significant adaptor contamination (with for example a very short library insert size). Finally "X_cov" shows the average coverage across the genome. For Illumina data we recommend that this number be between ~30X and 100X. Much less than 30X coverage and the quality of any given base in the assembly may come into question. Conversely, too much coverage can reduce the quality of the assembly and require downsampling. Verification of 16S Sequence Follow the steps described in Section IX, "Making a Phylogenetic Tree" for obtaining and BLASTing the full length 16s sequence. PhyloSift Navigate to http://phylosift.wordpress.com Download and unzip the latest version of phylosift In the terminal, navigate to the directory containing the unzipped phylosift Run $ ./phylosift search contig_file_name For example: $ ./phylosift search /Users/microBEnet/Desktop/Data-Genomes/Pantoea_Tatumella/tatumella/tatumella.final.scaffolds.fasta.contigs.fsa Note: The first time you run PhyloSift it has to download a marker gene database so it may take a few minutes. From the PhyloSift directory Move to the "PS_temp" directory Within this directoy, Phylosift has created a directory with the same name as the input file. Move to this new directory, then move to "blastDir". Open the marker_summary.txt file in the blastDir $ less marker_summary.txt The DNGNGWU0001-00040 markers represent 40 highly conserved bacterial genes, if one is missing it won't show up as a zero, it is necessary to manually verify the list. Most of the genes should only appear once. An occasional 2 is fine, but if all/a majority of the genes appear twice or even three times you have most likely sequenced multiple bacteria together. Additionally check to make sure there is no 18S RNA (at the top of the list) to ensure your sample has not been contaminated with a eukaryotes (e.g. yeast). Important Note: Markers 4, 8 and 38 are no longer included in the Phylosift analysis so do not be concerned if they are not listed.