Authorea

David Coil edited Data Submission.md almost 10 years ago

Commit id: 858009f3746041c9537637fc95f7540c95c4508e

deletions | additions

#Data Submission This section describes how to submit contigs and scaffolds (if applicable) as a Whole Genome Shotgun (WGS) submission to Genbank. We also recommend allowing NCBI to annotate the genome themselves, since submitting RAST annotations to Genbank can be prohibitively complicated. The genomes are automatically shared with the DNA Data Bank of Japan (DDBJ) and the European Molecular Biology Laboratory (EBML). In addition, genomes from Genbank are automatically pulled into the Integrated Microbial Genomes (IMG) database hosted at the Joint Genome Institute (JGI), and are annotated there as well. This section also describes how to submit the raw reads, in this case we use the European Nucleotide Archive (ENA) for ease of use but the reads will be automatically incorporated into the Short Read Archive (SRA) at NCBI as well. To submit a genome, you must first creat a BioProject (what is a BioProject?) When, "BioProject" at NCBI. When that is complete, a separate process is required to submit the genome sequence. Before submitting your genome, you will need to have available X number of files. ***Bulleted list of files and what they are. File types used in data submission: * AGP file (.agp). This is a file required by NCBI to describe scaffolding * FASTA file (.fasta). This is the standard filetype for sequence data, produced in this case by A5 * FSA file (.fsa). Same as a FASTA file but with a different extension * SQN file (.sqn). The filetype for sequence data required by NCBI * SBT file (.sbt). This is a template filetype used by NCBI **The section below will have to be reconciled with the was the A5 instructions are currently written, but I would actually give don't want to change them names (e.g., say that if you will refer are just going to AGP file instead of .agp file) so that you can stop typing file extensions*** have to go behind me and change it again...** The section below will have to be reconciled with the was the A5 instructions are currently written, but I don't want to change them if you are just going to have to go behind me and change it again... ##FASTA2AGP First, obtain create the .agp file In the terminal, navigate to the directory containing your scaffolds file Run the fasta2agp.pl script included with A5 on the scaffold file outputted from the A5 assembly "my\_scaffolds.fasta". Syntax is:

perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp If this runs successfully then you should see a both the .fsa FSA and .agp .AGP files in your current directory. Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the .fsa FSA file may be less than in your input file. Therefore we recommend counting the contigs in the .fsa FSA file: To count them in the terminal use the syntax grep -c “>” name_of_your_.fsa_file Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you should only submit the .sqn SQN file to Genbank and say ***(say where, to whom?)*** specify that scaffolding did not take place (otherwise NCBI will reject the .agp AGP file). Now, navigate to http://www.ncbi.nlm.nih.gov. Create an account and/or login. Then, create a BioProject at NCBI by navigating to https://submit.ncbi.nlm.nih.gov/subs/bioproject/ and clicking on "New submission." Fill in the personal information for the submitter.

Once the project is submitted, refresh the page and copy down the Bioproject ID (it starts with "PRJNA") ##Create a .sbt SBT template Create a .sbt SBT template file at NCBI http://www.ncbi.nlm.nih.gov/WebSub/template.cgi The BioProject # is the Bioproject ID starting with "PRJNA" which you received in the previous step, BioSample can be left blank

chmod 755 tbl2asn Once you have changed the permissions, create a new directory and place tbl2asn along with the .sbt SBT file and .fsa FSA files into the folder. Run the tbl2asn program using the following syntax. You will need to fill out the organism name, strain, location, collection date, isolation source specific to your own project. path_to_program/tbl2asn -p path_to_files -t template_file_name -M n -Z discrep -j "[organism=X] [strain=X] [country=X: city, state abbreviation] [collection_date=X] [isolation-source=X] [gcode=11]" Following the -p is the path to the directory containing the .fsa FSA file, following the -t is the path to and name of the SBT template file Sample syntax Desktop/ncbi/tbl2asn -p ~/Desktop/ncbi -t ~/Desktop/ncbi/template-1.sbt -M n -Z discrep –j "[organism=Ruthia magnifica str. UCD-CM][strain=UCD-CM] [country=USA: Davis, CA][collection_date=2002][isolation-source=Calyptogena magnifica tissue][gcode=11]" The program will output the necessary files into the directory you created earlier