David Coil edited Data Submission.md  almost 10 years ago

Commit id: 858009f3746041c9537637fc95f7540c95c4508e

deletions | additions      

       

#Data Submission  This section describes how to submit contigs and scaffolds (if applicable) as a Whole Genome Shotgun (WGS) submission to Genbank. We also recommend allowing NCBI to annotate the genome themselves, since submitting RAST annotations to Genbank can be prohibitively complicated. The genomes are automatically shared with the DNA Data Bank of Japan (DDBJ) and the European Molecular Biology Laboratory (EBML). In addition, genomes from Genbank are automatically pulled into the Integrated Microbial Genomes (IMG) database hosted at the Joint Genome Institute (JGI), and are annotated there as well. This section also describes how to submit the raw reads, in this case we use the European Nucleotide Archive (ENA) for ease of use but the reads will be automatically incorporated into the Short Read Archive (SRA) at NCBI as well.  To submit a genome, you must first creat a BioProject (what is a BioProject?) When, "BioProject" at NCBI. When  that is complete, a separate process is required to submit the genome sequence. Before submitting your genome, you will need to have available X number of files. ***Bulleted list of files and what they are. File types used in data submission:  * AGP file (.agp). This is a file required by NCBI to describe scaffolding  * FASTA file (.fasta). This is the standard filetype for sequence data, produced in this case by A5  * FSA file (.fsa). Same as a FASTA file but with a different extension  * SQN file (.sqn). The filetype for sequence data required by NCBI  * SBT file (.sbt). This is a template filetype used by NCBI  **The section below will have to be reconciled with the was the A5 instructions are currently written, but  I would actually give don't want to change  them names (e.g., say that if  you will refer are just going  to AGP file instead of .agp file) so that you can stop typing file extensions*** have to go behind me and change it again...**  The section below will have to be reconciled with the was the A5 instructions are currently written, but I don't want to change them if you are just going to have to go behind me and change it again...  ##FASTA2AGP  First, obtain create  the .agp file In the terminal, navigate to the directory containing your scaffolds file Run the fasta2agp.pl script included with A5 on the scaffold file outputted from the A5 assembly "my\_scaffolds.fasta".   Syntax is:  

perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp   If this runs successfully then you should see a both the .fsa FSA  and .agp .AGP  files in your current directory. Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the .fsa FSA  file may be less than in your input file. Therefore we recommend counting the contigs in the .fsa FSA  file: To count them in the terminal use the syntax  grep -c “>” name_of_your_.fsa_file  Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you should only submit the .sqn SQN  file to Genbank and say ***(say where, to whom?)*** specify  that scaffolding did not take place (otherwise NCBI will reject the .agp AGP  file). Now, navigate to http://www.ncbi.nlm.nih.gov. Create an account and/or login. Then, create a BioProject at NCBI by navigating to https://submit.ncbi.nlm.nih.gov/subs/bioproject/ and clicking on "New submission." Fill in the personal information for the submitter. 

Once the project is submitted, refresh the page and copy down the Bioproject ID (it starts with "PRJNA")  ##Create a .sbt SBT  template Create a .sbt SBT  template file at NCBI http://www.ncbi.nlm.nih.gov/WebSub/template.cgi  The BioProject # is the Bioproject ID starting with "PRJNA" which you received in the previous step, BioSample can be left blank 

chmod 755 tbl2asn  Once you have changed the permissions, create a new directory and place tbl2asn along with the .sbt SBT  file and .fsa FSA  files into the folder. Run the tbl2asn program using the following syntax. You will need to fill out the organism name, strain, location, collection date, isolation source specific to your own project.   path_to_program/tbl2asn -p path_to_files -t template_file_name -M n -Z discrep -j "[organism=X] [strain=X] [country=X: city, state abbreviation] [collection_date=X] [isolation-source=X] [gcode=11]"  Following the -p is the path to the directory containing the .fsa FSA  file, following the -t is the path to and name of the SBT  template file Sample syntax  Desktop/ncbi/tbl2asn -p ~/Desktop/ncbi -t ~/Desktop/ncbi/template-1.sbt -M n -Z discrep –j "[organism=Ruthia magnifica str. UCD-CM][strain=UCD-CM] [country=USA: Davis, CA][collection_date=2002][isolation-source=Calyptogena magnifica tissue][gcode=11]"    The program will output the necessary files into the directory you created earlier