Authorea

Jenna M. Lang edited Data Submission.md almost 10 years ago

Commit id: 0001f952656a8126a8296d40f083c407b6b7323c

deletions | additions

#Data Submission This section describes how to submit contigs and scaffolds (if applicable) as a Whole Genome Shotgun (WGS) submission to Genbank. We also recommend allowing NCBI to annotate the genome themselves, since submitting RAST annotations to Genbank can be prohibitively complicated. The genomes are automatically shared with the DNA Data Bank of Japan (DDBJ) and the European Molecular Biology Laboratory (EBML). In addition, genomes from Genbank are automatically pulled into the Integrated Microbial Genomes (IMG) database hosted at the Joint Genome Institute (JGI), and are annotated there as well. This section also describes how to submit the raw reads, in this case we use the European Nucleotide Archive (ENA) for ease of use but the reads will be automatically incorporated into the Short Read Archive (SRA) at NCBI as well. Genbank submission requires a .sqn file containing the contigs and an .agp file describing the scaffolds (if applicable). A5 outputs a .fasta file of scaffolds which can be converted Before submitting your genome, you will need to a .fsa and a .agp file through a command line script included in the A5 program package. The .fsa file, along with a .sbt template file (created on the NCBI website) can then be converted to a .sqn file via a script have availablethrough NCBI. Create a BioProject at NCBI Go to: http://www.ncbi.nlm.nih.gov ##FASTA2AGP To finish this submission you'll will need to obtain additional files as described below. files as described below In the terminal, navigate to the directory containing your scaffolds file Create an account or login Run the fasta2agp.pl script included with Google or NIH login A5 on the scaffold file outputted from the A5 assembly "my\_scaffolds.fasta". Syntax is: Create a BioProject at NCBI: Go to: https://submit.ncbi.nlm.nih.gov/subs/bioproject/ Click on New submission perl fasta2agp.pl my_scaffolds.fasta > my_scaffolds.agp eg: perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp Submitter-fill If this runs successfully then you should see a both the .fsa and .agp files in your current directory. Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the .fsa file may be less than in your input file. Therefore we recommend counting the contigs in the .fsa file: To count them in the terminal use the syntax grep -c “>” name_of_your_.fsa_file Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you should only submit the .sqn file to Genbank and say that scaffolding did not take place (otherwise NCBI will reject the .agp file). First, navigate to http://www.ncbi.nlm.nih.gov. Create an account and/or login. Then, create a BioProject at NCBI by navigating to https://submit.ncbi.nlm.nih.gov/subs/bioproject/ and clicking on "New submission." Fill in the personal information (information for the submitter. Below, in italics are the responses that we typically give for a genome sequencing project) project. + Project type + Project data type-_genome sequencing_

+ Biosample-_blank_ + Publications-_blank_ Once the project is submitted, refresh the page and copy down the Bioproject ID (starts (it starts with "PRJNA")##FASTA2AGP To finish this submission you'll need the files as described below In the terminal, navigate to the directory containing your scaffolds file Run the fasta2agp.pl script included with A5 on the scaffold file outputted from the A5 assembly "my\_scaffolds.fasta". Syntax is: perl fasta2agp.pl my_scaffolds.fasta > my_scaffolds.agp eg: perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp If this runs successfully then you should see a both the .fsa and .agp files in your current directory. Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the .fsa file may be less than in your input file. Therefore we recommend counting the contigs in the .fsa file: To count them in the terminal use the syntax grep -c “>” name_of_your_.fsa_file Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you should only submit the .sqn file to Genbank and say that scaffolding did not take place (otherwise NCBI will reject the .agp file). ##Create a .sbt template Create a .sbt template file at NCBI