deletions | additions
diff --git a/Data Submission.md b/Data Submission.md
index 2ae396d..833f0b6 100644
--- a/Data Submission.md
+++ b/Data Submission.md
...
#Data Submission
This section describes how to submit contigs and scaffolds (if applicable) as a Whole Genome Shotgun (WGS) submission to Genbank. We also recommend allowing NCBI to annotate the genome themselves, since submitting RAST annotations to Genbank can be prohibitively complicated. The genomes are automatically shared with the DNA Data Bank of Japan (DDBJ) and the European Molecular Biology Laboratory (EBML). In addition, genomes from Genbank are automatically pulled into the Integrated Microbial Genomes (IMG) database hosted at the Joint Genome Institute (JGI), and are annotated there as well. This section also describes how to submit the raw reads, in this case we use the European Nucleotide Archive (ENA) for ease of use but the reads will be automatically incorporated into the Short Read Archive (SRA) at NCBI as well.
To submit a genome, you must first creat a
BioProject (what is a BioProject?) When, "BioProject" at NCBI. When that is complete, a separate process is required to submit the genome sequence. Before submitting your genome, you will need to have available X number of files.
***Bulleted list of files and what they are. File types used in data submission:
* AGP file (.agp). This is a file required by NCBI to describe scaffolding
* FASTA file (.fasta). This is the standard filetype for sequence data, produced in this case by A5
* FSA file (.fsa). Same as a FASTA file but with a different extension
* SQN file (.sqn). The filetype for sequence data required by NCBI
* SBT file (.sbt). This is a template filetype used by NCBI
**The section below will have to be reconciled with the was the A5 instructions are currently written, but I
would actually give don't want to change them
names (e.g., say that if you
will refer are just going to
AGP file instead of .agp file) so that you can stop typing file extensions*** have to go behind me and change it again...**
The section below will have to be reconciled with the was the A5 instructions are currently written, but I don't want to change them if you are just going to have to go behind me and change it again...
##FASTA2AGP
First,
obtain create the .agp file
In the terminal, navigate to the directory containing your scaffolds file
Run the fasta2agp.pl script included with A5 on the scaffold file outputted from the A5 assembly "my\_scaffolds.fasta".
Syntax is:
...
perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp
If this runs successfully then you should see a both the
.fsa FSA and
.agp .AGP files in your current directory.
Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the
.fsa FSA file may be less than in your input file. Therefore we recommend counting the contigs in the
.fsa FSA file:
To count them in the terminal use the syntax
grep -c “>” name_of_your_.fsa_file
Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you should only submit the
.sqn SQN file to Genbank and
say ***(say where, to whom?)*** specify that scaffolding did not take place (otherwise NCBI will reject the
.agp AGP file).
Now, navigate to http://www.ncbi.nlm.nih.gov. Create an account and/or login. Then, create a BioProject at NCBI by navigating to https://submit.ncbi.nlm.nih.gov/subs/bioproject/ and clicking on "New submission." Fill in the personal information for the submitter.
...
Once the project is submitted, refresh the page and copy down the Bioproject ID (it starts with "PRJNA")
##Create a
.sbt SBT template
Create a
.sbt SBT template file at NCBI
http://www.ncbi.nlm.nih.gov/WebSub/template.cgi
The BioProject # is the Bioproject ID starting with "PRJNA" which you received in the previous step, BioSample can be left blank
...
chmod 755 tbl2asn
Once you have changed the permissions, create a new directory and place tbl2asn along with the
.sbt SBT file and
.fsa FSA files into the folder.
Run the tbl2asn program using the following syntax. You will need to fill out the organism name, strain, location, collection date, isolation source specific to your own project.
path_to_program/tbl2asn -p path_to_files -t template_file_name -M n -Z discrep -j "[organism=X] [strain=X] [country=X: city, state abbreviation] [collection_date=X] [isolation-source=X] [gcode=11]"
Following the -p is the path to the directory containing the
.fsa FSA file, following the -t is the path to and name of the
SBT template file
Sample syntax
Desktop/ncbi/tbl2asn -p ~/Desktop/ncbi -t ~/Desktop/ncbi/template-1.sbt -M n -Z discrep –j "[organism=Ruthia magnifica str. UCD-CM][strain=UCD-CM] [country=USA: Davis, CA][collection_date=2002][isolation-source=Calyptogena magnifica tissue][gcode=11]"
The program will output the necessary files into the directory you created earlier