Authorea

David Coil edited Data Submission.md over 9 years ago

Commit id: 623ac2a19655e2f61cd26c8de72091e05f1eaf43

deletions | additions

##Submitting contigs only Use this section if submitting only contigs, presumably in FASTA format STUFF HERE ##Submitting scaffolds Only use this section if you are submitting scaffolds, in most cases assembly with A5 will render this step uneccessary. Before submitting your scaffolded genome, you will need to have available 4-5 files which are listed below. File types used in data submission: * AGP file (.agp). This is a file required by NCBI to describe scaffolding (if applicable) * FASTA file (.fasta). This is the standard file type for sequence data, produced in this case by A5-miseq * FSA file (.fsa). Same as a FASTA file but with a different extension * SQN file (.sqn). The file type for sequence data required by NCBI * SBT file (.sbt). This is a template file type used by NCBI ##FASTA2AGP First, create the .agp file In the terminal, navigate to the directory containing your scaffolds file Run the fasta2agp.pl script included with A5 on the scaffold file output by the A5 assembly "my\_scaffolds.fasta". Syntax is: perl fasta2agp.pl my_scaffolds.fasta > my_scaffolds.agp eg: perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp If this runs successfully then you should see a both the FSA and AGP files in your current directory. Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the FSA file may be less than in your input file. Therefore we recommend counting the contigs in the FSA file: To count them in the terminal use the syntax grep -c “>” name_of_your_.fsa_file Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you should only submit the SQN file to Genbank and specify that scaffolding did not take place (otherwise NCBI will reject the AGP file). Now, navigate to http://www.ncbi.nlm.nih.gov. Create an account and/or login. Then, create a BioProject at NCBI by navigating to https://submit.ncbi.nlm.nih.gov/subs/bioproject/ and clicking on "New submission." Fill in the personal information for the submitter. Below, in italics are the responses that we typically give for a genome sequencing project.

Once the project is submitted, refresh the page and copy down the Bioproject ID (it starts with "PRJNA") ##Create a SBT template Create a SBT template file at NCBI http://www.ncbi.nlm.nih.gov/WebSub/template.cgi The BioProject # is the Bioproject ID starting with "PRJNA" which you received in the previous step, BioSample can be left blank When you click create the template, it will automatically download to your computer as template.sbt. We recommend immediately renaming the file to the appropriate project. ##Tbl2asn Download the tbl2asn program from ftp://ftp.ncbi.nih.gov/toolbox/ncbi\_tools/converters/by\_program/tbl2asn/ If you are using Safari, a window will pop up asking for login information, just choose guest and unzip the version of the program that is compatible with your operating system. Other browsers will take you to a page with a lot of tbl2asn programs, download the one compatible with your operating system. After downloading the desired command-line program, uncompress the archive and rename the resulting file to tbl2asn Now change the file permissions of the file (in the terminal) since transfer by FTP resets the permissions. Syntax is: chmod 755 tbl2asn Once you have changed the permissions, create a new directory and place tbl2asn along with the SBT file and FSA files into the folder. Run the tbl2asn program using the following syntax. You will need to fill out the organism name, strain, location, collection date, isolation source specific to your own project. path_to_program/tbl2asn -p path_to_files -t template_file_name -M n -Z discrep -j "[organism=X] [strain=X] [country=X: city, state abbreviation] [collection_date=X] [isolation-source=X] [gcode=11]" Following the -p is the path to the directory containing the FSA file, following the -t is the path to and name of the SBT template file Sample syntax Desktop/ncbi/tbl2asn -p ~/Desktop/ncbi -t ~/Desktop/ncbi/template-1.sbt -M n -Z discrep –j "[organism=Ruthia magnifica str. UCD-CM][strain=UCD-CM] [country=USA: Davis, CA][collection_date=2002][isolation-source=Calyptogena magnifica tissue][gcode=11]" The program will output the necessary files into the directory you created earlier (ensure no errors were generated by opening the errorsummary.val file and making sure it is blank, or listing the directory contents ($ ls –lh) to ensure it has zero bytes) ##Create a Whole Genome Shotgun (WGS) Submission Navigate to https://submit.ncbi.nlm.nih.gov/subs/wgs/

cat test.fa | awk '{print (NR%2==1) ? ">contigs_" ++i : $0}' > test2.fa ##Submitting scaffolds Only use this section if you are submitting scaffolds, in most cases assembly with A5 will render this step uneccessary. Many of the steps are the same as for submitting contigs, only the diferences are shown here. Before submitting your scaffolded genome, you will need to have available 4-5 files which are listed below. File types used in data submission: * AGP file (.agp). This is a file required by NCBI to describe scaffolding (if applicable) * FASTA file (.fasta). This is the standard file type for sequence data, produced in this case by A5-miseq * FSA file (.fsa). Same as a FASTA file but with a different extension * SQN file (.sqn). The file type for sequence data required by NCBI * SBT file (.sbt). This is a template file type used by NCBI **FASTA2AGP** First, create the .agp file In the terminal, navigate to the directory containing your scaffolds file Run the fasta2agp.pl script included with A5 on the scaffold file output by the A5 assembly "my\_scaffolds.fasta". Syntax is: perl fasta2agp.pl my_scaffolds.fasta > my_scaffolds.agp eg: perl /Users/Madison/Desktop/a5_miseq_macOS_20140113/bin/fasta2agp.pl /Users/Madison/Desktop/a5_miseq_macOS_20140113/example/phiX.a5.final.scaffolds.fasta > phiX.a5.scaffolds.agp If this runs successfully then you should see a both the FSA and AGP files in your current directory. Important Note: NCBI considers a gap of less than 10 nucleotides to be "missing information" in a contig, not a gap between contigs (whereas A5 has no minimum gap size). Therefore NCBI requires that contigs separated by less than 10 nucleotides be merged. This script performs that merging, meaning that the number of contigs in the FSA file may be less than in your input file. Therefore we recommend counting the contigs in the FSA file: To count them in the terminal use the syntax grep -c “>” name_of_your_.fsa_file Important Note: If after running the fasta2agp.pl script and counting the contigs you have the same number of contigs as starting scaffolds, then you submit only the contigs as described above. **Create a SBT template** Create a SBT template file at NCBI http://www.ncbi.nlm.nih.gov/WebSub/template.cgi The BioProject # is the Bioproject ID starting with "PRJNA" which you received in the previous step, BioSample can be left blank When you click create the template, it will automatically download to your computer as template.sbt. We recommend immediately renaming the file to the appropriate project. **Tbl2asn** Download the tbl2asn program from ftp://ftp.ncbi.nih.gov/toolbox/ncbi\_tools/converters/by\_program/tbl2asn/ If you are using Safari, a window will pop up asking for login information, just choose guest and unzip the version of the program that is compatible with your operating system. Other browsers will take you to a page with a lot of tbl2asn programs, download the one compatible with your operating system. After downloading the desired command-line program, uncompress the archive and rename the resulting file to tbl2asn Now change the file permissions of the file (in the terminal) since transfer by FTP resets the permissions. Syntax is: chmod 755 tbl2asn Once you have changed the permissions, create a new directory and place tbl2asn along with the SBT file and FSA files into the folder. Run the tbl2asn program using the following syntax. You will need to fill out the organism name, strain, location, collection date, isolation source specific to your own project. path_to_program/tbl2asn -p path_to_files -t template_file_name -M n -Z discrep -j "[organism=X] [strain=X] [country=X: city, state abbreviation] [collection_date=X] [isolation-source=X] [gcode=11]" Following the -p is the path to the directory containing the FSA file, following the -t is the path to and name of the SBT template file Sample syntax Desktop/ncbi/tbl2asn -p ~/Desktop/ncbi -t ~/Desktop/ncbi/template-1.sbt -M n -Z discrep –j "[organism=Ruthia magnifica str. UCD-CM][strain=UCD-CM] [country=USA: Davis, CA][collection_date=2002][isolation-source=Calyptogena magnifica tissue][gcode=11]" The program will output the necessary files into the directory you created earlier (ensure no errors were generated by opening the errorsummary.val file and making sure it is blank, or listing the directory contents ($ ls –lh) to ensure it has zero bytes) **Submitting Raw Reads to ENA/SRA** We recommend using Safari or Firefox for this step, in our hands Chrome can have issues with the Java requirements for uploading files.