Authorea

Jennifer Shelton edited results.tex over 8 years ago

Commit id: 5f5ea7003fdb3d5386f7b8c4963b19fb5a810ecb

deletions | additions

FASTA format tools were tested on the Vicugna\_pacos-2.0.1 whole genome shotgun sequence scaffolds because the 2.17 Gb \textit{Vicugna pacos} genome is large ($>$ 1 Gb) and has many scaffolds (276727) \cite{Lindblad_Toh_2011}. The large genome size and high number of individual sequences should approximate a typical large FASTA file. The FASTA file was downloaded from the NCBI FTP as NW\_005882702.1 \textit{Vicugna pacos} isolate Carlotta (AHFN-0088) Vicugna\_pacos-2.0.1 assembly scaffolds. An additional unwrapped sequence was added to the end of the file. This sequence was also missing a newline. Each FASTA record in the file also had spaces within the text of the headers. Additional The additional simulated FASTA record: \begin{verbatim} >NW_000000000.0 Vicugna pacos isolate Carlotta (AHFN-0088) FAKE genomic scaffold, Vicugna_pacos-2.0.1 Scaffold-, whole genome shotgun sequence ATACAACCATAAAGGTGCTATTCAGTCCATGGTTACAGGACATAACTACAACACACACCCACGTACACATGCGCATGCGCATGCACACACCCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACACCCACGTACGCACACACGTACACGTGTAGGCACGCATTTAGCAAGTATTTAGCTTGCTTAAACAAACCCCCCCTACCCCCCACGAGCCCCACCTTATATACCAGACAGTCTTGCCAAACCCCAAAAACAAGACATAGCGCATAAGCTATAGAACCCGGACAAACCTTTGCCCACAAACCCAACTTCTTAAATAATCACATGGCCAAATCGTACCAATGTGTTACTCTAGTATATTAAAAATATACAGACAGCTATCTCCCTAGATCCGCCAAAATTTTTAAAACAGAATTCAACAACCTTTTTAATGGCACCCCCCCCCCCCATAAATGACC\end{verbatim} record is available on \href{https://github.com/kstatebioinfo/Fasta-O-Matic-a-tool-to-sanity-check-and-if-needed-reformat-FASTA-files/blob/master/simulated_unwrapped.fa}{Github}. \subsection{Reformatting tests} No tool was found with all of Fasta-O-Matic's functions. Therefore sequence line wrapping was compared between Fasta-O-Matic and two other common reformatting tools, seqtk and seqret. Fasta-O-Matic was run with the \verb|--qc_steps| flag set to either \verb|wrap new_line header_whitespace| (all), \verb|wrap| (W) \verb|new_line| (NL), \verb|unique| (U) or \verb|header_whitespace| (HW). Seqtk was run with the arguments \verb|seq -l 60|. Seqret was run using only the \verb|-sequence| and \verb|-outseq| arguments. Code used in tests or to produce figures can be found on \href{https://github.com/kstatebioinfo/Fasta-O-Matic-a-tool-to-sanity-check-and-if-needed-reformat-FASTA-files/tree/master/figures}{github}. Run time and max memory was reported for each tool. Tests were run on a Xeon Phi server with 48x12-core Intel Xeon CPUs, 256GB of RAM, Linux CentOS 7 and Python2.7.

All tools could reformat the improperly wrapped FASTA file. Fasta-O-Matic had the lowest maximum memory requirements (Figure 1, Table 1). This may be useful if working on a large genome on a local machine or cluster headnode where memory usage is restricted. Fasta-O-Matic took several minutes rather than seconds (seqtk and seqret took $<$ 13 s) (Figure 2, Table 1). Fully re-formatted simulated FASTA record : (backslashes are used to indicate a new line that is for display in the article rather than the new lines being included in the actual FASTA record): \begin{verbatim} >NW_000000000.0_Vicugna_pacos_isolate_Carlotta_(AHFN-0088)_FAKE_genomic_scaffold,_Vicugna_pacos-2.1_Scaffold-,_whole_genome_shotgun_sequence >NW_000000000.0_Vicugna_pacos_isolate_Carlotta_(AHFN-0088)_FAKE_genomic_scaffold,_Vicugna_pacos-2.1_Scaffold-,_\ whole_genome_shotgun_sequence ATACAACCATAAAGGTGCTATTCAGTCCATGGTTACAGGACATAACTACAACACACACCC ACGTACACATGCGCATGCGCATGCACACACCCACGTACACGTACACGTACGCATACACAC CCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACAC