Results

Data

FASTA format tools were tested on the Vicugna_pacos-2.0.1 whole genome shotgun sequence scaffolds because the 2.17 Gb Vicugna pacos genome is large (\(>\) 1 Gb) and has many scaffolds (276727) \cite{Lindblad_Toh_2011}. The large genome size and high number of individual sequences should approximate a typical large FASTA file. The FASTA file was downloaded from the NCBI FTP as NW_005882702.1 Vicugna pacos isolate Carlotta (AHFN-0088) Vicugna_pacos-2.0.1 assembly scaffolds. An additional unwrapped sequence was added to the end of the file. This sequence was also missing a newline. Each FASTA record in the file also had spaces within the text of the headers.

The additional simulated FASTA record is available on Github.

Reformatting tests

No tool was found with all of Fasta-O-Matic’s functions. Therefore sequence line wrapping was compared between Fasta-O-Matic and two other common reformatting tools, seqtk and seqret. Fasta-O-Matic was run with the --qc_steps flag set to either wrap new_line header_whitespace (all), wrap (W) new_line (NL), unique (U) or header_whitespace (HW). Seqtk was run with the arguments seq -l 60 . Seqret was run using only the -sequence and -outseq arguments. Code used in tests or to produce figures can be found on github. Run time and max memory was reported for each tool. Tests were run on a Xeon Phi server with 48x12-core Intel Xeon CPUs, 256GB of RAM, Linux CentOS 7 and Python2.7.

Comparison between results

All tools could reformat the improperly wrapped FASTA file. Fasta-O-Matic had the lowest maximum memory requirements (Figure 1, Table 1). This may be useful if working on a large genome on a local machine or cluster headnode where memory usage is restricted. Fasta-O-Matic took several minutes rather than seconds (seqtk and seqret took \(<\) 13 s) (Figure 2, Table 1).

Fully re-formatted simulated FASTA record (backslashes are used to indicate a new line that is for display in the article rather than the new lines being included in the actual FASTA record): >NW_000000000.0_Vicugna_pacos_isolate_Carlotta_(AHFN-0088)_FAKE_genomic_scaffold,_Vicugna_pacos-2.1_Scaffold-,_\ whole_genome_shotgun_sequence ATACAACCATAAAGGTGCTATTCAGTCCATGGTTACAGGACATAACTACAACACACACCC ACGTACACATGCGCATGCGCATGCACACACCCACGTACACGTACACGTACGCATACACAC CCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACAC CCACGTACACGTACACGTACGCATACACACCCACGTACACGTACACGTACGCATACACAC CCACGTACGCACACACGTACACGTGTAGGCACGCATTTAGCAAGTATTTAGCTTGCTTAA ACAAACCCCCCCTACCCCCCACGAGCCCCACCTTATATACCAGACAGTCTTGCCAAACCC CAAAAACAAGACATAGCGCATAAGCTATAGAACCCGGACAAACCTTTGCCCACAAACCCA ACTTCTTAAATAATCACATGGCCAAATCGTACCAATGTGTTACTCTAGTATATTAAAAAT ATACAGACAGCTATCTCCCTAGATCCGCCAAAATTTTTAAAACAGAATTCAACAACCTTT TTAATGGCACCCCCCCCCCCCATAAATGACC