Authorea

Jennifer Shelton edited introduction.tex over 8 years ago

Commit id: 5dcf575411a3dfe7216dee373485d992c0c5e4cd

deletions | additions

Input: \verb|>my header| \verb|AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN| \verb|ATATATATATAT| \verb|ATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATATATATATATATTGCGCTCTCGTCTCCT| Output: \verb|>my header| \verb|AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN| \verb|ATATATATATATATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATA| \verb|TATATATATATTGCGCTCTCGTCTCCT| However, Seqret did not log the detected errors in the format. In many cases missing new line characters at the end of a file or variable line wrapping do not indicate corrupted data. However they could be the result of a corrupted file and the analyst should be made aware of such errors to briefly investigate the issue. Another feature of Seqret is that an output file is created even if the output is identical to the input. Storing two identical files is an inefficient use of disk space. Rather a tool should test for proper file format and export a reformatted file only if the input is found to have a formatting issue. Another example of a tool that can automate FASTA reformatting is Seqtk. The free tool Seqtk can wrap an improperly wrapped FASTA file to a user specified length but this tool was not developed to first test if wrapping was needed or already a feature of the input format. This tool has both limited functionality and would double the disk space required to store FASTA files in a workflow. Storage is trivial unless the FASTA files in question store whole genomes in which case the burden can add up for a bioinformatics core. Overall while many tools can either detect a format issue or repair a format issue no existing tool was found that both validates FASTA format and reformats automatically only where required for a user defined list of FASTA format issues.