Authorea

Jennifer Shelton edited introduction.tex over 8 years ago

Commit id: 4af6669c59233b1eabdbc4a714750f24dbc2904c

deletions | additions

\section{Introduction} Sequence data can be stored as text with each letter representing a nucleic acid (DNA and RNA) or amino acid (protein). The linear nature of these molecules makes it natural to represent them as strings, finite sequences of characters. Although it has been argued that a graph, a network of edges connected by vertices, is a more accurate way to store genomic sequences because graphs allow the inclusion of alternate alleles and alternate possible assemblies \cite{jaffe2012fastg} all of the most common methods for storing sequences (FASTA, FASTQ, SAM/BAM) use a linear strings.

Code: \verb|seqret \begin{verbatim} seqret -auto -stdout -sequence emboss_seqret-I20150716-200022-0179-11804058-oy.sequence -snucleotide1 -sformat1 pearson -osformat2 fasta -feature -ofname2 emboss_seqret-I20150716-200022-0179-11804058-oy.gff| emboss_seqret-I20150716-200022-0179-11804058-oy.gff \end{verbatim} Input: \verb|>my header| \verb|AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN| \verb|ATATATATATAT| \verb|ATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATATATATATATATTGCGCTCTCGTCTCCT| \begin{verbatim} >my header AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN ATATATATATAT ATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATATATATATATATTGCGCTCTCGTCTCCT\end{verbatim} Output: \verb|>my header| \verb|AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN| \verb|ATATATATATATATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATA| \verb|TATATATATATTGCGCTCTCGTCTCCT| \begin{verbatim} >my header AAAAAAAAAAAATTTTTTCCCCGGCGCGCGCGCTATAGCGCTATANNNNNNNNNNNNNNN ATATATATATATATTATTATATATATATTCTCTCTGGGCTCGCGTCTCGCTATTTATATA TATATATATATTGCGCTCTCGTCTCCT \end{verbatim} However, Seqret did not log the detected errors in the format. In many cases missing new line characters at the end of a file or variable line wrapping do not indicate corrupted data. However they could be the result of a corrupted file and the analyst should be made aware of such errors to briefly investigate the issue. Another feature of Seqret is that an output file is created even if the output is identical to the input. Storing two identical files is an inefficient use of disk space. Rather a tool should test for proper file format and export a reformatted file only if the input is found to have a formatting issue. Another example of a tool that can automate FASTA reformatting is Seqtk. The free tool Seqtk can wrap an improperly wrapped FASTA file to a user specified length but this tool was not developed to first test if for wrapping. This tool has both limited functionality and would double the disk space required to store FASTA files in a workflow. Storage is trivial unless the FASTA files in question store whole genomes in which case the burden can add up for a bioinformatics core. Overall, while many tools can either detect a format issue or repair a format issue no existing tool was found that both validates FASTA format and reformats automatically only where required for a user defined list of FASTA format issues.