Authorea

Sue Brown edited introduction.tex over 8 years ago

Commit id: 4f88d6a54fd35ebc61704be64c9df93e3b80cfc3

deletions | additions

Sequence data can be stored as text with each letter representing a nucleic acid (DNA and RNA) or amino acid (protein). The linear nature of these molecules makes it natural to represent them as strings, finite sequences of characters. Although it has been argued that a graph, a network of edges connected by vertices, is a more accurate way to store genomic sequences because graphs allow the inclusion of alternate alleles and alternate possible assemblies \cite{jaffe2012fastg} all of the most common methods for storing sequences (FASTA, FASTQ, SAM/BAM) use a linear strings. Other decisions about how to represent sequence data can be more arbitrary. For example, any character that is not used as a base or an amino acid can be used to indicate the beginning of a new sequence. Additionally text can be wrapped to limit the information content in any one line of a file. The advantage of wrapping text is that some programs can then be designed to work one line at time limiting the burden of each step (e.g. the program would never have to process an entire chromosome of sequence data in a single step). The disadvantage is that code must be slightly more complex to load an entire sequence record into the working memory. \subsection{FASTA file format specifications versus recommendations}