Authorea

Jennifer Shelton edited introduction.tex over 8 years ago

Commit id: a64430103d7de322cc0334361012a1fc28bfe019

deletions | additions

\section{Introduction} Sequence data can be stored as text with each letter representing a nucleic acid (DNA and RNA) or amino acid (protein). The linear nature of these molecules makes it natural to represent them as strings, finite sequence sequences of characters. Although it has been argued that a graph, a network of edges connected by vertices, is actually a more accurate way to store genomic sequences because graphs allow for the inclusion of alternate alleles and alternate possible assemblies \cite{jaffe2012fastg} all of the most common methods for storing sequences (FASTA, FASTQ, SAM/BAM) use a linear strings. Other decisions about how to represent sequence data can be more arbitrary. For example, any character that is not used as base or an amino acid can be used to indicate the beginning of a new sequence. Additionally text can be wrapped to limit the information content in any one line of a file. The advantage of wrapping text is that some programs can then be designed to work one line at time limiting the burden of each step (e.g. the program would never have to process an entire chromosome of sequence data in a single step). The disadvantage is that code must be slightly more complex to load an entire sequence record into the working memory.