kstatebioinfo added FASTA file format description to Introduction  almost 9 years ago

Commit id: 78ab565abcddea7e7a92af5dee4236403a538ff1

deletions | additions      

       

Other decisions about how to represent sequence data can be more arbitrary. For example, any character that is not used as base or an amino acid can be used to indicate the beginning of a new sequence. Additionally text can be wrapped to limit the information content in any one line of a file. The advantage of wrapping text is that some programs can then be designed to work one line at time limiting the burden of each step (e.g. the program would never have to process an entire chromosome of sequence data in a single step). The disadvantage is that code must be slightly more complex to load an entire sequence record into the working memory.  \subsection{FASTA file format specifications versus recommendations}  FASTA file format requirements are very minimal. Each sequence is preceded by a header/description line that begins with a \verb|>|. Sequence can lines can include any standard IUB/IUPAC single character symbols for nucleic acids or amino acids or the ambiguous codes that indicate possible residues or bases. They can also include dashes to indicate alignment gaps.   It is also common practice to wrap FASTA file sequences lines and to use the first `word' in a header (i.e. any character string to the left of the first space in the header) as the unique sequence id. Although these features are common they are not required leading to format compatibility issues with tools that treat them as required features.  \subsection{Customizing FASTA files to ensure that information is properly interpreted by downstream tools}  Add text defining FASTA format here (also define (1) actual format errors, (2) formats that are not technically wrong but are non-standard and (3) formats that throw errors because an existing tool has a bug (in which case we should modify the FASTA and proceed only if the tool will then correctly import the data and export the desired output)).  Some format errors are indicative of an attempt to use the wrong format (e.g. the first line is not a FASTA header because it does not begin with a "\verb|>|" character). However some formatting issues do not necessarily indicate the input file cannot be used (e.g. improperly wrapped/ unwrapped sequence lines, missing final new line characters, unusual new line characters like '\verb|\r|').