Jennifer Shelton edited introduction.tex  over 8 years ago

Commit id: 5c084afebe58c30d4cc2e98cdb6e35b3fc9a90ba

deletions | additions      

       

\subsection{FASTA file format specifications versus recommendations}  FASTA file format requirements are very minimal \cite{FASTAformat}. Each sequence is preceded by a header/description line that begins with a \verb|>|. \verb|>| symbol.  Sequence lines can include any standard IUB/IUPAC single character symbols for nucleic acids or amino acids or the ambiguous codes that indicate possible residues or bases \cite{comm1970abbreviations}. They can also include \verb|-| to indicate alignment gaps and \verb|*| to indicate stop codons. NCBI recommends wrapping FASTA file sequences lines \cite{FASTAformat}. It is also common practice to use the first `word' in a header (i.e. any character string to the left of the first space in the header) as the unique sequence id. Although these features are common they are not required leading to format compatibility issues with tools that treat these conventions as required. 

Regardless of whether a FASTA file is technically improperly formatted or it's format merely violates a popular convention it is critical to quality analysis workflows that data is converted into a format that will be correctly interpreted by downstream tools. Formatting issues can fall into multiple categories including actual format errors, formats that are not technically wrong but are non-standard and formats that throw errors because an existing tool has a bug (in which case we should modify the FASTA and proceed only if the tool will then correctly import the data and export the desired output)).   Some format errors are more often indicative of a major problem like an attempt to use the wrong data format (e.g. the first line is not a FASTA header because it does not begin with a \verb|>| character). These types of errors will be subsequently referred to as fatal. However Alternately,  some formatting issues are common and typically do not indicate occur commonly without indicating  the input FASTA  file is corrupt (e.g. improperly wrapped/ unwrapped wrapped/unwrapped  sequence lines, missing final new line characters, unusual new line characters like \verb|\r|). These issues will be referred to as non-fatal. Fatal formatting issues should cause processing to stop. Non-fatal formatting issues should be automatically corrected according to the most common resolution for this type of error. While downstream processing continues the analyst can double check the automated decision to reformat non-fatal issues. In this manner the workflow would not need to be slowed for trivial reformatting steps and the more rare problems (e.g. when a missing last new line was caused by incomplete file transfer) can still be caught.   \subsection{Existing tools}