Jennifer Shelton edited introduction.tex  over 8 years ago

Commit id: df8faf080309e3a020bf685c8cf6b5851424a792

deletions | additions      

       

Regardless of whether a FASTA file is technically improperly formatted or it's format merely violates a popular convention it is critical to quality analysis workflows that data is converted into a format that will be correctly interpreted by downstream tools. Formatting issues can fall into multiple categories including actual format errors, formats that are not technically wrong but are non-standard and formats that throw errors because an existing tool has a bug (in which case we should modify the FASTA and proceed only if the tool will then correctly import the data and export the desired output).   Some format errors often indicate a major problem like an attempt to use the wrong data format (e.g. the first line is not a FASTA header because it does not begin with a \verb|>| character). These types of errors will be subsequently referred to as fatal. Alternately, some formatting issues occur commonly without indicating the FASTA file is corrupt (e.g. improperly wrapped/unwrapped sequence lines, missing final new line characters, unusual new line characters like \verb|\r|). These issues will be referred to as non-fatal. Fatal formatting issues should cause processing to stop. Non-fatal formatting issues should be automatically corrected according to the most common resolution for this type of error. While downstream processing continues the analyst can double check the automated decision to reformat non-fatal issues. In this manner the This way  workflow would notneed to  be slowed for trivial reformatting steps and the more rare problems (e.g. when a missing last new line was caused by incomplete file transfer) can could  still be caught.   \subsection{Existing tools} 

TATATATATATTGCGCTCTCGTCTCCT  \end{verbatim}  However, Seqret did not log the detected errors in the format.In many cases missing new line characters at the end of a file or variable line wrapping do not indicate corrupted data. However they could be the result of a corrupted file and the analyst should be made aware of such errors to briefly investigate the issue.  Another feature of Seqret is that an output file is created even if the output is identical to the input. Storing two identical files is an inefficient use of disk space.Rather a tool should test for proper file format and export a reformatted file only if the input is found to have a formatting issue.  Seqtk is another example of a tool that can automate FASTA reformatting but does not first check original format or report format issues. Storage Restarting analysis manually after wrapping a FASTA file may only take minutes but the issue is how long it takes the analyst to become available. Likewise, storage  is trivial unless the FASTA files in question store whole genomes in which case the burden can add up for a bioinformatics core. Efficiency and automation a crucial as bioinformatics projects become more numerous and time consuming.  Many tools can either detect a format issue or repair a format issue. No existing tool was found that both validates FASTA format and reformats automatically only where required for a user defined list of non-fatal  FASTA format issues.