Indel error
MOTIVATION write introduction / motivation statement to lead into results While _statistical phasing_ approaches are necessary for the determination of large-scale haplotype structure , sequencing traces provide short-range phasing information that may be employed directly in primary variant detection to establish phase between proximal alleles. Present read lengths and error rates limit this _physical phasing_ approach to variants clustered within tens of bases, but as the cost of obtaining long sequencing traces decreases , physical phasing methods will enable the determination of larger haplotype structure directly using only sequence information from a single sample. Haplotype-based variant detection methods, in which short haplotypes are read directly from sequencing traces, offer a number of benefits over methods which operate on a single position at a time. Haplotype-based methods ensure semantic consistency among described variants by simultaneously evaluating all classes of alleles in the same context. Locally phased genotypes can be used to improve genotyping accuracy in the context of rare variations that can be difficult to impute due to sparse linkage information. Similarly, they can assist in the design of genotyping assays, which can fail in the context of undescribed variation at the assayed locus. Provided sequencing errors are independent, the use of longer haplotypes in variant detection can improve detection by increasing the signal to noise ratio of the genotype likelihood space that is used in analysis. This follows from the fact that the space of possible erroneous haplotypes expands dramatically with haplotype length, while the space of true variation remains constant, with the number of true alleles less than or equal to the ploidy of the sample at a given locus. The direct detection of haplotypes from alignment data presents several challenges to existing variant detection methods. As the length of a haplotype increases, so does the number of possible alleles within the haplotype, and thus methods designed to detect genetic variation over haplotypes in a unified context must be able to model multiallelism. However, most variant detection methods establish estimates of the likelihood of polymorphism at a given loci using statistical models which assume biallelism and uniform, typically diploid, copy number . Moreover, improper modeling of copy number impedes the accurate detection of small variants on sex chromosomes, in polyploid organisms, or in locations with known copy-number variations, where called alleles, genotypes, and likelihoods should reflect local copy number and global ploidy. To enable the application of population-level inference methods to the detection of haplotypes, we generalize the Bayesian statistical method described by to allow multiallelic loci and non-uniform copy number across the samples under consideration. We have implemented this model in FreeBayes .