Correcting for incorrect genotype calls

SNP genotypes that are incorrectly called as heterozygous are a major source of false positives, since reads that overlap them appear to come from only one allele. To account for this issue, we assume that allele specific reads are drawn from a mixture of two beta-binomials, with probabilities \(H_{ik}\) and \(1-H_{ik}\), where \(H_{ik}\) is the probability that individual \(i\) is heterozygous for SNP \(k\). Reads from heterozygous individuals contain the reference allele with probability \(p_{h}\). We assume that reads from homozygous individuals still have a small probability of coming from the other allele due to sequencing errors, which occur with probability, \(p_{\textrm{err}}\). The probability of observing \(y_{ik}\) reads from the reference allele for individual \(i\) at SNP \(k\) then becomes:

\[\begin{aligned} & \Pr_{\mathrm{BB-mix}}\left(Y = y_{ik} \left| p_{h}, n_{ik}, \Upsilon_i \right. \right) = H_{ik} \Pr_{\mathrm{BB}} \left(Y = y_{ik} \left| p_{h}, n_{ik}, \Upsilon_i \right. \right) &\\ & + (1 - H_{ik}) \left[ \Pr_{\mathrm{BB}} \left(Y = y_{ik} \left| p_{\textrm{err}}, n_{ik}, \Upsilon_i \right. \right) + \Pr_{\mathrm{BB}} \left(Y = y_{ik} \left| 1-p_{\textrm{err}}, n_{ik}, \Upsilon_i \right. \right) \right] &\end{aligned}\]

We found that even SNPs with heterozygous probabilities of 1.0 are occasionally miscalled so we set heterozygous probabilities to a maximum value of 0.99. We then update this heterozygous probability using sequencing data obtained from the same individual. Sequencing data may consist of DNA sequencing reads or reads aggregated across multiple types of experiments performed on the same individual (e.g. RNA-seq and ChIP-seq reads).

For a SNP with heterozygous probability \(H_{ik} = \min(0.99, H_{ik}^{\textrm{obs}})\), we define the updated heterozygous probability, \(\hat{H}_{ik}\) as:

\[\hat{H}_{ik} = \frac{H_{ik} \Pr_{\mathrm{Bin}} \left( D \left| p=0.5 \right. \right)} {H_{ik} \Pr_{\mathrm{Bin}} \left( D \left| p=0.5 \right. \right) + (1 - H_{ik}) \left[ \Pr_{\mathrm{Bin}} \left( D \left| p=p_{err} \right. \right) + \Pr_{\mathrm{Bin}} \left( D \left| p=1-p_{\textrm{err}} \right. \right) \right]}\]