Authorea

Graham McVicker edited Correcting for incorrect genotype calls.tex almost 10 years ago

Commit id: 04c1ef008695c01749955783f226e56822fe6879

deletions | additions

SNP genotypes that are incorrectly called as heterozygous are a major source of false positives, since reads that overlap them appear to come from only one allele. To account for this issue, we assume that allele specific reads are drawn from a mixture of two beta-binomials, with probabilities $H_{ik}$ and $1-H_{ik}$, where $H_{ik}$ is the probability that individual $i$ is heterozygous for SNP $k$. Reads from heterozygous individuals contain the reference allele with probability $p_{h}$. We assume that reads from homozygous individuals still have a small probability of coming from the other allele due to sequencing errors, which occur with probability, $p_{\textrm{err}}$. The probability of observing $y_{ik}$ reads from the reference allele at SNP $k$ for individual $i$ then becomes: \begin{eqnarray*} & \Pr_{\mathrm{BB-mix}}\left(Y_{ik} \Pr_{\mathrm{BB-mix}}\left(Y = y_{ik} \left| p_{h}, n_{ik}, \Upsilon_i \right. \right) = H_{ik} \Pr_{\mathrm{BB}} \left(Y_{ik} \left(Y = y_{ik} \left| p_{h}, n_{ik}, \Upsilon_i \right. \right) &\\ & + (1 - H_{ik}) \left[ \Pr_{\mathrm{BB}} \left(Y_{ik} \left(Y = y_{ik} \left| p_{\textrm{err}}, n_{ik}, \Upsilon_i \right. \right) + \Pr_{\mathrm{BB}} \left(Y_{ik} \left(Y = y_{ik} \left| 1-p_{\textrm{err}}, n_{ik}, \Upsilon_i \right. \right) \right] & \end{eqnarray*} We found that even SNPs with heterozygous probabilities of 1.0 are occasionally miscalled so we set heterozygous probabilities to a maximum value of 0.99. We then update this heterozygous probability using sequencing data obtained from the same individual. Sequencing data may consist of DNA sequencing reads or reads aggregated across multiple types of experiments performed on the same individual (e.g. RNA-seq and ChIP-seq reads).