ROHMM’s HMM algorithm:
ROHMM uses a 2-state HMM to infer homozygosity from genotyping
data in VCF format. ROHMM ’s algorithm uses the following notions;
- 2 states representing homozygous (ROH) and non-homozygous regions
(NonROH).
- Genotype at any given position i, Gi
- Genotype likelihood at any given position is calculated by the variant
caller represented in PL or GL format, “GLi”.
- Allele distribution probability of the given genotype derived from X
chromosome non-pseudoautosomal regions at given state,
“P(Genotype|State)”.
Using allele distribution probabilities and genotype likelihoods from GL
or PL fields populated by variant callers within VCF FORMAT tags or
assigning user-defined PL value for the missing entries, we generated
emission probabilities per site as follows.
Emission probabilities can also be calculated using population allele
frequencies as in bcftools roh yet as an optional method of
operation for ROHMM . The Allele Frequency Model is also included
for the sake of comparison.
Transition probabilities of ROHMM are similar to logarithmic
decay function introduced by Marioni and colleagues (Marioni et al.,
2006). This function calculates dynamic transition probabilities between
2 adjacent loci as an exponential function therefore the longer the
distance the larger the probability to disconnect from a previous state.
This logarithmic distance decay function is summarized below.
Standard transition probability stdtrans is set to a default
value of 0.1. Alternatively ROHMM also has the ability to use
fixed transition probabilities given by the user but the default is the
distance decay function.
The initial state probabilities of ROHMM is set to 0.5 to avoid
any bias towards any state unlike other methods described (Magi et al.,
2014; Narasimhan et al., 2016). ROHMM uses a viterbi decoding
function to infer homozygous and non-homozygous states based on
expectation maximization and calculates the average posterior
forward-backward scores for any inferred interval for quality scoring.
Results are presented as a 6-column BED file indicating state, average
posterior score and the number of sites used to infer the state.