ROHMM’s HMM algorithm:
ROHMM uses a 2-state HMM to infer homozygosity from genotyping data in VCF format. ROHMM ’s algorithm uses the following notions;
  1. 2 states representing homozygous (ROH) and non-homozygous regions (NonROH).
  2. Genotype at any given position i, Gi
  3. Genotype likelihood at any given position is calculated by the variant caller represented in PL or GL format, “GLi”.
  4. Allele distribution probability of the given genotype derived from X chromosome non-pseudoautosomal regions at given state, “P(Genotype|State)”.
Using allele distribution probabilities and genotype likelihoods from GL or PL fields populated by variant callers within VCF FORMAT tags or assigning user-defined PL value for the missing entries, we generated emission probabilities per site as follows.
Emission probabilities can also be calculated using population allele frequencies as in bcftools roh yet as an optional method of operation for ROHMM . The Allele Frequency Model is also included for the sake of comparison.
Transition probabilities of ROHMM are similar to logarithmic decay function introduced by Marioni and colleagues (Marioni et al., 2006). This function calculates dynamic transition probabilities between 2 adjacent loci as an exponential function therefore the longer the distance the larger the probability to disconnect from a previous state. This logarithmic distance decay function is summarized below.
Standard transition probability stdtrans is set to a default value of 0.1. Alternatively ROHMM also has the ability to use fixed transition probabilities given by the user but the default is the distance decay function.
The initial state probabilities of ROHMM is set to 0.5 to avoid any bias towards any state unlike other methods described (Magi et al., 2014; Narasimhan et al., 2016). ROHMM uses a viterbi decoding function to infer homozygous and non-homozygous states based on expectation maximization and calculates the average posterior forward-backward scores for any inferred interval for quality scoring. Results are presented as a 6-column BED file indicating state, average posterior score and the number of sites used to infer the state.