Discussions:
Efforts to detect homozygosity from genotyping data have resulted in
many different tools and algorithms. Sliding window and Hidden Markov
model approaches have been proposed as means to estimate homozygous
segments from various different data types (Ceballos, Hazelhurst, &
Ramsay, 2018; Howrigan, Simonson, & Keller, 2011). Sliding window
approaches have been useful especially when working with dense
genotyping arrays where allele densities are usually uniform and error
rates are low compared to sequencing based methods. GERMLINE ands are two representatives of the early sliding window algorithms
where the latter is still widely used by many studies utilizing
homozygosity mapping (Gusev et al., 2009; Purcell et al., 2007). However
both tools have been particularly targeted for dense genotyping arrays
and their performance under sparse and error prone data types generated
by next generation sequencing is questionable. Earlier algorithms using
HMM approaches were also present, yet their primary target is high
quality dense genotyping array data and their applicability to next
generation sequencing data is limited (Leutenegger et al., 2003; Marioni
et al., 2006). Newer HMM approaches like H3M2 , Filtus andbcftools roh mostly target sparse and error prone next generation
sequencing data. H3M2 uses a predefined set of SNPs along with a
heterogenous HMM to incorporate allelic distances as in BioHMM (Marioni
et al., 2006) and gaussian mixture probabilities of B-allele frequencies
to calculate genotypic probability under different states. Filtususes a modified version of Leutenegger’s algorithm to detect
autozygosity in next generation sequencing data (Vigeland, Gjøtterud, &
Selmer, 2016). bcftools roh on the other hand uses allele
frequencies as genotypic probabilities and utilizes genome wide
recombination maps to calculate state transitions between consecutive
allele positions. Both approaches have advantages over using sliding
window algorithms when used with next generation sequencing data (Magi
et al., 2014; Narasimhan et al., 2016).
Here we present ROHMM as a flexible HMM implementation for
homozygosity mapping using high throughput sequencing data.ROHMM ’s unique approach relies on observed allele distributions
in X chromosome non-pseudoautosomal regions in male and female samples.
Utilization of different approaches were present in other tools namelyH3M2 , bcftools roh , Filtus . ROHMM ’s design
approach resembles the strategy in between H3M2 andbcftools roh with the additional user friendliness from the
graphical user interface. H3M2’s design is not suitable for
population scale data, whereas lack of proper allele frequencies and
recombination maps limits bcftools roh’s functionality under the
condition of limited number samples. ROHMM on the other hand is
free from these limitations and can be utilized freely and flexibly on
all types of data.
ROHMM ’s performance under simulated data showed that ROHMMis vastly superior to sliding window algorithms. False negative rate of
sliding window algorithms especially under sparse genotyping data is
limiting their usability. ROHMM on the other hand can perform
stably even when data density is further lowered. During our simulated
data tests we observed a direct correlation between FROHand Fmom however we noted that this correlation may not
be used as a direct measure of performance under real data unlike what
was reported by others. Obviously our simulation data does not contain
the linkage disequilibrium present in the actual population data and
assumes that all sites were present individually but there may be other
reasons. One possibility is that others might have pruned the 1000G data
to an extend that was not reported in detail. Secondly imputation within
1000G data might have introduced excessive heterozygosity to regions
that were not properly genotyped by high throughput sequencing. Both
possibilities may need further investigation. Nevertheless, the ratio of
sites within homozygous stretches above 500Kb show high correlation with
the Fmom calculated from 1000G data. When we compared
exome and genome inferences of homozygous stretches above 500Kb we
noticed the similar levels of correlation reported by others, further
supporting the stable and robust performance of ROHMM .
Surprisingly heterozygosity ratio showed a much pronounced correlation
with ROHMM ’s inferences. Previous studies showing correlation
albeit with a lesser “R2” values suggest that
homozygosity inference methods used by those studies are sub-optimal
hence supporting the ROHMM ’s precision and accuracy under real
data. Additional support comes from the distribution of long homozygous
stretches inferred by ROHMM ’s allele distribution model and
allele frequency model. When compared against each other homozygous
stretches above 500Kb, especially above 1.5Mb shows high concordance
between two models suggesting that allele distribution model can be used
for population scale data. Narasimhan and colleagues reported that the
power to detect true autozygosity diminishes with the reduced number of
samples as the emission states are dependent on calculated allele
frequencies (Narasimhan et al., 2016). Since allele distribution model
is not affected from population size and allele frequencies, this may
further indicate that ROHMM ’s default model may be even more
suitable to any size of population or cohort data. On the clinical data,ROHMM was able to detect homozygosity signals within a single
sample and enhancements implemented within ROHMM enabled to fine
tune the inference even further especially for shorter segments that are
not evident from the VCF data only.
We recommend ROHMM to any user for detecting homozygosity with
confidence. We believe that the unique qualities presented here will be
make ROHMM a go to tool for all kinds of homozygosity analyses.