Discussions:
Efforts to detect homozygosity from genotyping data have resulted in many different tools and algorithms. Sliding window and Hidden Markov model approaches have been proposed as means to estimate homozygous segments from various different data types (Ceballos, Hazelhurst, & Ramsay, 2018; Howrigan, Simonson, & Keller, 2011). Sliding window approaches have been useful especially when working with dense genotyping arrays where allele densities are usually uniform and error rates are low compared to sequencing based methods. GERMLINE ands are two representatives of the early sliding window algorithms where the latter is still widely used by many studies utilizing homozygosity mapping (Gusev et al., 2009; Purcell et al., 2007). However both tools have been particularly targeted for dense genotyping arrays and their performance under sparse and error prone data types generated by next generation sequencing is questionable. Earlier algorithms using HMM approaches were also present, yet their primary target is high quality dense genotyping array data and their applicability to next generation sequencing data is limited (Leutenegger et al., 2003; Marioni et al., 2006). Newer HMM approaches like H3M2 , Filtus andbcftools roh mostly target sparse and error prone next generation sequencing data. H3M2 uses a predefined set of SNPs along with a heterogenous HMM to incorporate allelic distances as in BioHMM (Marioni et al., 2006) and gaussian mixture probabilities of B-allele frequencies to calculate genotypic probability under different states. Filtususes a modified version of Leutenegger’s algorithm to detect autozygosity in next generation sequencing data (Vigeland, Gjøtterud, & Selmer, 2016). bcftools roh on the other hand uses allele frequencies as genotypic probabilities and utilizes genome wide recombination maps to calculate state transitions between consecutive allele positions. Both approaches have advantages over using sliding window algorithms when used with next generation sequencing data (Magi et al., 2014; Narasimhan et al., 2016).
Here we present ROHMM as a flexible HMM implementation for homozygosity mapping using high throughput sequencing data.ROHMM ’s unique approach relies on observed allele distributions in X chromosome non-pseudoautosomal regions in male and female samples. Utilization of different approaches were present in other tools namelyH3M2 , bcftools roh , Filtus . ROHMM ’s design approach resembles the strategy in between H3M2 andbcftools roh with the additional user friendliness from the graphical user interface. H3M2’s design is not suitable for population scale data, whereas lack of proper allele frequencies and recombination maps limits bcftools roh’s functionality under the condition of limited number samples. ROHMM on the other hand is free from these limitations and can be utilized freely and flexibly on all types of data.
ROHMM ’s performance under simulated data showed that ROHMMis vastly superior to sliding window algorithms. False negative rate of sliding window algorithms especially under sparse genotyping data is limiting their usability. ROHMM on the other hand can perform stably even when data density is further lowered. During our simulated data tests we observed a direct correlation between FROHand Fmom however we noted that this correlation may not be used as a direct measure of performance under real data unlike what was reported by others. Obviously our simulation data does not contain the linkage disequilibrium present in the actual population data and assumes that all sites were present individually but there may be other reasons. One possibility is that others might have pruned the 1000G data to an extend that was not reported in detail. Secondly imputation within 1000G data might have introduced excessive heterozygosity to regions that were not properly genotyped by high throughput sequencing. Both possibilities may need further investigation. Nevertheless, the ratio of sites within homozygous stretches above 500Kb show high correlation with the Fmom calculated from 1000G data. When we compared exome and genome inferences of homozygous stretches above 500Kb we noticed the similar levels of correlation reported by others, further supporting the stable and robust performance of ROHMM . Surprisingly heterozygosity ratio showed a much pronounced correlation with ROHMM ’s inferences. Previous studies showing correlation albeit with a lesser “R2” values suggest that homozygosity inference methods used by those studies are sub-optimal hence supporting the ROHMM ’s precision and accuracy under real data. Additional support comes from the distribution of long homozygous stretches inferred by ROHMM ’s allele distribution model and allele frequency model. When compared against each other homozygous stretches above 500Kb, especially above 1.5Mb shows high concordance between two models suggesting that allele distribution model can be used for population scale data. Narasimhan and colleagues reported that the power to detect true autozygosity diminishes with the reduced number of samples as the emission states are dependent on calculated allele frequencies (Narasimhan et al., 2016). Since allele distribution model is not affected from population size and allele frequencies, this may further indicate that ROHMM ’s default model may be even more suitable to any size of population or cohort data. On the clinical data,ROHMM was able to detect homozygosity signals within a single sample and enhancements implemented within ROHMM enabled to fine tune the inference even further especially for shorter segments that are not evident from the VCF data only.
We recommend ROHMM to any user for detecting homozygosity with confidence. We believe that the unique qualities presented here will be make ROHMM a go to tool for all kinds of homozygosity analyses.