Performance on Real Data Sets:
To test the performance of ROHMM on real datasets we used 1000G integrated phase3 data. Homozygous regions of different classes were inferred from this data and we compared distributions of each class and overall homozygosity among different continents and sub-populations listed here. Initial comparison of exome scale data and genome scale data performance of ROHMM indicated that ROHMM is able to detect homozygosity at a comparable level from both types of data as it was the case when we tested down-sampled data from our synthetic benchmarks (R ~ 0.96, p<2.2e-16) (Figure 4A).
In order to make sure that inferred homozygous regions were real or close to real homozygosity among individuals, we decided to compare the inbreeding coefficient calculated by method of moments estimator (Fmom) against FROH which was defined as the ratio of sites within homozygous stretches over all sites present at each individual. This comparison was performed by many others before studying the effects of inbreeding on populations as well as small pedigrees (Keller et al., 2011; Narasimhan et al., 2016; Rosenberg, Pemberton, Li, & Belmont, 2013). We noticed that when Fmom was compared against the FROHcalculated from the total homozygosity detected by ROHMM we obtained a low level of correlation even when we used allele frequency model and bcftools roh itself. When FROH was calculated using homozygous regions longer than 0.5 kilobases the correlation between Fmom and FROH was more pronounced especially for superpopulations with higher consanguinity (R ~ 0.9, p<2.2e-16) (Figure 4B-4C). This result was also parallel to what others have published before (Keller et al., 2011) but in contradiction to what was reported by Narasimhan et. al. (Narasimhan et al., 2016). This discriminant behavior between reports may need further investigation.
As we sought for different measures for testing performance under real data, a direct comparison against heterozygosity measure turned out to be a better performer. Heterozygosity measure is defined as the ratio of heterozygous sites against all homozygous non-reference sites present per individual as defined by Wang and others (Samuels et al., 2016; Wang, Raskin, Samuels, Shyr, & Guo, 2015). This measure has been tested for its usefulness when comparing populations and individuals for disease resistance and recessive phenotype associations. According to those reports heterozygosity ratio is more robust when compared to homozygosity ratio which was reported to be density dependent. Our measurements and others have also confirmed that when the number of sites is reduced, the power to detect true homozygosity is diminished (Figure 3A-3B). We decided to compare our results against heterozygosity measure and surprisingly ROHMM ’s Allele Distribution Model showed significant correlation between heterozygosity measure of individual populations and FROH inferred from total sites within inferred homozygous segments. Previous reports from Samuels and colleagues indicated an inverse correlation albeit with a lower R2 value. ROHMM ’s inferences showed much higher correlation between FROH and heterozygosity measure (R < -0.9, p<2.2e-16). For South Asian populations where consanguinity is much higher this correlation coefficient is almost the same even when FROH is calculated from much less dense exome data (R < -0.9, p<2.2e-16) (Figure 5A-5B). When overall homozygosity ratios are compared between populations and sub-populations’, ROHMM’s inferences also show the differeces between subpopulations as indicated by other reports (Figure 5C-5D). Additionally, heterozygosity measure graphs by Wang et. al and homozygosity ratio calculated from ROHMM shows almost a perfect mirror image of each other (Figure 5C-5D compared to Figures 2A-2E from Wang et. al. (Wang et al., 2015)).
As a final comparison, we investigated the distribution of homozygous segments captured by ROHMM allele distribution model and allele frequency model. When we compared homozygous stretches longer than 0.5Kb and 1.5Mb we noticed that distribution of sites resemble each other regardless of the model used by ROHMM (Figure 6). This result further supports the idea that the allele distribution model is as useful as the allele frequency model when used with population scale data.