Performance on Real Data Sets:
To test the performance of ROHMM on real datasets we used 1000G
integrated phase3 data. Homozygous regions of different classes were
inferred from this data and we compared distributions of each class and
overall homozygosity among different continents and sub-populations
listed here. Initial comparison of exome scale data and genome scale
data performance of ROHMM indicated that ROHMM is able to
detect homozygosity at a comparable level from both types of data as it
was the case when we tested down-sampled data from our synthetic
benchmarks (R ~ 0.96, p<2.2e-16) (Figure 4A).
In order to make sure that inferred homozygous regions were real or
close to real homozygosity among individuals, we decided to compare the
inbreeding coefficient calculated by method of moments estimator
(Fmom) against FROH which was defined as
the ratio of sites within homozygous stretches over all sites present at
each individual. This comparison was performed by many others before
studying the effects of inbreeding on populations as well as small
pedigrees (Keller et al., 2011; Narasimhan et al., 2016; Rosenberg,
Pemberton, Li, & Belmont, 2013). We noticed that when
Fmom was compared against the FROHcalculated from the total homozygosity detected by ROHMM we
obtained a low level of correlation even when we used allele frequency
model and bcftools roh itself. When FROH was
calculated using homozygous regions longer than 0.5 kilobases the
correlation between Fmom and FROH was
more pronounced especially for superpopulations with higher
consanguinity (R ~ 0.9, p<2.2e-16) (Figure
4B-4C). This result was also parallel to what others have published
before (Keller et al., 2011) but in contradiction to what was reported
by Narasimhan et. al. (Narasimhan et al., 2016). This discriminant
behavior between reports may need further investigation.
As we sought for different measures for testing performance under real
data, a direct comparison against heterozygosity measure turned out to
be a better performer. Heterozygosity measure is defined as the ratio of
heterozygous sites against all homozygous non-reference sites present
per individual as defined by Wang and others (Samuels et al., 2016;
Wang, Raskin, Samuels, Shyr, & Guo, 2015). This measure has been tested
for its usefulness when comparing populations and individuals for
disease resistance and recessive phenotype associations. According to
those reports heterozygosity ratio is more robust when compared to
homozygosity ratio which was reported to be density dependent. Our
measurements and others have also confirmed that when the number of
sites is reduced, the power to detect true homozygosity is diminished
(Figure 3A-3B). We decided to compare our results against heterozygosity
measure and surprisingly ROHMM ’s Allele Distribution Model showed
significant correlation between heterozygosity measure of individual
populations and FROH inferred from total sites within
inferred homozygous segments. Previous reports from Samuels and
colleagues indicated an inverse correlation albeit with a lower
R2 value. ROHMM ’s inferences showed much higher
correlation between FROH and heterozygosity measure (R
< -0.9, p<2.2e-16). For South Asian populations
where consanguinity is much higher this correlation coefficient is
almost the same even when FROH is calculated from much
less dense exome data (R < -0.9, p<2.2e-16) (Figure
5A-5B). When overall homozygosity ratios are compared between
populations and sub-populations’, ROHMM’s inferences also show
the differeces between subpopulations as indicated by other reports
(Figure 5C-5D). Additionally, heterozygosity measure graphs by Wang et.
al and homozygosity ratio calculated from ROHMM shows almost a
perfect mirror image of each other (Figure 5C-5D compared to Figures
2A-2E from Wang et. al. (Wang et al., 2015)).
As a final comparison, we investigated the distribution of homozygous
segments captured by ROHMM allele distribution model and allele
frequency model. When we compared homozygous stretches longer than 0.5Kb
and 1.5Mb we noticed that distribution of sites resemble each other
regardless of the model used by ROHMM (Figure 6). This result
further supports the idea that the allele distribution model is as
useful as the allele frequency model when used with population scale
data.