Known SNP data and simulation of synthetic truth sets:
1000G project phase3 integrated SNP data were used to generate the set
of data for simulation as well as real data analysis (Auton et al.,
2015). INDELs and complex multiallelics were removed due to higher error
rates and sequence complexity as in some other work published elsewhere
(Magi et al., 2014; Narasimhan et al., 2016). To test the performance of
our algorithm, we generated true homozygous stretches using allele
frequency data from 1000G CEU individuals (99 individuals) inside a
viterbi scheme. To generate a variety of homozygous stretches, the
percentage of homozygous sites were limited to a discrete value
indicating total autozygosity for the sample [0.02 – 0.12] as well
as the transition probabilities were adjusted by 10 fold at each
simulation step between 1/100000 and 1/2500000. Generated synthetic
calls were merged into individual VCF files for synthetic benchmarks. To
make the most out of synthetic benchmarks we also introduced noise in
the form of extraneous heterozygous sites or homozygous sites inside
random positions of all VCF files. 5 to 10 percent of homozygous
reference allele sites were converted to heterozygous sites and vice
versa. Resulting VCF files included up to 10 percent more heterozygous
or more homozygous sites compared to their original state.