Known SNP data and simulation of synthetic truth sets:
1000G project phase3 integrated SNP data were used to generate the set of data for simulation as well as real data analysis (Auton et al., 2015). INDELs and complex multiallelics were removed due to higher error rates and sequence complexity as in some other work published elsewhere (Magi et al., 2014; Narasimhan et al., 2016). To test the performance of our algorithm, we generated true homozygous stretches using allele frequency data from 1000G CEU individuals (99 individuals) inside a viterbi scheme. To generate a variety of homozygous stretches, the percentage of homozygous sites were limited to a discrete value indicating total autozygosity for the sample [0.02 – 0.12] as well as the transition probabilities were adjusted by 10 fold at each simulation step between 1/100000 and 1/2500000. Generated synthetic calls were merged into individual VCF files for synthetic benchmarks. To make the most out of synthetic benchmarks we also introduced noise in the form of extraneous heterozygous sites or homozygous sites inside random positions of all VCF files. 5 to 10 percent of homozygous reference allele sites were converted to heterozygous sites and vice versa. Resulting VCF files included up to 10 percent more heterozygous or more homozygous sites compared to their original state.