Essential Maintenance: All Authorea-powered sites will be offline 4pm-6pm EDT Tuesday 28 May for essential maintenance.
We apologise for any inconvenience.

loading page

Estimates of heterozygosity from single nucleotide polymorphism markers are context dependent and often wrong
  • Jarrod Sopniewski,
  • Renee Catullo
Jarrod Sopniewski
The University of Western Australia

Corresponding Author:[email protected]

Author Profile
Renee Catullo
The University of Western Australia
Author Profile


Heterozygosity is frequently used to describe variation in genetic diversity amongst populations and is often estimated using single nucleotide polymorphisms (SNPs). However, methods of calculating heterozygosity from SNPs have been shown to be affected study design and filtering parameters, reducing their utility and comparability across studies. Though solutions have been proposed to account for identified problems, in our own data, we continued to see inconsistent results. Here, we aimed to further improve methods of reducing inconsistency in these results, specifically by investigating how sample size and missing data thresholds influenced autosomal estimates of heterozygosity (heterozygosity calculated from across the genome, i.e., both fixed and variable sites). We also investigated how the exclusion of tri- and tetra-allelic sites, which is generally standard practice in such studies, could affect eventual estimates of heterozygosity. Across three distinct taxa (a frog, Litoria rubella; a tree, Eucalyptus microcarpa; and a grasshopper, Keyacris scurra) we found autosomal heterozygosity estimates to be affected by samples size when missing data is not allowed and show that this is partly due to the exclusion of tri- and tetra-allelic loci. We also show that the biases introduced by these factors are not consistent between species, or even populations, with higher levels of actual heterozygosity tending to result in larger adverse effects. We propose a modified framework for calculating heterozygosity to reduce these inherent issues and highlight the need for further development in methods such that tri- and tetra-allelic sites can be included in the calculation of population genomics statistics.
Submitted to Molecular Ecology Resources
24 Jan 2024Editorial Decision: Revise Minor
21 Feb 2024Editorial Decision: Accept