loading page

Population genetics using low coverage RADseq data in non-model organisms: biases and solutions
  • +1
  • Stefano Mona,
  • Andrea Benazzo,
  • Erwan Delrieu-Trottin,
  • Pierre Lesturgie
Stefano Mona

Corresponding Author:[email protected]

Author Profile
Andrea Benazzo
University of Ferrara
Author Profile
Erwan Delrieu-Trottin
CEFE, Univ Montpellier, CNRS, EPHE-PSL University, IRD
Author Profile
Pierre Lesturgie
Author Profile


Restriction site-associated DNA sequencing (RADseq) allows the genotyping of thousands of single nucleotide polymorphisms (SNPs) in many individuals at a reduced cost. However, achieving the desired sequencing depth is challenging in non-model organisms, where the expected number of RADseq loci is unknown. The impact of low coverage sequencing in RADseq experiment on the estimated population genetic parameters has not yet been fully characterized. Here we performed an in silico RADseq experiment by extracting loci from whole genome sequences of diploid individuals simulated under various demographic scenarios. We generated fastq files from the extracted loci and evaluated the performance of three bioinformatics pipelines to discover genetic variants, namely STACKS v.1, STACKS v.2 and ANGSD. We specifically focused on the accuracy of each pipeline to produce datasets retrieving the genetic variability and the historical demography of the simulated populations for several average depth of coverage. For low coverage datasets (<15x) STACKS v.1 and, to a lesser extent, STACKS v.2, were highly sensible to assembly parameters, showing for all scenarios: i) deficit in genetic diversity; ii) site frequency spectrum (SFS) skewed toward low frequency variants. This led to a pronounced bias in the inferred demographic history, particularly for larger samples size, a parameter typically associated with greater confidence in the inferences. Conversely, ANGSD correctly retrieved the genetic variability for most of the simulated scenarios and assembly parameters. We confirmed our findings, based on simulated data, in an empirical RADseq dataset and provide practical guidelines to perform robust demographic inferences in low coverage experiments.