INTRODUCTION
Population genetic datasets are a rich source of information for
wildlife managers (Hoffmann et al. 2015; Hohenlohe et al. 2021). They
provide data on genetic structure, adaptation and evolutionary
trajectories of species and populations (e.g., local adaptation,
hybridization, population dynamics, evolutionary potential; Willi et al.
2021). They can reveal biological and ecological processes that could
not otherwise be studied (e.g., mating systems, sex-specific dispersal
and gene flow; Ellegren 2014; Amos et al. 2014). In addition, they help
to identify genetic problems in small populations—notably loss of
genetic diversity, inbreeding, inbreeding
depression—and develop simple
and cost-effective management solutions towards their conservation
(e.g., genetic augmentation, genetic rescue; Frankham et al. 2017;
Harrisson et al. 2019; Kardos 2021).
With the massive amount of genomic data that can be generated, the level
of expertise in bioinformatics required for analysing genomic datasets
has increased (MacMahon et al. 2014; Holderegger et al. 2019; Hohenlohe
et al. 2020). Conservation geneticists spend a great part of their time
learning the use of new software, which reduces their availability to
engage in other important activities needed to bridge the gap between
research and conservation practice (e.g., facilitating communication
with wildlife managers, building relationships with primary industry,
informing and shaping policy; Galla et al. 2016; Taylor et al. 2017;
Britt et al. 2018). Accordingly, there is much interest in creating
easy-to-use resources to automate and streamline dataset filtering and
genomic analyses. This has included the development of packages forR, which tends to be a more welcoming environment for biologists
than does command-line software (e.g., dartR : Gruber et al. 2018,
Mijangos et al. 2022; SambaR : de Jong et al. 2021; snpR :
Hemstrom & Jones 2022; SNPfiltR: DeRaad 2022; Hogg et al. 2022).
Most population genetic analyses assume autosomal loci; thus
best-practice filtering includes removal of sex-linked loci from SNP
datasets. If sex-linked loci are not removed, estimates of population
genetic diversity such as heterozygosity, Wright’s fixation indices
including F IS, polymorphism, and allelic richness
may be biased depending on the sex ratio of the sample and the
sex-chromosome-to-autosome diversity ratio (Ellegren 2009; Allendorf et
al. 2022; Frankham et al. 2017). Assessment of population genetic
structure also benefits from the removal of sex-linked loci because they
can mask genetic structure that is due to evolutionary processes (e.g.,
gene flow, natural selection, genetic drift; Pritchard et el. 2000;
Radosavljević et al. 2015; Benestan et al. 2016). Similarly, parentage
analyses assume autosomal Mendelian inheritance and so their accuracy
can be affected by the presence of sex-linked loci because they create
apparent genetic mismatches between true parent-offspring pairs (Jones
& Wang 2010). On the other hand, focusing on sex-linked markers can
help assign sex to individuals of sexually-monomorphic species, as well
as reveal interesting patterns of sex-specific ecology and evolution
(e.g., natural selection, philopatry; Castella et al. 2001; Pavlova et
al. 2013; Arnold & Wilkinson 2015). Thus, correct identification of
sex-linked loci is important for making appropriate management
recommendations.
In animal species, the two most common chromosomal sex-determination
systems are XY and ZW. In an XY system, typical for mammals and some
insects, males are the heterogametic sex with one X and one Y
chromosome, and females are the homogametic sex with two X chromosomes.
In contrast, in the ZW system, typical for birds, and some reptiles and
insects, females are heterogametic (ZW) and males homogametic (ZZ)
(Beaukeboom & Perrin 2014). The SNP markers on sex chromosomes can be
classified into three types with different inheritance and
characteristics (Figure 1):
- Those present only on the W or Y chromosome (hereafter
‘W-linked/Y-linked’; Figure 1 in yellow). In SNP datasets, such
markers are called only in the heterogametic sex and are missing in
the homogametic sex.
- Those present only on the Z or X chromosome (hereafter
‘Z-linked/X-linked’; Figure 1 in orange). In SNP datasets, the
heterogametic sex possesses only one allele (i.e., they arehemizygous ) and individuals appear homozygous when genotyped.
The homogametic sex, which possesses two alleles, can be heterozygous
or homozygous as for an autosomal locus.
- Those present in homologous regions of both sex chromosomes, Zand W or X and Y, and similar enough to be considered
alleles of the same locus (hereafter ‘gametologs’, Figure 1 in green).
In some cases, gametologous loci have one allele that is found
exclusively on one sex chromosome while the other allele appears
exclusively on the other. As a result, all members of the
heterogametic sex appear heterozygous, and the homogametic sex
homozygous. These loci are known as ‘fixed’ gametologs and are typical
of old sex chromosomes. In other cases (i.e., in recently evolved
[neo-] sex chromosomes), the ‘Z-allele’ (or ‘X-allele’) is still
found on some versions of the W (or Y) chromosome, and thus, some
individuals of the heterogametic sex are homozygous. In these cases,
the gametologs are ‘non-fixed’.
The simplest way to distinguish sex-linked loci from autosomal ones is
to identify those found in reads that mapped to the sex chromosomes of
the reference genome. However, this is not possible when (i) a reference
genome is not available–as is the case for most wildlife species–andde novo genotyping is required, (ii) there is little conserved
synteny between the studied genome and the reference, or (iii) the W/Y
chromosome of the reference genome is fragmented into numerous unmapped
scaffolds, as is common in many genome projects (Carvalho & Clark
2013).
Some methods to identify sex-linked SNPs have been developed.
MendelChecker, for example, uses the deviation from Mendelian
inheritance to calculate the probability that a specific SNP is
sex-linked, with the disadvantage that it requires genotype
probabilities and pedigree information for analysis (Chen et al. 2014).
Other methods use a set of individuals of known sex to test whether the
allele frequencies of a given locus differ between the sexes. For
instance, RADSEX is a command-line software that uses identical raw
reads as non-polymorphic markers and uses their presence or absence in
males and females to identify those significantly associated with sex
(Feron et al. 2021). Some other studies have identified sex-linked
markers by testing for differentiation between the sexes usingF ST, but this approach can be used only for
Z-linked/X-linked and gametologous loci (Benestan et al. 2017; Drinan et
al. 2018; Trenkel et al. 2020). Function gl.report.sexlinked fromdartR package (v2; Mijangos et al. 2022) uses arbitrary
heterozygosity thresholds as default parameters to identify fixed
gametologs, and can be used to identify non-fixed gametologs and
Z-linked/X-linked loci by fine-tuning parameters (Pavlova et al. 2022).
Nevertheless, this approach has the disadvantages that there are no
clear instructions on how to tune parameters, the user has to manually
adjust thresholds on a trial-and-error basis for each genomic dataset,
and its precision declines with heterozygosity, risking either the
erroneous removal of autosomal loci with rare alleles or the failure to
remove sex-linked loci with low heterozygosity. Overall, these methods
could be improved upon by developing an intuitive statistical approach
that systematically identifies and distinguishes among types of SNPs
(autosomal, W-linked/Y-linked, Z-linked/X-linked, and gametologs) that
is automated in a ready-to-use R function with little user
intervention needed.
In the same way that it is possible to use a set of known-sex
individuals to identify sex-linked loci, the opposite is also possible:
use a set of known sex-linked loci to identify the sex of an individual.
Sex-assignment is usually done utilizing a handful of sex-linked loci of
only one type (Trenkel et al. 2020). For example, if using non-fixed
ZW-gametologs (for which heterozygous individuals are never male), an
individual is declared female if it is heterozygous for at least one
locus, yet by chance, depending on allelic frequencies and the number of
evaluated gametologs, some females may not be heterozygous for any of
the loci. Similarly, depending on genotyping error rates, some males may
appear heterozygous for some loci. To our knowledge, despite the rich
information that the three types of sex-linked loci contain to improve
sex-assignment, the comparison of their information is rarely, if ever,
done. Thus, the sexing of individuals using large SNP datasets can
benefit from a methodical procedure that uses the information from all
available sex-linked loci and that can be integrated as a standard step
in bioinformatics pipelines.
Another best-practice during filtering autosomal and sex-linked datasets
is minimizing the presence of ‘multilocus’ SNPs (also known as
multilocus contigs, multicopy loci or homeologs; Hohenlohe et al. 2011;
Willis et al. 2017; O’Leary et al. 2018). These artefactual SNPs arise
during bioinformatic processing of raw reads and are product of
erroneously fusing two physically separate loci that are very similar,
either because they are paralogs, repetitive elements or otherwise very
much alike. Because a multilocus SNP is actually two loci, multilocus
SNPs tend to present abnormally high read depths. This characteristic
allows their removal by setting a maximum read depth threshold during
filtering (usually twice the mode or mean; Willis et al. 2017). In some
cases, there are fixed or near-fixed differences between the
artificially fused loci, which makes multilocus SNPs exhibit
heterozygosity well-above the expectation of 0.5 for biallelic markers
at Hardy-Weinberg proportions. As a consequence, these SNPs can inflate
estimates of heterozygosity (O’Leary et al. 2018). A common practice to
identify these artefactual loci using heterozygosity is to set an
arbitrary maximum threshold (e.g., heterozygosity ≥ 0.6). It has been
found that using more than one approach to identify multilocus SNPs–and
removing those that are flagged by any method—constitutes the best
strategy (Willis et al. 2017).
Lastly, parentage analyses and sibship reconstruction using molecular
markers have great relevance in wildlife conservation. Resolving unknown
parent-offspring relationships gives insights into the behaviour,
ecology and evolution of plant and animal populations (e.g., extra-pair
mating, inbreeding avoidance, dispersal, natural selection, effective
population size; Flanagan & Jones 2018). Their application extends into
very practical instances such as monitoring the success of
translocations and genetic rescue, and spotting illegal trade of wild
individuals (Fitzpatrick et al. 2020; Van Rossum 2022; Mucci et al.
2020). Moreover, captive breeding programs also benefit from parentage
analyses that allow them to estimate founder relationships (typically
assumed unrelated), and validate pedigrees and correct errors (Moran et
al. 2021; Overbeek et al. 2020; Galla et al. 2021). Among the variety of
parentage analysis software in existence, one of the most popular is
COLONY, which simultaneously infers sibship and parentage, and can
handle thousands of SNPs (Jones & Wang 2010). However, handling large
amounts of genetic data in order to format it into the specific input
file for COLONY requires some degree of bioinformatics expertise
(Flanagan & Jones 2018). Often, researchers need to create different
input files because several runs are usually required to maximize
detection of true relationships. This can be a time-consuming task worth
automating.
In this study, we aim to create four R functions to assist
researchers analyzing reduced-representation genomic datasets. The study
consists of two parts. First, we describe four R functions that
we designed to automate common tasks in conservation genomic studies:
(1) identify and remove sex-linked loci (functionfilter.sex.linked ), (2) use sex-linked loci to identify the
genetic sex of individuals (function infer.sex ), (3) filter out
excessively heterozygous loci that are likely to be genotyping errors
(function filter.excess.het ), and (4) create input files for
parentage analyses in COLONY (function gl2colony ). Second, we use
the four new functions to process genetic datasets for two bird species
and show how incomplete removal of sex-linked loci affects downstream
analyses of (i) population genetic diversity, (ii) individual
heterozygosity, (iii) population genetic structure, and (iv)
parentage.