2. Methods
2.1 Study system and available genetic data
We created models using available genetic, environmental and trait data
for 31 species of birds from the Atlantic Forest of South America, a
biogeographic region with an accumulating number of phylogeographic
studies over the past decades
(Peres et al. 2020). We
summarized and retrieved all intra-specific mitochondrial DNA (mtDNA)
data made available across 20 published studies to date (Table S1 in
Supporting File 1). While not representing the entire genome, we decided
to use mtDNA loci for the purpose of this study since they 1) have been
largely used in phylogeographic studies over the past two decades
(Hickerson et al. 2010),
especially in the Atlantic Forest
(Peres et al. 2020); 2)
represent a good approximation of intraspecific genetic differentiation
due to higher mutation rates and smaller effective population size
(Avise et al. 2016). From
each available study, we listed all mtDNA loci sequenced across all
individuals, as well as their associated georeferenced locality and
GenBank accession number. When the accession number was missing from the
original manuscript, we performed a search of the GenBank database using
the species name, locus, museum voucher and specimen ID as keywords,
through the NCBI API function entrez_search implemented in the R
package rentrez(Winter 2017). For
sequences not found through this automated search, we performed a manual
search on the NCBI website. Sequences were downloaded using theentrez_fetch function of the rentrez R package.
Mitochondrial DNA loci included in this analyses are Cytochrome b
(CytB), NADH dehydrogenase 2 (ND2) and the D-loop (or control) region.
All sequence data were aligned per species and locus, using the Muscle
algorithm (Edgar 2004) in
the muscle R package.
Sequences with missing coordinates were georeferenced using Google Maps
API within the R package tidygeocoder(Kahle and Wickham 2013).
We performed each search using all provided information about the
locality in each manuscript, which in most cases included a locality
name, state and country. We then plotted all localities per species in
geographic space to visualize possible errors, and manually corrected
erroneous localities. Although this approach may not retrieve the exact
location where specimens were collected, the accuracy of the geocoding
tool is sufficient to guide our measurements of genetic differences
given the spatial scale of the data
(Singh 2017).
2.2 Response and predictor variables
To summarize intraspecific genetic differentiation, we calculated the
average number of nucleotide differences (DXY;
Nei 1987) between pairs of
localities for each locus alignment, in each species.
DXY was calculated with the R package PopGenome(Pfeifer et al. 2014).
Pairwise values of DXY were then used as the response
variable in our models. Since different loci will yield different ranges
of DXY values due to different mutation rates, the
information about the locus used to calculate this metric was included
in our models as a control variable.
Environmental predictors were employed as distances between every pair
of localities for which sequences were available. Specifically, we
measured distance as the landscape resistance between the two localities
in each pair: this estimates the effective resistance of the landscape
to organism movement by calculating the relative probability of
individuals to move from any one point in space to adjacent points,
based on characteristics of the environment
(McRae 2006). For that, all
movement probabilities across a specific region were summarized in
resistance matrices, and a resistance distance between two localities
was then calculated as the average probability of individuals to move
from one locality to another. We created resistance matrices for the
Atlantic Forest based on topographic variation and 19 bioclimatic
variables retrieved from the WorldClim database
(Fick and Hijmans 2017) at a
2.5 arc-min resolution. Additionally, we created a resistance matrix to
be a proxy of plain geographic distance by assuming equal probability of
individuals moving in all directions in space (i.e., no effect of
topography or environment variation). Resistance matrices were created
using the function transition and resistance distances were
calculated using the function costDistance , both from the R
package gdistance (van
Etten 2017).
To test the importance of ecological traits in the predictive models, we
summarized phenotypic characteristics shown to be related to dispersal
capacity (hereinafter referred to as dispersal traits) and reproductive
rates known to influence population dynamic (hereinafter referred to as
demographic traits). Dispersal traits included three morphological
measurements (body size, wing length and tarsus length), two descriptors
of foraging ecology (foraging stratum and diet) and a measure of
propensity to disperse through open areas (forest sensitivity). Foraging
stratum and diet were transformed into ordinal variables to incorporate
a gradient of ecological variation: values ranged from low to high to
indicate a gradient from understorey towards exclusively canopy birds,
and from diets consisting of one item (arthropods) to more generalized
diets (i.e., arthropods along with other items). Diet based on nectar
(observed only for Thalurania glaucopis ) was quantified as the
highest value to indicate an entirely different diet resource.
Demographic traits included annual adult survival, age at first
reproductive event, maximum longevity and generation length. Body size,
foraging stratum and diet were obtained from the Handbook of Birds of
the World (Billerman et al.
2022), whereas wing length and tarsus length were obtained from the
AVONET database (Tobias et
al. 2022). Forest sensitivity was obtained from
(Stotz et al. 1996).
Finally, demographic traits were obtained from
(Bird et al. 2020). The
values of all ecological traits of each species included in this study
are summarized in Table 1.
2.3 Model fitting and evaluation
We created models using the Random Forest technique
(Breiman 2001), a machine
learning approach particularly efficient in investigating relationships
when many predictor variables are present
(Schrider and Kern 2018).
This approach is well suited to predict values of genetic
differentiation since 1) it has been shown to perform well in regression
analyses to predict continuous data (which is the case of our response
variable; Boehmke and
Greenwell 2019, Barrow et al. 2021), and 2) it has been successfully
used in previous studies predicting genetic breaks
(Sullivan et al. 2019). We
implemented the random forest algorithm in the R package ranger(Wright and Ziegler 2015).
The set of parameters best suited for our dataset was estimated using
the tune package (Kuhn
2023), which implements cross-validations to explore different
combinations of parameters. We performed such exploration by combining
different values for the following parameters: number of trees (varying
from 1 to 2000), number of variables per tree (from 1 to 22) and minimum
node size (from 2 to 40). To choose the best parameter values for our
models, we calculated the root mean squared error for each parameter
combination.
To test the efficiency of a random forest model in predicting pairwise
DXY values, we first created a global model by pooling
together all pairwise locality data from all species. The goal of this
model was to verify if one is able to predict levels of
DXY across the entire region and highlight regions of
possible lineage turnover in space. To evaluate model uncertainty, we
created 100 replicates of this global model. In each replicate, we
retained 70% of the data points as a training dataset, and evaluated
the model by performing predictions of the 30% data points left out as
testing dataset. To avoid over-fitting, we performed what we called a
”species cross-validation”, i.e. we forced each species to be either
entirely included or entirely excluded from the training dataset,
forcing the model to always be evaluated on a set of species that was
not present in the training dataset. We estimated variable importance
using the impurity metric, which measures the change in prediction
accuracy when a predictor variable is removed from a decision tree
(Wright and Ziegler 2015).
Since predictor variables may be correlated to different extents, we
also calculated the Spearman correlation coefficient among all predictor
variables and used that information to discuss their relative
importance.
To test the transferability of the model and evaluate whether we can
extrapolate correlations detected in a group of well-studied species to
taxa with little or no available genetic data, we additionally created
species-specific models to predict values of DXY in each
species. These models were created by leaving out the genetic data of
one target species at a time and training our model on the remaining
data. This approach differs from the one described above in that
species-specific models directly evaluate how a model trained on all
available data performs when predicting intraspecific genetic breaks
within a single species instead of globally throughout the Atlantic
Forest. We created one species-specific model for each combination of
locus and species in our dataset (n = 37; Table 2).
Both approaches described above (global and species-specific models)
were implemented with four different sets of predictor variables: 1)
environmental data only ; 2) environmental data along with dispersal
trait data; 3) environmental data along with demographic trait data; and
4) all available predictor variables (i.e., environmental data along
with both types of ecological trait data). We compared the predictive
power of these different sets of predictors by correlating observed and
predicted values of DXY and reporting the
R2 square of such correlations. For both global and
species-specific models, we tested if the distributions of
R2 values across replicates (in global models) or
species (in species-specific models) differed among models using
different sets of predictors, by using a Kruskal-Wallis test.
Additionally, we calculated the difference in R2between models that included versus models that did not include
ecological traits, and used a Wilcoxon Test to evaluate which set of
traits (i.e., dispersal or demographic traits) led to a larger increase
in R2.
2.4 Visualizing the location of modeled genetic barriers in space
To visualize observed and predicted values of genetic differentiation in
space, we mapped DXY values from each pair of locality
to the geographic point located at the center of the shortest path
connecting those two sites (hereafter referred to as midpoint). Because
we expect that high values of DXY will occur whenever
two localities are on opposite sides of a barrier to gene flow, the use
of a color legend facilitates the identification of genetic breaks in
space. For global models, we calculated and mapped for each midpoint the
mean and standard deviation of the predicted values of
DXY, as well as the average difference between predicted
and observed, across the total number of replicates (n of replicates =
100). For species-specific models, we plotted the observed and predicted
value of DXY for each species. To visualize values
continuously over the landscape, we performed an inverse distance
weighted (IDW) interpolation of DXY values using the
function idw from the package gstat(Gräler et al. 2016).
Interpolated maps were created for observed values as well as values
predicted by models using each different set of predictors. To focus on
the relative levels of genetic differentiation across space, rather than
the absolute values, we re-scaled the predicted values of
DXY in our interpolation based on the maximum and
minimum values of observed DXY. Additionally, in order
to visualize genetic differentiation from different loci in the same
map, we re-scaled values across loci to the same range. These re-scaling
procedures allow us to remove the effect of different locus when
visualizing genetic differentiation in geographic space, as well as
emphasize the relative differences in model prediction across the area.