2. Methods
2.1 Study system and available genetic data
We created models using available genetic, environmental and trait data for 31 species of birds from the Atlantic Forest of South America, a biogeographic region with an accumulating number of phylogeographic studies over the past decades (Peres et al. 2020). We summarized and retrieved all intra-specific mitochondrial DNA (mtDNA) data made available across 20 published studies to date (Table S1 in Supporting File 1). While not representing the entire genome, we decided to use mtDNA loci for the purpose of this study since they 1) have been largely used in phylogeographic studies over the past two decades (Hickerson et al. 2010), especially in the Atlantic Forest (Peres et al. 2020); 2) represent a good approximation of intraspecific genetic differentiation due to higher mutation rates and smaller effective population size (Avise et al. 2016). From each available study, we listed all mtDNA loci sequenced across all individuals, as well as their associated georeferenced locality and GenBank accession number. When the accession number was missing from the original manuscript, we performed a search of the GenBank database using the species name, locus, museum voucher and specimen ID as keywords, through the NCBI API function entrez_search implemented in the R package rentrez(Winter 2017). For sequences not found through this automated search, we performed a manual search on the NCBI website. Sequences were downloaded using theentrez_fetch function of the rentrez R package. Mitochondrial DNA loci included in this analyses are Cytochrome b (CytB), NADH dehydrogenase 2 (ND2) and the D-loop (or control) region. All sequence data were aligned per species and locus, using the Muscle algorithm (Edgar 2004) in the muscle R package.
Sequences with missing coordinates were georeferenced using Google Maps API within the R package tidygeocoder(Kahle and Wickham 2013). We performed each search using all provided information about the locality in each manuscript, which in most cases included a locality name, state and country. We then plotted all localities per species in geographic space to visualize possible errors, and manually corrected erroneous localities. Although this approach may not retrieve the exact location where specimens were collected, the accuracy of the geocoding tool is sufficient to guide our measurements of genetic differences given the spatial scale of the data (Singh 2017).
2.2 Response and predictor variables
To summarize intraspecific genetic differentiation, we calculated the average number of nucleotide differences (DXY; Nei 1987) between pairs of localities for each locus alignment, in each species. DXY was calculated with the R package PopGenome(Pfeifer et al. 2014). Pairwise values of DXY were then used as the response variable in our models. Since different loci will yield different ranges of DXY values due to different mutation rates, the information about the locus used to calculate this metric was included in our models as a control variable.
Environmental predictors were employed as distances between every pair of localities for which sequences were available. Specifically, we measured distance as the landscape resistance between the two localities in each pair: this estimates the effective resistance of the landscape to organism movement by calculating the relative probability of individuals to move from any one point in space to adjacent points, based on characteristics of the environment (McRae 2006). For that, all movement probabilities across a specific region were summarized in resistance matrices, and a resistance distance between two localities was then calculated as the average probability of individuals to move from one locality to another. We created resistance matrices for the Atlantic Forest based on topographic variation and 19 bioclimatic variables retrieved from the WorldClim database (Fick and Hijmans 2017) at a 2.5 arc-min resolution. Additionally, we created a resistance matrix to be a proxy of plain geographic distance by assuming equal probability of individuals moving in all directions in space (i.e., no effect of topography or environment variation). Resistance matrices were created using the function transition and resistance distances were calculated using the function costDistance , both from the R package gdistance (van Etten 2017).
To test the importance of ecological traits in the predictive models, we summarized phenotypic characteristics shown to be related to dispersal capacity (hereinafter referred to as dispersal traits) and reproductive rates known to influence population dynamic (hereinafter referred to as demographic traits). Dispersal traits included three morphological measurements (body size, wing length and tarsus length), two descriptors of foraging ecology (foraging stratum and diet) and a measure of propensity to disperse through open areas (forest sensitivity). Foraging stratum and diet were transformed into ordinal variables to incorporate a gradient of ecological variation: values ranged from low to high to indicate a gradient from understorey towards exclusively canopy birds, and from diets consisting of one item (arthropods) to more generalized diets (i.e., arthropods along with other items). Diet based on nectar (observed only for Thalurania glaucopis ) was quantified as the highest value to indicate an entirely different diet resource. Demographic traits included annual adult survival, age at first reproductive event, maximum longevity and generation length. Body size, foraging stratum and diet were obtained from the Handbook of Birds of the World (Billerman et al. 2022), whereas wing length and tarsus length were obtained from the AVONET database (Tobias et al. 2022). Forest sensitivity was obtained from (Stotz et al. 1996). Finally, demographic traits were obtained from (Bird et al. 2020). The values of all ecological traits of each species included in this study are summarized in Table 1.
2.3 Model fitting and evaluation
We created models using the Random Forest technique (Breiman 2001), a machine learning approach particularly efficient in investigating relationships when many predictor variables are present (Schrider and Kern 2018). This approach is well suited to predict values of genetic differentiation since 1) it has been shown to perform well in regression analyses to predict continuous data (which is the case of our response variable; Boehmke and Greenwell 2019, Barrow et al. 2021), and 2) it has been successfully used in previous studies predicting genetic breaks (Sullivan et al. 2019). We implemented the random forest algorithm in the R package ranger(Wright and Ziegler 2015). The set of parameters best suited for our dataset was estimated using the tune package (Kuhn 2023), which implements cross-validations to explore different combinations of parameters. We performed such exploration by combining different values for the following parameters: number of trees (varying from 1 to 2000), number of variables per tree (from 1 to 22) and minimum node size (from 2 to 40). To choose the best parameter values for our models, we calculated the root mean squared error for each parameter combination.
To test the efficiency of a random forest model in predicting pairwise DXY values, we first created a global model by pooling together all pairwise locality data from all species. The goal of this model was to verify if one is able to predict levels of DXY across the entire region and highlight regions of possible lineage turnover in space. To evaluate model uncertainty, we created 100 replicates of this global model. In each replicate, we retained 70% of the data points as a training dataset, and evaluated the model by performing predictions of the 30% data points left out as testing dataset. To avoid over-fitting, we performed what we called a ”species cross-validation”, i.e. we forced each species to be either entirely included or entirely excluded from the training dataset, forcing the model to always be evaluated on a set of species that was not present in the training dataset. We estimated variable importance using the impurity metric, which measures the change in prediction accuracy when a predictor variable is removed from a decision tree (Wright and Ziegler 2015). Since predictor variables may be correlated to different extents, we also calculated the Spearman correlation coefficient among all predictor variables and used that information to discuss their relative importance.
To test the transferability of the model and evaluate whether we can extrapolate correlations detected in a group of well-studied species to taxa with little or no available genetic data, we additionally created species-specific models to predict values of DXY in each species. These models were created by leaving out the genetic data of one target species at a time and training our model on the remaining data. This approach differs from the one described above in that species-specific models directly evaluate how a model trained on all available data performs when predicting intraspecific genetic breaks within a single species instead of globally throughout the Atlantic Forest. We created one species-specific model for each combination of locus and species in our dataset (n = 37; Table 2).
Both approaches described above (global and species-specific models) were implemented with four different sets of predictor variables: 1) environmental data only ; 2) environmental data along with dispersal trait data; 3) environmental data along with demographic trait data; and 4) all available predictor variables (i.e., environmental data along with both types of ecological trait data). We compared the predictive power of these different sets of predictors by correlating observed and predicted values of DXY and reporting the R2 square of such correlations. For both global and species-specific models, we tested if the distributions of R2 values across replicates (in global models) or species (in species-specific models) differed among models using different sets of predictors, by using a Kruskal-Wallis test. Additionally, we calculated the difference in R2between models that included versus models that did not include ecological traits, and used a Wilcoxon Test to evaluate which set of traits (i.e., dispersal or demographic traits) led to a larger increase in R2.
2.4 Visualizing the location of modeled genetic barriers in space
To visualize observed and predicted values of genetic differentiation in space, we mapped DXY values from each pair of locality to the geographic point located at the center of the shortest path connecting those two sites (hereafter referred to as midpoint). Because we expect that high values of DXY will occur whenever two localities are on opposite sides of a barrier to gene flow, the use of a color legend facilitates the identification of genetic breaks in space. For global models, we calculated and mapped for each midpoint the mean and standard deviation of the predicted values of DXY, as well as the average difference between predicted and observed, across the total number of replicates (n of replicates = 100). For species-specific models, we plotted the observed and predicted value of DXY for each species. To visualize values continuously over the landscape, we performed an inverse distance weighted (IDW) interpolation of DXY values using the function idw from the package gstat(Gräler et al. 2016). Interpolated maps were created for observed values as well as values predicted by models using each different set of predictors. To focus on the relative levels of genetic differentiation across space, rather than the absolute values, we re-scaled the predicted values of DXY in our interpolation based on the maximum and minimum values of observed DXY. Additionally, in order to visualize genetic differentiation from different loci in the same map, we re-scaled values across loci to the same range. These re-scaling procedures allow us to remove the effect of different locus when visualizing genetic differentiation in geographic space, as well as emphasize the relative differences in model prediction across the area.