4. Discussion
By incorporating data from multiple species in a supervised machine learning framework, we were able to explore different correlates of the spatial distribution of genetic differentiation as well as our ability to predict patterns across space. Model predictive accuracy showed large variation but was highly dependent on the set of predictors utilized: models including species-specific ecological traits led to consistently higher accuracy (Fig. 2). This result is in line with previous studies suggesting that species ecological characteristics interact with the abiotic environment in driving observed patterns (Burney and Brumfield 2009, Pabijan et al. 2012, Paz et al. 2015, Sullivan et al. 2019, Miller et al. 2021). Importantly, even though model accuracy increases when ecological traits are included, environmental predictors are still the most important variables in our models, particularly differences in temperature seasonality (bio4) and precipitation during cold periods (bio19; Fig. 3), which is highly correlated with additional variables describing temperature ranges and extremes of lower precipitation (Table S2). This suggests that genetic differentiation increases when the geographic cells along the path connecting two localities are similar in values of temperature range and precipitation extremes. This important environmental effect is not surprising since the biogeographic literature largely supports the abiotic environment as a major driver of spatial patterns on several scales (Davies et al. 2007, Stein et al. 2014, Voskamp et al. 2017, French et al. 2023). However, the increase in prediction accuracy (Fig. 2) clearly shows that the abiotic environment alone cannot account for all the observed intraspecific genetic variation. This is expected to be especially true on spatial scales where environmental variation is large, as in the present study (Peres et al. 2020). In these cases, the combination of abiotic predictors with species-specific traits helps decrease the amount of unexplained observed genetic variation.
We additionally found that different categories of ecological traits have varying predictive power. Dispersal traits, especially morphological measurements, were more informative than demographic traits in our predictive models (Fig. 2 and 3D). This makes sense considering the large support in the bird literature to the close relationship between body size and wing length with dispersal ability (e.g., Dawideit et al. 2009, Claramunt et al. 2012). Demographic traits also improve model accuracy, but to a lesser extent (Fig. 2), and we believe three aspects may explain their weaker effect on genetic differentiation: the spatial scale of our study, possible correlations with other predictors and their effect on changes in standing genetic variation rather than spatial differentiation. First, at the regional scale encompassed by our dataset, the effect of environmental differences (Manel and Holderegger 2013) and long range dispersal dictated mainly by morphology (Claramunt et al. 2012, Sheard et al. 2020, Claramunt 2021) may prevail over the effect of demographic traits, which can be more pronounced in local scales (Castorani et al. 2017, Drake et al. 2022), especially in regions with high environmental heterogeneity. Second, we find that survival was highly correlated to body size in our dataset (Table S2), and was also among the best predictors when morphological traits were not included (Fig. 3). This suggests that some of the biological importance of survival as a correlate of genetic differentiation may be accounted by body size, especially in models where both traits are included (Fig. 3D). We highlight that (Sullivan et al. 2019) found clutch size, a demographic trait, to be an important predictor of intraspecific divergence, but this trait is also thought to be correlated with body size models (Tuomi 1980, Ford and Seigel 1989, McGinley 1989, Sibly and Brown 2007, Werner and Griebeler 2011) and models where that trait was shown to be important did not include body size (Sullivan et al. 2019). In such cases, where dispersal and demographic traits are correlated, the relative contribution of the two categories is hard to disentangle. Finally, demographic traits may be more important to explain changes in effective population size, through their effect on demographic rates such as growth and recruitment rates (Saether et al. 2013, Waples 2016). Therefore, they may contribute mainly to the relative amount of genetic variation present in different populations (i.e, standing genetic variation) and contribute only indirectly to the relative differences in landscape connectivity across species.
Mapping model uncertainty (i.e., variance and error in predicted values) further allows us to discuss the relative importance of different drivers of genetic differentiation. Aside from the predictive variance inherent to the modeling procedure (Boehmke and Greenwell 2019) and to the stochasticity of the evolutionary process (Lenormand et al. 2009), we assume additional variance stems from the proportion of genetic variation that is not explained by predictors in our model. First, we observe that variance and error is higher when predictor traits are absent (Fig. 6), further emphasizing their relevance to predict genetic differentiation. Additionally, even in maps incorporating all of our predictors, variance and error is higher in the northern Atlantic Forest. We suggest two possible reasons for this result, the first being the absence of predictors that reflect past environmental conditions. Mitochondrial DNA genetic variation is expected to reflect relatively recent spatial and temporal changes in populations (Avise 2009), and it has been suggested that the distribution of genetic diversity in the northern region of the Atlantic Forest is better explained by past climate dynamics (Carnaval et al. 2014). In the framework we follow here, where genetic differentiation is calculated across pairs of localities, we believe incorporating past climatic conditions to explain current environmental differences is problematic because of the uncertainty in the past distribution of the species, which is expected to have suffered significant changes in the last 100 thousand years (Hofreiter and Stewart 2009, Baker et al. 2020). The past environmental distance between two present localities is not a good proxy of the effect of historical climate because individuals in each locality do not necessarily represent the genetic diversity observed in that locality in the past.
A second and equally plausible reason for higher uncertainty in northern AF is the fact that most sampled localities are distributed in the southern AF (Fig. 1A). In fact, southern AF (south of latitude 19 ºS) encompasses 83% of the data points in our dataset. This means that even though we observe both low and high values of genetic differentiation across the entire region, most of the variation in our response variable is concentrated in the southern AF. This raises the question of whether the relationship between environmental and genetic data in southern AF (which dominate the training of our data) can be extrapolated to the northern AF. If that is a safe extrapolation, we could conclude that variance and predictive error in northern AF stems from undocumented phylogeographic structure. However, we believe that case is unlikely since spatial autocorrelation suggests these two regions will tend to have different environmental characteristics (Keitt et al. 2002, Carnaval et al. 2014). We therefore believe that uncertainty in northern AF would mostly stem from lack of representation of that environment in our models. The same rationale can be applied to the variation in ecological traits across species: most species in our dataset are distributed entirely in the southern AF (Table 1). This means ecological differences across species might not be high in northern AF data points and therefore do not contribute to increasing model accuracy. Overall, these results point to the need for uniform geographic representation of genetic variation when implementing predictive models.
The high variation observed in predictive accuracy of species-specific models further emphasizes the relevance of evaluating the ability of the model to extrapolate learned relationships. In species-specific models, accuracy is dependent on how much of the environmental variation in the species range is present in the set of species in which the model was trained. We observed low accuracy prediction when there is little overlap of the species range with the ranges from species in the training dataset. That is the case, for instance, for Cacicus chrysopterus and Synallaxis cinerea , which occur in the southern extreme of the Atlantic Forest and in the Diamantina mountains, respectively, locations where few other species are sampled. Finally, we also observe low predictive accuracy in species where range outside of training combines with sparse geographic sampling (e.g.,Phylloscartes ventralis and Poecilotriccus fumifrons ) or in small ranged species that have fewer points to be predicted (such asSynallaxis cinerea ). Models based solely on environment still perform worse than those that include traits (Fig. 5B). Combined, these results suggest that the use of predictive models to infer distribution of genetic diversity in unsampled species require careful evaluation of how represented the species is within the training variation, and that information on morphological traits might still be relevant to increase prediction accuracy.
Our results highlight the relevance of balancing the goals of explanation and prediction in predictive biogeography: by exploring models with different sets of predictors, we show that environmental variation best explains genetic differentiation but is not enough to perform accurate predictions. Additionally, we show how mapping predictions and the related uncertainty allows for further investigation of model accuracy over space and gives directions to improve prediction. Finally, the goal of predicting is readily applicable to conservation biology. Policies aiming to create a network of preserved areas can use machine learning algorithms to predict areas of turnover and feed this information into approaches like systematic conservation planning (Margules and Pressey 2000, Nielsen et al. 2023). We suggest that, at least for birds, morphological traits should be included given their relevance for model accuracy. When the aim is to make predictions on a focal species based on all available data for a community, it is necessary to: 1) make sure the available data has good genetic sampling covering the area our focal species exist in; 2) include dispersal traits whenever possible to give more realistic predictions. Even though demographic traits did not lead to the highest observed increase in accuracy, they may also be included especially in species for which population connectivity is thought to be more correlated to life history strategies such as strong philopatry or unique social structures (Drake et al. 2022). As our results show, the use of machine learning approaches in predictive biogeography gains from incorporating extra predictor information but careful evaluation is needed to assess what type of information leads to the highest increase in prediction accuracy.