4. Discussion
By incorporating data from multiple species in a supervised machine
learning framework, we were able to explore different correlates of the
spatial distribution of genetic differentiation as well as our ability
to predict patterns across space. Model predictive accuracy showed large
variation but was highly dependent on the set of predictors utilized:
models including species-specific ecological traits led to consistently
higher accuracy (Fig. 2). This result is in line with previous studies
suggesting that species ecological characteristics interact with the
abiotic environment in driving observed patterns
(Burney
and Brumfield 2009, Pabijan et al. 2012, Paz et al. 2015, Sullivan et
al. 2019, Miller et al. 2021). Importantly, even though model accuracy
increases when ecological traits are included, environmental predictors
are still the most important variables in our models, particularly
differences in temperature seasonality (bio4) and precipitation during
cold periods (bio19; Fig. 3), which is highly correlated with additional
variables describing temperature ranges and extremes of lower
precipitation (Table S2). This suggests that genetic differentiation
increases when the geographic cells along the path connecting two
localities are similar in values of temperature range and precipitation
extremes. This important environmental effect is not surprising since
the biogeographic literature largely supports the abiotic environment as
a major driver of spatial patterns on several scales
(Davies et al.
2007, Stein et al. 2014, Voskamp et al. 2017, French et al. 2023).
However, the increase in prediction accuracy (Fig. 2) clearly shows that
the abiotic environment alone cannot account for all the observed
intraspecific genetic variation. This is expected to be especially true
on spatial scales where environmental variation is large, as in the
present study (Peres et al.
2020). In these cases, the combination of abiotic predictors with
species-specific traits helps decrease the amount of unexplained
observed genetic variation.
We additionally found that different categories of ecological traits
have varying predictive power. Dispersal traits, especially
morphological measurements, were more informative than demographic
traits in our predictive models (Fig. 2 and 3D). This makes sense
considering the large support in the bird literature to the close
relationship between body size and wing length with dispersal ability
(e.g., Dawideit et al.
2009, Claramunt et al. 2012). Demographic traits also improve model
accuracy, but to a lesser extent (Fig. 2), and we believe three aspects
may explain their weaker effect on genetic differentiation: the spatial
scale of our study, possible correlations with other predictors and
their effect on changes in standing genetic variation rather than
spatial differentiation. First, at the regional scale encompassed by our
dataset, the effect of environmental differences
(Manel and Holderegger
2013) and long range dispersal dictated mainly by morphology
(Claramunt et al.
2012, Sheard et al. 2020, Claramunt 2021) may prevail over the effect
of demographic traits, which can be more pronounced in local scales
(Castorani et al. 2017,
Drake et al. 2022), especially in regions with high environmental
heterogeneity. Second, we find that survival was highly correlated to
body size in our dataset (Table S2), and was also among the best
predictors when morphological traits were not included (Fig. 3). This
suggests that some of the biological importance of survival as a
correlate of genetic differentiation may be accounted by body size,
especially in models where both traits are included (Fig. 3D). We
highlight that (Sullivan et
al. 2019) found clutch size, a demographic trait, to be an important
predictor of intraspecific divergence, but this trait is also thought to
be correlated with body size models
(Tuomi
1980, Ford and Seigel 1989, McGinley 1989, Sibly and Brown 2007, Werner
and Griebeler 2011) and models where that trait was shown to be
important did not include body size
(Sullivan et al. 2019). In
such cases, where dispersal and demographic traits are correlated, the
relative contribution of the two categories is hard to disentangle.
Finally, demographic traits may be more important to explain changes in
effective population size, through their effect on demographic rates
such as growth and recruitment rates
(Saether et al. 2013,
Waples 2016). Therefore, they may contribute mainly to the relative
amount of genetic variation present in different populations (i.e,
standing genetic variation) and contribute only indirectly to the
relative differences in landscape connectivity across species.
Mapping model uncertainty (i.e., variance and error in predicted values)
further allows us to discuss the relative importance of different
drivers of genetic differentiation. Aside from the predictive variance
inherent to the modeling procedure
(Boehmke and Greenwell 2019)
and to the stochasticity of the evolutionary process
(Lenormand et al. 2009), we
assume additional variance stems from the proportion of genetic
variation that is not explained by predictors in our model. First, we
observe that variance and error is higher when predictor traits are
absent (Fig. 6), further emphasizing their relevance to predict genetic
differentiation. Additionally, even in maps incorporating all of our
predictors, variance and error is higher in the northern Atlantic
Forest. We suggest two possible reasons for this result, the first being
the absence of predictors that reflect past environmental conditions.
Mitochondrial DNA genetic variation is expected to reflect relatively
recent spatial and temporal changes in populations
(Avise 2009), and it has
been suggested that the distribution of genetic diversity in the
northern region of the Atlantic Forest is better explained by past
climate dynamics (Carnaval
et al. 2014). In the framework we follow here, where genetic
differentiation is calculated across pairs of localities, we believe
incorporating past climatic conditions to explain current environmental
differences is problematic because of the uncertainty in the past
distribution of the species, which is expected to have suffered
significant changes in the last 100 thousand years
(Hofreiter and Stewart
2009, Baker et al. 2020). The past environmental distance between two
present localities is not a good proxy of the effect of historical
climate because individuals in each locality do not necessarily
represent the genetic diversity observed in that locality in the past.
A second and equally plausible reason for higher uncertainty in northern
AF is the fact that most sampled localities are distributed in the
southern AF (Fig. 1A). In fact, southern AF (south of latitude 19 ºS)
encompasses 83% of the data points in our dataset. This means that even
though we observe both low and high values of genetic differentiation
across the entire region, most of the variation in our response variable
is concentrated in the southern AF. This raises the question of whether
the relationship between environmental and genetic data in southern AF
(which dominate the training of our data) can be extrapolated to the
northern AF. If that is a safe extrapolation, we could conclude that
variance and predictive error in northern AF stems from undocumented
phylogeographic structure. However, we believe that case is unlikely
since spatial autocorrelation suggests these two regions will tend to
have different environmental characteristics
(Keitt et al. 2002,
Carnaval et al. 2014). We therefore believe that uncertainty in
northern AF would mostly stem from lack of representation of that
environment in our models. The same rationale can be applied to the
variation in ecological traits across species: most species in our
dataset are distributed entirely in the southern AF (Table 1). This
means ecological differences across species might not be high in
northern AF data points and therefore do not contribute to increasing
model accuracy. Overall, these results point to the need for uniform
geographic representation of genetic variation when implementing
predictive models.
The high variation observed in predictive accuracy of species-specific
models further emphasizes the relevance of evaluating the ability of the
model to extrapolate learned relationships. In species-specific models,
accuracy is dependent on how much of the environmental variation in the
species range is present in the set of species in which the model was
trained. We observed low accuracy prediction when there is little
overlap of the species range with the ranges from species in the
training dataset. That is the case, for instance, for Cacicus
chrysopterus and Synallaxis cinerea , which occur in the southern
extreme of the Atlantic Forest and in the Diamantina mountains,
respectively, locations where few other species are sampled. Finally, we
also observe low predictive accuracy in species where range outside of
training combines with sparse geographic sampling (e.g.,Phylloscartes ventralis and Poecilotriccus fumifrons ) or
in small ranged species that have fewer points to be predicted (such asSynallaxis cinerea ). Models based solely on environment still
perform worse than those that include traits (Fig. 5B). Combined, these
results suggest that the use of predictive models to infer distribution
of genetic diversity in unsampled species require careful evaluation of
how represented the species is within the training variation, and that
information on morphological traits might still be relevant to increase
prediction accuracy.
Our results highlight the relevance of balancing the goals of
explanation and prediction in predictive biogeography: by exploring
models with different sets of predictors, we show that environmental
variation best explains genetic differentiation but is not enough to
perform accurate predictions. Additionally, we show how mapping
predictions and the related uncertainty allows for further investigation
of model accuracy over space and gives directions to improve prediction.
Finally, the goal of predicting is readily applicable to conservation
biology. Policies aiming to create a network of preserved areas can use
machine learning algorithms to predict areas of turnover and feed this
information into approaches like systematic conservation planning
(Margules and Pressey
2000, Nielsen et al. 2023). We suggest that, at least for birds,
morphological traits should be included given their relevance for model
accuracy. When the aim is to make predictions on a focal species based
on all available data for a community, it is necessary to: 1) make sure
the available data has good genetic sampling covering the area our focal
species exist in; 2) include dispersal traits whenever possible to give
more realistic predictions. Even though demographic traits did not lead
to the highest observed increase in accuracy, they may also be included
especially in species for which population connectivity is thought to be
more correlated to life history strategies such as strong philopatry or
unique social structures
(Drake et al. 2022). As our
results show, the use of machine learning approaches in predictive
biogeography gains from incorporating extra predictor information but
careful evaluation is needed to assess what type of information leads to
the highest increase in prediction accuracy.