2.2. Model description
Statistical models are general methods in the study of geography. It is usually built on some theoretical assumptions, and the data need to obey or approximately conform to a specific spatial distribution before the model can obtain credible results. However, ML algorithm is a general approximation algorithm, which generally does not require theoretical assumptions. The spatial analysis algorithm based on ML does not need a prior knowledge but a set of training data to learn the patterns of the geoscience system (Lary et al., 2016). Based on the above characteristics, we chose two statistical models and two ML algorithms to fit the present and future MAGT and ALT in this paper. The generalized linear modeling (GLM) and the generalized additive modeling (GAM) are traditional statistical methods used to simulate the thermal regimes of permafrost (Nan et al., 2002; Zhang et al., 2012a). And the two ML algorithms are the generalized boosting method (GBM) and random forest (RF). In this study, all the four models were executed based on the R software program. The detailed information and characteristics of the models are as follows:
1) Generalized linear model
The generalized linear model (GLM) is an extension of a linear model that can deal with the nonlinear relationships between explanatory variables and response variables (Nelder and Wedderburn, 1972):
\(g\left\{\mu\left(x\right)\right\}=\beta_{0}+\beta_{1}\left(x_{1}\right)+\beta_{2}\left(x_{2}\right)+\ldots+\beta_{i}\left(x_{i}\right)\)(1)
where \(g\left(\mu\right)\) is the link function connecting the estimated mean to the distribution of the response variable (here is MAGT and ALT), μ =E \(\left(y/x_{1},x_{2},x_{3},\ldots,x_{i}\right)\), E is the expected value, \(\text{\ \ }\beta_{0}\) is the intercept component,\(\beta_{i}\) is the regression coefficient to be estimated and\(x_{i}\) is the predictor. For MAGT and ALT, GLM was based on first and second order polynomials and identity–link function.
2) Generalized additive model
Generalized additive model (GAM) is semi-parametric extensions of GLM that specify smoothing functions to fit nonlinear response curves to the data (Hastie and Tibshirani, 1986):
\(g\left\{\mu\left(x\right)\right\}=\beta_{0}+f_{1}\left(x_{1}\right)+f_{2}\left(x_{2}\right)+\ldots+f_{i}\left(x_{i}\right)\text{\ \ \ \ }\)(2)
where \(g\left(\mu\right)\) is the link function connecting the estimated mean to the distribution of the response variable (here is MAGT and ALT), μ =E \(\left(y/x_{1},x_{2},x_{3},\ldots,x_{i}\right)\), E is the expected value, \(\beta_{0}\) is the intercept component, \(f_{i}\)is a smoothing function for each explanatory variable and \(x_{i}\) is the predictor. To associate the MAGT and ALT with environmental predictors, the maximum smoothing function was set to three which were subsequently optimized by the model fitting function.
3) Generalized boosting method
The generalized boosting method (GBM, based on the R package dismo) is a sequential integration modeling method that combines a large number of iteratively fitted classification trees into a single model, using cross-validation methods to estimate the optimal number of trees, and thereby improving prediction accuracy (Elith et al., 2008). GBMs automatically incorporate interactions between predictors and are capable of modeling highly complex nonlinear systems (Aalto et al., 2018). GBMs (with Gaussian–Markov error assumption) were fitted using the gbm.step function, including the main parameters of the learning rate, tree complexity, bagging fraction, maximum number of trees, and others.
4) Random forest
Random forest (RF, implemented in the R package randomForest.) is a ML algorithm based on a classification tree, which forms a “forest” by generating a large ensemble of regression trees. The model uses a bootstrap sampling method to extract multiple samples from the original samples, conduct decision tree modeling for each sample, and then combine the prediction of multiple decision trees in order to obtain the final prediction result through a voting process. The model is characterized by strong applicability, effective avoidance of over-fitting and insensitivity to missing data and multivariate collinearity (Breiman et al., 2001; Hutengs and Vohland 2016). It is an effective empirical approach in the nonlinear-regression systems and its superiority has been proved useful by a large number of applications in the earth system (Lary et al., 2016).
To study the effects of predictors on MAGT and ALT, our models were designed using the following specifications:
MAGT =\(\ f_{1}\left(\text{TDD}\right)\)+\(f_{2}\left(\text{FDD}\right)\)+\(f_{3}\left(Sol\_pre\right)\)+\(f_{4}\left(Liq\_pre\right)\)+\(f_{5}\left(\text{PISR}\right)\)+\(f_{6}\left(\text{SOC}\right)\)
+\(f_{7}\left(\text{Lon}\right)\)+\(f_{8}\left(\text{Lat\ }\right)\)+\(f_{9}\left(\text{Ele}\right)\)(3)
ALT =\(\ f_{1}\left(\text{TDD}\right)\)+\(f_{2}\left(\text{FDD}\right)\)+\(f_{3}\left(Sol\_pre\right)\)+\(f_{4}\left(Liq\_pre\right)\)+\(f_{5}\left(\text{PISR}\right)\)+\(f_{6}\left(\text{SOC}\right)\)
+\(f_{7}\left(\text{Lon}\right)\)+\(f_{8}\left(\text{Lat\ }\right)\)+\(f_{9}\left(\text{Ele}\right)\)(4)
The independent variables in these equations are same, while the corresponding\(f_{i}\left(x_{i}\right)\) in each equation is different. In order to fully consider the advantages and disadvantages of the above four models and to reduce the uncertainty, we used an ensemble approach. This method puts the average of the four models as the new results. The optimal model was determined by comparing the key parameters of the final five results. Model performance was assessed using a repeated cross-validation (CV) scheme. Based on a total of 84 boreholes and 70 ALT observation sites, the models gave the simulated results after 10 times fitting processes using a random sample of 90% of the observation data and verification processes using the remaining 10%. After each CV run for all models, the predicted and observed values of MAGT and ALT were compared in the terms of the root-mean-square error (RMSE), mean difference (cf. bias), and R-squared (R2).
3. Results