Machine learning models Description
Decision trees (DT) Decision trees (DT) are simple hierarchical models that can easily be interpreted as an ordinary decision-making chart with a question in each node (e.g. ‘Is feature x below value y?’). These questions can be linked together in sequence, forming a tree-like graphic shape, until a classification or value is obtained (Elith et al., 2008). The induction of decision trees (Quinlan, 1986) is one of the most popular machine learning models. From a set of examples, these models successively extract the most discriminating features to be tested at each node.
Ensembles (random forests – RF, boosted regression trees – BRT) Ensembles are any group of learning models combined towards a single prediction. All the ensembles reviewed here are based on decision trees, using either bagging or boosting. Random forests (RF) are a ML method consisting of an ensemble (multi-model option with a single output) of DTs (Ho, 1995; Breiman, 2001). RFs reduce the propensity for bias in DTs by performing bagging on both data and features (aka feature bagging, feature randomisation or random subspace method). Bagging (aka bootstrap aggregation) means that training data is taken from an original set randomly, with the same point possibly included several times. Likewise, feature bagging means a random subset of features is generated for each submodel, which reduces correlation between DTs. The end result is a very flexible and powerful method that can be used for both regression (average of DT values) and classification (majority vote of DTs) problems. In RFs, boosting can be performed instead of bagging, creating what are usually called boosted regression trees (BRT). BRTs are a powerful technique and the basis for several ensemble methods, such as adaptive boosting (Freund & Schapire, 1999), gradient boosting (Friedman, 2001) and XGBoost (Chen & Guestrin, 2016). Boosting could be interpreted as a kind of tree-based gradient descent: A first tree minimises cost by extensively testing splits, followed by a second tree that fits to the residuals (aka our cost, or some analogue of) of the first tree, the residuals of the second tree are calculated and so forth (Elith et al., 2008).
Maximum entropy (MaxEnt) Maximum entropy modelling, or MaxEnt, is a ML method based on the maximum entropy principle, which states that, among the distributions that are consistent with what we know, we should choose the distribution whose entropy is highest (Jaynes, 1957). Maximum entropy modelling has been used successfully in a variety of fields, such as word classifiers in natural language processing (see NLP below) and species distribution modelling (Phillips et al., 2006).
Artificial neural networks (ANN, including multi-layer perceptrons – MLP, deep neural networks – DNN, recurrent neural networks – RNN) Neural Networks (artificial neural networks or simulated neural networks, hence ANN) is a ML method inspired by neuron decision-making (McCulloch & Pitts, 1943). An ANN is formed by an input layer, a variable number of hidden layers and an output layer. Each node is a single, albeit simple, neuron, connecting with other neurons to form a ‘brain tissue’–like ANN. In practice, a NN is a series of nodes whose output is given by linear models. The output value of a node then follows another node, transformed according to need with an activation function (aka squashing function) that confers non-linearity (IBM, 2022a; Russel & Norvig, 2020). ANNs have multiple variations, of which the most ubiquitous are multi-layered perceptrons (MLP), deep neural networks (DNN) and recurrent neural networks (RNN). DNNs are those that use two or more hidden layers. RNNs have feedback connections among neurons and are especially useful when dealing with data series or data streams (IBM, 2022a; Russel & Norvig, 2020).
Evolutionary algorithms (EA, including genetic algorithms – GA, genetic programming – GP, and symbolic regression – SR) Evolutionary algorithms (EA) are a class of ML that can be seen as variants of stochastic beam search explicitly motivated by the metaphor of natural selection in biology: There is a population of individuals (solutions), in which the fittest (highest value) individuals produce offspring (successor solutions), which are combined through crossover with some randomness produced by mutation (Barricelli, 1954, 1962). There are multiple forms of EA that vary both in the size of their populations, as well as what each individual represents: In a genetic algorithm (GA), each individual is a string from a finite alphabet (often a Boolean string), like how DNA is a string from the alphabet ‘ACGT’; and in genetic programming (GP), an individual is a computer program (Russell & Norvig, 2020). In symbolic regression (SR) (Koza, 1992), each individual is a structure representing an equation of different components, including variables of interest, algebraic operators (e.g. +, –, ÷, ×), analytic function types (exponential, log, power etc.), constants and other mathematical transformations (Cardoso et al., 2020).
Support vector machines (SVM) Support vector machines work by mapping data to a high-dimensional feature space so that data points can be categorised, even when the data isn’t otherwise linearly separable (Boser, Guyon & Vapnik, 1992). A maximum margin separator between the categories is calculated so that it minimises generalisation loss by maximising the distance between the separator and a selection of example points. The data can also be transformed in such a way that the separator can be drawn as a hyperplane, created through a function called a kernel, allowing for linear separation of data in the kernel space. Following these steps, the characteristics of the new data and the separator can be used to predict the group to which a new record should belong to (Russell & Norvig, 2020).
Natural language processing (NLP) Natural language processing (NLP), or computational linguistics, is a field of ML that is often used for processing and extracting text data. Therefore, it is used as a way to obtain knowledge from literature on specific data fields, in particular in conservation science (for an overview, see Thessen, Cui & Mozzherin, 2012). It partially attributes its creation to linguist Noam Chomsky’s theory of Syntactic Structures, which revolutionised the field by not only explaining the acquisition of language by children, but also by, unlike previous theories, being formal enough that it could, in principle, be programmed. Currently, NLP combines this rule-based modelling of human language with statistical approaches to process human language as text or sound, and to extract meaning, intent or sentiment (IBM, 2022b; Russel & Norvig, 2020).
Bayesian (Bayesian belief networks – BBN, naïve Bayes – NB, Markov chain Monte Carlo – MCMC) Bayesian machine learning methods are based on Bayes’ theorem and the concept of conditional probability (Pearl, 1985). These rely on analyses and updates to ‘prior’ statistical models using new data, as opposed to hypothesis-driven frequentist statistics, making them very useful in automatic learning systems. Common examples of Bayesian methods include: (1) Markov chain Monte Carlo (MCMC), which can be thought of as a conditional chain of events, one in which each node or state specifies the value of all variables in the model, and every following state is determined by random changes; (2) Bayesian belief networks (BBN) are directed graphs made up of nodes annotated with conditional probability relationships and their quantitative probability information; 3) naïve Bayes (NB) are models that simplify problems by assuming independence between the variables given the result, even when this is not necessarily the case (hence naïve) (Russell & Norvig, 2020).