\pretitle\posttitle\preauthor\postauthor\predate\postdate

What is Machine Learning?

Machine learning is a sub-field of computer science based on the study of pattern recognition using computer software in order to identify mathematical rules. Such mathematical rules, in their turn, allow to predict future outcomes from existing data. Machine learning involves defining an algorithm, training the algorithm on data, and collecting an expected outcome, using a computer. Thus, powerful computer and large datasets, which are a development of the last couple of decades, are the characteristic feature of machine learning, therefore, is not essentially different from any statistical method used in the last century in criminal justice.
A statistical method of estimation includes an expert’s forecast of criminal behavior in a neighborhood, a linear projection of crime relative to a set of variables, or an ordinary least squares regression procedure or the score computed on the basis of a clinical test. This basic definition, however has been reshaped on the basis of those two technological developments: the availability of large amounts of data related to crime and the surge of computational capability to perform complex and iterative mathematical operations.11By expert’s forecast I consider risk assessment tools based on psychological science as well as heuristics used by criminal justice officials who rely on their experience and knowledge of the field to predict or decide over specific cases. Another type of expert’s prediction is that of researchers who forecast outcomes based on academic experience.—\cite{ridgeway2013pitfalls}— includes the over-reliance on experts a pitfall of prediction as one of the most common pitfalls of the practice of prediction for decades describing examples from the criminal justice sector.
In social sciences, including criminology, research involving statistical analysis has shown a growing reliance on specialized software first developed in the 1960s. Among the most widely known specialized computer programs are SPSS (Statistical Package for the Social Sciences, developed by International Business Machines (IBM) Corporation and released in 1968; Stata, created by economist William Gould founder of Stata Corporation, released in 1985 and, more recently, R, designed by Ross Ihaka and Robert Gentleman and released in 1993. These three computer programs are widely used in the social sciences, alongside other software more popular in other scientific fields, such as Statistical Analysis System (SAS), released 1976; Wolfram Mathematica, released in 1988, and Python, released in 1991. However, SPSS, Stata and R are arguably the most widely used programs to conduct statistical analysis in criminal justice. SPSS and Stata are paid programs, while R is an open source software accessible free of charge and crowd-sourced, as any individual can contribute with a library or package containing algorithms to perform computations and implement statistical model 3. A relevant caveat to examining the trend in use of these three programs is the fact that only R supports the implementation of machine learning algorithms and unlike Stata and SPSS is increasingly improving packages to efficiently conduct complex and recursive computations using big data.22It is only in the past few years that Stata researchers have placed growing attention to the development of implementations of popular machine learning techniques such as random forest and support vector machine.
Both events, namely, data availability and computational power allow for the use of elaborated mathematical formulas, as opposed to simple linear ones, and a more frequent use of wide and big data to answers policy question (billions of observations of multiple variables in different formats), as opposed to small data (imaginable number of observations of a few defined variables). The intervention of computer software is the distinctive characteristic of machine learning relative to other terms such as statistical learning and actuarial methods, which I however use interchangeably in this paper. However, when necessary I do draw important distinctions between cases, as criminal justice scholars are better off acknowledging that the complexity of algorithms and the potential to train them with real world data is rapidly advancing towards the implementation of statistical models that require superior computational capabilities than current standard practices. In this sense, the term machine learning is closer to deep learning, virtual reality and artificial intelligence than it is of ordinary least square regression analysis.
Many of the critiques of statistical prediction methods are vague in terms of the specific methods that are employed by researchers.
The terms actuarial approaches, statistical learning and prediction are used liberally when scholars and practitioners blame the black box as the cause of bias, particularly of racial biases, when it comes to the potential dangers of using machine learning to support criminal justice decisions.
This type of argumentation that targets the messenger rather than the message fits the definition of a logical fallacy where the undesirable results of the implementation of an algorithm are mistaken for the algorithm itself. In this sense, I argue that statistical methods are sometimes the red herring for unexpected results of policy design and implementation. 33For example, Bernard Harcourt in his article The Shaping of Chance: Actuarial Models and Criminal Profiling at the Turn of the Twenty-First Century, critiques actuarial models in criminal justice and nowhere in the text defines the term. It is a nice touch to quote Jean-Luc Godard as a sign of the philosophical tradition supporting his critique; however, ”being slaves of probabilities” might not be a bad thing when it can outperform human judgment in guaranteeing societal outcomes.
It is true that the higher complexity and limited to null causal inference properties of predictive algorithms are indeed problematic. But, as empirical studies have shown, the purpose, structure, validation, and implementation of statistical learning techniques to policy are in no way homogeneous or easily labeled in a single category. Additionally, the variety of machine learning methods is broad and some algorithms are relatively easier to understand than others, and, in some cases, the explanation of the basic two-dimensional underlying model can make the typology of methods manageable and understandable, in a way that it is possibly to associate different criminal justice policy questions with specific algorithmic settings. To illustrate what I mean by that, in the following paragraphs I provide a brief overview of the machine learning methods more broadly used in criminology research. 44Explain why it is sort of a justifiable sample of studies consulted
As mentioned, the main distinction between machine learning and traditional statistical learning is the mediation of computers to obtain information from data.
A defining characteristic of machine learning is, therefore, a technological interface to guide the process of learning. According to this definition, the line dividing statistical learning from machine learning is small, and although there is no single definition, in practice by ”mediation of computers”, a high computational power, enough to scale and study big data is assumed. Machine learning is broadly divided into supervised and unsupervised, where the former requires data of an outcome variable (numerical or categorical) and a set of features or input variables, associated with that outcome. The later, does not require information about the outcome to guide the process of learning \citep{HTF2001}.
A distinctive trait of areas where machine learning is applied is when large databases are available and what is called wide data is at hand, wide data refers to a large number of variables or traits, in database terms it indicate the number of columns. In the presence of multidimensional data, techniques to identify association or select variables are valuable and statistical learning achieves this goal. A typical statistical learning problem is that of classification into binary categories, such as having a disease versus not having a disease, success or failure or, high or low risk.
A classification of statistical analysis techniques is provided by \cite{varian2014big}, who identifies four categories - prediction, summarization, estimation and hypothesis testing- and considers that machine learning is mostly related to prediction. In this paper, I focus on prediction techniques and mention briefly machine learning methods for classification and visualization, as I consider them a practical tool for exploratory data analysis usually conducted prior to prediction.
Supervised and unsupervised learning, as defined previously, build learning models to either predict/classify (supervised) or represent associations among variables (unsupervised). Among the most important and used families of techniques in the fields of statistics, social research, engineering, finance and artificial intelligence, are the following: 1) clustering and principal components analysis, 2) linear methods for variable selection, 3) tree methods, 4) boosting, bagging, and bootstrapping.
The central objective of supervised learning algorithms for prediction is to make quality – i.e., precise and unbiased – out-of-sample forecasting of events. For this reason, predictive algorithms in machine learning usually are trained on a subset of the data (trainer or training set) and the remainder, (the test set) is used to evaluate the performance of the algorithm and its parameters in out of sample observations. Validation and cross-validation of algorithms are statistical techniques to increase the prediction power of a model so that it can perform well when used to forecast out-of-sample events. In addition to validation in the test set, there are two basic protocols to decrease out-of-data forecasting error, k -fold validation and holdout approaches, being the former most common. k -fold validation works by subsetting the data into k folds (Where K ¿ 2) and testing each recursively to calibrate the parameters. This way of validating the performance of an algorithm allows researchers to take advantage of big data sets and prevents overfitting of the algorithm to the data at hand. There is no general rule as the proportion of the size of the training set to the validation set, and of the number of folds to use in multiple fold validation. As for the holdout, this is typically a five to ten percent of the original dataset that is excluded from all analyses and used once the final model is selected (through one of the previous protocols) to fine tune the parameters. These validation protocols are ways of taking advantage of large amounts of data and computational capacity to expose the model to many different potential distributions of the observations, calibrate parameters and a good performance outside the original data.

Clustering Methods

Visualizing Data: Clustering and principal component analysis are both canonical examples of techniques used in unsupervised learning to find association among variables, identify latent variables, and reduce data dimensions. Clustering is a family of methods that group variables according to a given definition of similarity - or dissimilarity, this approach is also used as a form of describing data as it provides information about the existence or not of groups or subgroups.
The definition of similarity can be thought of as the loss or cost function that is typical in prediction problems, such as the square distance, in the case of the more traditional ordinary least squares regression. Similarity is the central concept associated with clustering methods. This notion, although mathematically represented, can only be defined based on knowledge of the subject matter and the specific variables available as the most important to operationalize the definition in the algorithm.
For example, while minimizing distance across observations can be central to one definition of similarity, for others a sharing a particular attribute or a specific number of groupings may be key to the definition of similarity. Based on such definition, the objective of the computations is to group elements such that pairwise dissimilarities between elements of a group are smaller that those in different groups.
Clustering a large number of variables and observations becomes complex pretty fast. There are several popular algorithms to manage the grouping, namely, k-means, k-medoids, proximity matrices, nearest neighbors, self-organizing maps, spectral clustering, and so on. These algorithms can be classified into three types according to their assumptions about the way in which the elements to be grouped are distributed: combinatorial, mixture modeling, and mode seeking. Combinatorial models do not assume any particular probability distribution of the observations; mixture modeling assumes that the observations are independent and identically distributed and mode seeking is a non-parametric form of clustering that estimates modal probability density functions.
Clustering methods are suitable for the analysis of geographically based data. In criminal justice it is a popular technique to detect hotspots or to deploy resources based on selected characteristics of neighborhoods in order to prevent crime.\cite{eck2015crime} conducted a review of 14 studies using different clustering techniques concluding that mapping crime patterns is helpful to prevention strategies, as the benefits of policies are diffuse in unprotected locations. This dispersion suggests that using clustering to adapt policies by geographic location is an efficient approach to prevention. One example of this particular approach to policy is illustrated by \cite{singh2006hierarchical}, who used hierarchical clustering techniques to identify and map areas of high crime concentration and high risk of crime to design a prevention program targeting those spots and specific crimes.
Clustering techniques are complex and the definition of similarity is critical when applying such algorithms to policy, as illustrated by \cite{grubesic2006application}. The author applied several clustering techniques to crime data in Cincinnati, Ohio. To identify areas of high crime concentration, the author concluded that fuzzy clustering is the appropriate approach to delineate hotspots in urban settings, Grubesic also showed that the spatial configuration of crime in his analysis changed with different algorithm specifications, showing how different concepts of similarity and grouping produce varied results.

Principal Components Analysis

Principal components analysis or PCA is a technique that creates new variables based on existing variables, the new variables being the so-called components. Through a multiplication of the eigenvalues of the existing variables, the PCA produces the components which are orthogonal transformations of linearly dependent predictors and helpful to reduce the number of dimensions of a dataset. Applying the technique implies matrices operations that transform the original vector of parameters into their corresponding matrices of eigenvalues. Then, via multiplication, the new variables or so-called ”components” are created and represent combinations of the eigenvalues of the original predictors. PCA techniques are a common way of addressing multicollinearity.
It can be validly argued that the components do not have a practical meaning when talking about policy, as they represent a mixture of the original variables; however, these operations allow researchers to identify the original variables that have highest variance across their values and combine them into ”components” that represent those variables or features that combined yield a more precise measure of variation across observations in a dataset. Examples of application of PCA techniques in criminal justice are several. Among them, \cite{cooper2016risk} explore how race and ethnicity variables affect drug consumption and criminal activity in different groups, and \cite{ayoola2015estimation} estimate crime rates in Nigeria using PCA techniques.

Linear Methods

Linear models for regression or classification work under the assumption that whatever the underlying data generation process of the subject of study, it can be represented as linear. The simplicity of this type of model makes it powerful, under certain conditions it can outperform complex non-linear algorithms in the task of regression and classification (Friedman et al., 2001). Linear regression, as a method of the pre-computer era is the most popular one among social researchers. The parsimony of its structure and its power to identify causal effects once the assumptions are met make of this approach the preferred among social scientists, including criminology. However, satisfying the assumptions of a causal model is a non-trivial endeavor, as in criminology data is rarely normally distributed and variables uncorrelated. As Richard Berk puts it, in practice, the practical scope of regression analysis in criminology might be overestimated (Berk 2010).
\citet{varian2014big} Varian (2014) considers that issues with big datasets require different tools. The author specifically mentions data manipulation tools such as computational software, mathematical algorithms for variable selection, given that there is more availability of potential predictors for estimation and techniques to model non-linear relationships, factors that machine learning offers.

Ordinary Least Squares

Among linear methods for regression three categories stand out: least squares, subset selection, and shrinkage estimation, these three categories have in common an optimization function that typically consist in minimizing a given loss function. In the case of least squares, the optimization function is the minimum of the squared errors across observations. This model is widely used, but it has two main limitations. First is the low prediction accuracy due to the low bias but high variance of the coefficient estimates. The second limitation concerns the problem of interpreting coefficients in the presence of multiple predictors (Friedman et al., 2001, p. 56).
However, ordinary least squares regression is still the most widely used method in practice, as illustrated by the proportion of academic articles published in journals. Given the limitations of this approach, as mentioned above, it is important to consider what are the expectations and the scope of such analysis based on OLS regression. \cite{berk2010reg} characterizes three levels of possible regression analysis as follows, Level I, descriptive regression analysis, is an exploratory exercise with no assumptions about the data generation process that simply describes patterns and relations observed across variables. Level II refers to inferential statistical analysis, which requires a well-defined population and a sample obtained through a probability sampling technique, where the probability of each observation being selected is known. This type of analysis uses hypothesis testing, confidence intervals and estimation of key parameters to add statistical inference to the description, but does not convey any causal finding. Finally, Level III, or causal regression analysis, requires a model specification with very small room for error and compliance with strong assumptions about the data generation process in order to make causal claims about the regression parameters. Model specification and assumptions are necessary to make this type of analysis work, but in criminology this type of analysis is not common by the nature of the data.
From a similar perspective, \cite{varian2014big} considers linear regression to be a conventional statistical technique that has been widely used in social sciences, but has substantially changed in its relevance as the availability of new data and computational power has benefited the development of new algorithms to analyze data. The author reviews the main machine learning algorithms and shows some examples of their comparative advantages to identify and modeling non-linear relationships among variables.

Logistic Regression

Logistic regression is a powerful tool to model the probability of K possible outcomes using a linear function and ensuring that the probabilities add up to 1, and in the range of [0,1]. When K =2, typically a binary classification problem, the model is simple with a unique linear function. Variations of logistic regression, such as sparse logistic regression in its binomial or multinomial version have proven to be useful in complex classification problems and text analysis (Sculley et al., 2011). \citep{sculley2011results}.

Subset Selection

One potential solution to tackle the problem of multiple predictors is to select a subset of variables, in which the least squares technique is applied to only a subset of selected variables. To do so, there are several techniques, such as best-subset selection, forward-and-backward step-wise selection and, forward-stage-wise regression.
A subset selection method finds a subset of size k that yields the smallest residual sum of squares; this procedure can be efficiently applied for a number of predictor p of around 40 \citep{HTF2001}. The decision to choose k is usually a trade-off between bias and variance, as well as a researcher’s preference for parsimony, but typically a k is chosen that minimizes the expected prediction error. To find this k, the algorithm searches through all possible combinations of predictors, ordering the possible sets.
When the number of predictors is substantially over 40 a forward-step- wise selection of variables can be applied, this method builds a model adding one variable at a time starting with the intercept and following with the variable that is a best fit. Inversely, the backward-step-wise technique starts with a full model and excludes one by one the variable that most affect the fit. When there are a large number of variables these methods are heavily computational but offer several potential models to set on the trade-off between explanatory power and parsimony. Because these methods add or subtract a variable at a time, in a discrete process, they might lead to high variance, affecting the overall prediction accuracy of the final model. This is one of the reasons why shrinkage methods aimed at reducing the number of variables but in a continuous fashion might be preferred.

Shrinkage: Ridge and Lasso

Selecting a few variables has the advantage of providing a more parsimonious model that can be easily interpreted and has more prediction accuracy than a full or bigger model. Instead of selecting one variable at a time to achieve a more parsimonious model, shrinkage methods introduce a penalty associated with the number of regression coefficients. The penalty parameter, usually denoted by \(\lambda\), determines the size of shrinkage. There are two common shrinkage techniques, ridge and lasso.
Ridge shrinkage penalizes the sum of square residuals and pushes the estimators towards zero and each other, and do not penalizes the intercept \citep{hoerl1970ridge}. The penalization parameter \(\Lambda\) is the norm L- 2 or median. A similar approach is Lasso regression, a method that also penalizes coefficients with the important difference that it does so by applying a penalization of a different form, namely L- 1 or the mean making the solution a quadratic computation. An alternative approach to deal with a large number of parameters and multicollinearity is using principal components as regressors.

Tree Methods

Decision trees are a tool for classification problems, when the objective is to predict a 0-1 outcome or, in other words, place an observation in one of two mutually exclusive categories. Examples of classification problems would be to place individuals into high-risk or low-risk segments, predict if an individual will or will not develop a disease, label an email as spam or not-spam, or other discrete categories. Although trees are mostly used for classification problem, they can also be used in a regression setting by using the leafs or branches as variables.
The classification task is based on a set of predictors and might be carried out via a logit or probit model.
However, an alternative to these methods is to grow a tree classifier that models a sequence of partitions. While a partition can only handle two variables, a tree manages an unlimited number of predictors and, what is more important, there are efficient computational ways to carry out this process.
This method is particularly helpful for settings where there are relevant non-linear relationships and interactions among variables. It also happens to handle missing data very well.
A good example of the application of this technique is the Titanic survival, described by \cite{varian2014big}, in which a classification tree shows extreme age, namely, being very young or very old was decisive in survival rates, but for passengers in the middle of the distribution of age, variables different than age played a most important role. These types of insights from data are easily extracted by a tree algorithm and cannot otherwise be revealed. An example that I particularly like for its policy implications is the data analysis conducted by \cite{varian2014big} using the same data set than \cite{munnell1996mortgage} use to analyze the effects of the Home Mortgage Disclosure Act enacted in 1975 in the access of low-income individuals to the housing market.
In their paper, using a logistic regression approach, \cite{munnell1996mortgage} concluded that minorities are more than twice as likely to be denied a mortgage as whites; therefore, they maintain, ”race continued to play an important […] role in the decision to grant a mortgage”. Yet, several years later, Varian used a random forest method, a variation of the tree method described in previous paragraphs, to analyze the data on which the original article was based, Varian unveiled a slightly different conclusion: race was not the first but the second most important variable in explaining the difference in mortgage credits granted between the two groups. It turned out that ”dmi” or denial of mortgage insurance - a previous step in the process of applying for a mortgage credit- was the first predictor, which accounted for improved prediction accuracy of the random forest model by 10 percent, classifying adequately 223/2380 cases.
The narration of the authors regarding this variable suggests that previous banking history as well as other predictors of economic stability were generally more positive among whites relative to minorities, which can be attributed to a general economic trend and underlying dynamic, rather than to the specific event of insurance authorization. It might be argued that race, again, was the main predictor in denial of an insurance that later in the process will influence the decision of approval or denial of a mortgage; however, in terms of tracking the process by which minority individuals were denied mortgage credit, it is salient that the influence of race is partly exogenous, as insurance is necessary to access a housing credit.
While a single decision tree is relatively easy to interpret, techniques such as random forest, where hundreds of random trees are created and de-correlated is more difficult to make sense of, partly because of the non-linearities and interactions between variables that end up in different partitions. Another potential issue with tree methods is that they tend to overfit the data; for this reason, pruning the tree is important and that is a decision of the researcher as to where to stop partitioning so that the resulting classification algorithm can be useful to analyze new data efficiently \cite{HTF2001}.

Boosting, Bagging and Bootstrapping

Boosting is relatively new method, introduced in the last couple of decades, mostly used to transform a weak learning algorithm into a strong one by adapting to weighted versions of the training dataset and adopting a majority rule to produce a final prediction. One of the best known algorithms of this type is Adaptive Boosting or AdaBoost \cite{freund1999short}.
Bagging stands for ”bootstrap averaging” or ”bootstrap aggregating,” which is a method that uses bootstrap samples to train an algorithm and average the results obtained for each sample to produce a classifier with reduced noise. This approach can be applied to virtually any algorithm to produce a model with parameters calibrated from as many bootstrapping samples as the computational capacity makes possible. This approach is particularly useful for unstable procedures such as tree methods \cite{breiman1996bagging}.