# Statistics and Machine Learning[^ref] By Julien Corriveau-Trudel In medical and biological research, discovering and inferring new information about the underlying process of a phenomenon and predicting future outcomes are both critical objectives that usually go hand in hand. Identifying timing and triggers of the spread of carcinoma and then projecting where it will spread next is an example of context where these two objectives are pursued conjointly. Mathematical and computational considerations are important to complete these tasks. Historically, the domain of statistics offered a paradigm to answer both inference and prediction, usually under specific statistical assumptions. More contemporarily, Machine Learning (ML) has emerged as a domain of computer science that initially tackled the development of computer systems that could "*learn*". Presently, it offers algorithmic methodology to approach predictions and other learning tasks, such as pattern recognition and image identification. Both these fields offer considerable knowledge in *how to approach* data analysis, and are not mutually exclusive. However, identifying which methods falls in the statistical domain and which in machine learning is a battle lost in advance. The line dividing these two domains is not traced around sets of methods or techniques. There is too much overlap for this. Take linear regression, for example, which is a very simple model considered to be of both fields, and it is used for both inference and prediction. One one side, the statistician will assume a linear model on the data generating process. Then, this person can infer multiple statistical propositions on the population, such as rejecting the hypothesis that the linear relationship is non-significant. The validity of the propositions are tied with the validity of the model as a surrogate of the underlying data distribution. On the other side, a ML approach, with or without supposing an underlying data generation model, could verify empirically if a linear model is a good predictor for future values of the dependant variable in the context. However, a ML specialist will use a statistical model if its use is justified in the context. So if statistics and machine learning are not distinguished solely on the methods used by each domain, how can we characterize the distinctions between the two? In the following, I offer an answer to this question by approaching each of these fields with their *defining question*, then describing their approaches with the use of a very simple example case study to better illustrate the distinction. ## Defining questions To describe a scientific domain, it is efficient to define it by stating its fundamental question(s). The defining question of ML can be summarized as: > "How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all **learning** processes?" [^Mitchell] This question encloses a daunting amount of tasks, such as building a computer program to parse through historical medical records to predict how a patient will respond to a particular treatment, to developing software to automatically recognize a cancerous tumour in lymphoid tissue on an MRI, to teaching a computer how to play chess or Dota 2, etc. More formally, **learning** is formalized with respect to a certain task $\mathcal{T}$, performance metric $\mathcal{P}$ and type of experience $\mathcal{E}$. We say that a machine **learns** if the system improves its performance $\mathcal{P}$ at task $\mathcal{T}$, following experience $\mathcal{E}$. Needless to say, every task of ML requires data. ML, as a branch of Computer Science, is the "go to" when an application is too complex for developers to manually design an explicit algorithm for it, or when it requires that the software customizes itself based on its operational environment after it is fielded. A task of ML that is particularly relevant for healthcare research is **Supervised Learning** (SL). Roughly, SL deals with *prediction*, and does so by searching generalizable predictive patterns. Compared to statistics, ML puts less importance on assumptions of the data-generating process, as we will illustrate later. The validity of an algorithm or method is usually based on *empirical accuracy*, and not the validity of assumptions. On the other side, statistics have a different question to answer: >"What can be inferred from data plus a set of modeling assumptions; with what reliability?" [^Mitchell] Statistics is the scientific field of studying quantitative information under uncertainty. It usually does so by first producing a set of mathematical assumptions about the underlying data generation process, called a **statistical model**, then by deriving conclusions from these assumptions and the data at hand. Under a small quantity of data and under simple underlying model, statistics methods shine: classical methods require small amount of computation power under these conditions. However, with the advent of computers and electronic databases, the amount of available data for analysis grew, and statistical methods that required heavier computational power were developed. To address this growing tendency, in 1977, the International Statistical Institute (ISI) founded the International Association for Statistical Computing (IASC) as a section of their institute. This is to state that computational aspects are not restricted to the domain of ML. All in all, statistics and ML differ in their objectives: "Statistics draws population inferences from a sample, and ML finds generalizable predictive patterns." [^Bzdok_et_al] As we said earlier, it is tortuous to divide them on the level of algorithm and methods, as they differ in goals. However, in the next subsection, we will describe some common grounds, and we will disentangle both approaches and goals in a simple context: linear regression. ## An example on common grounds... To illustrate more strongly both stances, we describe linear regression using paradigms from both field. Note that this example shows a setting where Statistics and ML overlap. There are many contexts where one of these fields would be more relevant than the other. | $X_1$ | $X_2$ | $X_3$ | $\ldots$ | $X_p$ | $Y$ | | --- | --- | --- | --- | --- | --- | | 0.8 | 16.0 | 0.09 | $\ldots$ | -0. 45 | 0.9 | | 1.3 | 18.9 | 0.76 | $\ldots$ | -0.03 | 0.7 | | 3.1 | 25.1 | 0.15 | $\ldots$ | 0.09 | 0.9 | | 2.8 | 11.5 | 0.77 | $\ldots$ | 0.25 | 0.4 | | 0.7 | 14.3 | 0. 24 | $\ldots$ | -0.09 | 0.4 | | $\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | $\vdots$ | Suppose we have in hand a tabular data as seen in the table above. There are $p + 1$ columns, which represent different variables, while there are $n$ rows for each individual data point. All values are numerical. Some variables are bounded over $0$, some aren't. ### ... with Machine Learning We begin with a ML paradigm: **Supervised Learning**, which is the study of the elicitation of prediction functions under labeled "*training*" data. A question that fits in this framework is the following: >Suppose we have an interest in variable $Y$ , and, in future settings, we will have access to variables $X_1$ to $X_p$, but not $Y$ . How can we predict variable $Y$ in such conditions? Say there is reason to believe that there is a linear relationship between $(X_1$, $\ldots$ , $X_p)$ and $Y$ . We then choose to optimize a linear predictor to answer this task. This choice is not to be taken lightly, but for the sake of this example, we will keep things simple. Linear functions considered are of the form $f(\mathbf{x}, \mathbf{w}, b) = b + w_1x_1 + . . . + w_px_p = b + \mathbf{w}^T\mathbf{x},$ where $\mathbf{x}$ and $\mathbf{w}$ are vectors of dimension $p$, representing a data point without the variable $Y$ and the **parameters** to adjust, respectively, and $b$ is another parameter to adjust. To evaluate how good the predictor is and to offer a means to search for better weights $\mathbf{w}$, we use a **loss function**. Different loss functions offer different behaviours. We choose the quadratic loss function: $\ell_2(y_1, y_2) = (y_1 − y_2)^2$ . Searching for the best parameters $\mathbf{w}$ and $b$ is a question of minimizing the sum of the loss function on each data point, as to have a linear function that approximates well the the variable $Y$ for each data point. Mathematically, this is written as searching for $\mathbf{w}^∗$ such that $ \DeclareMathOperator*{\argmin}{arg\,min} (\mathbf{w}^∗ , b^∗ ) = \argmin_{(w,b) ~\in~ \mathbb{R}^{p+1}} \sum^n_{i=1} \left(f(\mathbf{x}_i , \mathbf{w}, b) − y_i\right)^2, $ where $(\mathbf{x}_i , y_i)$ is the $i$-th data point. This problem is called **Empirical Risk Minimization**[^Vapnik], and is fundamental to statistical learning. Since there is an analytical solution to this minimization, we can find the exact solutions $(\mathbf{w}^* , b^*)$, which will be the parameters of the predictor function. It is important to note that there is a statistical framework that justifies supervised learning, which builds on the notion of a minimal statistical model, even if it is not often stated. However, in the point of view of ML, the intention is to be sure that the predictor is good *empirically*. It is possible to analyze the prediction function to extract more information on the relationship between $(X_1$, $\ldots$, $X_p)$ and $Y$ . For example, the specialist can retrain the function with different subsets of input variables and see the differences in predictive power, and detect which variables are more valuable to the model. ### ...with Statistics Now, let’s have a different take on the same model, but from a frequentist statistical approach. We suppose that the $n$ data points are sampled independently from a random vector $(X_1$, $\ldots$ , $X_p$, $Y)$. As mentioned earlier, we have reason to believe that there is a linear relationship between $(X_1$, $\ldots$, $X_p)$ and $Y$ . However, our goal here might not be only prediction of future labels $y_i$ , with $i > n$, but we might want to answer other inferential questions about the data generating process, as we'll see later. We assume that the collected tabular data adhere to the following model, which is one of linear regression. We suppose that the relationship between $(X_1$, $\ldots$, $X_p)$ and $Y$ is a linear function, padded with a random variable of unknown variance, ε, called an error variable. Thus, for $i = 1$, $\ldots$, $n$, $\begin{align*} y_i &= b + w_1x_{i1} + w_2x_{i2} + \ldots + w_kx_{ik} + \varepsilon_i \\ &= b + \mathbf{w}^T\mathbf{x}_i + \varepsilon_i , \end{align*}$ where $\varepsilon_i$ are independent and identically distributed (*iid*) for all $i$, where $(\mathbf{w}, b)$ are parameters of the model and where $(\mathbf{x}_i , y_i)$ is the $i$-th data point. We also assume that $\mathbf{x}_i$ and $\varepsilon_i$ are uncorrelated. To elaborate the model further, the statistician may suppose, with justification, that the error $\varepsilon_i$ follows a Gaussian distribution, written $\varepsilon_i \sim \mathcal{N}(0, \sigma^2 )$, also named normal distribution. Such an assumption on the underlying distribution is often seen in a statistical setting. We enter the world of **parametric** models. A direct consequence of the distribution assumption on the error is that $y_i$ now follows a Gaussian distribution: $y_i \sim \mathcal{N} (b + \mathbf{w}^T\mathbf{x}_i , \sigma^2 )$, since a Gaussian random variable with added fixed values stays Gaussian, but with a different mean. This is a good example of what a parametric model is. The next task will be to fit the relationship as "best" as possible, by estimating the parameters of the model. Note the slight distinction between parameters in ML and in statistics. In ML, parameters are generally values of a function to adjust, using optimization or statistical estimation. In statistics, parameters are also values to adjust, but are always ingrained in a statistical model. After choosing a statistical model, the statistician might adjust the parameters of the chosen parametric family underlying the data generation process. Usually, these parameters can be optimised following different methodologies, based on estimation theory: maximizing likelihood, minimizing the magnitude of bias, minimizing Mean Squared Error (MSE) or variance of the estimator, etc. Very suitably, optimizing the parameters $(\mathbf{w}, b)$ in this setting by maximizing likelihood gives the same result as searching for a minimal variance unbiased linear estimator, which is also the same result as the $\ell_2$ loss minimization seen in the ML approach. Once the parameters are found, the statistical setting offers more than a prediction function. There is a plethora of different inferences that can be made on the model. Here are some examples: - Giving a confidence interval (CI) on the variance of the error variable $\varepsilon_i$ , $\sigma^2$, which signifies that if the experiment was to be done over and over and that if the assumptions on the underlying model are good, then $\sigma^2$ would fall in the CI with chosen probability $\alpha$; - Testing the marginal contribution of a variable with a hypothesis test of the form: $\begin{cases} \mathbb{H}_0 : w_j = 0 \\ \mathbb{H}_1 : w_j \neq 0 , \end{cases}$ where we either reject the null hypothesis or not; - Testing if the linear relationship between $(X_1$, $\ldots$, $X_p)$ and $Y$ is significant. ## Conclusion The previous example lie in the common ground between statistics and ML, but as we were digging linear regression with both approach, the distinction between those field was made clear: they mostly differ in terms of goals (the why) and sometimes in methodology (the how). Plus, they don't necessarily differ on what model is used (the what). The domain of statistics offers methodologies to make conclusions about the "population" from which the data come from, such as hypothesis testing or prediction of future data points that are outside the initial population. It is usually done by making assumptions on the underlying data generation process. Machine learning proposes methods and algorithms to allow a computer to "learn" by itself. Often, it borrows methods from statistics for their predictive power. We sometimes see some alterations on the methodology used to approach the settings of these methods in ML literature. All in all, these fields are definitely tied together, whether we like it or not. [^ref]: This section is based upon Mitchell[^Mitchell], Bzdok et al.[^Bzdok_et_al], Breiman[^Breiman], and Albers[^Albers]. [^Mitchell]: T. M. Mitchell, *The Discipline of Machine Learning*. Carnegie Mellon University, School of Computer Science, Machine Learning Department, 2006. [^Bzdok_et_al]: D. Bzdok, N. Altman, and M. Krzywinski, *Statistics versus machine learning*, Nature Methods, vol. 15, no. 4, pp. 233–234, 2018. [^Breiman]: L. Breiman, *Statistical modeling: The two cultures*, Statistical Science, vol. 16, no. 3, pp. 199–231, 2001. [^Albers]: C. Albers, *The statistical approach known as machine learning*, Nieuw Archief voor Wiskunde, vol. 5/19, no. 3, pp. 215–217, 2018. [^Vapnik]:: Vapnik, V. *Principles of Risk Minimization for Learning Theory*, 1992.