5.1 How are machine learning models fit?

We already have quite an insight on how machine learning models are fit. We have previously seen clustering methods, which are unsupervised machine learning models, and we have seen linear regression which is a simple machine learning model if we disregard its objectives for statistical inference.

Machine learning models are optimization methods at their core. They all depend on defining a “cost” or “loss” function to minimize. For example, in linear regression the difference between the predicted and the original values are being minimized. When we have a data set with the correct answer such as original values or class labels, this is called supervised learning. We use the structure in the data to predict a value, and optimization methods help us use the right structure or patterns in the data. The supervised machine learning methods use predictor variables such as gene expression values or other genomic scores to build a mathematical function, or a mapping method if you will. This function maps a predictor variable vector or matrix from a given sample to the response variable: labels/classes or numeric values. The response variable is also called the “dependent variable”. Then, the predictions are simply output of mathematical functions, \(f(X)\). These functions take predictor variables, \(X\), as input. The variables in \(X\) are also called “independent variables”, “explanatory variables” or “features”. The functions also have internal parameters that help map \(X\) to the predicted values. The optimization works on the parameters of \(f(X)\) and tries to minimize the difference between the function output and original response variables (\(Y\)): \(\sum(Y-f(X))^2\). Now, this is just a simplification of the actual “cost” or “loss” function. Especially in classification problems, cost functions can take different forms, but the idea is the same. You have a mathematical expression you can minimize by searching for the optimal parameter values. The core ingredients of a machine learning algorithm are the same and they are listed as follows:

  1. Define a prediction function or method \(f(X)\).
  2. Devise a function (called the loss or cost function) to optimize the difference between your predictions and observed values, such as \(\sum (Y-f(X))^2\).
  3. Apply mathematical optimization methods to find the best parameter values for \(f(X)\) in relation to the cost/loss function.

Similarly, clustering and dimension reduction techniques can use optimization methods, but they do so without having a correct answer to predict or train with. In this case, they find patterns or structure in the data without trying to estimate a correct answer. These patterns are groupings of samples or variables, such as common gene expression patterns, that can be obtained from dimension reduction techniques such as PCA. In general, dimension reduction algorithms can be thought of as optimization procedures that are trying to minimize \(X-WH\). Here, \(X\) is our original data set and \(WH\) is the product of potentially two lower dimension matrices, \(W\) and \(H\). In this case, the optimization procedure hopefully gives us the lower-dimensional space so that we can represent our data without losing too much information.

5.1.1 Machine learning vs. statistics

Machine learning and statistics are related and sometimes overlapping fields. Statistical inference is the main purpose of statistics. The aim of inference is to find statistical properties of the underlying data and to estimate the uncertainty about those properties. However, while doing so, the field of statistics developed dimension reduction and regression techniques that are the cornerstone of machine learning applications.

Both machine learning and statistics share the same overarching goal, which is learning from the data. The difference between the two is that machine learning emphasizes optimization and performance over statistical inference. Statistics is also concerned about performance but would like to calculate the uncertainty associated with parameters of the model. It will try to model the population statistics from the sample data points to assess that uncertainty. Having said that, many machine learning algorithms, including a couple we will introduce below, are developed by scientists who will define themselves as statisticians, and work at statistics departments of universities.