5.2 Steps in supervised machine learning

There are many methods to use for supervised learning problems. However, there are similar steps that you will need to follow whatever machine learning method you choose to train. These steps are briefly described below and we will get back to these in detail later in the chapter:

  • Pre-processing data: We might have to use normalization and data transformation procedures.
  • Training and test data split: Decide which strategy you want to use for evaluation purposes. You need to use a test set to evaluate your model later on.
  • Training the model: This is where your choice of supervised learning algorithm becomes relevant. “Training” generally means your data set is used in optimization of the loss function to find parameters for \(f(x)\).
  • Estimating performance of the model: This is about which metrics to use to evaluate performance and how to calculate those metrics.
  • Model tuning and selection: We try different parameters and select the best model.

Many of these steps are identical for different supervised learning methods. Therefore, we will use the caret package to perform these steps, which streamlines the steps and provides a similar interface for different supervised learning methods. There are other similar packages, such as mlr, that can provide similar functionality. For now, we will focus on classification models, which is a subset of supervised learning models. In these types of models, we try to predict a categorical response variable, such as if a patient has the disease or not, or what type of disease the patient has based on predictor variables.