5.6 Predicting the subtype with k-nearest neighbors

One of the easiest things to wrap our heads around when we are trying to predict a label such as disease subtype is to look for similar samples and assign the labels of those similar samples to our sample.

Conceptually, k-nearest neighbors (k-NN) is very similar to clustering algorithms we have seen earlier. If we have a measure of distance between the samples, we can find the nearest \(k\) samples to our new sample and use a voting method to decide on the label of our new sample.

Let us run the k-NN algorithm with our cancer data. For illustrative purposes, we provide the same data set for training and test data. Providing the training data as the test data shows us the training error or accuracy, which is how the model is doing on the data it is trained with. Below we are running k-NN with the caret:knn3() function. The most important argument is k, which is the number of nearest neighbors to consider. In this case, we set it to 5. We will later discuss how to find the best k.

knnFit=knn3(x=training[,-1], # training set
            y=training[,1], # training set class labels
# predictions on the test set