5.10 How to deal with class imbalance

A common hurdle in many applications of machine learning on genomic data is the large class imbalance. The imbalance refers to relative difference in the sizes of the groups being classified. For example, if we had class imbalance in our example data set we could have much more CIMP samples in the training than noCIMP samples, or the other way around. Another example with severe class imbalance would be enhancer prediction (Libbrecht and Noble 2015). Depending on which training data set you use, you can have a couple of hundred to thousands of positive examples for enhancer locations in the human genome. In either case, the negative set, “not enhancer”, will overwhelm the training, because the human genome is 3 billion base-pairs long and most of that does not overlap with an enhancer annotation. In whatever strategy you pick to build a negative set, it will contain many more data points than the positive set. As we have mentioned in the model performance section above, if we have a severe imbalance in the class sizes, the training algorithm may get better accuracy just by calling everything one class. This will be evident in specificity and sensitivity metrics, and the related balanced accuracy metric. Below, we will discuss a couple of techniques that might help when the training set has class imbalance.

5.10.1 Sampling for class balance

If we think class imbalance is a problem based on looking at the relative sizes of the classes and relevant accuracy metrics of a model, there are a couple of things that might help. First, we can try sampling or “stratified” sampling when we are constructing our training set. This simply means that before training we can we build the classification model with samples of the data so we have the same size classes. This could be down-sampling the classes with too many data points. For this purpose, you can simply use the sample() or caret::downSample() function and create your training set prior to modeling. In addition, the minority class could be up-sampled for the missing number of data points using sampling with replacement similar to bootstrap sampling with the caret::upSample() function. There are more advanced up-sampling methods such as the synthetic up-sampling method SMOTE (Chawla, Bowyer, Hall, et al. 2002). In this method, each data point from the minority class is up-sampled synthetically by adding variability to the predictor variable vector from one of the k-nearest neighbors of the data point. Specifically, one neighbor is randomly chosen and the difference between predictor variables of the neighbor and the original data point is added to the original predictor variables after multiplying the difference values with a random number between \(0\) and \(1\). This creates synthetic data points that are similar to original data points but not identical. This method and other similar methods of synthetic sampling are available at smotefamily package in CRAN.

In addition to the strategies above, some methods can do sampling during training to cope with the effects of class imbalance. For example, random forests has a sampling step during training, and this step can be altered to do stratified sampling. We will be introducing random forests later in the chapter.

However, even if we are doing the sampling on the training set to avoid problems, the test set proportions should have original class label proportions to evaluate the performance in a real-world situation.

5.10.2 Altering case weights

For some methods, we can use different case weights proportional to the imbalance suffered by the minority class. This means cases from the minority class will have higher case weights, which causes an effect as if we are up-sampling the minority class. Logistic regression-based methods and boosting methods are examples of algorithms that can utilize case weights, both of which will be introduced later.

5.10.3 Selecting different classification score cutoffs

Another simple approach for dealing with class imbalance is to select a prediction score cutoff that minimizes the excess true positives or false positives depending on the direction of the class imbalance. This can simply be done using ROC curves. For example, the classical prediction cutoff for a 2-class classification problems is 0.5. We can alter this cutoff to optimize sensitivity and specificity.

References

Chawla, Bowyer, Hall, and Kegelmeyer. 2002. “SMOTE: Synthetic Minority over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57.

Libbrecht, and Noble. 2015. “Machine learning applications in genetics and genomics.” Nat. Rev. Genet. 16 (6): 321–32.