Chapter 5 Predictive Modeling with Supervised Machine Learning

In this chapter we will introduce supervised machine learning applications for predictive modeling. In genomics, we are often faced with biological questions to answer using lots of data. Some of those questions can easily fit in the domain of machine learning, where algorithms will learn a mathematical model of the input data in order to make decisions about similar data, previously unseen by the model. Often we are trying to predict a medical or biological variable of interest using molecular signatures obtained via genomics methods. To give you a better idea, we listed some of the machine learning applications in genomics:

  • Predicting gene expression from epigenetic modifications (Dong, Greven, Kundaje, et al. 2012).
  • Predicting gene locations (Mathe, Sagot, Schiex, et al. 2002).
  • Predicting enhancer or other regulatory regions (Fernandez and Miranda-Saavedra 2012).
  • Predicting drug response based on genomics (Wang, McLeod, and Weinshilboum 2011).
  • Predicting healthy/disease status or disease subtypes based on genomics (Kourou, Exarchos, Exarchos, et al. 2015).
  • Predicting the effect of SNPs on gene regulation (Zhou and Troyanskaya 2015).
  • Calling SNPs (Poplin, Chang, Alexander, et al. 2018).

Apart from prediction of an outcome, machine learning can be used to understand which predictor variables are the most important for prediction performance. This often gives insights into the biology as well. Many machine learning algorithms have either built-in variable importance assessment or can be wrapped around a model-agnostic variable importance method. For example, we may want to find which epigenetic modifications are most important for gene expression prediction. Although decades of molecular biology gives a pretty good idea for this, we could arrive at similar conclusions by building a machine learning model to predict gene expression using histone modifications H3K27ac, H3K27me, H3K4me1, H3K4me3, and DNA methylation. We can then check which of these are most important for gene expression prediction using variable importance metrics.

In this chapter, we will show how to use supervised machine learning models to solve problems in genomics. We will go over general steps in machine learning applications. In addition, we will introduce how to use some of the most popular supervised machine learning models in practice.


Dong, Greven, Kundaje, et al. 2012. “Modeling gene expression using chromatin features in various cellular contexts.” Genome Biol. 13 (9): R53.

Fernandez, and Miranda-Saavedra. 2012. “Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines.” Nucleic Acids Res. 40 (10): e77.

Kourou, Exarchos, Exarchos, Karamouzis, and Fotiadis. 2015. “Machine learning applications in cancer prognosis and prediction.” Comput Struct Biotechnol J 13: 8–17.

Mathe, Sagot, Schiex, and Rouze. 2002. “Current methods of gene prediction, their strengths and weaknesses.” Nucleic Acids Res. 30 (19): 4103–17.

Poplin, Chang, Alexander, et al. 2018. “A universal SNP and small-indel variant caller using deep neural networks.” Nat. Biotechnol. 36 (10): 983–87.

Wang, McLeod, and Weinshilboum. 2011. “Genomics and drug response.” N. Engl. J. Med. 364 (12): 1144–53.

Zhou, and Troyanskaya. 2015. “Predicting effects of noncoding variants with deep learning-based sequence model.” Nat. Methods 12 (10): 931–34.