5.3 Use case: Disease subtype from genomics data
We will start our illustration of machine learning using a real dataset from tumor biopsies. We will use the gene expression data of glioblastoma tumor samples from The Cancer Genome Atlas project. We will try to predict the subtype of this disease using molecular markers. This subtype is characterized by large-scale epigenetic alterations called the “CpG island methylator phenotype” or “CIMP” (Noushmehr, Weisenberger, Diefes, et al. 2010); half of the patients in our data set have this subtype and the rest do not, and we will try to predict which ones have the CIMP subtype. There two data objects we need for this exercise, one for gene expression values per tumor sample and the other one is subtype annotation per patient. In the expression data set, every row is a patient and every column is a gene expression value. There are 184 tumor samples. This data set might be a bit small for real-world applications, however it is very relevant for the genomics focus of this book and the small datasets take less time to train, which is useful for reproducibility purposes. We will read these data sets from the compGenomRData package now with the readRDS()
function.
# get file paths
fileLGGexp=system.file("extdata",
"LGGrnaseq.rds",
package="compGenomRData")
fileLGGann=system.file("extdata",
"patient2LGGsubtypes.rds",
package="compGenomRData")
# gene expression values
gexp=readRDS(fileLGGexp)
head(gexp[,1:5])
## TCGA-CS-4941 TCGA-CS-4944 TCGA-CS-5393 TCGA-CS-5394 TCGA-CS-5395
## A1BG 72.2326 24.7132 46.3789 37.9659 19.5162
## A1CF 0.0000 0.0000 0.0000 0.0000 0.0000
## A2BP1 524.4997 105.4092 323.5828 19.7390 299.5375
## A2LD1 144.0856 18.0154 29.0942 7.5945 202.1231
## A2ML1 521.3941 159.3746 164.6157 63.5664 953.4106
## A2M 17944.7205 10894.9590 16480.1130 9217.7919 10801.8461
## [1] 20501 184
## subtype
## TCGA-FG-8185 CIMP
## TCGA-DB-5276 CIMP
## TCGA-P5-A77X CIMP
## TCGA-IK-8125 CIMP
## TCGA-DU-A5TR CIMP
## TCGA-E1-5311 CIMP
## [1] 184 1
References
Noushmehr, Weisenberger, Diefes, et al. 2010. “Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma.” Cancer Cell 17 (5): 510–22.