5.3 Use case: Disease subtype from genomics data

We will start our illustration of machine learning using a real dataset from tumor biopsies. We will use the gene expression data of glioblastoma tumor samples from The Cancer Genome Atlas project. We will try to predict the subtype of this disease using molecular markers. This subtype is characterized by large-scale epigenetic alterations called the “CpG island methylator phenotype” or “CIMP” (Noushmehr, Weisenberger, Diefes, et al. 2010); half of the patients in our data set have this subtype and the rest do not, and we will try to predict which ones have the CIMP subtype. There two data objects we need for this exercise, one for gene expression values per tumor sample and the other one is subtype annotation per patient. In the expression data set, every row is a patient and every column is a gene expression value. There are 184 tumor samples. This data set might be a bit small for real-world applications, however it is very relevant for the genomics focus of this book and the small datasets take less time to train, which is useful for reproducibility purposes. We will read these data sets from the compGenomRData package now with the readRDS() function.

# get file paths
fileLGGexp=system.file("extdata",
                      "LGGrnaseq.rds",
                      package="compGenomRData")
fileLGGann=system.file("extdata",
                      "patient2LGGsubtypes.rds",
                      package="compGenomRData")
# gene expression values
gexp=readRDS(fileLGGexp)
head(gexp[,1:5])
##       TCGA-CS-4941 TCGA-CS-4944 TCGA-CS-5393 TCGA-CS-5394 TCGA-CS-5395
## A1BG       72.2326      24.7132      46.3789      37.9659      19.5162
## A1CF        0.0000       0.0000       0.0000       0.0000       0.0000
## A2BP1     524.4997     105.4092     323.5828      19.7390     299.5375
## A2LD1     144.0856      18.0154      29.0942       7.5945     202.1231
## A2ML1     521.3941     159.3746     164.6157      63.5664     953.4106
## A2M     17944.7205   10894.9590   16480.1130    9217.7919   10801.8461
dim(gexp)
## [1] 20501   184
# patient annotation
patient=readRDS(fileLGGann)
head(patient)
##              subtype
## TCGA-FG-8185    CIMP
## TCGA-DB-5276    CIMP
## TCGA-P5-A77X    CIMP
## TCGA-IK-8125    CIMP
## TCGA-DU-A5TR    CIMP
## TCGA-E1-5311    CIMP
dim(patient)
## [1] 184   1

References

Noushmehr, Weisenberger, Diefes, et al. 2010. “Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma.” Cancer Cell 17 (5): 510–22.