## 5.16 Exercises

### 5.16.1 Classification

For this set of exercises we will be using the gene expression and patient annotation data from the glioblastoma patient. You can read the data as shown below:

library(compGenomRData)
# get file paths
fileLGGexp=system.file("extdata",
"LGGrnaseq.rds",
package="compGenomRData")
fileLGGann=system.file("extdata",
"patient2LGGsubtypes.rds",
package="compGenomRData")
# gene expression values

# patient annotation
patient=readRDS(fileLGGann)
1. Our first task is to not use any data transformation and do classification. Run the k-NN classifier on the data without any transformation or scaling. What is the effect on classification accuracy for k-NN predicting the CIMP and noCIMP status of the patient? [Difficulty: Beginner]

2. Bootstrap resampling can be used to measure the variability of the prediction error. Use bootstrap resampling with k-NN for the prediction accuracy. How different is it from cross-validation for different $$k$$s? [Difficulty: Intermediate]

3. There are a number of ways to get variable importance for a classification problem. Run random forests on the classification problem above. Compare the variable importance metrics from random forest and the one obtained from DALEX. How many variables are the same in the top 10? [Difficulty: Advanced]

4. Come up with a unified importance score by normalizing importance scores from random forests and DALEX, followed by taking the average of those scores. [Difficulty: Advanced]

### 5.16.2 Regression

For this set of problems we will use the regression data set where we tried to predict the age of the sample from the methylation values. The data can be loaded as shown below:

# file path for CpG methylation and age
fileMethAge=system.file("extdata",
"CpGmeth2Age.rds",
package="compGenomRData")

ameth=readRDS(fileMethAge)