4.3 Exercises
For this set of exercises we will be using the expression data shown below:
expFile=system.file("extdata",
"leukemiaExpressionSubset.rds",
package="compGenomRData")
mat=readRDS(expFile)
4.3.1 Clustering
We want to observe the effect of data transformation in this exercise. Scale the expression matrix with the
scale()
function. In addition, try taking the logarithm of the data with thelog2()
function prior to scaling. Make box plots of the unscaled and scaled data sets using theboxplot()
function. [Difficulty: Beginner/Intermediate]For the same problem above using the unscaled data and different data transformation strategies, use the
ward.d
distance in hierarchical clustering and plot multiple heatmaps. You can try to use thepheatmap
library or any other library that can plot a heatmap with a dendrogram. Which data-scaling strategy provides more homogeneous clusters with respect to disease types? [Difficulty: Beginner/Intermediate]For the transformed and untransformed data sets used in the exercise above, use the silhouette for deciding number of clusters using hierarchical clustering. [Difficulty: Intermediate/Advanced]
Now, use the Gap Statistic for deciding the number of clusters in hierarchical clustering. Is it the same number of clusters identified by two methods? Is it similar to the number of clusters obtained using the k-means algorithm in the chapter. [Difficulty: Intermediate/Advanced]
4.3.2 Dimension reduction
We will be using the leukemia expression data set again. You can use it as shown in the clustering exercises.
Do PCA on the expression matrix using the
princomp()
function and then use thescreeplot()
function to visualize the explained variation by eigenvectors. How many top components explain 95% of the variation? [Difficulty: Beginner]Our next tasks are to remove eigenvectors and reconstruct the matrix using SVD, then calculate the reconstruction error as the difference between original and reconstructed matrix. HINT: You have to use the
svd()
function and equalize eigenvalue to \(0\) for the component you want to remove. [Difficulty: Intermediate/Advanced]Produce a 10-component ICA from the expression data set. Remove each component and measure the reconstruction error without that component. Rank the components by decreasing reconstruction-error. [Difficulty: Advanced]
In this exercise we use the
Rtsne()
function on the leukemia expression data set. Try to increase and decrease perplexity t-sne, and describe the observed changes in 2D plots. [Difficulty: Beginner]