8.5 Exercises
8.5.1 Exploring the count tables
Here, import an example count table and do some exploration of the expression data.
counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv",
package = "compGenomRData")
coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv",
package = "compGenomRData")
- Normalize the counts using the TPM approach. [Difficulty: Beginner]
- Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using the 100 most variable genes. [Difficulty: Beginner]
- Re-do the heatmaps setting the
scale
argument tonone
, andcolumn
. Compare the results withscale = 'row'
. [Difficulty: Beginner] - Draw a correlation plot for the samples depicting the sample differences as ‘ellipses’, drawing only the upper end of the matrix, and order samples by hierarchical clustering results based on
average
linkage clustering method. [Difficulty: Beginner] - How else could the count matrix be subsetted to obtain quick and accurate clusters? Try selecting the top 100 genes that have the highest total expression in all samples and re-draw the cluster heatmaps and PCA plots. [Difficulty: Intermediate]
- Add an additional column to the annotation data.frame object to annotate the samples and use the updated annotation data.frame to plot the heatmaps. (Hint: Assign different batch values to CASE and CTRL samples). Make a PCA plot and color samples by the added variable (e.g. batch). [Difficulty: Intermediate]
- Try making the heatmaps using all the genes in the count table, rather than sub-selecting. [Difficulty: Advanced]
- Use the
Rtsne
package to draw a t-SNE plot of the expression values. Color the points by sample group. Compare the results with the PCA plots. [Difficulty: Advanced]
8.5.2 Differential expression analysis
Firstly, carry out a differential expression analysis starting from raw counts. Use the following datasets:
counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv",
package = "compGenomRData")
coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv",
package = "compGenomRData")
- Import the read counts and colData tables.
- Set up a DESeqDataSet object.
- Filter out genes with low counts.
- Run DESeq2 contrasting the
CASE
sample withCONTROL
samples.
Now, you are ready to do the following exercises:
- Make a volcano plot using the differential expression analysis results. (Hint: x-axis denotes the log2FoldChange and the y-axis represents the -log10(pvalue)). [Difficulty: Beginner]
- Use DESeq2::plotDispEsts to make a dispersion plot and find out the meaning of this plot. (Hint: Type ?DESeq2::plotDispEsts) [Difficulty: Beginner]
- Explore
lfcThreshold
argument of theDESeq2::results
function. What is its default value? What does it mean to change the default value to, for instance,1
? [Difficulty: Intermediate] - What is independent filtering? What happens if we don’t use it? Google
independent filtering statquest
and watch the online video about independent filtering. [Difficulty: Intermediate] - Re-do the differential expression analysis using the
edgeR
package. Find out how much DESeq2 and edgeR agree on the list of differentially expressed genes. [Difficulty: Advanced] - Use the
compcodeR
package to run the differential expression analysis using at least three different tools and compare and contrast the results following thecompcodeR
vignette. [Difficulty: Advanced]
8.5.3 Functional enrichment analysis
- Re-run gProfileR, this time using pathway annotations such as KEGG, REACTOME, and protein complex databases such as CORUM, in addition to the GO terms. Sort the resulting tables by columns
precision
and/orrecall
. How do the top GO terms change when sorted forprecision
,recall
, orp.value
? [Difficulty: Beginner] - Repeat the gene set enrichment analysis by trying different options for the
compare
argument of theGAGE:gage
function. How do the results differ? [Difficulty: Beginner] - Make a scatter plot of GO term sizes and obtained p-values by setting the
gProfiler::gprofiler
argumentsignificant = FALSE
. Is there a correlation of term sizes and p-values? (Hint: Take -log10 of p-values). If so, how can this bias be mitigated? [Difficulty: Intermediate] - Do a gene-set enrichment analysis using gene sets from top 10 GO terms. [Difficulty: Intermediate]
- What are the other available R packages that can carry out gene set enrichment analysis for RNA-seq datasets? [Difficulty: Intermediate]
- Use the topGO package (https://bioconductor.org/packages/release/bioc/html/topGO.html) to re-do the GO term analysis. Compare and contrast the results with what has been obtained using the
gProfileR
package. Which tool is faster,gProfileR
or topGO? Why? [Difficulty: Advanced] - Given a gene set annotated for human, how can it be utilized to work on C. elegans data? (Hint: See
biomaRt::getLDS
). [Difficulty: Advanced] - Import curated pathway gene sets with Entrez identifiers from the MSIGDB database and re-do the GSEA for all curated gene sets. [Difficulty: Advanced]
8.5.4 Removing unwanted variation from the expression data
For the exercises below, use the datasets at:
counts_file <- system.file('extdata/rna-seq/SRP049988.raw_counts.tsv',
package = 'compGenomRData')
colData_file <- system.file('extdata/rna-seq/SRP049988.colData.tsv',
package = 'compGenomRData')
- Run RUVSeq using multiple values of
k
from 1 to 10 and compare and contrast the PCA plots obtained from the normalized counts of each RUVSeq run. [Difficulty: Beginner] - Re-run RUVSeq using the
RUVr()
function. Compare PCA plots fromRUVs
,RUVg
andRUVr
using the samek
values and find out which one performs the best. [Difficulty: Intermediate] - Do the necessary diagnostic plots using the differential expression results from the EHF count table. [Difficulty: Intermediate]
- Use the
sva
package to discover sources of unwanted variation and re-do the differential expression analysis using variables from the output ofsva
and compare the results withDESeq2
results usingRUVSeq
corrected normalization counts. [Difficulty: Advanced]