8.5 Exercises
8.5.1 Exploring the count tables
Here, import an example count table and do some exploration of the expression data.
counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv",
package = "compGenomRData")
coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv",
package = "compGenomRData")- Normalize the counts using the TPM approach. [Difficulty: Beginner]
- Plot a heatmap of the top 500 most variable genes. Compare with the heatmap obtained using the 100 most variable genes. [Difficulty: Beginner]
- Re-do the heatmaps setting the
scaleargument tonone, andcolumn. Compare the results withscale = 'row'. [Difficulty: Beginner] - Draw a correlation plot for the samples depicting the sample differences as ‘ellipses’, drawing only the upper end of the matrix, and order samples by hierarchical clustering results based on
averagelinkage clustering method. [Difficulty: Beginner] - How else could the count matrix be subsetted to obtain quick and accurate clusters? Try selecting the top 100 genes that have the highest total expression in all samples and re-draw the cluster heatmaps and PCA plots. [Difficulty: Intermediate]
- Add an additional column to the annotation data.frame object to annotate the samples and use the updated annotation data.frame to plot the heatmaps. (Hint: Assign different batch values to CASE and CTRL samples). Make a PCA plot and color samples by the added variable (e.g. batch). [Difficulty: Intermediate]
- Try making the heatmaps using all the genes in the count table, rather than sub-selecting. [Difficulty: Advanced]
- Use the
Rtsnepackage to draw a t-SNE plot of the expression values. Color the points by sample group. Compare the results with the PCA plots. [Difficulty: Advanced]
8.5.2 Differential expression analysis
Firstly, carry out a differential expression analysis starting from raw counts. Use the following datasets:
counts_file <- system.file("extdata/rna-seq/SRP029880.raw_counts.tsv",
package = "compGenomRData")
coldata_file <- system.file("extdata/rna-seq/SRP029880.colData.tsv",
package = "compGenomRData")
- Import the read counts and colData tables.
- Set up a DESeqDataSet object.
- Filter out genes with low counts.
- Run DESeq2 contrasting the
CASEsample withCONTROLsamples.
Now, you are ready to do the following exercises:
- Make a volcano plot using the differential expression analysis results. (Hint: x-axis denotes the log2FoldChange and the y-axis represents the -log10(pvalue)). [Difficulty: Beginner]
- Use DESeq2::plotDispEsts to make a dispersion plot and find out the meaning of this plot. (Hint: Type ?DESeq2::plotDispEsts) [Difficulty: Beginner]
- Explore
lfcThresholdargument of theDESeq2::resultsfunction. What is its default value? What does it mean to change the default value to, for instance,1? [Difficulty: Intermediate] - What is independent filtering? What happens if we don’t use it? Google
independent filtering statquestand watch the online video about independent filtering. [Difficulty: Intermediate] - Re-do the differential expression analysis using the
edgeRpackage. Find out how much DESeq2 and edgeR agree on the list of differentially expressed genes. [Difficulty: Advanced] - Use the
compcodeRpackage to run the differential expression analysis using at least three different tools and compare and contrast the results following thecompcodeRvignette. [Difficulty: Advanced]
8.5.3 Functional enrichment analysis
- Re-run gProfileR, this time using pathway annotations such as KEGG, REACTOME, and protein complex databases such as CORUM, in addition to the GO terms. Sort the resulting tables by columns
precisionand/orrecall. How do the top GO terms change when sorted forprecision,recall, orp.value? [Difficulty: Beginner] - Repeat the gene set enrichment analysis by trying different options for the
compareargument of theGAGE:gagefunction. How do the results differ? [Difficulty: Beginner] - Make a scatter plot of GO term sizes and obtained p-values by setting the
gProfiler::gprofilerargumentsignificant = FALSE. Is there a correlation of term sizes and p-values? (Hint: Take -log10 of p-values). If so, how can this bias be mitigated? [Difficulty: Intermediate] - Do a gene-set enrichment analysis using gene sets from top 10 GO terms. [Difficulty: Intermediate]
- What are the other available R packages that can carry out gene set enrichment analysis for RNA-seq datasets? [Difficulty: Intermediate]
- Use the topGO package (https://bioconductor.org/packages/release/bioc/html/topGO.html) to re-do the GO term analysis. Compare and contrast the results with what has been obtained using the
gProfileRpackage. Which tool is faster,gProfileRor topGO? Why? [Difficulty: Advanced] - Given a gene set annotated for human, how can it be utilized to work on C. elegans data? (Hint: See
biomaRt::getLDS). [Difficulty: Advanced] - Import curated pathway gene sets with Entrez identifiers from the MSIGDB database and re-do the GSEA for all curated gene sets. [Difficulty: Advanced]
8.5.4 Removing unwanted variation from the expression data
For the exercises below, use the datasets at:
counts_file <- system.file('extdata/rna-seq/SRP049988.raw_counts.tsv',
package = 'compGenomRData')
colData_file <- system.file('extdata/rna-seq/SRP049988.colData.tsv',
package = 'compGenomRData')
- Run RUVSeq using multiple values of
kfrom 1 to 10 and compare and contrast the PCA plots obtained from the normalized counts of each RUVSeq run. [Difficulty: Beginner] - Re-run RUVSeq using the
RUVr()function. Compare PCA plots fromRUVs,RUVgandRUVrusing the samekvalues and find out which one performs the best. [Difficulty: Intermediate] - Do the necessary diagnostic plots using the differential expression results from the EHF count table. [Difficulty: Intermediate]
- Use the
svapackage to discover sources of unwanted variation and re-do the differential expression analysis using variables from the output ofsvaand compare the results withDESeq2results usingRUVSeqcorrected normalization counts. [Difficulty: Advanced]