6.6 Exercises
The data for the exercises is within the compGenomRData package.
Run the following to see the data files.
dir(system.file("extdata",
package="compGenomRData"))
You will need some of those files to complete the exercises.
6.6.1 Operations on genomic intervals with the GenomicRanges package
- Create a
GRangesobject using the information in the table below:[Difficulty: Beginner]
| chr | start | end | strand | score |
|---|---|---|---|---|
| chr1 | 10000 | 10300 | + | 10 |
| chr1 | 11100 | 11500 | - | 20 |
| chr2 | 20000 | 20030 | + | 15 |
Use the
start(),end(),strand(),seqnames()andwidth()functions on theGRangesobject you created. Figure out what they are doing. Can you get a subset of theGRangesobject for intervals that are only on the + strand? If you can do that, try getting intervals that are on chr1. HINT:GRangesobjects can be subset using the[ ]operator, similar to data frames, but you may need to usestart(),end()andstrand(),seqnames()within the[]. [Difficulty: Beginner/Intermediate]Import mouse (mm9 assembly) CpG islands and RefSeq transcripts for chr12 from the UCSC browser as
GRangesobjects usingrtracklayerfunctions. HINT: Check chapter content and modify the code there as necessary. If that somehow does not work, go to the UCSC browser and download it as a BED file. The track name for Refseq genes is “RefSeq Genes” and the table name is “refGene”. [Difficulty: Beginner/Intermediate]Following from the exercise above, get the promoters of Refseq transcripts (-1000bp and +1000 bp of the TSS) and calculate what percentage of them overlap with CpG islands. HINT: You have to get the promoter coordinates and use the
findOverlaps()orsubsetByOverlaps()from theGenomicRangespackage. To get promoters, type?promoterson the R console and see how to use that function to get promoters or calculate their coordinates as shown in the chapter. [Difficulty: Beginner/Intermediate]Plot the distribution of CpG island lengths for CpG islands that overlap with the promoters. [Difficulty: Beginner/Intermediate]
Get canonical peaks for SP1 (peaks that are in both replicates) on chr21. Peaks for each replicate are located in the
wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gzandwgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep2.broadPeak.gzfiles. HINT: You need to usefindOverlaps()orsubsetByOverlaps()to get the subset of peaks that occur in both replicates (canonical peaks). You can try to read “…broadPeak.gz” files using thegenomation::readBroadPeak()function; broadPeak is just an extended BED format. In addition, you can try to usethe coverage()andslice()functions to get more precise canonical peak locations. [Difficulty: Intermediate/Advanced]
6.6.2 Dealing with mapped high-throughput sequencing reads
- Count the reads overlapping with canonical SP1 peaks using the BAM file for one of the replicates. The following file in the
compGenomRDatapackage contains the alignments for SP1 ChIP-seq reads:wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam. HINT: Use functions from theGenomicAlignmentspackage. [Difficulty: Beginner/Intermediate]
6.6.3 Dealing with contiguous scores over the genome
Extract the
Viewsobject for the promoters on chr20 from theH1.ESC.H3K4me1.chr20.bwfile available atCompGenomRDatapackage. Plot the first “View” as a line plot. HINT: See the code in the relevant section in the chapter and adapt the code from there. [Difficulty: Beginner/Intermediate]Make a histogram of the maximum signal for the Views in the object you extracted above. You can use any of the view summary functions or use
lapply()and write your own summary function. [Difficulty: Beginner/Intermediate]Get the genomic positions of maximum signal in each view and make a
GRangesobject. HINT: See the?viewRangeMaxshelp page. Try to make aGRangesobject out of the returned object. [Difficulty: Intermediate]
6.6.4 Visualizing and summarizing genomic intervals
- Extract -500,+500 bp regions around the TSSes on chr21; there are refseq files for the hg19 human genome assembly in the
compGenomRDatapackage. Use SP1 ChIP-seq data in thecompGenomRDatapackage, access the file path via thesystem.file()function, the file name is:wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam. Create an average profile of read coverage around the TSSes. Following that, visualize the read coverage with a heatmap. HINT: All of these are possible using thegenomationpackage functions. Checkhelp(ScoreMatrix)to see how you can use bam files. As an example here is how you can get the file path to refseq annotation on chr21. [Difficulty: Intermediate/Advanced]
Extract -500,+500 bp regions around the TSSes on chr20. Use H3K4me3 (
H1.ESC.H3K4me3.chr20.bw) and H3K27ac (H1.ESC.H3K27ac.chr20.bw) ChIP-seq enrichment data in thecompGenomRDatapackage and create heatmaps and average signal profiles for regions around the TSSes.[Difficulty: Intermediate/Advanced]Download P300 ChIP-seq peaks data from the UCSC browser. The peaks are locations where P300 binds. The P300 binding marks enhancer regions in the genome. (HINT: group: “regulation”, track: “Txn Factor ChIP”, table:“wgEncodeRegTfbsClusteredV3”, you need to filter the rows for “EP300” name.) Check enrichment of H3K4me3, H3K27ac and DNase-seq (
H1.ESC.dnase.chr20.bw) experiments on chr20 on and arounf the P300 binding-sites, use data fromcompGenomRDatapackage. Make multi-heatmaps and metaplots. What is different from the TSS profiles? [Difficulty: Advanced]Cluster the rows of multi-heatmaps for the task above. Are there obvious clusters? HINT: Check arguments of the
multiHeatMatrix()function. [Difficulty: Advanced]Visualize one of the -500,+500 bp regions around the TSS using
Gvizfunctions. You should visualize both H3K4me3 and H3K27ac and the gene models. [Difficulty: Advanced]