9.5 Computational Analysis of ChIP-seq data

The following tutorial presumes that the user is already familiar with the following technical and conceptual knowledge:

9.5.0.1 Technical requirements

The practical section presumes that the user knows how to use the following R tools:

  • ggplot2 package

  • R objects, and classes, and object conversion (list, data.frame, matrix, vector)

  • construction, and usage of functions

  • GenomicRanges, GenomicFeatures, and GenomicAlignment Bioconductor packages

  • Rle, coverage

9.5.0.2 Knowledge requirements for the tutorial

By this chapter you should be familiar with the following terms:

  • Read filtering

  • Read mapping

  • File format

  • Structure of bed, gtf, wig, bigWig, bam

  • Uniquely-mapping and multi-mapping reads

  • PCR duplicates

  • biological and technical replicates

    • and - strand
  • PCR - polymerase chain reaction

9.5.1 Prerequisites

Please install the following R packages

packages = c(
    'AnnotationHub',
    'Biostrings',
    'BSgenome',
    'BSgenome.Hsapiens.UCSC.hg38',
    'circlize',
    'ComplexHeatmap',
    'dplyr',
    'GenomicAlignments',
    'GenomicFeatures',
    'GenomicRanges',
    'GenomeInfoDB',
    'ggplot2',
    'Gviz',
    'JASPAR2018',
    'MotifDb',
    'motifRG',
    'motifStack',
    'normr',
    'rtracklayer',
    'seqLogo',
    'TFBSTools',
    'tidyr')
BiocManager::install(packages)

9.5.2 The Data

Experimental data was downloaded from the public ENCODE (ref) database of ChIP-seq experiments. The experiments were performed on a Lymphoblastoid cell line GM12878, and mapped to the GRCh38 (hg38) version of the Human genome, using the standard ENCODE ChIP-seq pipeline.

In this tutorial, for performance considerations, we have taken a subset of the data which corresponds to the human chromosome 21 (chr21).

The data sets are located in the compGenomRData package. The location of the data sets can be accessed using the system.file command, in the following way:

data_path = system.file('extdata/chip-seq',package='compGenomRData')

The available data sets can be listed using the list.files command:

chip_files = list.files(data_path, full.names=TRUE)
## [1] "/Users/aakalin/Rlibs/compGenomRData/extdata/chip-seq/CTCF_peaks.txt"                    
## [2] "/Users/aakalin/Rlibs/compGenomRData/extdata/chip-seq/GM12878_hg38_CTCF_r1.chr21.bam"    
## [3] "/Users/aakalin/Rlibs/compGenomRData/extdata/chip-seq/GM12878_hg38_CTCF_r1.chr21.bam.bai"
## [4] "/Users/aakalin/Rlibs/compGenomRData/extdata/chip-seq/GM12878_hg38_CTCF_r2.chr21.bam"    
## [5] "/Users/aakalin/Rlibs/compGenomRData/extdata/chip-seq/GM12878_hg38_CTCF_r2.chr21.bam.bai"
## [6] "/Users/aakalin/Rlibs/compGenomRData/extdata/chip-seq/GM12878_hg38_H3K27me3.chr21.bam"

The data set consist of the following ChIP experiments:

  1. Transcription factors: CTCF, SMC3, ZNF143, PolII (RNA polymerase 2)

  2. Histone modifications: H3k4me3, H3k36me3, H3k27ac, H3k27me3

  3. Various input samples

The first step in the ChIP-seq data analysis is to perform ChIP quality control