9.4 Pre-processing ChIP data
The focus of ChIP preprocessing is to check the quality of the sequencing experiment, remove sequencing artifacts, and find the genomic location of sequenced fragments using read mapping. The quality control consists of read quality control and adapter trimming. These methods are described in depth in Chapter 7.
9.4.1 Mapping of ChIP-seq data
Mapping is a procedure of trying to locate the exact genomic location which created each genomic fragment, each sequenced read. Several tools are available for mapping ChIP-seq data sets: Bowtie, Bowtie2, BWA (Langmead, Trapnell, Pop, et al. 2009; Langmead and Salzberg 2012b; H. Li and Durbin 2009b), and all of them have comparable sensitivity and specificity (Ruffalo, LaFramboise, and Koyutürk 2011). Read length is the variable with the biggest effect on the mapping procedure. The longer the sequenced reads, the more uniquely can the read be assigned to a position on the genome. Reads which are assigned ambiguously to multiple locations in the genome are called multi-mapping reads. Such fragments are most often produced by repetitive genomic regions, such as retrotransposons, pseudogenes or paralogous genes (Li and Freudenberg 2014). It is important to, a priori, decide whether such duplicated regions are of interest for the current experimental setup (i.e. whether we want to study transcription factor binding in olfactory receptors). If they are, then the multi-mapping reads should be included in the analysis. If they are not, they should be omitted. This is done during the mapping step, by limiting the number of locations to which a read can map. The methodology of working with multi-mapping reads differs according to the use case, and will not be considered in this chapter. For more information, please see the references (Chung, Kuan, Li, et al. 2011).
Current Illumina sequencing procedures enable sequencing of DNA fragments from just one, or both ends. Sequencing from both ends is called paired-end sequencing and greatly enhances the sample mappability, the percentage of genome which can be uniquely mapped. Additionally, it provides an out-of-the-box estimate of the average DNA fragment length, a parameter which is important for quality control and peak calling. Although it would always be preferable to do paired-end sequencing it substantially increases the sequencing costs, which can be prohibitive.
Different reads, which map to the same genomic location (same chromosome, position, and strand), are called duplicated reads. Such reads are an indication that the same DNA fragment was present multiple times during the library preparation. This can happen due to high enrichment with highly specific antibodies, or such fragments can be artificially produced during PCR amplification. Because we do not know the exact origin of the duplicated fragments, they are most often collapsed during the peak calling procedure, i.e. when multiple reads map to the same chromosome, position, and strand, only one read is used. If the transcription factor binds to a small number of regions in the genome, such data reduction might be too stringent, and we can increase the sensitivity by allowing up to N different reads, per position (i.e. if more than N reads map to the same location, only N reads are kept for the downstream analysis).
Some peak calling algorithms have automated statistical methods for determining the number of reads, per position, which will be used in the analysis (Zhang, Liu, Meyer, et al. 2008).
An important consideration to take into account is the genome which was used in the experiment. Cell lines, cancer samples, and personal genomes usually contain structural genomic alterations which are not present in the reference genome (duplications, insertions, and deletions). Such regions can cause false negatives, and false positives in the ChIP-seq experiment. If a region was present multiple times in the experimental system, and only a single time in the reference genome, it will be relatively enriched in the final sequencing library. Such fragments will pile up on a single location during the mapping step, and create an artificial peak, which can be falsely characterized as a binding event. Such regions are called blacklisted regions and should be removed from the downstream analysis. The UCSC browser database contains tables with such regions for the most commonly used model organism species.
This chapter presumes that the user is already familiar with the following technical and conceptual knowledge in computational data processing. From Chapters 7 and 6, you should be familiar with the concept of multi-mapping reads, and the following file formats BED, GTF, WIG, bigWig, BAM. You should also be familiar with PCR, what are PCR duplicates, positive and negative DNA strands, and technical and biological replicates.
References
Chung, Kuan, Li, Sanalkumar, Liang, Bresnick, Dewey, and Keleş. 2011. “Discovering Transcription Factor Binding Sites in Highly Repetitive Regions of Genomes with Multi-Read Analysis of ChIP-Seq Data.” PLoS Comput Biol 7 (7): e1002111. https://doi.org/10.1371/journal.pcbi.1002111.
Langmead, and Salzberg. 2012b. “Fast Gapped-Read Alignment with Bowtie 2.” Nat Methods 9 (4): 357–59. https://doi.org/10.1038/nmeth.1923.
Langmead, Trapnell, Pop, and Salzberg. 2009. “Ultrafast and Memory-Efficient Alignment of Short DNA Sequences to the Human Genome.” Genome Biol 10 (3): R25. https://doi.org/10.1186/gb-2009-10-3-r25.
Li, and Durbin. 2009b. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics 25 (14): 1754–60. https://doi.org/10.1093/bioinformatics/btp324.
Li, and Freudenberg. 2014. “Mappability and Read Length.” Front Genet 5 (November): 381. https://doi.org/10.3389/fgene.2014.00381.
Ruffalo, LaFramboise, and Koyutürk. 2011. “Comparative Analysis of Algorithms for Next-Generation Sequencing Read Alignment.” Bioinformatics 27 (20): 2790–6. https://doi.org/10.1093/bioinformatics/btr477.
Zhang, Liu, Meyer, et al. 2008. “Model-Based Analysis of ChIP-Seq (MACS).” Genome Biol 9 (9): R137. https://doi.org/10.1186/gb-2008-9-9-r137.