6.2 Dealing with mapped high-throughput sequencing reads
The reads from sequencing machines are usually pre-processed and aligned to the genome with the help of specific bioinformatics tools. We have introduced the details of general read processing, quality check and alignment methods in Chapter 7. In this section we will deal with mapped reads. Since each mapped read has a start and end position the genome, mapped reads can be thought of as genomic intervals stored in a file. After mapping, the next task is to quantify the enrichment of those aligned reads in the regions of interest. You may want to count how many reads overlap with your promoter set of interest or you may want to quantify RNA-seq reads overlap with exons. This is similar to operations on genomic intervals which are described previously. If you can read all your alignments into memory and create a GRanges
object, you can apply the previously described operations. However, most of the time we can not read all mapped reads into memory, so we have to use specialized tools to query and quantify alignments on a given set of regions. One of the most common alignment formats is SAM/BAM format, most aligners will produce SAM/BAM output or you will be able to convert your specific alignment format to SAM/BAM format. The BAM format is a binary version of the human-readable SAM format. The SAM format has specific columns that contain different kinds of information about the alignment such as mismatches, qualities etc. (see [http://samtools.sourceforge.net/SAM1.pdf] for SAM format specification).
6.2.1 Counting mapped reads for a set of regions
The Rsamtools
package has functions to query BAM files. The function we will use in the first example is the countBam()
function, which takes input of the BAM file and param argument. The param
argument takes a ScanBamParam
object. The object is instantiated using ScanBamParam()
and contains parameters for scanning the BAM file. The example below is a simple example where ScanBamParam()
only includes regions of interest, promoters on chr21.
promoter.gr=tss.gr
start(promoter.gr)=start(promoter.gr)-1000
end(promoter.gr) =end(promoter.gr)+1000
promoter.gr=promoter.gr[seqnames(promoter.gr)=="chr21"]
library(Rsamtools)
bamfilePath=system.file("extdata",
"wgEncodeHaibTfbsGm12878Sp1Pcr1xAlnRep1.chr21.bam",
package="compGenomRData")
# get reads for regions of interest from the bam file
param <- ScanBamParam(which=promoter.gr)
counts=countBam(bamfilePath, param=param)
Alternatively, aligned reads can be read in using the GenomicAlignments
package (which on this occasion relies on the Rsamtools
package).