6.1 Operations on genomic intervals with GenomicRanges package

6.1 Operations on genomic intervals with `GenomicRanges` package

The Bioconductor project has a dedicated package called GenomicRanges to deal with genomic intervals. In this section, we will provide use cases involving operations on genomic intervals. The main reason we will stick to this package is that it provides tools to do overlap operations. However, the package requires that users operate on specific data types that are conceptually similar to a tabular data structure implemented in a way that makes overlapping and related operations easier. The main object we will be using is called the GRanges object and we will also see some other related objects from the GenomicRanges package.

6.1.1 How to create and manipulate a GRanges object

GRanges (from GenomicRanges package) is the main object that holds the genomic intervals and extra information about those intervals. Here we will show how to create one. Conceptually, it is similar to a data frame and some operations such as using [ ] notation to subset the table will also work on GRanges, but keep in mind that not everything that works for data frames will work on GRanges objects.

library(GenomicRanges)
gr=GRanges(seqnames=c("chr1","chr2","chr2"),
           ranges=IRanges(start=c(50,150,200),
                          end=c(100,200,300)),
           strand=c("+","-","-")
)
gr

## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    50-100      +
##   [2]     chr2   150-200      -
##   [3]     chr2   200-300      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# subset like a data frame
gr[1:2,]

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1    50-100      +
##   [2]     chr2   150-200      -
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

As you can see, it looks a bit like a data frame. Also, note that the peculiar second argument “ranges” basically contains the start and end positions of the genomic intervals. However, you cannot just give start and end positions, you actually have to provide another object of IRanges. Do not let this confuse you; GRanges actually depends on another object that is very similar to itself called IRanges and you have to provide the “ranges” argument as an IRanges object. In its simplest form, an IRanges object can be constructed by providing start and end positions to the IRanges() function. Think of it as something you just have to provide in order to construct the GRanges object.

GRanges can also contain other information about the genomic interval such as scores, names, etc. You can provide extra information at the time of the construction or you can add it later. Here is how you can do that:

gr=GRanges(seqnames=c("chr1","chr2","chr2"),
           ranges=IRanges(start=c(50,150,200),
                          end=c(100,200,300)),
           names=c("id1","id3","id2"),
           scores=c(100,90,50)
)
# or add it later (replaces the existing meta data)
mcols(gr)=DataFrame(name2=c("pax6","meis1","zic4"),
                    score2=c(1,2,3))

gr=GRanges(seqnames=c("chr1","chr2","chr2"),
           ranges=IRanges(start=c(50,150,200),
                          end=c(100,200,300)),
           names=c("id1","id3","id2"),
           scores=c(100,90,50)
)

# or appends to existing meta data
mcols(gr)=cbind(mcols(gr),
                          DataFrame(name2=c("pax6","meis1","zic4")) )
gr

## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames    ranges strand |       names    scores       name2
##          <Rle> <IRanges>  <Rle> | <character> <numeric> <character>
##   [1]     chr1    50-100      * |         id1       100        pax6
##   [2]     chr2   150-200      * |         id3        90       meis1
##   [3]     chr2   200-300      * |         id2        50        zic4
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# elementMetadata() and values() do the same things
elementMetadata(gr)

## DataFrame with 3 rows and 3 columns
##         names    scores       name2
##   <character> <numeric> <character>
## 1         id1       100        pax6
## 2         id3        90       meis1
## 3         id2        50        zic4

values(gr)

## DataFrame with 3 rows and 3 columns
##         names    scores       name2
##   <character> <numeric> <character>
## 1         id1       100        pax6
## 2         id3        90       meis1
## 3         id2        50        zic4

# you may also add metadata using the $ operator, as for data frames
gr$name3 = c("A","C", "B")
gr

## GRanges object with 3 ranges and 4 metadata columns:
##       seqnames    ranges strand |       names    scores       name2       name3
##          <Rle> <IRanges>  <Rle> | <character> <numeric> <character> <character>
##   [1]     chr1    50-100      * |         id1       100        pax6           A
##   [2]     chr2   150-200      * |         id3        90       meis1           C
##   [3]     chr2   200-300      * |         id2        50        zic4           B
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

6.1.2 Getting genomic regions into R as GRanges objects

There are multiple ways you can read your genomic features into R and create a GRanges object. Most genomic interval data comes in a tabular format that has the basic information about the location of the interval and some other information. We already showed how to read BED files as a data frame in Chapter 2. Now we will show how to convert it to the GRanges object. This is one way of doing it, but there are more convenient ways described further in the text.

# read CpGi data set
filePath=system.file("extdata",
                      "cpgi.hg19.chr21.bed",
                      package="compGenomRData")
cpgi.df = read.table(filePath, header = FALSE,
                     stringsAsFactors=FALSE) 
# remove chr names with "_"
cpgi.df =cpgi.df [grep("_",cpgi.df[,1],invert=TRUE),]

cpgi.gr=GRanges(seqnames=cpgi.df[,1],
                ranges=IRanges(start=cpgi.df[,2],
                              end=cpgi.df[,3]))

You may need to do some pre-processing before/after reading in the BED file. Below is an example of getting transcription start sites from BED files containing RefSeq transcript locations.

# read refseq file
filePathRefseq=system.file("extdata",
                      "refseq.hg19.chr21.bed",
                      package="compGenomRData")


ref.df = read.table(filePathRefseq, header = FALSE,
                     stringsAsFactors=FALSE) 
ref.gr=GRanges(seqnames=ref.df[,1],
               ranges=IRanges(start=ref.df[,2],
                              end=ref.df[,3]),
               strand=ref.df[,6],name=ref.df[,4])
# get TSS
tss.gr=ref.gr
# end of the + strand genes must be equalized to start pos
end(tss.gr[strand(tss.gr)=="+",])  =start(tss.gr[strand(tss.gr)=="+",])
# startof the - strand genes must be equalized to end pos
start(tss.gr[strand(tss.gr)=="-",])=end(tss.gr[strand(tss.gr)=="-",])
# remove duplicated TSSes ie alternative transcripts
# this keeps the first instance and removes duplicates
tss.gr=tss.gr[!duplicated(tss.gr),]

Another way of doing this from a BED file is to use the readTranscriptfeatures() function from the genomation package. This function takes care of the steps described in the code chunk above.

Reading the genomic features as text files and converting to GRanges is not the only way to create a GRanges object. With the help of the rtracklayer package we can directly import BED files.

require(rtracklayer)

# we are reading a BED file, the path to the file
# is stored in filePathRefseq variable
import.bed(filePathRefseq)

Next, we will show how to use other methods to automatically obtain the data in the GRanges format from online databases. But you will not be able to use these methods for every data set, so it is good to know how to read data from flat files as well. We will use the rtracklayer package to download data from the UCSC Genome Browser. We will download CpG islands as GRanges objects. The rtracklayer workflow we show below works like using the UCSC table browser. You need to select which species you are working with, then you need to select which dataset you need to download and lastly you download the UCSC dataset or track as a GRanges object.

require(rtracklayer)
session <- browserSession("UCSC",url = 'http://genome-euro.ucsc.edu/cgi-bin/')
genome(session) <- "mm9"
## choose CpG island track on chr12
query <- ucscTableQuery(session, track="CpG Islands",table="cpgIslandExt",
        range=GRangesForUCSCGenome("mm9", "chr12"))
## get the GRanges object for the track
track(query)

There is also an interface to the Ensembl database called biomaRt. This package will enable you to access and import all of the datasets included in Ensembl. Another similar package is AnnotationHub. This package is an aggregator for different datasets from various sources. Using AnnotationHub one can access data sets from the UCSC browser, Ensembl browser and datasets from genomics consortia such as ENCODE and Roadmap Epigenomics. We provide examples of using Biomart package further into the chapter. In addition, the AnnotationHub package is used in Chapter 9.

6.1.2.1 Frequently used file formats and how to read them into R as a table

There are multiple file formats in genomics but some of them you will see more frequently than others. We already mentioned some of them. Here is a list of files and functions that can read them into R as GRanges objects or something coercible to GRanges objects.

BED: This format is used and popularized by the UCSC browser, and can hold a variety of information including exon/intron structure of transcripts in a single line. We will be using BED files in this chapter. In its simplest form, the BED file contains the chromosome name, the start position and end position for a genomic feature of interest.
- genomation::readBed()
- genomation::readTranscriptFeatures() good for getting intron/exon/promoters from BED12 files
- rtracklayer::import.bed()
GFF: GFF format is a tabular text format for genomic features similar to BED. However, it is a more flexible format than BED, which makes it harder to parse at times. Many gene annotation files are in this format.
- genomation::gffToGranges()
- rtracklayer::impot.gff()
BAM/SAM: BAM format is a compressed and indexed tabular file format designed for aligned sequencing reads. SAM is the uncompressed version of the BAM file. We will touch upon BAM files in this chapter. The uncompressed SAM file is similar in spirit to a BED file where you have the basic location of chromosomal location information plus additional columns that are related to the quality of alignment or other relevant information. We will introduce this format in detail later in this chapter.
- GenomicAlignments::readGAlignments
- Rsamtools::scanBam returns a data frame with columns from a SAM/BAM file.
BigWig: This is used to for storing scores associated with genomic intervals. It is an indexed format. Similar to BAM, this makes it easier to query and only necessary portions of the file could be loaded into memory.
- rtracklayer::import.bw()
Generic Text files: This represents any text file with the minimal information of chromosome, start and end coordinates.
- genomation::readGeneric()
Tabix/Bcf: These are tabular file formats indexed and compressed similar to BAM. The following functions return lists rather than tabular data structures. These formats are mostly used to store genomic variation data such as SNPs and indels.
- Rsamtools::scanTabix
- Rsamtools::scanBcf

6.1.3 Finding regions that do/do not overlap with another set of regions

This is one of the most common tasks in genomics. Usually, you have a set of regions that you are interested in and you want to see if they overlap with another set of regions or see how many of them overlap. A good example is transcription factor binding sites determined by ChIP-seq experiments. We will introduce ChIP-seq in more detail in Chapter 9. However, in these types of experiments and the following analysis, one usually ends up with genomic regions that are bound by transcription factors. One of the standard next questions would be to annotate binding sites with genomic annotations such as promoter, exon, intron and/or CpG islands, which are important for gene regulation. Below is a demonstration of how transcription factor binding sites can be annotated using CpG islands. First, we will get the subset of binding sites that overlap with the CpG islands. In this case, binding sites are ChIP-seq peaks.

In the code snippet below, we read the ChIP-seq analysis output files using the genomation::readBroadPeak() function. This function directly outputs a GRanges object. These output files are similar to BED files, where the location of the predicted binding sites are written out in a tabular format with some analysis-related scores and/or P-values. After reading the files, we can find the subset of peaks that overlap with the CpG islands using the subsetByoverlaps() function.

library(genomation)
filePathPeaks=system.file("extdata",   
              "wgEncodeHaibTfbsGm12878Sp1Pcr1xPkRep1.broadPeak.gz",
                      package="compGenomRData")

# read the peaks from a bed file
pk1.gr=readBroadPeak(filePathPeaks)

# get the peaks that overlap with CpG islands
subsetByOverlaps(pk1.gr,cpgi.gr)

## GRanges object with 44 ranges and 5 metadata columns:
##        seqnames            ranges strand |        name     score signalValue
##           <Rle>         <IRanges>  <Rle> | <character> <integer>   <numeric>
##    [1]    chr21   9825360-9826582      * |   peak14562        56      183.11
##    [2]    chr21   9968469-9968984      * |   peak14593       947     3064.92
##    [3]    chr21 15755368-15755956      * |   peak14828        90      291.90
##    [4]    chr21 19191579-19192525      * |   peak14840       290      940.03
##    [5]    chr21 26979619-26980048      * |   peak14854        32      104.67
##    ...      ...               ...    ... .         ...       ...         ...
##   [40]    chr21 46237464-46237809      * |   peak15034        32      106.36
##   [41]    chr21 46707702-46708084      * |   peak15037        67      217.02
##   [42]    chr21 46961552-46961875      * |   peak15039        38      124.31
##   [43]    chr21 47743587-47744125      * |   peak15050       353     1141.58
##   [44]    chr21 47878412-47878891      * |   peak15052       104      338.78
##           pvalue    qvalue
##        <integer> <integer>
##    [1]        -1        -1
##    [2]        -1        -1
##    [3]        -1        -1
##    [4]        -1        -1
##    [5]        -1        -1
##    ...       ...       ...
##   [40]        -1        -1
##   [41]        -1        -1
##   [42]        -1        -1
##   [43]        -1        -1
##   [44]        -1        -1
##   -------
##   seqinfo: 23 sequences from an unspecified genome; no seqlengths

For each CpG island, we can count the number of peaks that overlap with a given CpG island with GenomicRanges::countOverlaps().

counts=countOverlaps(pk1.gr,cpgi.gr)
head(counts)

## [1] 0 0 0 0 0 0

The GenomicRanges::findOverlaps() function can be used to see one-to-one overlaps between peaks and CpG islands. It returns a matrix showing which peak overlaps which CpG island.

findOverlaps(pk1.gr,cpgi.gr)

## Hits object with 45 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]     14562           1
##    [2]     14593           3
##    [3]     14828           8
##    [4]     14840          13
##    [5]     14854          16
##    ...       ...         ...
##   [41]     15034         155
##   [42]     15037         166
##   [43]     15039         176
##   [44]     15050         192
##   [45]     15052         200
##   -------
##   queryLength: 26121 / subjectLength: 205

Another interesting thing would be to look at the distances to the nearest CpG islands for each peak. In addition, just finding the nearest CpG island could also be interesting. Oftentimes, you will need to find the nearest TSS or gene to your regions of interest, and the code below is handy for doing that using the nearest() and distanceToNearest() functions, the resulting plot is shown in Figure 6.2.

# find nearest CpGi to each TSS
n.ind=nearest(pk1.gr,cpgi.gr)
# get distance to nearest
dists=distanceToNearest(pk1.gr,cpgi.gr,select="arbitrary")
dists

## Hits object with 620 hits and 1 metadata column:
##         queryHits subjectHits |  distance
##         <integer>   <integer> | <integer>
##     [1]     14440           1 |    384188
##     [2]     14441           1 |    382968
##     [3]     14442           1 |    381052
##     [4]     14443           1 |    379311
##     [5]     14444           1 |    376978
##     ...       ...         ... .       ...
##   [616]     15055         205 |     26212
##   [617]     15056         205 |     27402
##   [618]     15057         205 |     30468
##   [619]     15058         205 |     31611
##   [620]     15059         205 |     34090
##   -------
##   queryLength: 26121 / subjectLength: 205

# histogram of the distances to nearest TSS
dist2plot=mcols(dists)[,1]
hist(log10(dist2plot),xlab="log10(dist to nearest TSS)",
     main="Distances")

FIGURE 6.2: Histogram of distances of CpG islands to the nearest TSSes.