10.4 Data filtering and exploratory analysis
We assume that we start the analysis in R with the methylation call files. We will read those files in and carry out exploratory analysis, and we will show how to filter bases or regions from the data and in what circumstances we might need to do so. We will use the methylKit(Akalin, Kormaksson, Li, et al. 2012) package for the bulk of the analysis.
10.4.1 Reading methylation call files
A typical methylation call file looks like this:
## chrBase chr base strand coverage freqC freqT ## 1 chr21.9764539 chr21 9764539 R 12 25.00 75.00 ## 2 chr21.9764513 chr21 9764513 R 12 0.00 100.00 ## 3 chr21.9820622 chr21 9820622 F 13 0.00 100.00 ## 4 chr21.9837545 chr21 9837545 F 11 0.00 100.00 ## 5 chr21.9849022 chr21 9849022 F 124 72.58 27.42
Most of the time bisulfite sequencing experiments have test and control samples. The test samples can be from a disease tissue while the control samples can be from a healthy tissue. You can read a set of methylation call files that have test/control conditions giving a
treatment vector option. The treatment vector defines the sample groups and it is very important for the differential methylation analysis. For the sake of subsequent analysis, file.list, sample.id and treatment option should have the same order. In the following example, the first two files have the sample IDs “test1” and “test2” and as determined by the treatment vector they belong to the same group. The third and fourth files have sample IDs “ctrl1” and “ctrl2” and they belong to the same group as indicated by the treatment vector. We will first get a list of file paths and have a look at the content.
If you look what is inside the
file.list variable, you will see that it is a simple list of file paths. Each file contains methylation calls for a given sample. Now, we can read the files with the
Tab-separated bedgraph like formats from Bismark methylation caller can also be read in by methylkit. In those cases, we have to provide either
pipeline="bismarkCytosineReport" to the
methRead() function. In addition to the options we mentioned above,
any tab-separated text file with a generic format can be read in using methylKit,
such as methylation ratio files from BSMAP.
See here for an example.
Before we move on, let us have a look at what kind of information is stored in
myobj. This is technically a
methylRawList object, which is essentially a list of
methylRaw objects. These objects hold
the information for the genomic location of Cs, and methylated Cs and unmethylated Cs.
##  4
## chr start end strand coverage numCs numTs ## 1 chr21 9764513 9764513 - 12 0 12 ## 2 chr21 9764539 9764539 - 12 3 9 ## 3 chr21 9820622 9820622 + 13 0 13 ## 4 chr21 9837545 9837545 + 11 0 11 ## 5 chr21 9849022 9849022 + 124 90 34 ## 6 chr21 9853296 9853296 + 17 10 7
10.4.2 Further quality check
It is always a good idea to check how the data looks before proceeding further. For example, the methylation values should have bimodal distribution generally. This can be checked via the
getMethylationStats() function. Normally, we should see bimodal
distributions. Strong deviations from the bimodality may be due to poor experimental quality, such as problems with bisulfite treatment. Below we show how to get these plots using the
getMethylationStats() function. The result is shown in Figure 10.1. As expected, it has a bimodal distribution where most CpGs have either high methylation or low methylation.
In addition, we might want to see coverage values. By default, methylkit handles bases with at least 10X coverage but that can be changed. The bases with unusually high coverage are usually alarming. It might indicate a PCR bias issue in the experimental procedure. The general coverage statistics can be checked with the
getCoverageStats() function shown below. The resulting plot is shown in Figure 10.2.
It might be useful to filter samples based on coverage. Particularly, if our samples are suffering from PCR bias, it would be useful to discard bases with very high read coverage. Furthermore, we would also like to discard bases that have low read coverage; a high enough read coverage will increase the power of the statistical tests. The code below filters a
methylRawList, discards bases that have coverage below 10X, and also discards the bases that have more than 99.9th percentile of coverage in each sample.
10.4.3 Merging samples into a single table
When we first read the files, each file is stored as its own entity. If we want to compare samples in any way, we need to make a unified data structure that contains CpGs that are covered in most samples. The
unite() function creates a new object using the CpGs covered in each sample.
Let us take a look at the data content of the
## chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 ## 1 chr21 9853296 9853296 + 17 10 7 333 268 65 ## 2 chr21 9853326 9853326 + 17 12 5 329 249 79 ## 3 chr21 9860126 9860126 + 39 38 1 83 78 5 ## 4 chr21 9906604 9906604 + 68 42 26 111 97 14 ## 5 chr21 9906616 9906616 + 68 52 16 111 104 7 ## 6 chr21 9906619 9906619 + 68 59 9 111 109 2 ## coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 ## 1 18 16 2 395 341 54 ## 2 16 14 2 379 284 95 ## 3 83 83 0 41 40 1 ## 4 23 18 5 37 33 4 ## 5 23 14 9 37 27 10 ## 6 22 18 4 37 29 8
By default, the
unite() function produces bases/regions covered in all samples. That requirement can be relaxed using the
min.per.group option in the
10.4.4 Filtering CpGs
We might need to filter the CpGs further before exploratory analysis or even before the downstream analysis such as differential methylation. For exploratory analysis, it is of general interest to see how samples relate to each other and we might want to remove CpGs that are not variable before doing that. Or we might remove Cs that are potentially C->T mutations. First, we show how to
filter based on variation. Below, we extract percent methylation values from CpGs as a matrix. Calculate the standard deviation for each CpG and filter based on standard deviation. We also plot the distribution of per-CpG standard deviations with the
hist() function. The resulting plot is shown in Figure 10.3.
## chr start end strand coverage1 numCs1 numTs1 coverage2 numCs2 numTs2 ## 11 chr21 9906681 9906681 + 21 12 9 60 56 4 ## 12 chr21 9906694 9906694 + 21 9 12 60 53 7 ## 13 chr21 9906700 9906700 + 13 6 7 53 43 10 ## 14 chr21 9906714 9906714 + 14 3 11 41 37 4 ## 18 chr21 9906873 9906873 + 12 8 4 41 33 8 ## 23 chr21 9927527 9927527 + 17 5 12 40 22 18 ## coverage3 numCs3 numTs3 coverage4 numCs4 numTs4 ## 11 37 14 23 26 11 15 ## 12 39 16 23 26 15 11 ## 13 30 8 22 23 10 13 ## 14 25 19 6 21 19 2 ## 18 15 4 11 22 7 15 ## 23 32 32 0 14 11 3
Now, let’s assume we know the locations of C->T mutations. These locations should be removed from the analysis as they do not represent
bisulfite-treatment-associated conversions. Mutation locations are
stored in a
GRanges object, and we can use that to remove CpGs
overlapping with mutations. In order to do the overlap operation, we will convert the methylKit object to a
GRanges object and do the filtering with the
%over% function within
[ ]. The returned object will still be a methylKit object.
##  963
##  961
10.4.5 Clustering samples
Clustering is used for grouping data points by their similarity. It is a general concept that can be achieved by many different algorithms and we introduced clustering and multiple prominent clustering algorithms in Chapter 4. In the context of DNA methylation, we are trying to find samples that are similar to each other. For example, if we sequenced 3 heart samples and 4 liver samples, we would expect liver samples will be more similar to each other than heart samples on the DNA methylation space.
The following function will cluster the samples and draw a dendrogram. It will use correlation distance, which is \(1-\rho\) , where \(\rho\) is the correlation coefficient between two pairs of samples. The cluster tree will be drawn using the “ward” method. This specific variant uses a “bottom up” approach: each data point starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. In Ward’s method, two clusters are merged if the variance is minimized compared to other possible merge operations. This bottom up approach helps build the dendrogram showing the relationship between clusters. The result of the clustering is shown in Figure 10.4.
## ## Call: ## hclust(d = d, method = HCLUST.METHODS[hclust.method]) ## ## Cluster method : ward.D ## Distance : pearson ## Number of objects: 4
plot=FALSE will return a dendrogram object which can be manipulated by users or fed in to other user functions that can work with dendrograms.
10.4.6 Principal component analysis
Principal component analysis (PCA) is a mathematical transformation of (possibly) correlated variables into a number of uncorrelated variables called principal components. The resulting components from this transformation are defined in such a way that the first principal component has the highest variance and accounts for most of the variability in the data. We have introduced PCA and other similar methods in Chapter 4. The following function will plot a scree plot for importance of components and the result is shown in Figure 10.5.
We can also plot the PC1 and PC2 axes and a scatter plot of our samples on those axes which will reveal how they cluster within these new dimensions. Similar to the clustering dendrogram, we would like to see samples that are similar to be close to each other on the scatter plot. If they are not, it might indicate problems with the experiment such as batch effects. The function below plots the samples in such a scatter plot on principal component axes. The resulting plot is shown in Figure 10.6.
In this case, we also returned an object from the plotting function. This is the output of the
prcomp() function, which includes loadings and eigenvectors which might be useful. You can also do your own PCA analysis using
prcomp(). In the case above, the methylation matrix is transposed. This allows us to compare distances between samples on the PCA scatter plot.
Akalin, Kormaksson, Li, Garrett-Bakelman, Figueroa, Melnick, and Mason. 2012. “MethylKit: A Comprehensive R Package for the Analysis of Genome-Wide DNA Methylation Profiles.” Genome Biol. 13 (10): R87.