2.1 Steps of (genomic) data analysis

Regardless of the analysis type, data analysis has a common pattern. We will discuss this general pattern and how it applies to genomics problems. The data analysis steps typically include data collection, quality check and cleaning, processing, modeling, visualization, and reporting. Although one expects to go through these steps in a linear fashion, it is normal to go back and repeat the steps with different parameters or tools. In practice, data analysis requires going through the same steps over and over again in order to be able to do a combination of the following: a) answer other related questions, b) deal with data quality issues that are later realized, and c) include new data sets to the analysis.

We will now go through a brief explanation of the steps within the context of genomic data analysis.

2.1.1 Data collection

Data collection refers to any source, experiment or survey that provides data for the data analysis question you have. In genomics, data collection is done by high-throughput assays, introduced in Chapter 1. One can also use publicly available data sets and specialized databases, also mentioned in Chapter 1. How much data and what type of data you should collect depends on the question you are trying to answer and the technical and biological variability of the system you are studying.

2.1.2 Data quality check and cleaning

In general, data analysis almost always deals with imperfect data. It is common to have missing values or measurements that are noisy. Data quality check and cleaning aims to identify any data quality issue and clean it from the dataset.

High-throughput genomics data is produced by technologies that could embed technical biases into the data. If we were to give an example from sequencing, the sequenced reads do not have the same quality of bases called. Towards the ends of the reads, you could have bases that might be called incorrectly. Identifying those low-quality bases and removing them will improve the read mapping step.

2.1.3 Data processing

This step refers to processing the data into a format that is suitable for exploratory analysis and modeling. Oftentimes, the data will not come in a ready-to-analyze format. You may need to convert it to other formats by transforming data points (such as log transforming, normalizing, etc.), or subset the data set with some arbitrary or pre-defined condition. In terms of genomics, processing includes multiple steps. Following the sequencing analysis example above, processing will include aligning reads to the genome and quantification over genes or regions of interest. This is simply counting how many reads are covering your regions of interest. This quantity can give you ideas about how much a gene is expressed if your experimental protocol was RNA sequencing. This can be followed by some normalization to aid the next step.

2.1.4 Exploratory data analysis and modeling

This phase usually takes in the processed or semi-processed data and applies machine learning or statistical methods to explore the data. Typically, one needs to see a relationship between variables measured, and a relationship between samples based on the variables measured. At this point, we might be looking to see if the samples are grouped as expected by the experimental design, or are there outliers or any other anomalies? After this step you might want to do additional cleanup or re-processing to deal with anomalies.

Another related step is modeling. This generally refers to modeling your variable of interest based on other variables you measured. In the context of genomics, it could be that you are trying to predict disease status of the patients from expression of genes you measured from their tissue samples. Then your variable of interest is the disease status. This kind of approach is generally called “predictive modeling”, and could be solved with regression-based machine learning methods.

Statistical modeling would also be a part of this modeling step. This can cover predictive modeling as well, where we use statistical methods such as linear regression. Other analyses such as hypothesis testing, where we have an expectation and we are trying to confirm that expectation, is also related to statistical modeling. A good example of this in genomics is the differential gene expression analysis. This can be formulated as comparing two data sets, in this case expression values from condition A and condition B, with the expectation that condition A and condition B have similar expression values. You will see more on this in Chapter 3.

2.1.5 Visualization and reporting

Visualization is necessary for all the previous steps more or less. But in the final phase, we need final figures, tables, and text that describe the outcome of your analysis. This will be your report. In genomics, we use common data visualization methods as well as specific visualization methods developed or popularized by genomic data analysis. You will see many popular visualization methods in Chapters 3 and 6.

2.1.6 Why use R for genomics ?

R, with its statistical analysis heritage, plotting features, and rich user-contributed packages is one of the best languages for the task of analyzing genomic data. High-dimensional genomics datasets are usually suitable to be analyzed with core R packages and functions. On top of that, Bioconductor and CRAN have an array of specialized tools for doing genomics-specific analysis. Here is a list of computational genomics tasks that can be completed using R.

2.1.6.1 Data cleanup and processing

Most of general data cleanup, such as removing incomplete columns and values, reorganizing and transforming data, can be achieved using R. In addition, with the help of packages, R can connect to databases in various formats such as mySQL, mongoDB, etc., and query and get the data into the R environment using database specific tools.

On top of these, genomic data-specific processing and quality check can be achieved via R/Bioconductor packages. For example, sequencing read quality checks and even HT-read alignments can be achieved via R packages.

2.1.6.2 General data analysis and exploration

Most genomics data sets are suitable for application of general data analysis tools. In some cases, you may need to preprocess the data to get it to a state that is suitable for application of such tools. Here is a non-exhaustive list of what kind of things can be done via R. You will see popular data analysis methods in Chapters 3, 4 and 5.

Unsupervised data analysis: clustering (k-means, hierarchical), matrix factorization (PCA, ICA, etc.)
Supervised data analysis: generalized linear models, support vector machines, random forests

2.1.6.3 Genomics-specific data analysis methods

R/Bioconductor gives you access to a multitude of other bioinformatics-specific algorithms. Here are some of the things you can do. We will touch upon many of the following methods in Chapter 6 and onwards.

Sequence analysis: TF binding motifs, GC content and CpG counts of a given DNA sequence
Differential expression (or arrays and sequencing-based measurements)
Gene set/pathway analysis: What kind of genes are enriched in my gene set?
Genomic interval operations such as overlapping CpG islands with transcription start sites, and filtering based on overlaps
Overlapping aligned reads with exons and counting aligned reads per gene

2.1.6.4 Visualization

Visualization is an important part of all data analysis techniques including computational genomics. Again, you can use core visualization techniques in R and also genomics-specific ones with the help of specific packages. Here are some of the things you can do with R.

Basic plots: Histograms, scatter plots, bar plots, box plots, heatmaps
Ideograms and circos plots for genomics provide visualization of different features over the whole genome.
Meta-profiles of genomic features, such as read enrichment over all promoters
Visualization of quantitative assays for given locus in the genome