7.4 Mapping/aligning reads to the genome

After the quality check and potential pre-processing, the reads are ready to be mapped or aligned to the reference genome. This process simply finds the most probable origin of each read in the genome. Since there might be errors in sequencing and mutations in the genomes, we may not find exact matches of reads in the genomes. An important feature of the alignment algorithms is to tolerate potential mismatches between reads and the reference genome. In addition, efficient algorithms and data structures are needed for the alignment to be completed in a reasonable amount of time. Alignment methods usually create data structures to store and efficiently search the genome for matching reads. These data structures are called genome indices and creating these indices is the first step for the read alignment. Based on how indices are created, there are two major types of methods. One class of methods relies on “hash tables”, to store and search the genomes. Hash tables are simple lookup tables in which all possible k-mers point to locations in the genome. The general idea is that overlapping k-mers constructed from a read go through this lookup table. Each k-mer points to potential locations in the genome. Then, the final location for the read is obtained by optimizing the k-mer chain by their distances in the genome and in the read. This optimization process removes k-mer locations that are distant from other k-mers that map nearby each other.

Another class of algorithms builds genome indices by creating a Burrows-Wheeler transformation of the genome. This in essence creates a compact and searchable data structure for all reads. Although details are out of the scope of this section, these alignment tools provide faster alignment and use less memory. BWA(H. Li and Durbin 2009 a), Bowtie1/2(Langmead and Salzberg 2012 a) and SOAP(R. Li, Yu, Li, et al. 2009) are examples of such algorithms.

The read mapping in R can be done with the gmapR (Barr, Wu, and Lawrence 2019), QuasR (Gaidatzis, Lerch, Hahne, et al. 2015), Rsubread (Liao, Smyth, and Shi 2013), and systemPipeR (Backman and Girke 2016) packages. We will demonstrate read mapping with QuasR which uses the Rbowtie package, which wraps the Bowtie aligner. Below, we show how to map reads from a ChIP-seq experiment using QuasR/bowtie.

We will use the qAlign() function which requires two mandatory arguments: 1) a genome file in either fasta format or as a BSgenome package and 2) a sample file which is a text file and contains file paths to fastq files and sample names. In the case below, sample file looks like this:

FileName    SampleName
chip_1_1.fq.bz2 Sample1
chip_2_1.fq.bz2 Sample2

library(QuasR)

# copy example data to current working directory
file.copy(system.file(package="QuasR", "extdata"), ".", recursive=TRUE)

# genome file in fasta format
genomeFile <- "extdata/hg19sub.fa"

# text file containing sample names and fastq file paths
sampleFile <- "extdata/samples_chip_single.txt"

# create alignments 
proj <- qAlign(sampleFile, genomeFile)

It is good to explain what is going on here as the qAlign() function makes things look simple. This function is designed to be easy. For example, it creates a genome index automatically if it does not exist, and will look for existing indices before it creates one. We provided only two arguments, a text file containing sample names and fastq file paths and a reference genome file. In fact, this function also has many knobs and you can change its behavior by supplying different arguments in order to affect the behavior of Bowtie. For example, you can supply parameters to Bowtie using the alignmentParameter argument. However the qAlign() function is optimized for different types of alignment problems and selects alignment parameters automatically. It is designed to work with alignment and quantification tasks for RNA-seq, ChIP-seq, small-RNA sequencing, Bisulfite sequencing (DNA methylation) and allele-specific analysis. If you want to change the default bowtie parameters, only do it for simple alignment problems such as ChIP-seq and RNA-seq.

Want to know more ?

More on hash tables and Burrows-Wheeler-based aligners
- A survey of sequence alignment algorithms for next-generation sequencin: (https://academic.oup.com/bib/article/11/5/473/264166) H Li, N Homer - Briefings in bioinformatics, 2010
More on QuasR and all the alignment and post-processing capabilities: (https://bioconductor.org/packages/release/bioc/vignettes/QuasR/inst/doc/QuasR.html)

References

Backman, and Girke. 2016. “systemPipeR: NGS Workflow and Report Generation Environment.” BMC Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-1241-0.

Barr, Wu, and Lawrence. 2019. GmapR: An R Interface to the Gmap/Gsnap/Gstruct Suite.

Gaidatzis, Lerch, Hahne, and Stadler. 2015. “QuasR: Quantification and Annotation of Short Reads in R.” Bioinformatics 31 (7): 1130–2. https://doi.org/10.1093/bioinformatics/btu781.

Langmead, and Salzberg. 2012a. “Fast Gapped-Read Alignment with Bowtie 2.” Nature Methods 9 (4): 357.

Liao, Smyth, and Shi. 2013. “The Subread Aligner: Fast, Accurate and Scalable Read Mapping by Seed-and-Vote.” Nucleic Acids Research 41 (10): e108–e108. https://doi.org/10.1093/nar/gkt214.

Li, and Durbin. 2009a. “Fast and Accurate Short Read Alignment with Burrows–Wheeler Transform.” Bioinformatics 25 (14): 1754–60.

Li, Yu, Li, Lam, Yiu, Kristiansen, and Wang. 2009. “SOAP2: An Improved Ultrafast Tool for Short Read Alignment.” Bioinformatics 25 (15): 1966–7.