7.3 Filtering and trimming reads

Based on the results of the quality check, you may want to trim or filter the reads. The quality check might have shown the number of reads that have low quality scores. These reads will probably not align very well because of the potential mistakes in base calling, or they may align to wrong places in the genome. Therefore, you may want to remove these reads from your fastq file. Another potential scenario is that parts of your reads need to be trimmed in order to align the reads. In some cases, adapters will be present in either side of the read; in other cases technical errors will lead to decreasing base quality towards the ends of the reads. Both in these cases, the portion of the read should be trimmed so the read can align or better align the genome. We will show how to use the QuasR package to trim the reads. Other packages such as ShortRead also have capabilities to trim and filter reads. However, the QuasR::preprocessReads() function provides a single interface to multiple preprocessing possibilities. With this function, we match adapter sequences and remove them. We can remove low-complexity reads (reads containing repetitive sequences). We can trim the start or ends of the reads by a pre-defined length.

Below we will first set up the file paths to fastq files and filter them based on their length and whether or not they contain the “N” character, which stands for unidentified base. With the same function we will also trim 3 bases from the end of the reads and also trim segments from the start of the reads if they match the “ACCCGGGA” sequence.

library(QuasR)

# obtain a list of fastq file paths
fastqFiles <- system.file(package="ShortRead",
                          "extdata/E-MTAB-1147",
                          c("ERR127302_1_subset.fastq.gz",
                            "ERR127302_2_subset.fastq.gz")
)

# defined processed fastq file names
outfiles <- paste(tempfile(pattern=c("processed_1_",
                              "processed_2_")),".fastq",sep="")

# process fastq files
# remove reads that have more than 1 N, (nBases)
# trim 3 bases from the end of the reads (truncateEndBases)
# Remove ACCCGGGA patern if it occurs at the start (Lpattern)
# remove reads shorter than 40 base-pairs (minLength)
preprocessReads(fastqFiles, outfiles, 
                nBases=1,
                truncateEndBases=3,
                Lpattern="ACCCGGGA",
                minLength=40)

As we have mentioned, the ShortRead package has low-level functions, which QuasR::preprocessReads() also depends on. We can use these low-level functions to filter reads in ways that are not possible using the QuasR::preprocessReads() function. Below we are going to read in a fastq file and filter the reads where every quality score is below 20.

library(ShortRead)

# obtain a list of fastq file paths
fastqFile <- system.file(package="ShortRead",
                          "extdata/E-MTAB-1147",
                          "ERR127302_1_subset.fastq.gz")

# read fastq file
fq = readFastq(fastqFile)

# get quality scores per base as a matrix
qPerBase = as(quality(fq), "matrix")

# get number of bases per read that have quality score below 20
# we use this
qcount = rowSums( qPerBase <= 20) 

# Number of reads where all Phred scores >= 20
fq[qcount == 0]

## class: ShortReadQ
## length: 10699 reads; width: 72 cycles

We can finally write out the filtered fastq file with the ShortRead::writeFastq() function.

# write out fastq file with only reads where all 
# quality scores per base are above 20
writeFastq(fq[qcount == 0], 
           paste(fastqFile, "Qfiltered", sep="_"))

As fastq files can be quite large, it may not be feasible to read a 30-Gigabyte file into memory. A more memory-efficient way would be to read the file piece by piece. We can do our filtering operations for each piece, write the filtered part out, and read a new piece. Fortunately, this is possible using the ShortRead::FastqStreamer() function. This function enables “streaming” the fastq file in pieces, which are blocks of the fastq file with a pre-defined number of reads. We can access the successive blocks with the yield() function. Each time we call the yield() function after opening the fastq file with FastqStreamer(), a new part of the file will be read to the memory.

# set up streaming with block size 1000
# every time we call the yield() function 1000 read portion
# of the file will be read successively. 
f <- FastqStreamer(fastqFile,readerBlockSize=1000) 

# we set up a while loop to call yield() function to
# go through the file
while(length(fq <- yield(f))) {
  
    # remove reads where all quality scores are < 20 
    # get quality scores per base as a matrix
    qPerBase = as(quality(fq), "matrix")

    # get number of bases per read that have Q score < 20
    qcount = rowSums( qPerBase <= 20) 
 
    # write fastq file with mode="a", so every new block
    # is written out to the same file
    writeFastq(fq[qcount == 0], 
               paste(fastqFile, "Qfiltered", sep="_"), 
               mode="a")
}