2.6 Reading and writing data

Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is the BED format, which is used primarily by the UCSC genome browser but most other genome browsers and tools will support the BED file format. We have all the annotation data in BED format. You will read more about data formats in Chapter 6. In R, you can easily read tabular format data with the read.table() function.

enhancerFilePath=system.file("extdata",
                "subset.enhancers.hg18.bed",
                package="compGenomRData")
cpgiFilePath=system.file("extdata",
                "subset.cpgi.hg18.bed",
                package="compGenomRData")
# read enhancer marker BED file
enh.df <- read.table(enhancerFilePath, header = FALSE) 

# read CpG island BED file
cpgi.df <- read.table(cpgiFilePath, header = FALSE) 

# check first lines to see how the data looks like
head(enh.df)

##      V1     V2     V3 V4   V5 V6    V7      V8 V9
## 1 chr20 266275 267925  . 1000  .  9.11 13.1693 -1
## 2 chr20 287400 294500  . 1000  . 10.53 13.0231 -1
## 3 chr20 300500 302500  . 1000  .  9.10 13.3935 -1
## 4 chr20 330400 331800  . 1000  .  6.39 13.5105 -1
## 5 chr20 341425 343400  . 1000  .  6.20 12.9852 -1
## 6 chr20 437975 439900  . 1000  .  6.31 13.5184 -1

head(cpgi.df)

##      V1     V2     V3       V4
## 1 chr20 195575 195851  CpG:_28
## 2 chr20 207789 208148  CpG:_32
## 3 chr20 219055 219437  CpG:_33
## 4 chr20 225831 227155 CpG:_135
## 5 chr20 252826 256323 CpG:_286
## 6 chr20 275376 276977 CpG:_116

You can save your data by writing it to disk as a text file. A data frame or matrix can be written out by using the write.table() function. Now let us write out cpgi.df. We will write it out as a tab-separated file; pay attention to the arguments.

write.table(cpgi.df,file="cpgi.txt",quote=FALSE,
            row.names=FALSE,col.names=FALSE,sep="\t")

You can save your R objects directly into a file using save() and saveRDS() and load them back in with load() and readRDS(). By using these functions you can save any R object whether or not it is in data frame or matrix classes.

save(cpgi.df,enh.df,file="mydata.RData")
load("mydata.RData")
# saveRDS() can save one object at a type
saveRDS(cpgi.df,file="cpgi.rds")
x=readRDS("cpgi.rds")
head(x)

One important thing is that with save() you can save many objects at a time, and when they are loaded into memory with load() they retain their variable names. For example, in the above code when you use load("mydata.RData") in a fresh R session, an object named cpg.df will be created. That means you have to figure out what name you gave to the objects before saving them. Conversely, when you save an object by saveRDS() and read by readRDS(), the name of the object is not retained, and you need to assign the output of readRDS() to a new variable (x in the above code chunk).

2.6.1 Reading large files

Reading large files that contain tables with base R function read.table() might take a very long time. Therefore, there are additional packages that provide faster functions to read the files. The data.table and readr packages provide this functionality. Below, we show how to use them. These functions with provided parameters will return equivalent output to the read.table() function.

library(data.table)
df.f=d(enhancerFilePath, header = FALSE,data.table=FALSE)

library(readr)
df.f2=read_table(enhancerFilePath, col_names = FALSE)