2.6 Reading and writing data
Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is the BED format, which is used primarily by the UCSC genome browser but most other genome browsers and tools will support the BED file format. We have all the annotation data in BED format. You will read more about data formats in Chapter 6. In R, you can easily read tabular format data with the
enhancerFilePath=system.file("extdata", "subset.enhancers.hg18.bed", package="compGenomRData") cpgiFilePath=system.file("extdata", "subset.cpgi.hg18.bed", package="compGenomRData") # read enhancer marker BED file enh.df <- read.table(enhancerFilePath, header = FALSE) # read CpG island BED file cpgi.df <- read.table(cpgiFilePath, header = FALSE) # check first lines to see how the data looks like head(enh.df)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 ## 1 chr20 266275 267925 . 1000 . 9.11 13.1693 -1 ## 2 chr20 287400 294500 . 1000 . 10.53 13.0231 -1 ## 3 chr20 300500 302500 . 1000 . 9.10 13.3935 -1 ## 4 chr20 330400 331800 . 1000 . 6.39 13.5105 -1 ## 5 chr20 341425 343400 . 1000 . 6.20 12.9852 -1 ## 6 chr20 437975 439900 . 1000 . 6.31 13.5184 -1
## V1 V2 V3 V4 ## 1 chr20 195575 195851 CpG:_28 ## 2 chr20 207789 208148 CpG:_32 ## 3 chr20 219055 219437 CpG:_33 ## 4 chr20 225831 227155 CpG:_135 ## 5 chr20 252826 256323 CpG:_286 ## 6 chr20 275376 276977 CpG:_116
You can save your data by writing it to disk as a text file. A data frame or matrix can be written out by using the
write.table() function. Now let us write out
cpgi.df. We will write it out as a tab-separated file; pay attention to the arguments.
You can save your R objects directly into a file using
saveRDS() and load them back in with
readRDS(). By using these functions you can save any R object whether or not it is in data frame or matrix classes.
One important thing is that with
save() you can save many objects at a time, and when they are loaded into memory with
load() they retain their variable names. For example, in the above code when you use
load("mydata.RData") in a fresh R session, an object named
cpg.df will be created. That means you have to figure out what name you gave to the objects before saving them. Conversely, when you save an object by
saveRDS() and read by
readRDS(), the name of the object is not retained, and you need to assign the output of
readRDS() to a new variable (
x in the above code chunk).
2.6.1 Reading large files
Reading large files that contain tables with base R function
read.table() might take a very long time. Therefore, there are additional packages that provide faster functions to read the files. The
readr packages provide this functionality. Below, we show how to use them. These functions with provided parameters will return equivalent output to the