2.10 Exercises
2.10.1 Computations in R
Sum 2 and 3 using the
+
operator. [Difficulty: Beginner]Take the square root of 36, use
sqrt()
. [Difficulty: Beginner]Take the log10 of 1000, use function
log10()
. [Difficulty: Beginner]Take the log2 of 32, use function
log2()
. [Difficulty: Beginner]Assign the sum of 2,3 and 4 to variable x. [Difficulty: Beginner]
Find the absolute value of the expression
5 - 145
using theabs()
function. [Difficulty: Beginner]Calculate the square root of 625, divide it by 5, and assign it to variable
x
.Ex:y= log10(1000)/5
, the previous statement takes log10 of 1000, divides it by 5, and assigns the value to variable y. [Difficulty: Beginner]Multiply the value you get from previous exercise by 10000, assign it to variable x Ex:
y=y*5
, multipliesy
by 5 and assigns the value toy
. KEY CONCEPT: results of computations or arbitrary values can be stored in variables we can re-use those variables later on and over-write them with new values. [Difficulty: Beginner]
2.10.2 Data structures in R
Make a vector of 1,2,3,5 and 10 using
c()
, and assign it to thevec
variable. Ex:vec1=c(1,3,4)
makes a vector out of 1,3,4. [Difficulty: Beginner]Check the length of your vector with length(). Ex:
length(vec1)
should return 3. [Difficulty: Beginner]Make a vector of all numbers between 2 and 15. Ex:
vec=1:6
makes a vector of numbers between 1 and 6, and assigns it to thevec
variable. [Difficulty: Beginner]Make a vector of 4s repeated 10 times using the
rep()
function. Ex:rep(x=2,times=5)
makes a vector of 2s repeated 5 times. [Difficulty: Beginner]Make a logical vector with TRUE, FALSE values of length 4, use
c()
. Ex:c(TRUE,FALSE)
. [Difficulty: Beginner]Make a character vector of the gene names PAX6,ZIC2,OCT4 and SOX2. Ex:
avec=c("a","b","c")
makes a character vector of a,b and c. [Difficulty: Beginner]Subset the vector using
[]
notation, and get the 5th and 6th elements. Ex:vec1[1]
gets the first element.vec1[c(1,3)]
gets the 1st and 3rd elements. [Difficulty: Beginner]You can also subset any vector using a logical vector in
[]
. Run the following:
myvec=1:5
# the length of the logical vector
# should be equal to length(myvec)
myvec[c(TRUE,TRUE,FALSE,FALSE,FALSE)]
myvec[c(TRUE,FALSE,FALSE,FALSE,TRUE)]
[Difficulty: Beginner]
==,>,<, >=, <=
operators create logical vectors. See the results of the following operations:
[Difficulty: Beginner]
Use the
>
operator inmyvec[ ]
to get elements larger than 2 inmyvec
which is described above. [Difficulty: Beginner]Make a 5x3 matrix (5 rows, 3 columns) using
matrix()
. Ex:matrix(1:6,nrow=3,ncol=2)
makes a 3x2 matrix using numbers between 1 and 6. [Difficulty: Beginner]What happens when you use
byrow = TRUE
in your matrix() as an additional argument? Ex:mat=matrix(1:6,nrow=3,ncol=2,byrow = TRUE)
. [Difficulty: Beginner]Extract the first 3 columns and first 3 rows of your matrix using
[]
notation. [Difficulty: Beginner]Extract the last two rows of the matrix you created earlier. Ex:
mat[2:3,]
ormat[c(2,3),]
extracts the 2nd and 3rd rows. [Difficulty: Beginner]Extract the first two columns and run
class()
on the result. [Difficulty: Beginner]Extract the first column and run
class()
on the result, compare with the above exercise. [Difficulty: Beginner]Make a data frame with 3 columns and 5 rows. Make sure first column is a sequence of numbers 1:5, and second column is a character vector. Ex:
df=data.frame(col1=1:3,col2=c("a","b","c"),col3=3:1) # 3x3 data frame
. Remember you need to make a 3x5 data frame. [Difficulty: Beginner]Extract the first two columns and first two rows. HINT: Use the same notation as matrices. [Difficulty: Beginner]
Extract the last two rows of the data frame you made. HINT: Same notation as matrices. [Difficulty: Beginner]
Extract the last two columns using the column names of the data frame you made. [Difficulty: Beginner]
Extract the second column using the column names. You can use
[]
or$
as in lists; use both in two different answers. [Difficulty: Beginner]Extract rows where the 1st column is larger than 3. HINT: You can get a logical vector using the
>
operator , and logical vectors can be used in[]
when subsetting. [Difficulty: Beginner]Extract rows where the 1st column is larger than or equal to 3. [Difficulty: Beginner]
Convert a data frame to the matrix. HINT: Use
as.matrix()
. Observe what happens to numeric values in the data frame. [Difficulty: Beginner]Make a list using the
list()
function. Your list should have 4 elements; the one below has 2. Ex:mylist= list(a=c(1,2,3),b=c("apple,"orange"))
[Difficulty: Beginner]Select the 1st element of the list you made using
$
notation. Ex:mylist$a
selects first element named “a”. [Difficulty: Beginner]Select the 4th element of the list you made earlier using
$
notation. [Difficulty: Beginner]Select the 1st element of your list using
[ ]
notation. Ex:mylist[1]
selects the first element named “a”, and you get a list with one element.mylist["a"]
selects the first element named “a”, and you get a list with one element. [Difficulty: Beginner]Select the 4th element of your list using
[ ]
notation. [Difficulty: Beginner]Make a factor using factor(), with 5 elements. Ex:
fa=factor(c("a","a","b"))
. [Difficulty: Beginner]Convert a character vector to a factor using
as.factor()
. First, make a character vector usingc()
then useas.factor()
. [Difficulty: Intermediate]Convert the factor you made above to a character using
as.character()
. [Difficulty: Beginner]
2.10.3 Reading in and writing data out in R
- Read CpG island (CpGi) data from the compGenomRData package
CpGi.table.hg18.txt
. This is a tab-separated file. Store it in a variable calledcpgi
. Use
cpgFilePath=system.file("extdata",
"CpGi.table.hg18.txt",
package="compGenomRData")
to get the file path within the installed compGenomRData
package. [Difficulty: Beginner]
Use
head()
on CpGi to see the first few rows. [Difficulty: Beginner]Why doesn’t the following work? See
sep
argument athelp(read.table)
. [Difficulty: Beginner]
cpgtFilePath=system.file("extdata",
"CpGi.table.hg18.txt",
package="compGenomRData")
cpgtFilePath
cpgiSepComma=read.table(cpgtFilePath,header=TRUE,sep=",")
head(cpgiSepComma)
- What happens when you set
stringsAsFactors=FALSE
inread.table()
? [Difficulty: Beginner]
cpgiHF=read.table("intro2R_data/data/CpGi.table.hg18.txt",
header=FALSE,sep="\t",
stringsAsFactors=FALSE)
Read only the first 10 rows of the CpGi table. [Difficulty: Beginner/Intermediate]
Use
cpgFilePath=system.file("extdata","CpGi.table.hg18.txt",
package="compGenomRData")
to get the file path, then useread.table()
with argumentheader=FALSE
. Usehead()
to see the results. [Difficulty: Beginner]Write CpG islands to a text file called “my.cpgi.file.txt”. Write the file to your home folder; you can use
file="~/my.cpgi.file.txt"
in linux.~/
denotes home folder.[Difficulty: Beginner]Same as above but this time make sure to use the
quote=FALSE
,sep="\t"
androw.names=FALSE
arguments. Save the file to “my.cpgi.file2.txt” and compare it with “my.cpgi.file.txt”. [Difficulty: Beginner]Write out the first 10 rows of the
cpgi
data frame. HINT: Use subsetting for data frames we learned before. [Difficulty: Beginner]Write the first 3 columns of the
cpgi
data frame. [Difficulty: Beginner]Write CpG islands only on chr1. HINT: Use subsetting with
[]
, feed a logical vector using==
operator.[Difficulty: Beginner/Intermediate]Read two other data sets “rn4.refseq.bed” and “rn4.refseq2name.txt” with
header=FALSE
, and assign them to df1 and df2 respectively. They are again included in the compGenomRData package, and you can use thesystem.file()
function to get the file paths. [Difficulty: Beginner]Use
head()
to see what is inside the data frames above. [Difficulty: Beginner]Merge data sets using
merge()
and assign the results to a variable named ‘new.df’, and usehead()
to see the results. [Difficulty: Intermediate]
2.10.4 Plotting in R
Please run the following code snippet for the rest of the exercises.
Make a scatter plot using the
x1
andy1
vectors generated above. [Difficulty: Beginner]Use the
main
argument to give a title toplot()
as inplot(x,y,main="title")
. [Difficulty: Beginner]Use the
xlab
argument to set a label for the x-axis. Useylab
argument to set a label for the y-axis. [Difficulty: Beginner]Once you have the plot, run the following expression in R console.
mtext(side=3,text="hi there")
does. HINT:mtext
stands for margin text. [Difficulty: Beginner]See what
mtext(side=2,text="hi there")
does. Check your plot after execution. [Difficulty: Beginner]Use mtext() and paste() to put a margin text on the plot. You can use
paste()
as ‘text’ argument inmtext()
. HINT:mtext(side=3,text=paste(...))
. See howpaste()
is used for below. [Difficulty: Beginner/Intermediate]
## [1] "Text here"
## [1] "Text here"
cor()
calculates the correlation between two vectors. Pearson correlation is a measure of the linear correlation (dependence) between two variables X and Y. Try using thecor()
function on thex1
andy1
variables. [Difficulty: Intermediate]Try to use
mtext()
,cor()
andpaste()
to display the correlation coefficient on your scatter plot. [Difficulty: Intermediate]Change the colors of your plot using the
col
argument. Ex:plot(x,y,col="red")
. [Difficulty: Beginner]Use
pch=19
as an argument in yourplot()
command. [Difficulty: Beginner]Use
pch=18
as an argument to yourplot()
command. [Difficulty: Beginner]Make a histogram of
x1
with thehist()
function. A histogram is a graphical representation of the data distribution. [Difficulty: Beginner]You can change colors with ‘col’, add labels with ‘xlab’, ‘ylab’, and add a ‘title’ with ‘main’ arguments. Try all these in a histogram. [Difficulty: Beginner]
Make a boxplot of y1 with
boxplot()
.[Difficulty: Beginner]Make boxplots of
x1
andy1
vectors in the same plot.[Difficulty: Beginner]In boxplot, use the
horizontal = TRUE
argument. [Difficulty: Beginner]Make multiple plots with
par(mfrow=c(2,1))
- run
par(mfrow=c(2,1))
- make a boxplot
- make a histogram [Difficulty: Beginner/Intermediate]
- run
Do the same as above but this time with
par(mfrow=c(1,2))
. [Difficulty: Beginner/Intermediate]Save your plot using the “Export” button in Rstudio. [Difficulty: Beginner]
You can make a scatter plot showing the density of points rather than points themselves. If you use points it looks like this:
If you use the smoothScatter()
function, you get the densities.
Now, plot with the colramp=heat.colors
argument and then use a custom color scale using the following argument.
colramp = colorRampPalette(c("white","blue", "green","yellow","red")))
[Difficulty: Beginner/Intermediate]
2.10.5 Functions and control structures (for, if/else, etc.)
Read CpG island data as shown below for the rest of the exercises.
cpgtFilePath=system.file("extdata",
"CpGi.table.hg18.txt",
package="compGenomRData")
cpgi=read.table(cpgtFilePath,header=TRUE,sep="\t")
head(cpgi)
## chrom chromStart chromEnd name length cpgNum gcNum perCpg perGc obsExp
## 1 chr1 18598 19673 CpG: 116 1075 116 787 21.6 73.2 0.83
## 2 chr1 124987 125426 CpG: 30 439 30 295 13.7 67.2 0.64
## 3 chr1 317653 318092 CpG: 29 439 29 295 13.2 67.2 0.62
## 4 chr1 427014 428027 CpG: 84 1013 84 734 16.6 72.5 0.64
## 5 chr1 439136 440407 CpG: 99 1271 99 777 15.6 61.1 0.84
## 6 chr1 523082 523977 CpG: 94 895 94 570 21.0 63.7 1.04
Check values in the perGc column using a histogram. The ‘perGc’ column in the data stands for GC percent => percentage of C+G nucleotides. [Difficulty: Beginner]
Make a boxplot for the ‘perGc’ column. [Difficulty: Beginner]
Use if/else structure to decide if the given GC percent is high, low or medium. If it is low, high, or medium: low < 60, high>75, medium is between 60 and 75; use greater or less than operators,
<
or>
. Fill in the values in the code below, where it is written ‘YOU_FILL_IN’. [Difficulty: Intermediate]
GCper=65
# check if GC value is lower than 60,
# assign "low" to result
if('YOU_FILL_IN'){
result="low"
cat("low")
}
else if('YOU_FILL_IN'){ # check if GC value is higher than 75,
#assign "high" to result
result="high"
cat("high")
}else{ # if those two conditions fail then it must be "medium"
result="medium"
}
result
- Write a function that takes a value of GC percent and decides if it is low, high, or medium: low < 60, high>75, medium is between 60 and 75. Fill in the values in the code below, where it is written ‘YOU_FILL_IN’. [Difficulty: Intermediate/Advanced]
GCclass<-function(my.gc){
YOU_FILL_IN
return(result)
}
GCclass(10) # should return "low"
GCclass(90) # should return "high"
GCclass(65) # should return "medium"
- Use a for loop to get GC percentage classes for
gcValues
below. Use the function you wrote above.[Difficulty: Intermediate/Advanced]
gcValues=c(10,50,70,65,90)
for( i in YOU_FILL_IN){
YOU_FILL_IN
}
- Use
lapply
to get GC percentage classes forgcValues
. [Difficulty: Intermediate/Advanced]
Use sapply to get values to get GC percentage classes for
gcValues
. [Difficulty: Intermediate]Is there a way to decide on the GC percentage class of a given vector of
GCpercentages
without using if/else structure and loops ? if so, how can you do it? HINT: Subsetting using < and > operators. [Difficulty: Intermediate]