1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWr1"

From MAD
Jump to navigation Jump to search
(Created page with "<!-- NOTEX --> See also the lecture <!-- /NOTEX --> In this homework, try to read the text, execute given commands, potentially trying some small modifications. Then...")
 
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
See also the [[Lr1|lecture]]
+
See also the [[Lr1|lecture]].
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
In this homework, try to read the text, execute given commands, potentially trying some small modifications. Then do tasks A-D
+
In this homework, try to read the text, execute given commands, potentially trying some small modifications. Within the tutorial, you will find the tasks to complete in this exercise.
 
<!-- NOTEX -->
 
<!-- NOTEX -->
* Submit the required files (3x .png)
+
* Submit the required files (.png).
* In your protocol, enter the commands used in tasks A-D, with explanatory comments in more complicated situations
+
* In your protocol, enter the commands used in all tasks, with explanatory comments in more complicated situations.
* In task B also enter the required output to the protocol
+
* In tasks B and D also enter the required output to the protocol.
 
* Protocol template in <tt>/tasks/r1/protocol.txt</tt>
 
* Protocol template in <tt>/tasks/r1/protocol.txt</tt>
 +
 +
Two versions of the homework based on your R proficiency (your decision which one you choose).
 +
 +
Beginners:
 +
* Do tasks A-E, skip F.
 +
 +
Intermediate / advanced
 +
* Skip to section [[#Expression data|Expression data]].
 +
* Do tasks D-F, but instead of <tt>plot</tt> and <tt>matplot</tt> functions from the basic R graphics library, use ggplot2 library. See the list of resources [[#Optional:_ggplot2_library|below]].
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
Line 14: Line 23:
 
Type a command, R writes the answer, e.g.:
 
Type a command, R writes the answer, e.g.:
 
<pre>
 
<pre>
> 1+2
+
> 1 + 2
 
[1] 3
 
[1] 3
 
</pre>
 
</pre>
Line 31: Line 40:
 
</pre>
 
</pre>
  
Surprises in the R language:
+
Several surprises in the R language:
* dots are used as parts of id's, e.g. <tt>read.table</tt> is name of a single function (not a method for the object <tt>read</tt>)
+
* Dots are used as parts of id's, e.g. <tt>read.table</tt> is name of a single function (not a method for the object <tt>read</tt>).
* assignment via <tt><-</tt> or <tt>=</tt>
+
* Assignment can be done via <tt><-</tt> or <tt>=</tt>.
* vectors etc are indexed from 1, not from 0
+
* Vectors etc are indexed from 1, not from 0.
  
 
===Vectors, basic plots===
 
===Vectors, basic plots===
A vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans)
+
A vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans).
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
 
# Vector can be created from a list of numbers by function named c
 
# Vector can be created from a list of numbers by function named c
a = c(1,2,4)
+
a = c(1, 2, 4)
 
a
 
a
 
# prints [1] 1 2 4
 
# prints [1] 1 2 4
  
 
# c also concatenates vectors
 
# c also concatenates vectors
c(a,a)
+
c(a, a)
 
# prints [1] 1 2 4 1 2 4
 
# prints [1] 1 2 4 1 2 4
  
Line 57: Line 66:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
====Vector arithmetics====
+
====Vector arithmetic====
Many operations can be easily applied to each member of a vector
+
Many operations can be easily applied to each member of a vector.
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
 
x = 1:10
 
x = 1:10
 
# Square each number in vector x
 
# Square each number in vector x
x*x
+
x * x
 
# prints [1]  1  4  9  16  25  36  49  64  81 100
 
# prints [1]  1  4  9  16  25  36  49  64  81 100
  
 
# New vector y: logarithm of a number in x squared
 
# New vector y: logarithm of a number in x squared
y = log(x*x)
+
y = log(x * x)
 
y
 
y
 
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
 
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
Line 72: Line 81:
  
 
# Draw the graph of function log(x*x) for x=1..10
 
# Draw the graph of function log(x*x) for x=1..10
plot(x,y)
+
plot(x, y)
 
# The same graph but use lines instead of dots
 
# The same graph but use lines instead of dots
plot(x,y,type="l")
+
plot(x, y, type="l")
  
 
# Addressing elements of a vector: positions start at 1
 
# Addressing elements of a vector: positions start at 1
Line 83: Line 92:
 
# Which elements of the vector satisfy certain condition?  
 
# Which elements of the vector satisfy certain condition?  
 
# (vector of logical values)
 
# (vector of logical values)
y>3
+
y > 3
 
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  
 
# write only those elements from y that satisfy the condition
 
# write only those elements from y that satisfy the condition
y[y>3]
+
y[y > 3]
 
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170
 
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170
  
 
# we can also write values of x such that values of y satisfy the condition...
 
# we can also write values of x such that values of y satisfy the condition...
x[y>3]
+
x[y > 3]
 
# prints [1]  5  6  7  8  9 10
 
# prints [1]  5  6  7  8  9 10
 
</syntaxhighlight>
 
</syntaxhighlight>
 
Alternative plotting facilities: [http://ggplot2.org/ ggplot2 library], [https://cran.r-project.org/web/packages/lattice/index.html lattice library]
 
  
 
====Task A====
 
====Task A====
Line 104: Line 111:
  
 
Hints:
 
Hints:
* Create <tt>x</tt> and <tt>y</tt> by vector arithmetics
+
* Create <tt>x</tt> and <tt>y</tt> by vector arithmetic.
 
* To compute binary logarithm check help <tt>? log</tt>
 
* To compute binary logarithm check help <tt>? log</tt>
* Before running plot, use command <tt>png("log.png")</tt> to store the result, afterwards call <tt>dev.off()</tt> to close the file (in Rstudio you can also export plots manually)
+
* Before running <tt>plot</tt>, use command <tt>png("log.png")</tt> to store the result, afterwards call <tt>dev.off()</tt> to close the file (in Rstudio you can also export plots manually).
  
 
===Data frames and simple statistics===
 
===Data frames and simple statistics===
Line 112: Line 119:
  
 
We will use a table with the following columns:
 
We will use a table with the following columns:
* Country name
+
* country name,
* Region (continent)
+
* region (continent),
* Area in thousands of km2
+
* area in thousands of km2,
* Population in millions in 2019
+
* population in millions in 2019.
(source of data UN)
 
 
 
The table is stored in the csv format (columns separated by commas).
 
  
 +
Data is from the UN. The table is stored in the csv format (columns separated by commas). Here are the first six lines:
 
<pre>
 
<pre>
 +
Afghanistan,Asia,652.864,38.0418
 +
Albania,Europe,28.748,2.8809
 +
Algeria,Africa,2381.741,43.0531
 +
American Samoa,Oceania,0.199,0.0553
 +
Andorra,Europe,0.468,0.0771
 +
Angola,Africa,1246.7,31.8253
 
</pre>
 
</pre>
  
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
 
# reading a data frame from a file
 
# reading a data frame from a file
a = read.csv("countries.csv",header = TRUE)
+
a = read.csv("/tasks/r1/countries.csv",header = TRUE)
  
 
# display mean, median, etc. of each column
 
# display mean, median, etc. of each column
summary(a);
+
summary(a)
 
# Compactly display structure of a  
 
# Compactly display structure of a  
 
# (good for checking that import worked etc)
 
# (good for checking that import worked etc)
Line 145: Line 156:
 
plot(a$Area, a$Population)
 
plot(a$Area, a$Population)
  
# we will data better in log-scale (both axes)
+
# we will see smaller values better in log-scale (both axes)
 
plot(a$Area, a$Population, log='xy')
 
plot(a$Area, a$Population, log='xy')
  
# or in linear scale but omitting the biggest countries
+
# use linear scale, but zoom in on smaller countries:
 
plot(a$Area, a$Population, xlim=c(0,1500), ylim=c(0,150))
 
plot(a$Area, a$Population, xlim=c(0,1500), ylim=c(0,150))
  
Line 162: Line 173:
  
 
# Histogram of country populations in Europe
 
# Histogram of country populations in Europe
hist(a$Size)
+
hist(a$Population[a$Region=="Europe"])
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
===Task B===
 
===Task B===
Create frame <tt>europe</tt> which contains all countries from Europe from frame <tt>a</tt>. Also create a simuilar frame for African countries. Hint:
+
Create frame <tt>europe</tt> which contains data for European countries selected from frame <tt>a</tt>. Also create a similar frame for African countries. Hint:
* Try command <tt>a[c(1,2,3),]</tt>. What is it doing?
+
* One option is to select rows using boolean expressions as in the computation of median country size in Europe above. This time we apply it on a frame rather than single column. To select some rows (and all columns) from frame <tt>a</tt> use expression of the form <tt>a[which_rows, ]</tt>. Here, <tt>which_rows</tt> can be a list, such as <tt>c(1,2,3)</tt> or a boolean expression producing a list of booleans.  
* Try command <tt>a$Region=="Europe"</tt>.
+
* Alternatively, you can also explore <tt>[https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/subset subset]</tt> command.
* Combine these two commands to get rows for all European countries and store the frame in a new variable, then repeat for Africa
 
* Use a general approach which does not depend on the exact number and ordering of rows in the table.
 
  
Run the command <tt>summary</tt> separately for each new frame. How do their characteristics differ?
+
Run the command <tt>summary</tt> separately for each new frame. Comment on how their characteristics differ.
 
<!-- NOTEX -->
 
<!-- NOTEX -->
'''Write''' output and your conclusion to the protocol.
+
'''Write''' the output and your conclusion to the protocol.
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
 
===Task C===
 
===Task C===
Draw a graph comparing the area vs population in Europe and Africa; use different colors for points representing European and African countries.
+
Draw a graph comparing the area vs population in Europe and Africa; use different colors for points representing European and African countries. Apply log scale on both axes.
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 
* '''Submit''' the plot in file <tt>countries.png</tt>
 
* '''Submit''' the plot in file <tt>countries.png</tt>
Line 184: Line 193:
 
To draw the graph, you can use one of the options below, or find yet another way.
 
To draw the graph, you can use one of the options below, or find yet another way.
  
Option 1: first draw the Europe with one color, then add Africa in another color
+
Option 1: first draw Europe with one color, then add Africa in another color
* Color of points can be changed by as follows: <tt>plot(1:10,1:10, col="red")</tt>
+
* Color of points can be changed by as follows: <tt>plot(1:10, 1:10, col="red")</tt>.
* After the <tt>plot</tt> command, you can add more points to the same graph by command <tt>points</tt>, which can be used similarly as <tt>plot</tt>
+
* After the <tt>plot</tt> command, you can add more points to the same graph by command <tt>points</tt>, which can be used similarly as <tt>plot</tt>.
* Warning: command <tt>points</tt> does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options <tt>xlim</tt> and <tt>ylim</tt>, e.g. <tt>plot(x,y, col="red", xlim=c(1,100), ylim=c(1,100))</tt>
+
* Warning: command <tt>points</tt> does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options <tt>xlim</tt> and <tt>ylim</tt>, e.g. <tt>plot(x,y, col="red", xlim=c(0.1,100), ylim=c(0.1,100)).</tt>
  
Option 2: plot both Europe and Africa in one plot command, and give it a vector of colors, one for each point. Command <tt>plot(1:10,1:10,col=c(rep("red",5),rep("blue",5)))</tt> will plot the first 5 points red and the last 5 points blue
+
Option 2: plot both Europe and Africa in one <tt>plot</tt> command, and give it a vector of colors, one for each point. Command <tt>plot(1:10,1:10,col=c(rep("red",5),rep("blue",5)))</tt> will plot the first 5 points red and the last 5 points blue
  
 
'''Bonus task:''' add a legend to the plot, showing which color is Europe and which is Africa.
 
'''Bonus task:''' add a legend to the plot, showing which color is Europe and which is Africa.
  
===Expression data and clustering===
+
===Expression data===
  
Data here is bigger. If you work on our server, it is better to use plain R rather than Rstudio (due to limited server CPU/memory).
+
The dataset was described in the lecture.
  
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
 
# Read gene expression data table
 
# Read gene expression data table
a <- read.table("/tasks/r1/microarray.csv", header = TRUE, sep = "\t", row.names=1)
+
a = read.csv("/tasks/r1/microarray.csv", row.names=1)
 
# Visual check of the first row
 
# Visual check of the first row
 
a[1,]
 
a[1,]
# Plot starting point vs. situation after 1 hour
+
# Plot control replicate 1 vs. acetate acid replicate 1
plot(a$cold_0min,a$cold_1h)
+
plot(a$control1, a$acetate1)
# To better see density in dense clouds of points, use this plot
+
# Plot control replicate 1 vs. control replicate 2
smoothScatter(a$cold_15min, a$cold_1h)
+
plot(a$control1, a$control2)
# Outliers away from diagonal in the plot above are most strongly differentially expressed genes
+
# To show density in dense clouds of points, use this plot
# These are more easy to see in MA plot:
+
smoothScatter(a$control1, a$acetate1)
# x-axis: average expression in the two conditions
 
# y-axis: difference between values (they are log-scale, so difference 1 means 2-fold)
 
smoothScatter((a$cold_15min+a$cold_1h)/2, a$cold_15min-a$cold_1h)
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Clustering is a wide group of methods that split data points into groups with similar properties. We will group together genes that have a similar reaction to cold, i.e. their rows in gene expression data matrix have similar values. We will consider two simple clustering methods
+
===Optional: ggplot2 library===
 +
<!-- NOTEX -->
 +
Required for intermediate / advanced in R, who decide to do tasks D-F.
 +
<!-- /NOTEX -->
 +
 
 +
Several resources for ggplot2 library
 +
* Plotting library based Grammar of Graphics by Leland Wilkinson, systematic way of creating plots by specifying individual components.
 +
* Website (including more pointers) https://ggplot2.tidyverse.org/
 +
* Cheatsheet https://github.com/rstudio/cheatsheets/blob/master/data-visualization.pdf
 +
* Book Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen. ggplot2: elegant graphics for data analysis. Available online at https://ggplot2-book.org/index.html
 +
* Book Winston Chang. R Graphics Cookbook. Available online at https://r-graphics.org/index.html
 +
 
 +
===Task D===
 +
 
 +
In the plots above we compare two experiments, say A=control1 and B=acetate1. Outliers away from the diagonal in the plot are the genes whose expression changes. However distance from the diagonal is hard to judge visually, instead we will create MA plot:
 +
* As above, each gene is one dot in the plot (use <tt>plot</tt> rather than <tt>smoothScatter</tt>).
 +
* The x-axis is the average between values for conditions A and B. The points on the right have overall higher expression than points on the left.
 +
* The y-axis is the difference between condition A and B. The values in frame <tt>a</tt> are in log-scale base 2, so the difference of 1 means 2-fold change in expression.
 +
* The points far from the line y=0 have the highest change in expression. Use R functions <tt>min</tt>, <tt>max</tt>, <tt>which.min</tt> and <tt>which.max</tt> to find the largest positive and negative difference from line y=0 and which genes they correspond to. Functions <tt>min</tt> and <tt>max</tt> give you the minimum and maximum of a given vector. Functions <tt>which.min</tt> and <tt>which.max</tt> return the index where this extreme value is located. You can use this index to get the appropriate row of the dataframe <tt>a</tt>, including the gene name.
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>ma.png</tt> with your plot. Include the genes with the extreme changes in your protocol.
 +
<!-- /NOTEX -->
 +
 
 +
===Clustering applied to expression data===
 +
Clustering is a wide group of methods that split data points into groups with similar properties. We will group together genes that have a similar reaction to acids, i.e. their rows in gene expression data matrix have similar values. We will consider two simple clustering methods.
 
* '''K means''' clustering splits points (genes) into ''k'' clusters, where ''k'' is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
 
* '''K means''' clustering splits points (genes) into ''k'' clusters, where ''k'' is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
 +
 +
[[Image:HW08-heatmap.png|thumb|400px|right|Examples of heatmaps]]
 
* '''Hierarchical clustering''' puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.
 
* '''Hierarchical clustering''' puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.
  
[[Image:HW08-heatmap.png|thumb|200px|right|Example of a heatmap]]
+
 
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
# Heatmap: creates hierarchical clustering of rows  
+
# Create a new version of frame a in which row is scaled so that
# then shows every value in the table using color ranging from red (lowest) to white (highest)
+
# it has mean 0 and standard deviation 1
# Computation may take some time
+
# Function scale does such transformation on columns instead of rows,
heatmap(as.matrix(a), Colv=NA)
+
# so we transpose the frame using function t, then transpose it back
# Previous heatmap normalized each row, the next one uses data as they are:
+
b = t(scale(t(a)))
 +
# Matrix b shows relative movements of each gene,
 +
# disregarding its overall high or low expression
 +
 
 +
# Command heatmap creates hierarchical clustering of rows,
 +
# then shows values using colors ranging from red (lowest) to white (highest)
 
heatmap(as.matrix(a), Colv=NA, scale="none")
 
heatmap(as.matrix(a), Colv=NA, scale="none")
 +
heatmap(as.matrix(b), Colv=NA, scale="none")
 +
# compare the two matrices - which phenomena influenced clusters in each of them?
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
<syntaxhighlight lang="r">
 
<syntaxhighlight lang="r">
# k means clustering to 7 clusters
+
# k means clustering to 5 clusters
k = 7
+
k = 5
cl <- kmeans(a, k)
+
cl = kmeans(b, k)
# Each gene has assigned a cluster (number between 1 and k)
+
# Each gene is assigned a cluster (number between 1 and k)
cl$cluster
+
# the command below displays the first 10 elements, i.e. clusters of first 10 genes
# Draw only cluster number 3 out of k
+
head(cl$cluster)
heatmap(as.matrix(a[cl$cluster==3,]), Rowv=NA, Colv=NA)
+
# Draw heatmap of cluster number 3 out of k, no further clustering applied
 +
# Do you see any common pattern to genes in the cluster?
 +
heatmap(as.matrix(b[cl$cluster==3,]), Rowv=NA, Colv=NA, scale="none")
  
# Reorder genes in the table according to cluster
+
# Reorder genes in the whole table according to their cluster cluster number
heatmap(as.matrix(a[order(cl$cluster),]), Rowv=NA, Colv=NA)
+
# Can you spot our k clusters?
 +
heatmap(as.matrix(b[order(cl$cluster),]), Rowv=NA, Colv=NA, scale="none")
  
 
# Compare overall column means with column means in cluster 3
 
# Compare overall column means with column means in cluster 3
 
# Function apply runs mean on every column (or row if 2 changed to 1)
 
# Function apply runs mean on every column (or row if 2 changed to 1)
apply(a, 2, mean)
+
apply(b, 2, mean)
 
# Now means within cluster 3
 
# Now means within cluster 3
apply(a[cl$cluster==3,],2,mean)
+
apply(b[cl$cluster==3,], 2, mean)
  
 
# Clusters have centers which are also computed as means
 
# Clusters have centers which are also computed as means
Line 250: Line 292:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
===Task D===
+
===Task E===
[[Image:HW08-clusters.png|thumb|200px|right|Example of a required plot]]
+
[[Image:HW08-clusters.png|thumb|200px|right|Example of a required plot (but for k=3, not k=5)]]
Draw a plot in which x-axis is the time and y-axis is the expression level and the center of each cluster is shown as a line
+
Draw a plot in which the x-axis corresponds to experiments, the y-axis is the expression level and the center of each cluster is shown as a line (use k-means clustering on the scaled frame <tt>b</tt>, computed as shown above)
* Use command <tt>matplot(x, y, type="l")</tt> which gets two matrices <tt>x</tt> and <tt>y</tt> and plots columns of <tt>x</tt> vs columns of <tt>y</tt>
+
* Use command <tt>matplot(x, y, type="l", lwd=2)</tt> which gets two matrices <tt>x</tt> and <tt>y</tt> of the same size and plots each column of matrices  <tt>x</tt> and <tt>y</tt> as one line (setting <tt>lwd=2</tt> makes lines thicker)
* Command <tt>matplot(, y, type="l")</tt> will use numbers 1,2,3... as columns of the missing matrix <tt>x</tt>
+
* In this case we omit matrix <tt>x</tt>, the command will use numbers 1,2,3... as columns of the missing matrix  
 
* Create <tt>y</tt> from <tt>cl$centers</tt> by applying function <tt>t</tt> (transpose)
 
* Create <tt>y</tt> from <tt>cl$centers</tt> by applying function <tt>t</tt> (transpose)
* To create an appropriate matrix <tt>x</tt>, first create a vector of times for individual experiments in minutes or hours (do it manually, no need to parse column names automatically). Using functions <tt>rep</tt> and <tt>matrix</tt> you can create a matrix <tt>x</tt> in which this vector is used as every column
 
* Then run <tt>matplot(x, y, type="l")</tt>
 
* Since time points are not evenly spaced, it would be better to use logarithmic scale on the x-axis: <tt>matplot(x, y, type="l", log="x")</tt>
 
* to avoid <tt>log(0)</tt>, change the first timepoint from <tt>0min</tt> to <tt>1min</tt>
 
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 
* '''Submit''' file <tt>clusters.png</tt> with your final plot
 
* '''Submit''' file <tt>clusters.png</tt> with your final plot
 +
<!-- /NOTEX -->
 +
 +
 +
===Task F===
 +
 +
<!-- NOTEX -->
 +
Only for students intermediate / advanced in R.
 +
<!-- /NOTEX -->
 +
 +
In task E, each cluster is represented only by its center. However, we would like see the spread of values over all genes assigned to the cluster. Use ggplot2 to create a boxplot for each cluster, with a box for every experiment, showing the median, quartiles, range and outliers over all genes assigned to that cluster. Place plots for all clusters to a single image using [https://ggplot2-book.org/getting-started.html#qplot-faceting faceting].
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>boxplots.png</tt> with your final plot
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->

Latest revision as of 14:54, 3 April 2023

See also the lecture.

In this homework, try to read the text, execute given commands, potentially trying some small modifications. Within the tutorial, you will find the tasks to complete in this exercise.

  • Submit the required files (.png).
  • In your protocol, enter the commands used in all tasks, with explanatory comments in more complicated situations.
  • In tasks B and D also enter the required output to the protocol.
  • Protocol template in /tasks/r1/protocol.txt

Two versions of the homework based on your R proficiency (your decision which one you choose).

Beginners:

  • Do tasks A-E, skip F.

Intermediate / advanced

  • Skip to section Expression data.
  • Do tasks D-F, but instead of plot and matplot functions from the basic R graphics library, use ggplot2 library. See the list of resources below.

The first steps

Type a command, R writes the answer, e.g.:

> 1 + 2
[1] 3

We can store values in variables and use them later:

> # population of Slovakia in millions, 2019
> population = 5.457
> population
[1] 5.457
> # area of Slovakia in thousands of km2
> area = 49.035
> density = population / area
> density
[1] 0.1112879

Several surprises in the R language:

  • Dots are used as parts of id's, e.g. read.table is name of a single function (not a method for the object read).
  • Assignment can be done via <- or =.
  • Vectors etc are indexed from 1, not from 0.

Vectors, basic plots

A vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans).

# Vector can be created from a list of numbers by function named c
a = c(1, 2, 4)
a
# prints [1] 1 2 4

# c also concatenates vectors
c(a, a)
# prints [1] 1 2 4 1 2 4

# Vector of two strings 
b = c("hello", "world")

# Create a vector of numbers 1..10
x = 1:10
x
# prints [1]  1  2  3  4  5  6  7  8  9 10

Vector arithmetic

Many operations can be easily applied to each member of a vector.

x = 1:10
# Square each number in vector x
x * x
# prints [1]   1   4   9  16  25  36  49  64  81 100

# New vector y: logarithm of a number in x squared
y = log(x * x)
y
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
# [9] 4.394449 4.605170

# Draw the graph of function log(x*x) for x=1..10
plot(x, y)
# The same graph but use lines instead of dots
plot(x, y, type="l")

# Addressing elements of a vector: positions start at 1
# Second element of the vector 
y[2]
# prints [1] 1.386294

# Which elements of the vector satisfy certain condition? 
# (vector of logical values)
y > 3
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

# write only those elements from y that satisfy the condition
y[y > 3]
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170

# we can also write values of x such that values of y satisfy the condition...
x[y > 3]
# prints [1]  5  6  7  8  9 10

Task A

Create a plot of the binary logarithm with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)

  • Store it in file log.png and submit this file

Hints:

  • Create x and y by vector arithmetic.
  • To compute binary logarithm check help ? log
  • Before running plot, use command png("log.png") to store the result, afterwards call dev.off() to close the file (in Rstudio you can also export plots manually).

Data frames and simple statistics

Data frame: a table similar to a spreadsheet. Each column is a vector, all are of the same length.

We will use a table with the following columns:

  • country name,
  • region (continent),
  • area in thousands of km2,
  • population in millions in 2019.

Data is from the UN. The table is stored in the csv format (columns separated by commas). Here are the first six lines:

Afghanistan,Asia,652.864,38.0418
Albania,Europe,28.748,2.8809
Algeria,Africa,2381.741,43.0531
American Samoa,Oceania,0.199,0.0553
Andorra,Europe,0.468,0.0771
Angola,Africa,1246.7,31.8253
# reading a data frame from a file
a = read.csv("/tasks/r1/countries.csv",header = TRUE)

# display mean, median, etc. of each column
summary(a)
# Compactly display structure of a 
# (good for checking that import worked etc)
str(a)

# print the column with the name "Area"
a$Area

# population density: divide the population by the area
a$Population / a$Area

# Add density as a new column to frame a
a = cbind(a, Density = a$Population / a$Area)

# Scatter plot of area vs population
plot(a$Area, a$Population)

# we will see smaller values better in log-scale (both axes)
plot(a$Area, a$Population, log='xy')

# use linear scale, but zoom in on smaller countries:
plot(a$Area, a$Population, xlim=c(0,1500), ylim=c(0,150))

# average country population 33.00224 million
mean(a$Population)
# median country population 5.3805 million
median(a$Population)

# median country population in Europe
median(a$Population[a$Region=="Europe"])
# Standard deviation
sd(a$Population)

# Histogram of country populations in Europe
hist(a$Population[a$Region=="Europe"])

Task B

Create frame europe which contains data for European countries selected from frame a. Also create a similar frame for African countries. Hint:

  • One option is to select rows using boolean expressions as in the computation of median country size in Europe above. This time we apply it on a frame rather than single column. To select some rows (and all columns) from frame a use expression of the form a[which_rows, ]. Here, which_rows can be a list, such as c(1,2,3) or a boolean expression producing a list of booleans.
  • Alternatively, you can also explore subset command.

Run the command summary separately for each new frame. Comment on how their characteristics differ. Write the output and your conclusion to the protocol.

Task C

Draw a graph comparing the area vs population in Europe and Africa; use different colors for points representing European and African countries. Apply log scale on both axes.

  • Submit the plot in file countries.png

To draw the graph, you can use one of the options below, or find yet another way.

Option 1: first draw Europe with one color, then add Africa in another color

  • Color of points can be changed by as follows: plot(1:10, 1:10, col="red").
  • After the plot command, you can add more points to the same graph by command points, which can be used similarly as plot.
  • Warning: command points does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options xlim and ylim, e.g. plot(x,y, col="red", xlim=c(0.1,100), ylim=c(0.1,100)).

Option 2: plot both Europe and Africa in one plot command, and give it a vector of colors, one for each point. Command plot(1:10,1:10,col=c(rep("red",5),rep("blue",5))) will plot the first 5 points red and the last 5 points blue

Bonus task: add a legend to the plot, showing which color is Europe and which is Africa.

Expression data

The dataset was described in the lecture.

# Read gene expression data table
a = read.csv("/tasks/r1/microarray.csv", row.names=1)
# Visual check of the first row
a[1,]
# Plot control replicate 1 vs. acetate acid replicate 1
plot(a$control1, a$acetate1)
# Plot control replicate 1 vs. control replicate 2
plot(a$control1, a$control2)
# To show density in dense clouds of points, use this plot
smoothScatter(a$control1, a$acetate1)

Optional: ggplot2 library

Required for intermediate / advanced in R, who decide to do tasks D-F.

Several resources for ggplot2 library

Task D

In the plots above we compare two experiments, say A=control1 and B=acetate1. Outliers away from the diagonal in the plot are the genes whose expression changes. However distance from the diagonal is hard to judge visually, instead we will create MA plot:

  • As above, each gene is one dot in the plot (use plot rather than smoothScatter).
  • The x-axis is the average between values for conditions A and B. The points on the right have overall higher expression than points on the left.
  • The y-axis is the difference between condition A and B. The values in frame a are in log-scale base 2, so the difference of 1 means 2-fold change in expression.
  • The points far from the line y=0 have the highest change in expression. Use R functions min, max, which.min and which.max to find the largest positive and negative difference from line y=0 and which genes they correspond to. Functions min and max give you the minimum and maximum of a given vector. Functions which.min and which.max return the index where this extreme value is located. You can use this index to get the appropriate row of the dataframe a, including the gene name.
  • Submit file ma.png with your plot. Include the genes with the extreme changes in your protocol.

Clustering applied to expression data

Clustering is a wide group of methods that split data points into groups with similar properties. We will group together genes that have a similar reaction to acids, i.e. their rows in gene expression data matrix have similar values. We will consider two simple clustering methods.

  • K means clustering splits points (genes) into k clusters, where k is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
Examples of heatmaps
  • Hierarchical clustering puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.


# Create a new version of frame a in which row is scaled so that 
# it has mean 0 and standard deviation 1
# Function scale does such transformation on columns instead of rows, 
# so we transpose the frame using function t, then transpose it back
b = t(scale(t(a)))
# Matrix b shows relative movements of each gene, 
# disregarding its overall high or low expression

# Command heatmap creates hierarchical clustering of rows, 
# then shows values using colors ranging from red (lowest) to white (highest)
heatmap(as.matrix(a), Colv=NA, scale="none")
heatmap(as.matrix(b), Colv=NA, scale="none")
# compare the two matrices - which phenomena influenced clusters in each of them?
# k means clustering to 5 clusters
k = 5
cl = kmeans(b, k)
# Each gene is assigned a cluster (number between 1 and k)
# the command below displays the first 10 elements, i.e. clusters of first 10 genes
head(cl$cluster)
# Draw heatmap of cluster number 3 out of k, no further clustering applied
# Do you see any common pattern to genes in the cluster?
heatmap(as.matrix(b[cl$cluster==3,]), Rowv=NA, Colv=NA, scale="none")

# Reorder genes in the whole table according to their cluster cluster number
# Can you spot our k clusters?
heatmap(as.matrix(b[order(cl$cluster),]), Rowv=NA, Colv=NA, scale="none")

# Compare overall column means with column means in cluster 3
# Function apply runs mean on every column (or row if 2 changed to 1)
apply(b, 2, mean)
# Now means within cluster 3
apply(b[cl$cluster==3,], 2, mean)

# Clusters have centers which are also computed as means
# so this is the same as the previous command
cl$centers[3,]

Task E

Example of a required plot (but for k=3, not k=5)

Draw a plot in which the x-axis corresponds to experiments, the y-axis is the expression level and the center of each cluster is shown as a line (use k-means clustering on the scaled frame b, computed as shown above)

  • Use command matplot(x, y, type="l", lwd=2) which gets two matrices x and y of the same size and plots each column of matrices x and y as one line (setting lwd=2 makes lines thicker)
  • In this case we omit matrix x, the command will use numbers 1,2,3... as columns of the missing matrix
  • Create y from cl$centers by applying function t (transpose)
  • Submit file clusters.png with your final plot


Task F

Only for students intermediate / advanced in R.

In task E, each cluster is represented only by its center. However, we would like see the spread of values over all genes assigned to the cluster. Use ggplot2 to create a boxplot for each cluster, with a box for every experiment, showing the median, quartiles, range and outliers over all genes assigned to that cluster. Place plots for all clusters to a single image using faceting.

  • Submit file boxplots.png with your final plot