2-INF-185 Integrácia dátových zdrojov 2016/17
In this homework, try to read text, execute given commands, potentially trying some small modifications.
- Then do tasks A-D, submit required files (3x .png)
- In your protocol, enter commands used in tasks A-D, with explanatory comments in more complicated situations
- In task B also enter required output to protocol
- Type a command, R writes the answer, e.g.:
> 1+2  3
- We can store values in variables and use them later on
> # The size of the sequenced portion of cow's genome, in millions of base pairs > Cow_genome_size <- 2290 > Cow_genome_size  2290 > Cow_chromosome_pairs <- 30 > Cow_avg_chrom <- Cow_genome_size / Cow_chromosome_pairs > Cow_avg_chrom  76.33333
- dots are used as parts of id's, e.g. read.table is name of a single function (not method for object read)
- assignment via <- or =
- careful: a<-3 is assignment, a < -3 is comparison
- vectors etc are indexed from 1, not from 0
Vectors, basic plots
- Vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans)
# Vector can be created from a list of numbers by function c a<-c(1,2,4) a # prints  1 2 4 # function c also concatenates vectors c(a,a) # prints  1 2 4 1 2 4 # Vector of two strings b<-c("hello", "world") # Create a vector of numbers 1..10 x<-1:10 x # prints  1 2 3 4 5 6 7 8 9 10
- Operations applied to each member of the vector
x<-1:10 # Square each number in vector x x*x # prints  1 4 9 16 25 36 49 64 81 100 # New vector y: logarithm from a number in x squared y<-log(x*x) y # prints  0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883 #  4.394449 4.605170 # Draw graph of function log(x*x) for x=1..10 plot(x,y) # The same graph but use lines instead of dots plot(x,y,type="l") # Addressing elements of a vector: positions start at 1 # Second element of the vector y # prints  1.386294 # Which elements of the vector satisfy certain condition? (vector of logical values) y>3 # prints  FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE # write only those elements from y that satisfy the condition y[y>3] # prints  3.218876 3.583519 3.891820 4.158883 4.394449 4.605170 # we can also write values of x such that values of y satisfy the condition... x[y>3] # prints  5 6 7 8 9 10
- Create a plot of the binary logarithm with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)
- Store it in file log.png and submit this file
- Create x and y by vector arithmetics
- To compute binary logarithm check help ? log
- Before running plot, use command png("log.png") to store the result, afterwards call dev.off() to close the file (in rstudio you can also export plots manually)
Data frames and simple statistics
- Data frame: table similar to spreadsheet, each column is a vector, all are of the same length
- We will use a table with the following columns:
- The size of the sequenced portion of a genome, in millions of base pairs
- Number of chromosome pairs
- GC content
- taxonomic group mammal or fish
- Stored in CSV format, columns separated by tabs.
- Data: Han et al Genome Biology 2008 
Species Size Chrom GC Group Human 2850 23 40.9 mammal Chimpanzee 2750 24 40.7 mammal Macaque 2650 21 40.7 mammal Mouse 2480 20 41.7 mammal ... Tetraodon 187 21 45.9 fish ...
# reading a frame from file a<-read.table("/tasks/hw05/genomes.csv", header = TRUE, sep = "\t"); # column with name size a$Size # Average chromosome length: divide size by the number of chromosomes a$Size/a$Chrom # Add average chromosome length as a new column to frame a a<-cbind(a,AvgChrom=a$Size/a$Chrom) # scatter plot average chromosome length vs GC content plot(a$AvgChrom, a$GC) # compactly display structure of a # (good for checking that import worked etc) str(a) # display mean, median, etc. of each column summary(a); # average genome size mean(a$Size) # average genome size for mammals mean(a$Size[a$Group=="mammal"]) # Standard deviation sd(a$Size) # Histogram of genome sizes hist(a$Size)
- Divide frame a to two frames, one for mammals, one for fish. Hint:
- Try command a[c(1,2,3),]. What is it doing?
- Try command a$Group=="mammal".
- Combine these two commands to get rows for all mammals and store the frame in a new variable, then repeat for fish
- Use a general approach which does not depend on the exact number and ordering of rows in the table.
- Run the command summary separately for mammals and for fish. Which of their characteristics are different?
- Write output and your conclusion to the protocol
- Draw a graph comparing genome size vs GC content; use different colors for points representing mammals and fish
- Submit the plot in file genomes.png
- To draw the graph, you can use one of the options below, or find yet another way
- Option 1: first draw mammals with one color, then add fish in another color
- Color of points can be changed by: plot(1:10,1:10, col="red")
- After plot command you can add more points to the same graph by command points, which can be used similarly as plot
- Warning: command points does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options xlim and ylim, e.g. plot(x,y, col="red", xlim=c(1,100), ylim=c(1,100))
- Option 2: plot both mammals and fish in one plot command, and give it a vector of colors, one for each point
- plot(1:10,1:10,col=c(rep("red",5),rep("blue",5))) will plot the first 5 points red and the last 5 points blue
- Bonus task: add a legend to the plot, showing which color is mammal and which is fish
Expression data and clustering
Data here is bigger, better to use plain R rather than rstudio (limited server CPU/memory)
# Read gene expression data table a<-read.table("/tasks/hw05/microarray.csv", header = TRUE, sep = "\t", row.names=1) # Visual check of the first row a[1,] # plot starting point vs. situation after 1 hour plot(a$cold_0min,a$cold_1h) # to better see density in dense clouds of points, use this plot smoothScatter(a$cold_15min,a$cold_1h) # outliers away from diagonal in the plot above are most strongly differentially expressed genes # these are more easy to see in MA plot: # x-axis: average expression in the two conditions # y-axis: difference between values (they are log-scale, so difference 1 means 2-fold) smoothScatter((a$cold_15min+a$cold_1h)/2,a$cold_15min-a$cold_1h)
Clustering is a wide group of methods that split data points into groups with similar properties
- We will group together genes that have a similar reaction to cold, i.e. their rows in gene expression data matrix have similar values
We will consider two simple clustering methods
- K means clustering splits points (genes) into k clusters, where k is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
- Hierarchical clustering puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.
# Heatmap: creates hierarchical clustering of rows # then shows every value in the table using color ranging from red (lowest) to white (highest) # Computation may take some time heatmap(as.matrix(a),Colv=NA) # Previous heatmap normalized each row, the next one uses data as they are: heatmap(as.matrix(a),Colv=NA,scale="none")
# k means clustering to 7 clusters k=7 cl <- kmeans(a,k) # each gene has assigned a cluster (number between 1 and k) cl$cluster # draw only cluster number 3 out of k heatmap(as.matrix(a[cl$cluster==3,]),Rowv=NA, Colv=NA) # reorder genes in the table according to cluster heatmap(as.matrix(a[order(cl$cluster),]),Rowv=NA, Colv=NA) # compare overall column means with column means in cluster 3 # function apply uses mean on every column (or row if 2 changed to 1) apply(a,2,mean) # now means within cluster apply(a[cl$cluster==3,],2,mean) # clusters have centers which are also computed as means # so this is the same as previous command cl$centers[3,]
- Draw a plot in which x-axis is time and y-axis is the expression level and the center of each cluster is shown as a line
- use command matplot(x,y,type="l") which gets two matrices x and y and plots columns of one vs the other
- matplot(,y,type="l") will use numbers 1,2,3... as columns of the missing matrix x
- create y from cl$centers by applying function t (transpose)
- to create an appropriate matrix x, create a vector of times for individual experiments in minutes or hours (do it manually, no need to parse column names automatically)
- using functions rep and matrix you can create a matrix x in which this vector is used as every column
- then run matplot(x,y,type="l")
- since time points are not evenly spaced, it would be better to use logscale: matplot(x,y,type="l",log="x")
- to avoid log(0), change the first timepoint from 0min to 1min
- Submit file clusters.png with your final plot