2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z HW01 a HW04 nájdete na serveri v /grades/userid.txt


From IDZ
Jump to: navigation, search


In this homework, try to read text, execute given commands, potentially trying some small modifications.

  • Then do tasks A-D, submit required files (3x .png)
  • In your protocol, enter commands used in tasks A-D, with explanatory comments in more complicated situations
  • In task B also enter required output to protocol

First steps

  • Type a command, R writes the answer, e.g.:
> 1+2
[1] 3
  • We can store values in variables and use them later on
> # The size of the sequenced portion of cow's genome, in millions of base pairs
> Cow_genome_size <- 2290
> Cow_genome_size
[1] 2290
> Cow_chromosome_pairs <- 30
> Cow_avg_chrom <- Cow_genome_size / Cow_chromosome_pairs
> Cow_avg_chrom
[1] 76.33333


  • dots are used as parts of id's, e.g. read.table is name of a single function (not method for object read)
  • assignment via <- or =
    • careful: a<-3 is an assignment, a < -3 is a comparison
  • vectors etc are indexed from 1, not from 0

Vectors, basic plots

  • Vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans)
# Vector can be created from a list of numbers by function c
# prints [1] 1 2 4

# function c also concatenates vectors
# prints [1] 1 2 4 1 2 4

# Vector of two strings 
b<-c("hello", "world")

# Create a vector of numbers 1..10
# prints [1]  1  2  3  4  5  6  7  8  9 10

Vector arithmetics

  • Operations applied to each member of the vector
# Square each number in vector x
# prints [1]   1   4   9  16  25  36  49  64  81 100

# New vector y: logarithm of a number in x squared
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
# [9] 4.394449 4.605170

# Draw graph of function log(x*x) for x=1..10
# The same graph but use lines instead of dots

# Addressing elements of a vector: positions start at 1
# Second element of the vector 
# prints [1] 1.386294

# Which elements of the vector satisfy certain condition? (vector of logical values)

# write only those elements from y that satisfy the condition
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170

# we can also write values of x such that values of y satisfy the condition...
# prints [1]  5  6  7  8  9 10

Task A

  • Create a plot of the binary logarithm with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)
  • Store it in file log.png and submit this file
  • Hints:
    • Create x and y by vector arithmetics
    • To compute binary logarithm check help ? log
    • Before running plot, use command png("log.png") to store the result, afterwards call dev.off() to close the file (in rstudio you can also export plots manually)

Data frames and simple statistics

  • Data frame: a table similar to spreadsheet, each column is a vector, all are of the same length
  • We will use a table with the following columns:
    • The size of a genome, in millions of nucleotides
    • Number of chromosome pairs
    • GC content
    • Taxonomic group mammal or fish
  • Stored in CSV format, columns separated by tabs.
  • Data: Han et al Genome Biology 2008 [1]
Species    Size    Chrom   GC      Group
Human      2850    23      40.9    mammal
Chimpanzee 2750    24      40.7    mammal
Macaque    2650    21      40.7    mammal
Mouse      2480    20      41.7    mammal
Tetraodon   187    21      45.9    fish
# reading a frame from file
a<-read.table("/tasks/hw09/genomes.csv", header = TRUE, sep = "\t");
# column with name size

# Average chromosome length: divide size by the number of chromosomes

# Add average chromosome length as a new column to frame a

# Scatter plot of average chromosome length vs GC content
plot(a$AvgChrom, a$GC)

# Compactly display structure of a 
# (good for checking that import worked etc)

# display mean, median, etc. of each column

# average genome size
# average genome size for mammals
# Standard deviation

# Histogram of genome sizes

Task B

  • Divide frame a to two frames, one for mammals, one for fish. Hint:
    • Try command a[c(1,2,3),]. What is it doing?
    • Try command a$Group=="mammal".
    • Combine these two commands to get rows for all mammals and store the frame in a new variable, then repeat for fish
    • Use a general approach which does not depend on the exact number and ordering of rows in the table.
  • Run the command summary separately for mammals and for fish. Which of their characteristics are different?
    • Write output and your conclusion to the protocol

Task C

  • Draw a graph comparing genome size vs GC content; use different colors for points representing mammals and fish
    • Submit the plot in file genomes.png
    • To draw the graph, you can use one of the options below, or find yet another way
    • Option 1: first draw mammals with one color, then add fish in another color
      • Color of points can be changed by: plot(1:10,1:10, col="red")
      • After plot command you can add more points to the same graph by command points, which can be used similarly as plot
      • Warning: command points does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options xlim and ylim, e.g. plot(x,y, col="red", xlim=c(1,100), ylim=c(1,100))
    • Option 2: plot both mammals and fish in one plot command, and give it a vector of colors, one for each point
      • plot(1:10,1:10,col=c(rep("red",5),rep("blue",5))) will plot the first 5 points red and the last 5 points blue
  • Bonus task: add a legend to the plot, showing which color is mammal and which is fish

Expression data and clustering

Data here is bigger, better to use plain R rather than rstudio (limited server CPU/memory)

# Read gene expression data table
a<-read.table("/tasks/hw09/microarray.csv", header = TRUE, sep = "\t", row.names=1)
# Visual check of the first row
# plot starting point vs. situation after 1 hour
# to better see density in dense clouds of points, use this plot
# outliers away from diagonal in the plot above are most strongly differentially expressed genes
# these are more easy to see in MA plot:
# x-axis: average expression in the two conditions
# y-axis: difference between values (they are log-scale, so difference 1 means 2-fold)

Clustering is a wide group of methods that split data points into groups with similar properties

  • We will group together genes that have a similar reaction to cold, i.e. their rows in gene expression data matrix have similar values

We will consider two simple clustering methods

  • K means clustering splits points (genes) into k clusters, where k is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
  • Hierarchical clustering puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.
Example of a heatmap
# Heatmap: creates hierarchical clustering of rows 
# then shows every value in the table using color ranging from red (lowest) to white (highest)
# Computation may take some time
# Previous heatmap normalized each row, the next one uses data as they are:
# k means clustering to 7 clusters
cl <- kmeans(a,k)
# each gene has assigned a cluster (number between 1 and k)
# draw only cluster number 3 out of k
heatmap(as.matrix(a[cl$cluster==3,]),Rowv=NA, Colv=NA)

# reorder genes in the table according to cluster
heatmap(as.matrix(a[order(cl$cluster),]),Rowv=NA, Colv=NA)

# compare overall column means with column means in cluster 3
# function apply uses mean on every column (or row if 2 changed to 1)
# now means within cluster

# clusters have centers which are also computed as means
# so this is the same as previous command

Task D

Example of a required plot
  • Draw a plot in which x-axis is time and y-axis is the expression level and the center of each cluster is shown as a line
    • use command matplot(x,y,type="l") which gets two matrices x and y and plots columns of x vs columns of y
    • matplot(,y,type="l") will use numbers 1,2,3... as columns of the missing matrix x
    • create y from cl$centers by applying function t (transpose)
    • to create an appropriate matrix x, create a vector of times for individual experiments in minutes or hours (do it manually, no need to parse column names automatically)
    • using functions rep and matrix you can create a matrix x in which this vector is used as every column
    • then run matplot(x,y,type="l")
    • since time points are not evenly spaced, it would be better to use logscale: matplot(x,y,type="l",log="x")
    • to avoid log(0), change the first timepoint from 0min to 1min
  • Submit file clusters.png with your final plot