2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


HW06

From IDZ
Jump to: navigation, search

Lecture 6

  • Do either tasks A,B,C (beginners) or B,C,D (more advanced). If you really want, you can do all four for bonus credit.
  • In your protocol write used R commands with brief comments on your approach.
  • Submit required plots with filenames as specified.
  • For each task also include results as required and a short discussion commenting the results/plots you have obtained. Is the value of interest increasing or decreasing with some parameter? Are the results as expected or surprising?
  • Outline of protocol is in /tasks/hw06/protocol.txt

Task A: sign test

  • Consider a situation in which players played n games, out of which a fraction of q were won by A (example in lecture corresponds to q=0.6 and n=10)
  • Compute a table of p-values for n=10,20,...,90,100 and for q=0.6, 0.7, 0.8, 0.9
  • Plot the table using matplot (n is x-axis, one line for each value of q)
  • Submit the plot in sign.png
  • Discuss the values you have seen in the plot / table

Outline of the code:

# create vector rows with values 10,20,...,100
rows=(1:10)*10
# create vector columns with required values of q
columns=c(0.6, 0.7, 0.8, 0.9)
# create empty matrix of pvalues 
pvalues = matrix(0,length(rows),length(columns))
# TODO: fill in matrix pvalues using binom.test

# set names of rows and columns
rownames(pvalues)=rows
colnames(pvalues)=columns
# careful: pvalues[10,] is now 10th row, i.e. value for n=100, 
#          pvalues["10",] is the first row, i.e. value for n=10

# check that for n=10 and q=0.6 you get p-value 0.3769531
pvalues["10","0.6"]

# create x-axis matrix (as in HW05, part D)
x=matrix(rep(rows,length(columns)),nrow=length(rows))
# matplot command
png("sign.png")
matplot(x,pvalues,type="l",col=c(1:length(columns)),lty=1)
legend("topright",legend=columns,col=c(1:length(columns)),lty=1)
dev.off()

Task B: Welch's t-test on microarray data

  • Read table with microarray data, transform it to log scale, then work with table a:
input=read.table("/tasks/hw06/acids.tsv", header=TRUE, row.names=1)
a = log(input)
  • Columns 1,2,3 are reference, columns 4,5,6 acetic acid, 7,8,9 benzoate, 10,11,12 propionate, and 13,14,15 sorbate
  • Write a function my.test which will take as arguments table a and 2 lists of columns (e.g. 1:3 and 4:6) and will run for each row of the table Welch's t-test of the first set of columns vs the second set. It will return the resulting vector of p-values
  • For example by calling pa <- my.test(a, 1:3, 4:6) we will compute p-values for differences between reference and acetic acids (computation may take some time)
  • The first 5 values of pa should be
> pa[1:5]
[1] 0.94898907 0.07179619 0.24797684 0.48204100 0.23177496
  • Run the test for all four acids
  • Report how many genes were significant with p-value <= 0.01 for each acid
  • Report how many genes are significant for both acetic and benzoate acids? (logical and is written as &)

Task C: multiple testing correction

Run the following snippet of code, which works on the vector of p-values pa obtained for acetate in task B

# adjusts vectors of p-vales from tasks B for using Bonferroni correction
pa.adjusted = p.adjust(pa, method ="bonferroni")
# add this adjusted vector to frame a
a <-  cbind(a, pa.adjusted)
# create permutation ordered by pa.adjusted
oa = order(pa.adjusted)
# select from table five rows with the lowest pa.adjusted (using vector oa)
# and display columns containing reference, acetate and adjusted p-value
a[oa[1:5],c(1:6,16)]

You should get output like this:

            ref1     ref2     ref3  acetate1   acetate2  acetate3 pa.adjusted
SUL1    7.581312 7.394985 7.412040 2.1633230 2.05412373 1.9169226 0.004793318
YMR244W 2.985682 2.975530 3.054001 0.3364722 0.33647224 0.1823216 0.188582576
DIP5    6.943991 7.147795 7.296955 0.6931472 0.09531018 0.5306283 0.253995075
YLR460C 5.620401 5.801212 5.502482 3.2425924 3.48431229 3.3843903 0.307639012
HXT4    2.821379 3.049273 2.772589 7.7893717 8.24446541 8.3041980 0.573813502

Do the same procedure for benzoate p-values and report the result. Comment the results for both acids.

Task D: volcano plot, test on data generated from null hypothesis

Draw a volcano plot for the acetate data

  • x-axis of this plot is the difference in the mean of reference and mean of acetate.
    • You can compute row means of a matrix by rowMeans.
  • y-axis is -log10 of the p-value (use original p-values before multiple testing correction)
  • You can quickly see genes which have low p-values (high on y-axis) and also big difference in mean expression between the two conditions (far from 0 on x-axis). You can also see if acetate increases or decreases expression of these genes.

Now create a simulated dataset sharing some features of the real data but observing the null hypothesis that the mean of reference and acetate are the same for each gene

  • Compute vector m of means for columns 1:6 from matrix a
  • Compute vectors sr and sa of standard deviations for reference columns and for acetate columns respectively
    • You can compute standard deviation for each row of a matrix by apply(some.matrix, 1, sd)
  • For each for each i in 1:6398, create three samples from normal distribution with mean m[i] and standard deviation sr[i], and three samples with mean m[i] and deviation sa[i]
    • Use function rnorm
  • On the resulting matrix apply Welches' t-test and draw the volcano plot.
  • How many random genes have p-value <=0.01? Is it roughly what we would expect under the null hypothesis?

Draw histogram of p-values from the real data (reference vs acetate) and from random data. Describe how they differ. Is it what you would expect?

  • use function hist

Submit plots volcano-real.png, volcano-random.png, hist-real.png, hist-random.png (real for real expression data and random for generated data)