2-INF-185 Integrácia dátových zdrojov 2016/17
L06
From IDZ
Topic of this lecture are statistical tests in R.
- Beginners in statistics: listen to lecture, then do tasks A, B, C
- If you know basics of statistical tests, do tasks B, C, D
- More information on this topic in 1-EFM-340 Počítačová štatistika
Introduction to statistical tests: sign test
- [1]
- Two friends A and B have played their favourite game n=10 times, A has won 6 times and B has won 4 times.
- A claims that he is a better player, B claims that such a result could easily happen by chance if they were equally good players.
- Hypothesis of player B is called null hypothesis that the pattern we see (A won more often than B) is simply a result of chance
- Null hypothesis reformulated: we toss coin n times and compute value X: the number of times we see head. The tosses are independent and each toss has equal probability of being 0 or 1
- Similar situation: comparing programs A and B on several inputs, counting how many times is program A better than B.
# simulation in R: generate 10 psedorandom bits # (1=player A won) sample(c(0,1), 10, replace = TRUE) # result e.g. 0 0 0 0 1 0 1 1 0 0 # directly compute random variable X, i.e. sum of bits sum(sample(c(0,1), 10, replace = TRUE)) # result e.g. 5 # we define a function which will m times repeat # the coin tossing experiment with n tosses # and returns a vector with m values of random variable X experiment <- function(m, n) { x = rep(0, m) # create vector with m zeroes for(i in 1:m) { # for loop through m experiments x[i] = sum(sample(c(0,1), n, replace = TRUE)) } return(x) # return array of values } # call the function for m=20 experiments, each with n tosses experiment(20,10) # result e.g. 4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4 # draw histograms for 20 experiments and 1000 experiments png("hist10.png") # open png file par(mfrow=c(2,1)) # matrix of plots with 2 rows and 1 column hist(experiment(20,10)) hist(experiment(1000,10)) dev.off() # finish writing to file
- It is easy to realize that we get binomial distribution (binomické rozdelenie)
- <math>Pr(X=k) = {n \choose k} 2^{-n}</math>
- P-value of the test is the probability that simply by chance we would get k the same or more extreme than in our data.
- In other words, what is the probability that in 10 tosses we see head 6 times or more (one sided test)
- <math>\sum_{j=k}^n {n \choose k} 2^{-n}</math>
- If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player A is in fact better than B
# computing the probability that we get exactly 6 heads in 10 tosses dbinom(6, 10, 0.5) # result 0.2050781 # we get the same as our formula above: 7*8*9*10/(2*3*4*(2^10)) # result 0.2050781 # entire probability distribution: probabilities 0..10 heads in 10 tosses dbinom(0:10, 10, 0.5) # [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250 # [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250 # [11] 0.0009765625 #we can also plot the distribution plot(0:10, dbinom(0:10, 10, 0.5)) barplot(dbinom(0:10,10,0.5)) #our p-value is sum for 7,8,9,10 sum(dbinom(6:10,10,0.5)) # result: 0.3769531 # so results this "extreme" are not rare by chance, # they happen in about 38% of cases # R can compute the sum for us using pbinom # this considers all values greater than 5 pbinom(5, 10, 0.5, lower.tail=FALSE) # result again 0.3769531 # if probability is too small, use log of it pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE) # [1] -6931.472 # the probability of getting 10000x head is exp(-6931.472) = 2^{-100000} # generating numbers from binomial distribution # - similarly to our function experiment rbinom(20, 10, 0.5) # [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5 # running the test binom.test(6, 10, p = 0.5, alternative="greater") # # Exact binomial test # # data: 6 and 10 # number of successes = 6, number of trials = 10, p-value = 0.377 # alternative hypothesis: true probability of success is greater than 0.5 # 95 percent confidence interval: # 0.3035372 1.0000000 # sample estimates: # probability of success # 0.6 # to only get p-value run binom.test(6, 10, p = 0.5, alternative="greater")$p.value # result 0.3769531
Comparing two sets of values: Welch's t-test
- Let us now consider two sets of values drawn from two normal distributions with unknown means and variances
- The null hypothesis of the Welch's t-test is that the two distributions have equal means
- The test computes test statistics (in R for vectors x1, x2):
- (mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))
- This test statistics is approximately distributed according to Student's t-distribution with the degree of freedom obtained by
n1=length(x1) n2=length(x2) (var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
- Luckily R will compute the test for us simply by calling t.test
x1 = rnorm(6, 2, 1) # 2.70110750 3.45304366 -0.02696629 2.86020145 2.37496993 2.27073550 x2 = rnorm(4, 3, 0.5) # 3.258643 3.731206 2.868478 2.239788 > t.test(x1,x2) # t = -1.2898, df = 7.774, p-value = 0.2341 # alternative hypothesis: true difference in means is not equal to 0 # means 2.272182 3.024529 x2 = rnorm(4, 5, 0.5) # 4.882395 4.423485 4.646700 4.515626 t.test(x1,x2) # t = -4.684, df = 5.405, p-value = 0.004435 # means 2.272182 4.617051 # to get only p-value, run t.test(x1,x2)$p.value
We will apply Welch's t-test to microarray data
- Data from GEO database [2], publication [3]
- Abbott et al 2007: Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae
- gene expression measurements under 5 conditions:
- reference: yeast grown in normal environment
- 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic)
- from each condition (reference and each acid) we have 3 replicates
- together our table has 15 columns (3 replicates from 5 conditions)
- 6398 rows (genes)
- We will test statistical difference between the reference condition and one of the acids (3 numbers vs other 2 numbers)
- See Task B in HW06
Multiple testing correction
- When we run t-tests on the reference vs. acetic acid on all 6398 genes, we get 118 genes with p-value<=0.01
- Purely by chance this would happen in 1% of cases (from definition of p-value)
- So purely by chance we would expect to get about 64 genes with p-value<=0.01
- So perhaps roughly half of our detected genes (maybe less, maybe more) are false positives
- Sometimes false positives may even overwhelm results
- Multiple testing correction tries to limit the number of false positives among results of multiple statistical tests
- Many different methods
- The simplest one is Bonferroni correction, where the threshold on p-value is divided by the number of tested genes, so instead of 0.01 we use 0.01/6398 = 1.56e-6
- This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality)
- We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected
- After Bonferroni correction we get only 1 significant gene
# the results of p-tests are in vector pa of length 6398 # manually multiply p-values by length(pa), count those that have value <=0.01 sum(pa * length(pa) < 0.01) # in R you can use p.adjust form multiple testing correction pa.adjusted = p.adjust(pa, method ="bonferroni") # this is equivalent to multiplying by the length and using 1 if the result > 1 pa.adjusted = pmin(pa*length(pa),rep(1,length(pa))) # there are less conservative multiple testing correction methods, e.g. Holm's method # but in this case we get almost the same results pa.adjusted2 = p.adjust(pa, method ="holm")
- Other frequently used correction is false discovery rate (FDR), which is less strict and controls the overall proportion of false positives among results