1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "Lr2"
Jump to navigation
Jump to search
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<!-- NOTEX --> | <!-- NOTEX --> | ||
[[HWr2]] | [[HWr2]] | ||
− | |||
The topic of this lecture are statistical tests in R. | The topic of this lecture are statistical tests in R. | ||
* Beginners in statistics: listen to lecture, then do tasks A, B, C | * Beginners in statistics: listen to lecture, then do tasks A, B, C | ||
* If you know basics of statistical tests, do tasks B, C, D | * If you know basics of statistical tests, do tasks B, C, D | ||
− | * More information on this topic in the [https://sluzby.fmph.uniba.sk/infolist/sk/1-EFM- | + | * More information on this topic in the [https://sluzby.fmph.uniba.sk/infolist/sk/1-EFM-340.html 1-EFM-340 Computer Statistics] or [https://sluzby.fmph.uniba.sk/infolist/sk/1-DAV-303.html 1-DAV-303 Statistical Methods] courses. |
− | + | <!-- /NOTEX --> | |
==Introduction to statistical tests: sign test== | ==Introduction to statistical tests: sign test== | ||
− | * Two friends ''A'' and ''B'' | + | * Two friends ''A'' and ''B'' played their favorite game ''n''=10 times, ''A'' won 6 times and ''B'' won 4 times. |
− | * ''A'' claims that he is a better player, ''B'' claims that such a result could easily happen by chance if they were equally good players. | + | * ''A'' claims that she/he is a better player, ''B'' claims that such a result could easily happen by chance if they were equally good players. |
− | * Hypothesis of player ''B'' is called the ''null hypothesis''. It claims that the pattern we see (''A'' won more often than ''B'') is simply a result of chance | + | * Hypothesis of player ''B'' is called the ''null hypothesis''. It claims that the pattern we see (''A'' won more often than ''B'') is simply a result of chance. |
− | * The null hypothesis reformulated: we toss coin ''n'' times and compute value ''X'': the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail | + | * The null hypothesis reformulated: we toss coin ''n'' times and compute value ''X'': the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail. |
* Similar situation: comparing programs ''A'' and ''B'' on several inputs, and counting how many times is program ''A'' better than ''B''. | * Similar situation: comparing programs ''A'' and ''B'' on several inputs, and counting how many times is program ''A'' better than ''B''. | ||
<syntaxhighlight lang="r"> | <syntaxhighlight lang="r"> | ||
− | # simulation in R: generate 10 | + | # simulation in R: generate 10 pseudorandom bits |
# (1=player A won) | # (1=player A won) | ||
sample(c(0,1), 10, replace = TRUE) | sample(c(0,1), 10, replace = TRUE) | ||
Line 35: | Line 34: | ||
return(x) # return array of values | return(x) # return array of values | ||
} | } | ||
− | # call the function for m=20 experiments, each with n tosses | + | # call the function for m=20 experiments, each with n=10 tosses |
experiment(20,10) | experiment(20,10) | ||
# result e.g. 4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4 | # result e.g. 4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4 | ||
Line 48: | Line 47: | ||
* The probability of getting ''k'' ones out of ''n'' coin tosses is <math>\Pr(X=k) = {n \choose k} 2^{-n}</math> | * The probability of getting ''k'' ones out of ''n'' coin tosses is <math>\Pr(X=k) = {n \choose k} 2^{-n}</math> | ||
* The ''p-value'' of the test is the probability that simply by chance we would get ''k'' the same or more extreme than in our data. | * The ''p-value'' of the test is the probability that simply by chance we would get ''k'' the same or more extreme than in our data. | ||
− | * In other words, what is the probability that in ''n=10'' tosses we see head 6 times or more (one sided test) | + | * In other words, what is the probability that in ''n=10'' tosses we see head 6 times or more (this is called one sided test). |
− | * P-value for ''k'' ones out of ''n'' coin tosses <math>\sum_{j=k}^n {n \choose k} 2^{-n}</math> | + | * P-value for ''k'' ones out of ''n'' coin tosses <math>\sum_{j=k}^n {n \choose k} 2^{-n}</math>. |
− | * If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player ''A'' is in fact better than ''B'' | + | * If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player ''A'' is in fact better than ''B''. |
<syntaxhighlight lang="r"> | <syntaxhighlight lang="r"> | ||
Line 109: | Line 108: | ||
==Comparing two sets of values: Welch's t-test== | ==Comparing two sets of values: Welch's t-test== | ||
− | * Let us now consider two sets of values drawn from two [https://en.wikipedia.org/wiki/Normal_distribution normal distributions] with unknown means and variances | + | * Let us now consider two sets of values drawn from two [https://en.wikipedia.org/wiki/Normal_distribution normal distributions] with unknown means and variances. |
− | * The null hypothesis of the [https://en.wikipedia.org/wiki/Welch%27s_t-test Welch's t-test] is that the two distributions have equal means | + | * The null hypothesis of the [https://en.wikipedia.org/wiki/Welch%27s_t-test Welch's t-test] is that the two distributions have equal means. |
* The test computes test statistics (in R for vectors x1, x2): | * The test computes test statistics (in R for vectors x1, x2): | ||
** <tt>(mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))</tt> | ** <tt>(mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))</tt> | ||
Line 152: | Line 151: | ||
** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic) | ** 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic) | ||
* From each condition (control and each acid) we have 3 replicates | * From each condition (control and each acid) we have 3 replicates | ||
− | * Together our table has 15 columns (3 replicates from 5 conditions) and 6398 rows (genes). Last time we have used only a subset of rows | + | * Together our table has 15 columns (3 replicates from 5 conditions) and 6398 rows (genes). Last time we have used only a subset of rows. |
− | * We will test statistical difference between the control condition and one of the acids (3 numbers vs other 3 numbers) | + | * We will test statistical difference between the control condition and one of the acids (3 numbers vs other 3 numbers). |
− | * See Task B in [[HWr2|the exercises]] | + | * See Task B in [[HWr2|the exercises]]. |
==Multiple testing correction== | ==Multiple testing correction== | ||
− | * When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01 | + | * When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01. |
* Purely by chance this would happen in 1% of cases (from the definition of the p-value). | * Purely by chance this would happen in 1% of cases (from the definition of the p-value). | ||
* So purely by chance we would expect to get about 64 genes with p-value at most 0.01. | * So purely by chance we would expect to get about 64 genes with p-value at most 0.01. | ||
* So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance. | * So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance. | ||
− | * Sometimes false positives may even overwhelm the results | + | * Sometimes false positives may even overwhelm the results. |
− | * Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests | + | * Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests. There are many different methods. |
* The simplest one is [https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf Bonferroni correction], where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 = 1.56e-6. | * The simplest one is [https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf Bonferroni correction], where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 = 1.56e-6. | ||
− | * This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality). | + | * This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by [https://mathworld.wolfram.com/MarkovsInequality.html Markov inequality]). |
* We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected. | * We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected. | ||
− | * After Bonferroni correction we do not get any significant gene. | + | * After the Bonferroni correction we do not get any significant gene. |
<syntaxhighlight lang="r"> | <syntaxhighlight lang="r"> | ||
# the results of t-tests are in vector pb of length 6398 | # the results of t-tests are in vector pb of length 6398 | ||
Line 175: | Line 174: | ||
pb.adjusted = p.adjust(pa, method ="bonferroni") | pb.adjusted = p.adjust(pa, method ="bonferroni") | ||
# this is equivalent to multiplying by the length and using 1 if the result > 1 | # this is equivalent to multiplying by the length and using 1 if the result > 1 | ||
− | pb.adjusted = pmin(pa*length(pa),rep(1,length(pa))) | + | pb.adjusted = pmin(pa*length(pa), rep(1, length(pa))) |
# there are less conservative multiple testing correction methods, | # there are less conservative multiple testing correction methods, |
Latest revision as of 14:49, 24 April 2023
The topic of this lecture are statistical tests in R.
- Beginners in statistics: listen to lecture, then do tasks A, B, C
- If you know basics of statistical tests, do tasks B, C, D
- More information on this topic in the 1-EFM-340 Computer Statistics or 1-DAV-303 Statistical Methods courses.
Introduction to statistical tests: sign test
- Two friends A and B played their favorite game n=10 times, A won 6 times and B won 4 times.
- A claims that she/he is a better player, B claims that such a result could easily happen by chance if they were equally good players.
- Hypothesis of player B is called the null hypothesis. It claims that the pattern we see (A won more often than B) is simply a result of chance.
- The null hypothesis reformulated: we toss coin n times and compute value X: the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail.
- Similar situation: comparing programs A and B on several inputs, and counting how many times is program A better than B.
# simulation in R: generate 10 pseudorandom bits
# (1=player A won)
sample(c(0,1), 10, replace = TRUE)
# result e.g. 0 0 0 0 1 0 1 1 0 0
# directly compute random variable X, i.e. the sum of bits
sum(sample(c(0,1), 10, replace = TRUE))
# result e.g. 5
# we define a function which will m times repeat
# the coin tossing experiment with n tosses
# and returns a vector with m values of random variable X
experiment <- function(m, n) {
x = rep(0, m) # create vector with m zeroes
for(i in 1:m) { # for loop through m experiments
x[i] = sum(sample(c(0,1), n, replace = TRUE))
}
return(x) # return array of values
}
# call the function for m=20 experiments, each with n=10 tosses
experiment(20,10)
# result e.g. 4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4
# draw histograms for 20 experiments and 1000 experiments
png("hist10.png") # open png file
par(mfrow=c(2,1)) # matrix of plots with 2 rows and 1 column
hist(experiment(20,10))
hist(experiment(1000,10))
dev.off() # finish writing to file
- It is easy to realize that we get binomial distribution (binomické rozdelenie)
- The probability of getting k ones out of n coin tosses is
- The p-value of the test is the probability that simply by chance we would get k the same or more extreme than in our data.
- In other words, what is the probability that in n=10 tosses we see head 6 times or more (this is called one sided test).
- P-value for k ones out of n coin tosses .
- If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player A is in fact better than B.
# computing the probability that we get exactly 6 heads in 10 tosses
dbinom(6, 10, 0.5) # result 0.2050781
# we get the same as our formula above:
7*8*9*10/(2*3*4*(2^10)) # result 0.2050781
# entire probability distribution: probabilities 0..10 heads in 10 tosses
dbinom(0:10, 10, 0.5)
# [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
# [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
# [11] 0.0009765625
# we can also plot the distribution
plot(0:10, dbinom(0:10, 10, 0.5))
barplot(dbinom(0:10, 10, 0.5))
# our p-value is the sum for k=6,7,8,9,10
sum(dbinom(6:10, 10, 0.5))
# result: 0.3769531
# so results this "extreme" are not rare by chance,
# they happen in about 38% of cases
# R can compute the sum for us using pbinom
# this considers all values greater than 5
pbinom(5, 10, 0.5, lower.tail=FALSE)
# result again 0.3769531
# if probability is too small, use log of it
pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE)
# [1] -6931.472
# the probability of getting 10000x head is exp(-6931.472) = 2^{-100000}
# generating numbers from binomial distribution
# - similarly to our function experiment
rbinom(20, 10, 0.5)
# [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5
# running the test
binom.test(6, 10, p = 0.5, alternative="greater")
#
# Exact binomial test
#
# data: 6 and 10
# number of successes = 6, number of trials = 10, p-value = 0.377
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
# 0.3035372 1.0000000
# sample estimates:
# probability of success
# 0.6
# to only get p-value, run
binom.test(6, 10, p = 0.5, alternative="greater")$p.value
# result 0.3769531
Comparing two sets of values: Welch's t-test
- Let us now consider two sets of values drawn from two normal distributions with unknown means and variances.
- The null hypothesis of the Welch's t-test is that the two distributions have equal means.
- The test computes test statistics (in R for vectors x1, x2):
- (mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))
- If the null hypothesis holds, i.e. x1 and x2 were sampled from distributions with equal means, this test statistics is approximately distributed according to the Student's t-distribution with the degree of freedom obtained by
n1=length(x1)
n2=length(x2)
(var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
- Luckily R will compute the test for us simply by calling t.test
# generate x1: 6 values from normal distribution with mean 2 and standard deviation 1
x1 = rnorm(6, 2, 1)
# for example 2.70110750 3.45304366 -0.02696629 2.86020145 2.37496993 2.27073550
# generate x2: 4 values from normal distribution with mean 3 and standard deviation 0.5
x2 = rnorm(4, 3, 0.5)
# for example 3.258643 3.731206 2.868478 2.239788
t.test(x1, x2)
# t = -1.2898, df = 7.774, p-value = 0.2341
# alternative hypothesis: true difference in means is not equal to 0
# means 2.272182 3.024529
# this time the test was not significant
# regenerate x2 from a distribution with a much more different mean
x2 = rnorm(4, 5, 0.5)
# 4.882395 4.423485 4.646700 4.515626
t.test(x1, x2)
# t = -4.684, df = 5.405, p-value = 0.004435
# means 2.272182 4.617051
# this time much more significant p-value
# to get only p-value, run
t.test(x1,x2)$p.value
We will apply Welch's t-test to microarray data
- Data from the same paper as in the previous lecture, i.e. Abbott et al 2007 Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae
- Recall: Gene expression measurements under 5 conditions:
- Control: yeast grown in a normal environment
- 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic)
- From each condition (control and each acid) we have 3 replicates
- Together our table has 15 columns (3 replicates from 5 conditions) and 6398 rows (genes). Last time we have used only a subset of rows.
- We will test statistical difference between the control condition and one of the acids (3 numbers vs other 3 numbers).
- See Task B in the exercises.
Multiple testing correction
- When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01.
- Purely by chance this would happen in 1% of cases (from the definition of the p-value).
- So purely by chance we would expect to get about 64 genes with p-value at most 0.01.
- So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance.
- Sometimes false positives may even overwhelm the results.
- Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests. There are many different methods.
- The simplest one is Bonferroni correction, where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 = 1.56e-6.
- This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality).
- We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected.
- After the Bonferroni correction we do not get any significant gene.
# the results of t-tests are in vector pb of length 6398
# manually multiply p-values by length(pb), count those that have value <= 0.01
sum(pb * length(pb) <= 0.01)
# in R you can use p.adjust for multiple testing correction
pb.adjusted = p.adjust(pa, method ="bonferroni")
# this is equivalent to multiplying by the length and using 1 if the result > 1
pb.adjusted = pmin(pa*length(pa), rep(1, length(pa)))
# there are less conservative multiple testing correction methods,
# e.g. Holm's method, but in this case we get almost the same results
pa.adjusted2 = p.adjust(pa, method ="holm")
Another frequently used correction is false discovery rate (FDR), which is less strict and controls the overall proportion of false positives among the results.