1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Lr2

From MAD
Revision as of 13:18, 15 April 2021 by Brona (talk | contribs)
Jump to navigation Jump to search

HWr2

The topic of this lecture are statistical tests in R.

  • Beginners in statistics: listen to lecture, then do tasks A, B, C
  • If you know basics of statistical tests, do tasks B, C, D
  • More information on this topic in the 1-EFM-340 Computer Statistics course


Introduction to statistical tests: sign test

  • Two friends A and B have played their favourite game n=10 times, A has won 6 times and B has won 4 times.
  • A claims that he is a better player, B claims that such a result could easily happen by chance if they were equally good players.
  • Hypothesis of player B is called the null hypothesis. It claims that the pattern we see (A won more often than B) is simply a result of chance
  • The null hypothesis reformulated: we toss coin n times and compute value X: the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail
  • Similar situation: comparing programs A and B on several inputs, and counting how many times is program A better than B.
# simulation in R: generate 10 psedorandom bits
# (1=player A won)
sample(c(0,1), 10, replace = TRUE)
# result e.g. 0 0 0 0 1 0 1 1 0 0

# directly compute random variable X, i.e. the sum of bits
sum(sample(c(0,1), 10, replace = TRUE))
# result e.g. 5

# we define a function which will m times repeat 
# the coin tossing experiment with n tosses 
# and returns a vector with m values of random variable X
experiment <- function(m, n) {
  x = rep(0, m)     # create vector with m zeroes
  for(i in 1:m) {   # for loop through m experiments
    x[i] = sum(sample(c(0,1), n, replace = TRUE)) 
  }
  return(x)         # return array of values     
}
# call the function for m=20 experiments, each with n tosses
experiment(20,10)
# result e.g.  4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4
# draw histograms for 20 experiments and 1000 experiments
png("hist10.png")  # open png file
par(mfrow=c(2,1))  # matrix of plots with 2 rows and 1 column
hist(experiment(20,10))
hist(experiment(1000,10))
dev.off() # finish writing to file
  • It is easy to realize that we get binomial distribution (binomické rozdelenie)
  • The probability of getting k ones out of n coin tosses is
  • The p-value of the test is the probability that simply by chance we would get k the same or more extreme than in our data.
  • In other words, what is the probability that in n=10 tosses we see head 6 times or more (one sided test)
  • P-value for k ones out of n coin tosses
  • If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player A is in fact better than B
# computing the probability that we get exactly 6 heads in 10 tosses
dbinom(6, 10, 0.5) # result 0.2050781
# we get the same as our formula above:
7*8*9*10/(2*3*4*(2^10)) # result 0.2050781

# entire probability distribution: probabilities 0..10 heads in 10 tosses
dbinom(0:10, 10, 0.5)
# [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
# [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
# [11] 0.0009765625

# we can also plot the distribution
plot(0:10, dbinom(0:10, 10, 0.5))
barplot(dbinom(0:10, 10, 0.5))

# our p-value is the sum for k=6,7,8,9,10
sum(dbinom(6:10, 10, 0.5))
# result: 0.3769531
# so results this "extreme" are not rare by chance,
# they happen in about 38% of cases

# R can compute the sum for us using pbinom 
# this considers all values greater than 5
pbinom(5, 10, 0.5, lower.tail=FALSE)
# result again 0.3769531

# if probability is too small, use log of it
pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE)
# [1] -6931.472
# the probability of getting 10000x head is exp(-6931.472) = 2^{-100000}

# generating numbers from binomial distribution 
# - similarly to our function experiment
rbinom(20, 10, 0.5)
# [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5

# running the test
binom.test(6, 10, p = 0.5, alternative="greater")
#
#        Exact binomial test
#
# data:  6 and 10
# number of successes = 6, number of trials = 10, p-value = 0.377
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
# 0.3035372 1.0000000
# sample estimates:
# probability of success
#                   0.6

# to only get p-value, run
binom.test(6, 10, p = 0.5, alternative="greater")$p.value
# result 0.3769531

Comparing two sets of values: Welch's t-test

  • Let us now consider two sets of values drawn from two normal distributions with unknown means and variances
  • The null hypothesis of the Welch's t-test is that the two distributions have equal means
  • The test computes test statistics (in R for vectors x1, x2):
    • (mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))
  • If the null hypothesis holds, i.e. x1 and x2 were sampled from distributions with equal means, this test statistics is approximately distributed according to the Student's t-distribution with the degree of freedom obtained by
n1=length(x1)
n2=length(x2)
(var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
  • Luckily R will compute the test for us simply by calling t.test
# generate x1: 6 values from normal distribution with mean 2 and standard deviation 1
x1 = rnorm(6, 2, 1)
# for example 2.70110750  3.45304366 -0.02696629  2.86020145  2.37496993  2.27073550

# generate x2: 4 values from normal distribution with mean 3 and standard deviation 0.5
x2 = rnorm(4, 3, 0.5)
# for example 3.258643 3.731206 2.868478 2.239788
t.test(x1, x2)
# t = -1.2898, df = 7.774, p-value = 0.2341
# alternative hypothesis: true difference in means is not equal to 0
# means 2.272182  3.024529
# this time the test was not significant

# regenerate x2 from a distribution with a much more different mean
x2 = rnorm(4, 5, 0.5)
# 4.882395 4.423485 4.646700 4.515626
t.test(x1, x2)
# t = -4.684, df = 5.405, p-value = 0.004435
# means 2.272182  4.617051
# this time much more significant p-value

# to get only p-value, run 
t.test(x1,x2)$p.value

We will apply Welch's t-test to microarray data

Multiple testing correction

  • When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01
  • Purely by chance this would happen in 1% of cases (from the definition of the p-value).
  • So purely by chance we would expect to get about 64 genes with p-value at most 0.01.
  • So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance.
  • Sometimes false positives may even overwhelm the results
  • Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests, there are many different methods.
  • The simplest one is Bonferroni correction, where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 = 1.56e-6.
  • This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality).
  • We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected.
  • After Bonferroni correction we get only one significant gene.
# the results of t-tests are in vector pb of length 6398
# manually multiply p-values by length(pb), count those that have value <= 0.01
sum(pb * length(pb) <= 0.01)
# in R you can use p.adjust for multiple testing correction
pb.adjusted = p.adjust(pa, method ="bonferroni")
# this is equivalent to multiplying by the length and using 1 if the result > 1
pb.adjusted = pmin(pa*length(pa),rep(1,length(pa)))

# there are less conservative multiple testing correction methods, 
# e.g. Holm's method, but in this case we get almost the same results
pa.adjusted2 = p.adjust(pa, method ="holm")

Another frequently used correction is false discovery rate (FDR), which is less strict and controls the overall proportion of false positives among the results.