2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


L11

From IDZ
Jump to: navigation, search

HW11

Biological story: tiny monkeys

  • Common marmoset (Callithrix jacchus, Kosmáč bielofúzý) weights only about 1/4 kg
  • Most primates are much bigger
  • Which marmoset genes differ from other primates and are related to the small size?
  • Positive selection scan computes of each gene a p-value, whether it evolved on the marmoset lineage faster
    • Exact details, see papers [1] and [2]
  • The result is a list of p-values, one for each gene
  • Which biological functions are enriched among positively selected genes? Are any of those functions possibly related to body size?

Gene functions and GO categories

Use mysql database "marmoset" on the server.

  • We can look at the description of a particular gene:
select * from genes where prot='IGF1R';
+----------------------------+-------+-------------------------------------------------+                       
| transcriptid               | prot  | description                                     |                       
+----------------------------+-------+-------------------------------------------------+                       
| knownGene.uc010urq.1.1.inc | IGF1R | insulin-like growth factor 1 receptor precursor |                       
+----------------------------+-------+-------------------------------------------------+
  • In the database, we have stored all the P-values from positive selection tests:
select * from lrtmarmoset where transcriptid='knownGene.uc010urq.1.1.inc';
+----------------------------+---------------------+
| transcriptid               | pval                |
+----------------------------+---------------------+
| knownGene.uc010urq.1.1.inc | 0.00142731425252827 |
+----------------------------+---------------------+
  • Genes are also assigned functional categories based on automated processes (including sequence similarity to other genes) and manual curation. The corresponding database is maintained by Gene Ontology Consortium. We can use on-line sources to search for these annotations, e.g. here.
  • We can also download the whole database and preprocess it into usable form:
select * from genes2gocat,gocatdefs where transcriptid='knownGene.uc010urq.1.1.inc' and genes2gocat.cat=gocatdefs.cat;
(results in 50 categories)
  • GO categories have a hierarchical structure - see for example category GO:0005524 ATP binding:
select * from gocatparents,gocatdefs where gocatparents.parent=gocatdefs.cat and gocatparents.cat='GO:0005524';
+------------+------------+---------+------------+-------------------------------+
| cat        | parent     | reltype | cat        | def                           |
+------------+------------+---------+------------+-------------------------------+
| GO:0005524 | GO:0032559 | isa     | GO:0032559 | adenyl ribonucleotide binding |
+------------+------------+---------+------------+-------------------------------+
... and continuing further up the hierarchy:
| GO:0032559 | GO:0030554 | isa     | GO:0030554 | adenyl nucleotide binding     |
| GO:0032559 | GO:0032555 | isa     | GO:0032555 | purine ribonucleotide binding |
| GO:0030554 | GO:0001883 | isa     | GO:0001883 | purine nucleoside binding |
| GO:0030554 | GO:0017076 | isa     | GO:0017076 | purine nucleotide binding |
| GO:0032555 | GO:0017076 | isa     | GO:0017076 | purine nucleotide binding |
| GO:0032555 | GO:0032553 | isa     | GO:0032553 | ribonucleotide binding    |
| GO:0001883 | GO:0001882 | isa     | GO:0001882 | nucleoside binding |
| GO:0017076 | GO:0000166 | isa     | GO:0000166 | nucleotide binding |
| GO:0032553 | GO:0000166 | isa     | GO:0000166 | nucleotide binding |
| GO:0001882 | GO:0005488 | isa     | GO:0005488 | binding |
| GO:0000166 | GO:0005488 | isa     | GO:0005488 | binding |
| GO:0005488 | GO:0003674 | isa     | GO:0003674 | molecular_function |
  • What else can be under GO:0032559 adenyl ribonucleotide binding?
select * from gocatparents,gocatdefs where gocatparents.cat=gocatdefs.cat and gocatparents.parent='GO:0032559';
+------------+------------+---------+------------+-------------+
| cat        | parent     | reltype | cat        | def         |
+------------+------------+---------+------------+-------------+
| GO:0005524 | GO:0032559 | isa     | GO:0005524 | ATP binding |
| GO:0016208 | GO:0032559 | isa     | GO:0016208 | AMP binding |
| GO:0043531 | GO:0032559 | isa     | GO:0043531 | ADP binding |
+------------+------------+---------+------------+-------------+

Mann–Whitney U test

  • also known as Wilcoxon rank-sum test
  • In Lecture 6, we have used Welch's t-test to test if one set of expression measurements for a gene are significantly different from the second set
  • This test assumes that both sets come from normal (Gaussian) distributions with unknown parameters
  • Mann-Whitney U test is called non-parametric, because it does not make this assumption
  • The null hypothesis is that two sets of measurements were generated by the same unknown probability distribution
  • Alternative hypothesis: for X from the first distribution and Y from the second P(X>Y) is not equal P(Y>X)
  • We will use a one-side version of the alternative hypothesis: P(X>Y) > P(Y>X)
  • Compute test statistics U:
    • compare all pairs X, Y (X from first set, Y from second set)
    • if X>Y, add 1 to U
    • if X==Y, add 0.5
  • For large sets, U is approximately normally distributed under the null hypothesis

How to use in R:

# generate 20 samples from exponential distrib. with mean 1
x = rexp(20, 1)  
# generate 30 samples from exponential distrib. with mean 1/2
y = rexp(30, 2)  

# test if values of x greater than y
wilcox.test(x,y,alternative="greater")  
# W = 441, p-value = 0.002336
# alternative hypothesis: true location shift is greater than 0
# W is the U statistics above

# now generate y twice from the same distrib. as x
y = rexp(30, 1)
wilcox.test(x,y,alternative="greater")
# W = 364, p-value = 0.1053
# relatively small p-value (by chance)

y = rexp(30, 1)
wilcox.test(x,y,alternative="greater")
# W = 301, p-value = 0.4961
# now much greater p-value

Another form of the test, potentially useful for HW:

  • have a vector of values x, binary vector b indicating two classes: 0 and 1
  • test if values marked by 0 are greater than values marked by 1
# generate 10 with mean 1, 30 with mean 1/2, 10 with mean 1
x = c(rexp(10,1),rexp(30,2),rexp(10,1))
# classes 10x0, 20x1, 10x0
b = c(rep(0,10),rep(1,30),rep(0,10))
wilcox.test(x~b,alternative="greater")

# the same test by distributing into subvectors x0 and x1 for classes 0 and 1
x0 = x[b==0]
x1 = x[b==1]
wilcox.test(x0,x1,alternative="greater")
# should be the same as above