2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z už opravených úloh nájdete na serveri v /grades/userid.txt
Dátumy odovzdania projektov:
1. termín: nedeľa 4.6. 22:00
2. termín: streda 20.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatváranie známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search

Lecture 11

  • In this task, you can use a combination of any scripting languages (e.g. Perl, Python, R) but also SQL, command-line tools etc.
  • Input is in a database
  • Submit required text files (optionally also files with figures in bonus part E)
  • Also submit any scripts you have written for this HW
  • In the protocol, include shell commands you have run
  • Outline of protocol is in /tasks/hw11/protocol.txt

Available data

  • All data necessary for this task is available in the mysql database 'marmoset' on the server
  • You will find password in /tasks/hw11/readme.txt
  • You have read-only access to the 'marmoset' database
  • For creating temporary tables, etc., you can use database 'temp_youruserid' (e.g. 'temp_mrkvicka54'), where you are allowed to create new tables and store data
  • You can address tables in mysql even between databases: to start client with your writeable database as default location, use:
mysql -p temp_mrkvicka54
  • You can then access data in the table 'genes' in the database 'marmoset' simply by using 'marmoset.genes'

Getting data from database:

  • If your want to get data from database to a tab-separated file, write a select query, run with -e, redirect output:
mysql -p marmoset -e 'select transcriptid as id, pval from lrtmarmoset' > pvals.tsv

Task A: Find ancestors of each GO category

  • Compute a table (in your temporary db or in a file) which contains all pairs category and its ancestor
    • In table gocatparents you have pairs category, its parent, so you need a transitive closure over this relation
    • SQL is not very good at this, you can try repeated joins until you find no more ancestors
    • Alternatively, you can simply extract data from the database and process them in a language of your choice
  • Submit file sample-anc.txt which contains the list of all ancestors of GO:0042773, one per line, in sorted order
    • There should be 14 such ancestors, excluding this category itself; the first in sorted order is GO:0006091, the last is GO:0055114

Task B: Gene in/out data for each GO category

  • Again consider category GO:0042773
  • Create a list of all genes that occur in table lrtmarmoset
    • for each such gene list three columns separated by tabs: its transcript id, p-value from lrtmarmoset, and an indicator 0/1
    • the indicator is 1, if this gene occurs in GO:0042773 or one of its subcategories; 0 otherwise
    • to find, which gene occur directly in GO:0042773, use table genes2gocat, subcategories can be found in your table from part A
    • note that genes2gocat contains more genes, we will consider only genes from lrtmarmoset
  • Submit this file sample-genes.tsv
    • Overall, your table should have 13717 genes, out of which 28 have value 1 in the last column
    • The first lines of this list (when sorted alphabetically) might look as follos:
ensGene.ENST00000043410.1.inc   1       0
ensGene.ENST00000158526.1.inc   0.483315913388483       0
ensGene.ENST00000456284.1       1       1
  • Note that in part C, you will need to run this process for each category in the database, so make it sufficiently automated

Task C: Run Man-Whitney U test for each GO category

  • Run Man-Whitney U test for each non-trivial category
    • Non-trivial categories are such that at least one of our genes (from lrtmarmoset) is in the category and at least one of our genes is not in the category
    • You should test, if genes in a particular GO category have smaller p-values in positive selection than genes outside the category
    • List of all categories can be obtained from gocatdefs, but not all of them are non-trivial (there are 12455 non-trivial categories)
  • Submit file test.tsv in which each line contains two tab separated values:
    • GO category id
    • p-value from the test
  • For partial points test at least the category GO:0042773 from parts A and B

Task D: Report significant categories

  • Submit file report.tsv with 20 most significant GO categories (lowest p-values)
    • For each category list its ID, p-value and description
    • Order them from the most significant
    • Descriptions are in table gocatdefs
  • To your protocol, write any observations you can make
    • Do any reported categories seem interesting to you?
    • Are any reported categories likely related to each other based on their descriptions?

Task E (bonus): cluster significant categories

  • Some categories in task D appear similar according to their name
  • Try creating k-means or hierarchical clustering of categories
  • Represent each category as a binary vector in which for each gene you have one bit indicating if it is in the category
  • Thus categories with the same set of genes will have identical vectors
  • Try to report results in an appropriate form (table, text, figure), discuss them in the protocol


  • In part C, we have done many statistical tests, resulting P-values should be corrected by multiple testing correction from Lecture 10
    • This is not required in this homework, but should be done in a real study