1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Genomika 2017/18"

From MAD
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 234: Line 234:
  
 
=Predbežné informácie k štátniciam=
 
=Predbežné informácie k štátniciam=
 +
Na tejto stránke sú predbežné neoficiálne informácie k magisterskému štátnicovému predmetu Bioinformatika a strojové učenie pre školský rok 2017/18. Môže ešte dôjsť k nejakým zmenám (najmä v oblasti dátových štruktúr), finálna verzia by sa v prebehu pár dní mala objaviť na stránke [http://dcs.fmph.uniba.sk/ Katedry informatiky].
 +
 +
==Úvod==
 +
 +
Jedným z cieľov štátnic je uvedomiť si prepojenia medzi rôznymi predmetmi. Predmety v štátnicovom predmete Bioinformatika a strojové učenie navzájom súvisia, ale tieto súvislosti sa len v malej miere ukážu priamo v osnovách jednotlivých predmetov. Preto sme vybrali články z vedeckej literatúry, ktoré spájajú témy z viacerých predmetov a budú odrazovým mostíkom pre diskusiu na štátnych skúškach. Na štátnej skúške si vylosujete jeden z nižšie uvedených článov a trojicu otázok s ním súvisiacich. V prvej otázke bude vždy vašim cieľom sumarizovať hlavné výsledky článku a vysvetliť ich aj informatikom, ktorí nie sú priamo odborníkmi v oblasti zamerania článku. V tejto otázke očakávame cca 5-minútový prehľad článku s dôrazom na vysvetlenie potrebných pojmov a základných myšlienok článku, nie technických detailov. Druhá otázka bude z nižšie uvedených okruhov učiva. Môže ale nemusí súvisieť s témou článku. Tretia otázka bude podrobne vysvetliť niektorý technický detail článku (napr. nejakú časť algoritmu, zložitejšiu definíciu, dôkaz lemy, detaily experimentu a podobne). Po vylosovaní otázky dostanete k dispozícii vytlačený článok a budete mať aspoň hodinu času na prípravu, takže nie je potrebné tieto články poznať naspamäť. Pri príprave na štátnice vám odporúčame okrem opakovania si učiva v uvedených okruhoch pozrieť si aj uvedené články a s nimi súvisiacu terminológiu.
 +
 +
==Články==
 +
 +
* Apostolico A, Bock ME, Lonardi S, Xu X. Efficient detection of unusual words. Journal of Computational Biology. 2000 Feb 1;7(1-2):71-94. [http://www.cs.ucr.edu/~stelo/papers/jcb.pdf]
 +
 +
* Štefankovič D, Vempala S, Vigoda E. A deterministic polynomial-time approximation scheme for counting knapsack solutions. SIAM Journal on Computing. 2012 Apr 19;41(2):356-66. [https://arxiv.org/pdf/1008.1687]
 +
 +
* Dowell RD, Eddy SR. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics. 2004 Jun 4;5(1):1. [http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-5-71]
 +
 +
* Heng L, Durbin R. (2009): Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14): 1754-1760 [https://doi.org/10.1093/bioinformatics/btp324]
 +
 +
* Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Research. 1998 Jan 1;26(2):544-8. [http://nar.oxfordjournals.org/content/26/2/544.long]
 +
 +
* Wieland SC, Cassa CA, Mandl KD, Berger B. Revealing the spatial distribution of a disease while preserving privacy. Proceedings of the National Academy of Sciences. 2008 Nov 18;105(46):17608-13. [http://www.pnas.org/content/pnas/105/46/17608.full.pdf]
 +
 +
* Elias I, Lagergren J. Fast neighbor joining. Theoretical Computer Science. 2009 May 17;410(21):1993-2000. [https://pdfs.semanticscholar.org/fc80/df4469c8556fed45357cea8ba65f0c97535e.pdf]
 +
 +
* Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning 2006 Jun 25 (pp. 369-376). ACM. [https://mediatum.ub.tum.de/doc/1292048/file.pdf]
 +
 +
* Bachem O, Lucic M, Hassani H, Krause A. Fast and provably good seedings for k-means. In Advances in Neural Information Processing Systems 2016 (pp. 55-63). [https://papers.nips.cc/paper/6478-fast-and-provably-good-seedings-for-k-means.pdf]
 +
 +
* Turk M, Pentland A. Eigenfaces for recognition. Journal of cognitive neuroscience. 1991 Jan;3(1):71-86. [http://www.academia.edu/download/30894770/jcn.pdf]
 +
 +
==Okruhy učiva==
 +
V zátvorke skratky súvisiacich predmetov: AOP: Aproximácia optimalizačných problémov; G: Genomika; IDZ: Integrácia dátových zdrojov; MBI: Metódy v bioinformatike; NS: Neurónové siete; PaŠ: Pravdepodobnosť a štatistika; SU: Strojové učenie; VPDŠ: Vybrané partie z dátových štruktúr
 +
 +
* Neurónové siete: viacvrstvový perceptrón, metóda spätného šírenia chyby, hlboké architektúry neurónových sietí, Hebbovské učenie (SU,NS)
 +
 +
* Modelovanie sekvenčných dát: Skryté Markovove modely, podmienená pravdepodobnosť a Bayesove vety, Viterbiho a dopredný algoritmus, príklady využitia v bioinformatike (hľadanie génov a profilové HMM), rekurentné neurónové siete, Hopfieldov model (MBI,PaŠ,NS)
 +
 +
* Klasifikačné modely: support vector machines, rozhodovacie stromy, náhodné lesy, bagging, boosting (SU)
 +
 +
* Regresia: lineárna a generalizovaná lineárna regresia, metóda najmenších štvorcov, štatistický model s normálnym rozdelením chýb, regularizácia (PaŠ,SU)
 +
 +
* Teória strojového učenia: štatistický model strojového učenia, výchylka vs. rozptyl, preučenie a podučenie, PAC učenie, odhady pomocou VC dimenzie (SU,NS)
 +
 +
* Strojové učenie bez učiteľa: zhlukovanie, samoorganizujúce sa zobrazenia, analýza hlavných komponentov, využitie na analýzu génovej expresie (SU,NS,MBI)
 +
 +
* Testovanie štatistických hypotéz: Fisherov exaktný test, Welchov t-test, Mann-Whitneyho U-test, Bonferroniho korekcia viacnásobného testovania, log likelihood ratio test, príklady použitia testov v bioformatike (PaŠ,IDZ,MBI)
 +
 +
* Stredná hodnota náhodnej premennej: linearita strednej hodnoty, Markovova a Čebyševova nerovnosť (PaŠ)
 +
 +
* Limitné vety teórie pravdepodobnosti: centrálna limitná veta, Moivrova-Laplaceova veta, slabý zákon veľkých čísel (PaŠ)
 +
 +
* Sekvenovanie DNA: technológie sekvenovania a ich charakteristiky (Sanger, Illumina, nanopórové sekvenovanie), skladanie genómov, deBruijnove grafy, RNA-seq (MBI,G)
 +
 +
* Fylogenetika a komparatívna genomika: metóda spájania susedov, metóda úspornosti, Jukes-Cantorov model a iné substitučné modely, pozitívna a negatívna selekcia a jej vplyv na evolúciu biologických sekvencií (MBI, G)
 +
 +
* Zarovnania a algoritmy na reťazcoch: lokálne a globálne zarovnávanie sekvencií, BLAST (jadrá zarovnaní), perfektné hešovanie, Bloomov filter, efektívna reprezentácia sekvencií (sufixové stromy a polia, Burrowsova–Wheelerova transformácia, FM index) (MBI,VPDŠ)
 +
 +
* Metóda maximálnej vierohodnosti: odhad parametrov rozdelenia, nevychýlené odhady parametrov, metóda maximálnej vierohodnosti na rekonštrukciu fylogenetických stromov, Felsensteinov algoritmus, EM algoritmus, trénovanie skrytých Markovových modelov, hľadanie sekvenčných motívov (PaŠ, MBI)
 +
 +
* Lineárne programovanie: lineárne a kvadratické programovanie, simplexová metóda, dualita, celočíselné lineárne programovanie a jeho využitie na riešenie ťažkých problémov v bioinformatike, využitie lineárneho programovania v aproximačných algoritmoch (deterministické zaokrúhľovanie, iterované zaokrúhľovanie, randomizované zaokrúhľovanie + derandomizácia, primárno-duálne metódy), semidefinitné programovanie a max-cut, využitie duality v support vector machines (kernelové metódy) (AOP, SU, MBI)
 +
 +
* Aproximovateľnosť: Zložitostné triedy aproximačných algoritmov, PCP veta a jej použitie, AP-redukcia, APX úplné problémy, aproximovateľnosť problému obchodného cestujúceho, polynomiálne aproximačné schémy a príklady PTAS algoritmov (AOP)
 +
 +
* Aplikácie formálnych jazykov: Knuth-Morris-Pratt algoritmus na hľadanie vzorky v texte, stochastické bezkontextové gramatiky, kovariačný model a rodiny RNA, Nussinovovej algoritmus (MBI, VPDŠ)
 +
 +
* Modely dátových štruktúr: amortizovaná zložitosť a potenciálová funkcia, I/O model a B-stromy, cache-oblivious model a statický binárny strom s van Emde Boas rozložením, úsporné dátové štruktúry (rank a select) (VPDŠ)
 +
 +
* Dátové štruktúry pre intervaly: range minimum query, lowest common ancestor, segmentové stromy, rozsahové stromy (VPDŠ)
 +
 +
==Príklad otázok==
 +
Príklady otázok ku článku Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. [http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf]
 +
 +
Otázka 1: Sumarizujte hlavné výsledky článku a vysvetlite, prečo je skúmaný problém dôležitý pre moderné strojové učenie
 +
(ak v odpovedi na túto otázku nevysvetlíte, čo je neurónová sieť, pravdepodobne sa vás spýtame na definíciu)
 +
 +
Otázka 2: Vysvetlite, čo je normalizovaná inicializácia a na obrázkoch 7 a 9 vysvetlite, aký má normalizovaná inicializácia vplyv na priebeh učenia.
 +
(bude k dispozícii projektor, na ktorom sa dajú obrázky z článku ukázať)
 +
 +
Otázka 3: Štatistický model strojového učenia, výchylka vs. rozptyl, preučenie a podučenie
 +
 
=Genomika: Informácie ku trackom=
 
=Genomika: Informácie ku trackom=
 +
Informácie k predmetu [[Genomika]]
 +
 +
Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.
 +
 +
===Comments to the task list===
 +
* Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
 +
* Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
 +
** These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
 +
** Fast tasks can be done entirely on genomika server.
 +
** Students having accounts on compbio research cluster may run medium and slow tasks there.
 +
* If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
 +
* Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.
 +
 +
===Basic information on creating tracks===
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Basics-of-creating-tracks
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/How-to-add-track-to-DB-and-display-it-on-page
 +
* '''Important:''' compared to last year, path <tt>/kentsrc/kent/src/hg/makeDb/trackDb/</tt> was moved to <tt>/kentsrc/trackDb/</tt>
 +
 +
===(A) Genome (fast)===
 +
* Download genome in fasta format, add to browser
 +
* Data malGlo [https://www.ncbi.nlm.nih.gov/genome/701?genome_assembly_id=30575], malSym [https://www.ncbi.nlm.nih.gov/genome/16894?genome_assembly_id=302004]
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Setting-up-a-Yarrowia-lipolytica
 +
* Important step not described is to rename chromosomes/contigs to something reasonable
 +
* Genome versions are numbered, we will start with malGlo1 and malSym1
 +
* Missing part in last-year documentation - adding species to hgcentral database
 +
** look at existing records, e.g for yarLip1 to guess appropriate values
 +
** taxonomy ID can be found at https://www.ncbi.nlm.nih.gov/taxonomy
 +
<pre>
 +
hgsql hgcentral -e '
 +
insert into dbDb values (...);
 +
 +
insert into defaultDb values (...);
 +
 +
insert into genomeClade values (...);
 +
'
 +
</pre>
 +
 +
===(B) Protein coding genes and other items from the annotation (fast, needs A)===
 +
* Download genome annotation in GFF format, process to genepred format, split into two tracks: genes and other items
 +
* Last year done by 2 groups based on 2 different databases:
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-NCBI-RefSeq
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Protein-coding-genes-from-EnsembleFungi
 +
* Coordinate with renaming chromosomes in step (1)
 +
* In the first pass, use last-year scripts to convert formats, then load the tracks. Later we will work on polishing details, e.g.:
 +
** Use appropriate IDs for naming genes
 +
** mRNA items in other item track are redundant, should be omitted
 +
** also items covering entire chromosome (type=region) should be omitted
 +
** protein coding genes could be displayed with codon highlighting - use the following settings in trackDb.ra:
 +
<pre>
 +
baseColorUseCds given
 +
baseColorDefault genomicCodons
 +
</pre>
 +
 +
===(C) RepeatMasker (slow, needs A)===
 +
* Run RepeatMasker program to find repeated sequences (use fungi as species)
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/RepeatMasker
 +
* RepeatMasker and repeat library installed at compbio and genomika servers, or request a copy for your computer, registration may take several days
 +
 +
===(D) tRNAscan-SE (medium, needs A)===
 +
* Run software for finding tRNA genes (for comparison with annotation)
 +
* Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
 +
* Convert output by script rna/tRNAscan-SEtoBED.py on github
 +
* trackDb.ra record:
 +
<pre>
 +
track tRNAs
 +
shortLabel tRNA Genes
 +
longLabel Transfer RNA Genes Identified with tRNAscan-SE
 +
group genes
 +
visibility hide
 +
color 0,20,150
 +
type bed 12
 +
nextItemButton on
 +
priority 10
 +
</pre>
 +
 +
===(E) Augustus (slow, needs A)===
 +
* Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
 +
* Download and install software from http://bioinf.uni-greifswald.de/augustus/
 +
** Already installed on compbio servers
 +
* Example of command line: <tt>augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf</tt>
 +
* ustilago_maydis is a related fungal species used for training parameters
 +
* The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
 +
* If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra
 +
 +
===(F) Self-alignment (medium/slow needs A)===
 +
* The goal here is to find regions in the genome which have multiple approximate copies
 +
* First we need to create local alignments of the genome to itself, then convert them to appropriate format
 +
* Last year done by blastz: https://github.com/fmfi-genomika/genomika-2017/wiki/Self-alignments---self-chain---segmental-duplications
 +
* This yeast I suggest using program last (ubuntu package last-align, website http://last.cbrc.jp/)
 +
* Here is an example how I have done it in the past, it would be great to rewrite one-liners to some nicer script
 +
** ideally also patch the bug in pslToChain and submit the patch to the UCSC genome github
 +
<pre>
 +
lastdb genome.fa genome.fa
 +
lastal genome.fa genome.fa -E 1e-20 > self.maf #slow part
 +
maf-convert psl self.maf > tmpC.psl
 +
 +
# filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9
 +
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl
 +
pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika
 +
# fix bad coordinates on reverse strand
 +
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain
 +
 +
# another chain for alignments with at least 70% identity and length at least 300bp
 +
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl
 +
/projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev
 +
# fix bad coordinates on reverse strand
 +
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain
 +
</pre>
 +
 +
Parts of trackDb.ra (replace magCap5 with your genome name):
 +
<pre>
 +
track selfChain100_90
 +
shortLabel Self aln >90%id
 +
longLabel Self alignments with length >100bp, identity >90%
 +
group varRep
 +
type chain magCapA5
 +
 +
track selfChain300_70
 +
shortLabel Self aln >70%id
 +
longLabel Self alignments with length >300bp, identity >70%
 +
group varRep
 +
type chain magCapA5
 +
</pre>
 +
 +
===(G) Chains between genomes (medium, needs A from both groups)===
 +
* The goal is to create chains from malGlo to malSym and vice versa
 +
** Each group creates chains from its browser to the other browser
 +
* This is done similarly as self-similarity chains, but alignments are done between two different genomes and filtering is done differently
 +
<pre>
 +
lastdb genome.fa genome.fa
 +
lastal genome.fa genome2.fa -E 1e-20 > firstSecond.maf'
 +
maf-convert psl firstSecond.maf > tmp.psl
 +
 +
# keep only alignments of length at least 100 in both sequences
 +
perl -lane 'die unless @F==21; $s = $F[12]-$F[11]<100 || $F[16]-$F[15]<100; print unless $s' tmp.psl > tmp100.psl
 +
pslToChain tmp100.psl tmp100.chain # kent tools binary on genomika
 +
# fix bad coordinates on reverse strand
 +
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmp100.chain > firstSecond.chain
 +
</pre>
 +
* trackDb.ra record similar, but include target species in line with <tt>type chain</tt>
 +
 +
===(H) Protein-based chains between genomes (medium, needs A,B from both groups)===
 +
* In more distant species, DNA-based chains from part G are not sufficiently sensitive, but it is easier to find similarity between proteins
 +
* In this type of track you extract protein sequences based on genome sequence and gene annotation, then you compare protein sets from the two species and map protein alignments back to the genome
 +
* Commands from the last year create a psl file and load it. Then the alignments cannot be used to move between genomes. It would be better to convert psl to chain as in parts F and G.
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Chains-from-protein-alignments
 +
 +
===(I) Genomes for comparative genomics (fast, only one group)===
 +
* Download genomes of additional Malassezia species (other than malGlo and malSym)
 +
* Use list here [https://www.ncbi.nlm.nih.gov/genome/?term=txid55193%5BOrganism%3Aexp%5D], download M. pachydermatis, M. nana, M. equina, M. caprae, M. dermatis, M. restricta
 +
** Download one representative assembly per species (some species have multipe strains /assemblies)
 +
* Rename chromosomes similarly as in A, name fasta files in a systematic way (malPac1.fa etc.)
 +
* Store files in a directory at genomika server
 +
* Do not forget to note down in your documentation the URL of each downloaded fasta file.
 +
 +
===(J) Multiple whole-genome alignment (slow, needs A from both groups, I)===
 +
* The goal of this track is to create a whole-genome multiple alignment of several genomes
 +
* Use genomes from part I as well as malGlo and malSym genomes from the browser
 +
* Beware that malSym1 and malGlo1 should be correctly named, both the genome as a whole and their chromosomes as in the browser
 +
* The task requires some preprocessing - renaming things etc (fast), alignment computation (slow, we recommend running on compbio servers) and postprocessing (fast/medium)
 +
** Preprocessing and possibly also part of running alignment can be reused between groups - collaborate
 +
* The notes from the last year consist of three parts: general introduction, Brona's notes (Example of use of tba in a different project), and student notes (Example of use of tba in a our project, display alignments).
 +
** Probably follow student notes.
 +
** The notes are not finished (end with "track does not work"), but the track was finished, see track "S. Align (L)" in sacCer3 browser. See final version of sacCer3 trackDb.ra on genomika server.
 +
* To run alignment, you need phylogenetic tree of these species. Use the tree from paper by Wu et al 2015 [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4634964/figure/pgen.1005614.g003/] - our species are in group B. Write the tree in the parenthesis notation
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Alignments
 +
 +
===(K) Conservation by phyloP (medium, needs A,I,J)===
 +
* Based on multiple alignment from part J, find which positions are conserved in evolution (the result is a numerical level of conservation per position in a wiggle format)
 +
* See tracks Align. Cons. (L) and Multiz. Cons. (L) in sacCer3 browser (here we want only one track)
 +
* Use the same tree as in I
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/PhyloP-tracks
 +
 +
===(L) Conserved elements by phastCons (medium, needs A,I,J)===
 +
* Similar as track K, but uses a different program from the phast package. Phastcons is based on and HMM, finds contiguous conserved regions. The result is a list of conserved regions (bed format) as well as posterior probability of conserved region at each position (wig format)
 +
* On sacCer3, wig format are e.g. tracks Cons. new (L), Cons. old (L); bed format track is PhastCons Most, but that was taken from the original UCSC database so no commands for it are available, but hopefully it should be easy to create and load.
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Conservation
 +
 +
===(M) Protein domain and other protein annotation from Uniprot (fast/medium, needs A,B)===
 +
* The uniprot database (http://www.uniprot.org/) contains information about proteins. The goal is to download information about malGlo and malSym proteins, parse out info about particular regions and map these to the corresponding regions of the genome.
 +
* See sacCer3 tracks Pfam (L), uniProtAnnot (L), uniProtStruct (L)
 +
* Download protein info in XML format malGlo [http://www.uniprot.org/proteomes/UP000008837], malSym [http://www.uniprot.org/proteomes/UP000186303]
 +
* Last year's protocol links uniprot proteins to genes from browser annotation via sequence similarity search (blat). Possibly this could be done also by cross-linking information from the databases, but blat is fine.
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Uniprot-data
 +
* Last year, Pfam track was created by runing Interproscan tool locally [https://github.com/fmfi-genomika/genomika-2017/wiki/Get-PFAM-data]. However, this is time-consuming and uniprot contains pre-computed info about Pfam domains. Therefore it would be better to modify scripts so that they parse Pfam out of uniprot XML files together with other info.
 +
 +
===(N) Expression data from RNA-seq (medium/slow, needs A)===
 +
* The goal is to display the results of measurement of expression (amount of mRNA) by RNA-seq
 +
* Workflow:
 +
** The original data are reads in fastq format. Some preprocessing can be done (quality trimming etc)
 +
** Reads are aligned to the genome to produce sam/bam file. This is SLOW. The file is then sorted and indexed.
 +
** Bam files can be used in the browser, but they are big. We will report only the number of reads at each position in a wig (wiggle) format.
 +
** Wig files can be loaded to the database but perhaps more efficiently converted to binary bigwig files. The database then contains only reference to bigwig file.
 +
* Data:
 +
** malGlo [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA286710] - only reads provided. Out of 27 experiments choose only 1-2, align to genome, e.g. this one: [https://www.ncbi.nlm.nih.gov/sra/SRX1074608]
 +
** malSym [https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4589/] - bam files provided
 +
* malGlo needs to align reads to the genome.
 +
** Currently recommended aligner is STAR https://github.com/alexdobin/STAR
 +
** It seems that STAR can directly create wig files, read the manual for recommended settings (e.g. the section on small genomes)
 +
** To convert wig to bigwig, use wigToBigWig on genomika
 +
** To load bigwig file, see commands below
 +
* malSym already has bam files for several experiments
 +
** These need to be converted to wig / bigwig
 +
** First use [https://github.com/arq5x/bedtools2 bedtools suite] to create bedgraph (see commands below), then convert to bigwig using bedGraphToBigWig (installed on genomika)
 +
** To load bigwig file, see commands below
 +
** Multiple experiments are better combined to a single composite track with individual subtracks
 +
** Subtracks are loaded to db normally, composite tract is noted only in trackdb file, see below
 +
* Useful commands (modify for your situation):
 +
<pre>
 +
# bam to bedgraph
 +
faSize -detailed genome.fa > genome.sizes
 +
bedtools genomecov -ibam reads.bam -g genome.size -bga -split > reads.bedgraph
 +
 +
# to create track, place bigwig file to appropriate place in /gbdb
 +
# then create table with reference to this file:
 +
hgsql malXyz1 -e "CREATE TABLE table_name (fileName varchar(255) not null);"
 +
hgsql malXyz1 -e "insert into table_name values ('/gbdb/malXyz1/filename.bw');"
 +
 +
# in trackDb.ra include something like this: (change 500 to appropriate value at which read depth is clipped)
 +
track table_name
 +
shortLabel RNA-seq coverage
 +
longLabel RNA-seq coverage
 +
visibility dense
 +
group rna
 +
type bigWig 0 500
 +
 +
# composite track from multiple experiments:
 +
track track_name
 +
compositeTrack on
 +
type bigWig 0 200
 +
shortLabel RNA-seq coverage
 +
longLabel RNA-seq coverage
 +
group rna
 +
visibility dense
 +
 +
track subtrack_name
 +
shortLabel subtrack_label
 +
longLabel subtrack_label
 +
parent track_name
 +
type bigWig 0 250
 +
visibility full
 +
maxHeightPixels 80:16:8
 +
</pre>
 +
* Last year notes: https://github.com/fmfi-genomika/genomika-2017/wiki/Expression-tracks
 +
** However steps there are mostly not recommended this year
 +
** Last year tracks, see RNA-seq WT1 (L) in yarLip browser
 +
 +
===(O) Differences between strains (slow, needs A)===
 +
* The goal is to compare multiple strains of the same species and display differences between them in the browser
 +
* The usual way is to align sequencing reads from one strain to the reference strain, identify differences and display them in vcf format
 +
* Read files are large, therefore we directly compare assembled genomes and create the vcf file using c-sibelia tool
 +
* You can mostly follow last-year's notes except for the final steps. Instead of placing vcf.gz and vcf.gz.tbi files to a different server, place them to genomika to /gbdb/malXyz1/subdir, then insert to database using commands below
 +
* As in part N, you can group several strains to a single composite track, see parts of trackDb.ra in commands below
 +
* https://github.com/fmfi-genomika/genomika-2017/wiki/Strain-comparison
 +
* Last year's tracks are currently broken, but you can at least check their setting. eg. CLIB89 variants (L) in yarLip browser
 +
* Download other strains:
 +
** malGlo [https://www.ncbi.nlm.nih.gov/genome/genomes/701] use strains CBS 7966, CBS 7874
 +
** malSym [https://www.ncbi.nlm.nih.gov/genome/genomes/16894] use all strains except  ATCC 42132
 +
* Useful commands (modify for your situation):
 +
<pre>
 +
# to create track, place vcf.gz and vcf.gz.tbi files to appropriate place in /gbdb
 +
# then create table with reference to the vcf.gz file:
 +
hgsql malXyz1 -e "CREATE TABLE table_name (fileName varchar(255) not null);"
 +
hgsql malZyz1 -e "insert into table_name values ('/gbdb/maglXyz1/subdir/filename.vcf.gz');"
 +
 +
# in trackDb.ra include something like this:
 +
# composite track:
 +
track track_name
 +
compositeTrack on
 +
type vcfTabix
 +
shortLabel ...
 +
longLabel ...
 +
group varRep
 +
visibility hide
 +
 +
# subtrack:
 +
track subtrack_name
 +
shortLabel ...
 +
longLabel ...
 +
parent track_name
 +
visibility pack
 +
</pre>
 +
 
=Genomika: Rozvojové projekty=
 
=Genomika: Rozvojové projekty=
 +
Informácie k predmetu [[Genomika]]
 +
 +
Na tejto stránke sú informácie k podprojektom na záverečné týždne semestra.
 +
 +
==MalGlo group==
 +
===User trackDb, code management===
 +
* Think how to better manage changes to browser code in the future instances of the course
 +
* Explore possibilities of each user having their own trackDb
 +
* Start by reading short info in /kentsrc/trackDb/makefile on genomika server
 +
<pre>
 +
# Browser supports multiple trackDb's so that individual developers
 +
# can change things rapidly without stepping on other people's toes.
 +
...
 +
</pre>
 +
* Write a manual how to do your suggested changes and test it
 +
 +
===Rfam===
 +
* Rfam http://rfam.xfam.org/ is a database of families of non-coding RNAs
 +
* It contains a covariance model for each family
 +
* The database can be downloaded and searched against a genome using Infernal tool http://eddylab.org/infernal/
 +
* Do this search, then convert the output to appropriate format and display in the browser
 +
* Possibly use BEDdetail format https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7
 +
* After clicking on an Rfam match, there should be some display of additional information about the match and a link to the Rfam database. You can achieve this by the following lines in trackDb.ra:
 +
<pre>
 +
type bedDetail 14
 +
url http://rfam.xfam.org/family/$$
 +
urlLabel Rfam:
 +
</pre>
 +
Example of BEDdetail format for a Rfam match (items should be tab-separated, the last column starts at "truncated:")
 +
<pre>
 +
chrom chromStart chromEnd name score strand thickStart thickEnd reserved blockCount blockSizes chromStarts id description
 +
contigA 75109 75380 Fungi_SRP-1 1002 - 75109 75109 0 1 271 0 RF01502 truncated: no, E-value: 3.5e-19
 +
</pre>
 +
* Further things which you might want to explore:
 +
** Remove matches that correspond to tRNAScan-SE matches (try tool overlapSelect)
 +
** From several overlapping matches keep only the strongest (try tool overlapSelect)
 +
** More ambitious: Explore creating image of each RNA structure and somehow linking it to the info page for the match (as in non-coding RNA track in the human genome browser - see for example http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1%3A16520585%2D16520658, display non-coding RNA track and click on the tRNA match)
 +
 +
===Information for users===
 +
* Each track should provide basic information for users in the HTML document displayed after clicking on track name or left bar of the browser image.
 +
* The information should summarize what is displayed, what was source of the data, what program was used to produce the results etc
 +
** keep it less technical, with a link to your github wiki page for the track for potential developers replicating your work
 +
* See examples for tracks on the http://genome-euro.ucsc.edu/ browser
 +
* Also, the genome as a whole should have a description page. On the title page of http://genome-euro.ucsc.edu/ you see details of the selected assembly, e.g. for the guinea pig genome you see text
 +
<pre>
 +
Guinea pig Genome Browser - cavPor3 assembly
 +
The Feb. 2008 Cavia porcellus draft assembly (Broad Institute cavPor3) was produced by the Broad Institute at MIT and Harvard.
 +
...
 +
</pre>
 +
* You should create some explanatory text for you species and genome and make it display on the title page
 +
** This already works for Yarrowia lipolitica on genomika server, so you can try to find out how it was done
 +
 +
==MalSym group==
 +
 +
Informácie k predmetu [[Genomika]]
 +
 +
===Gene info pages===
 +
* If you click on a gene or other displayed item in a well-setup genome browser, you get a page with more information about this item
 +
* These info pages do not work satisfactorily on our genomika browser
 +
* Look at all protein coding gene tracks in four browsers:
 +
** sacCer3 in original UCSC genome browser [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=sacCer3], tracks NCBI RefSeq, SGD Genes, Ensembl Genes
 +
** sacCer3 in our genomika genome browser [http://genomika.compbio.fmph.uniba.sk/cgi-bin/hgTracks?db=sacCer3], tracks NCBI RefSeq, SGD Genes, Ens. Genes, NCBI RefSeq (L), SGD Genes (L), Ens. Genes (L),
 +
** yarLip1 in our genomika genome browser [http://genomika.compbio.fmph.uniba.sk/cgi-bin/hgTracks?db=yarLip1], tracks Ens. Genes (L), RefSeq Genes (L)
 +
** malSym1 in our genomika genome browser [http://genomika.compbio.fmph.uniba.sk/cgi-bin/hgTracks?db=malSym1], track Ensemble Genes (should be renamed Genes from NCBI)
 +
* For each explored track, find out what gets displayed on the gene info page, whether there are any error messages, whether the page contains a link to the source database (e.g. Ensembl, RefSeq, NCBI, SGD)
 +
* Explore how the differences in these info pages are encoded in the database and trackDb.ra
 +
* Suggest and implement improvements in these info pages on our browser in sacCer, yarLip, malSym and after warning the other group also in malGlo
 +
* The most comprehensive gene info pages use additional db tables downloaded from the uniprot database. This database is too large to be completely mirrored on our server. Can you suggest and implement a method for downloading only parts of the database for our species and loading it to the tables? (You were downloading uniprot for one species, its "proteome" in task M, possibly it can be used here.)
 +
 +
Note:
 +
* To explore how things work at UCSC, you can see setup notes in theit github [https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/doc], particularly the uniProt section and sacCer3.txt
 +
* You can also check their original trackDb.ra files [https://github.com/ucscGenomeBrowser/kent/tree/master/src/hg/makeDb/trackDb/sacCer] - see also parent directory and subdirectories
 +
* You can explore even the UCSC mysql database through their mysql server [http://genome.ucsc.edu/goldenPath/help/mysql.html]
 +
 +
===Blat and name search===
 +
Blat:
 +
* In the blue menu bar on top of the genome browser screen find Tool->Blat. This is a fast alignment tool which find sequences highly similar to your query.
 +
* In the genomika browser it seems to work for sacCer3 but not for the other three genomes. Make it work for all four, document your changes.
 +
 +
Name search:
 +
* Browser screen also contains text input field, where you can enter particular coordinates but also other keywords, such a gene name etc.
 +
** Try searching for gene YDR157W in sacCer3
 +
** Try searching for gene CAG83524 in yarLip1 - the gene is there but is not found, instead we get an error message
 +
** Make the search work for gene identifiers in all 4 genomes (sacCer, yarLip, malGlo, malSym)
 +
* Possibly also allow searching for other entities (keywords from gene descriptions, tRNA anti-codons, domains from Uniprot annotation track etc)
 +
** For example searching for keyword "ribosomal" in UCSC sacCer genome browser returns a list of genes with ribosomal in their description - try: [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=sacCer3]
 +
* Get rid of misleading error message when search is unsuccessful (see what error you get in the UCSC brwoser)
 +
 +
See the note in the previous task for information sources on how things are setup at UCSC
 +
 +
===Information for users===
 +
* Each track should provide basic information for users in the HTML document displayed after clicking on track name or left bar of the browser image.
 +
* The information should summarize what is displayed, what was source of the data, what program was used to produce the results etc
 +
** keep it less technical, with a link to your github wiki page for the track for potential developers replicating your work
 +
* See examples for tracks on the http://genome-euro.ucsc.edu/ browser
 +
* Also, the genome as a whole should have a description page. On the title page of http://genome-euro.ucsc.edu/ you see details of the selected assembly, e.g. for the guinea pig genome you see text
 +
<pre>
 +
Guinea pig Genome Browser - cavPor3 assembly
 +
The Feb. 2008 Cavia porcellus draft assembly (Broad Institute cavPor3) was produced by the Broad Institute at MIT and Harvard.
 +
...
 +
</pre>
 +
* You should create some explanatory text for you species and genome and make it display on the title page
 +
** This already works for Yarrowia lipolitica on genomika server, so you can try to find out how it was done

Latest revision as of 14:03, 20 February 2019

Contents

Genomika

Stránka k predmetu 2-INF-269/15 Genomika, školský rok 2017/18

Obsahové prerekvizity

  • Metódy v bioinformatike a Integrácia dátových zdrojov
  • Ak ste skúsení v práci na príkazovom riadku v Linuxe, Integráciu je možné brať aj súčasne s Genomikou

Ciele predmetu

Základné ciele:

  • Vystaviť vás interdisciplinárnej komunikácii a spolupráci.
  • Budovať schopnosť rýchlo sa oboznámiť s podstatnými znalosťami z vám neznámej oblasti, ktorá vám umožní efektívne komunikovať s klientami a kolegami, ktorí nie sú informatici.
  • Rozvíjať schopnosti tímovej spolupráce a organizácie práce.
  • Vyskúšať si projekt, kde nastupujete do "rozbehnutého vlaku" (práca s existujúcim softvérom s potrebou vývoja vlastných rozšírení).

Vedomostná náplň pre všetkých:

  • Zoznámiť sa s modernými technológiami, ktoré sú podstatným zdrojom fenoménu "big data" a sú základom moderného medicínskeho výskumu.

Pre vážnych záujemcov o bioinformatiku:

  • Vyskúšať si prácu s reálnymi biologickými dátami.
  • Prísť do kontaktu s odborníkmi z prírodných vied.

Hodnotenie

  • Písomná skúška: 50% (spoločná pre biológov aj informatikov)
  • Práca skupiny ako celku: 25%
  • (Preukázateľný) individuálny prínos k úspešnosti projektu: 25%
  • Známky A: 90+, B: 80+, C: 70+, D: 60+, E: 50+

Poznámky k hodnoteniu cvičení:

  • Obzvlášť malý alebo veľký podiel na práci skupiny môže vieť k individuálnej zmene váh(v extrémnych prípadoch môže individuálne hodnotenie tvoriť až 50% celej známky)
  • Za každú fázu skupinového projektu (t.j. po každom stretnutí) vám budú pridelené čierne a/alebo červené body
    • Červené body sú za splnené úlohy a ich počet odzrkadľuje kvalitu, kvantitu a náročnosť práce
    • Čierne body sú za úlohy, ktoré vám boli priradené, ale ktoré ste nesplnili, obzvlášť ak od nich závisí ďalší postup ostatných členov skupiny.
    • Čierne body môžu byť udelené aj za prístup narúšajúci úspešné napredovanie tímu(neospravedlnená neprítomnosť na stretnutí, narušenie práce spoločného servera a pod.)
    • Individuálne hodnotenie je neklesajúca funkcia od počtu červených bodov a nerastúca od počtu čiernych.

Prednášky

Čo si máte odniesť z prednášky?

  • Pochopiť podstatné myšlienky prezentácie / textu (o akej technológii sa bavíme, aký typ dát tam vystupuje, akým spôsobom ich získavame, aký je princíp fungovania)?
  • Nie je podstatné (ani možné) na 100% ovládať terminológiu
    • využívajte znalosti získané v MBI! (je dobré si pred prednáškou zopakovať relevantnú časť)
    • treba sa preniesť nad fakt, že nie každému slovu budete rozumieť
    • je ok sa na pár minút stratiť v detailoch (ale nie je ok sa stratiť na 70% prednášky)
    • treba sa priebežne pýtať rozumné otázky smerujúce k vyjasneniu podstatných vecí
    • (tréning k interdisciplinárnej komunikácii ide oboma smermi ;))
    • Don't panic! Jediná vec, ktorá nie je v knihe, je Tomášova prednáška.
  • Tréning v schopnosti rozlíšiť podstatné od nepodstatného (veľmi dôležitý do budúcnosti)
  • V prípade veľkých problémov sa môžeme dohodnúť na konzultáciách ku konkrétnym otázkam

Cvičenia

  • Cvičiaci Broňa Brejová a Tomáš Vinař
  • Tvorba prehliadača genómov na báze softvéru UCSC genome browser pre vybrané genómy.
  • Ak budú výsledky dobré, reálna šanca na využitie v medzinárodnej komunite!
  • Dve skupiny (s rôznymi cieľmi), stretnutia cca každé dva týždne v rozvrhovanom čase.

Je toto reálny model niečoho s čím sa môžem stretnúť v praxi?

  • Vo väčšine firiem nastupujete do rozbehnutého projektu.
  • Nie je neobvyklé, že skupina ľudí odíde a zanechá po sebe nesúrodú dokumentáciu a rozrobenú prácu, na ktorej vy musíte pokračovať.
  • Nie príliš schopný manažér.
  • Firmy so stabilným produktom používajú zabehnuté technológie (z vášho pohľadu legacy postupy s prvkami zastaralých programovacích jazykov); nie je finančne ani časovo možné neustále refaktorovať na nové platformy
  • V tomto projekte: hlavná časť softvéru v C/C++, Perl; databáza MySQL - jadro podporného softvéru vyvinuté na prelome tisícročí
  • Ťažiskom projektu je vyhľadávanie, spracovanie a porozumenie dátam
  • Vývoj softvéru je pomocný prvok s dôrazom na dosiahnutie konkrétneho cieľa; kľúčová je reprodukovateľnosť, vítaná je znovupoužiteľnosti v iných kontextoch

Typický priebeh cvičenia

  • Krátke prezentácie členov tímu o postupe / dosiahnutí cieľov (vrátane prezentácie informácii, ktoré by mohli byť užitočné kolegom pri ich práci)
  • Diskusia k aktuálnym problémom, brain storming ohľadom riešenia aktuálnych problémov
  • Nové ciele, rozdelenie práce
  • Začnete pracovať na nových cieľoch, cvičiaci pomôžu riešiť technické problémy / zodpovedať otázky. Z cvičenia by ste mali odchádzať s predstavou čo idete robiť a ako dlho vám to bude trvať.
  • Po skončení cvičenia pokračujete individuálne do ďalšieho stretnutia (komunikácia v rámci skupiny je samozrejme vítaná).


Malassezia globosa a Malassezia sympodialis

  • Budeme používať skratky malGlo a malSym
  • Sú to mikroorganizmy, ktoré patria medzi huby (fungi).
  • Bežne žijú na ľudskej pokožke, živia sa kožným mazom.
  • Môžu spôsobovať problémy, ako lupiny vo vlasoch, ekzém, infekcie.
  • Obrázky: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4069738/figure/F1/
  • Saunders CW, Scheynius A, Heitman J. Malassezia fungi are specialized to live on skin and associated with dandruff, eczema, and other skin diseases. PLoS pathogens. 2012 Jun 21;8(6):e1002701. [1]


Malassezia globosa

  • genóm publikovaný firmou Procter and Gamble, ktorá vyrába šampón Head and Shoulders, ktorý obsahuje antigungálne látky
  • Xu J, Saunders CW, Hu P, Grant RA, Boekhout T, Kuramae EE, Kronstad JW, DeAngelis YM, Reeder NL, Johnstone KR, Leland M. Dandruff-associated Malassezia genomes reveal convergent and divergent virulence traits shared with plant and human fungal pathogens. Proceedings of the National Academy of Sciences. 2007 Nov 20;104(47):18730-5. [2]
  • Wu G, Zhao H, Li C, Rajapakse MP, Wong WC, Xu J, Saunders CW, Reeder NL, Reilman RA, Scheynius A, Sun S. Genus-wide comparative genomics of Malassezia delineates its phylogeny, physiology, and niche adaptation on human skin. PLoS genetics. 2015 Nov 5;11(11):e1005614. [3]
  • Genóm [4], proteíny [5], RNA-seq [6]
  • Tím: Becza, Hraška, Jariabka, Krajčovič, Smolík, Šuppa, Zeleňák

Malassezia sympodialis

  • Gioti A, Nystedt B, Li W, Xu J, Andersson A, Averette AF, Münch K, Wang X, Kappauf C, Kingsbury JM, Kraak B. Genomic insights into the atopic eczema-associated skin commensal yeast Malassezia sympodialis. MBio. 2013 Mar 1;4(1):e00572-12. [7]
  • Zhu Y, Engström PG, Tellgren-Roth C, Baudo CD, Kennell JC, Sun S, Billmyre RB, Schröder MS, Andersson A, Holm T, Sigurgeirsson B. Proteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis. Nucleic acids research. 2017 Jan 18;45(5):2629-43. [8]
  • Genóm [9], proteíny [10], RNA-seq [11]
  • Tím: Ižip, Mayer, Metohajrová, Novák, Rabatin, D. Simeunovič, R. Simeunovič

Ďalšie príbuzné genómy

Ǔlohy pre vás

  • Skúste si spraviť cvičenie na prácu s UCSC prehliadačom
  • Do pondelka 26.2.: poslať B. Brejovej email obsahujúci vaše meno, gmailové konto a githubové konto, ktoré chcete na predmete využívať, prijať pozvánku za člena Github projektu
  • Rozmyslite si v skupinách aké spôsoby koordinácie chcete používať, návrhy nižšie
  • Pre ďalšie dve prednášky je vhodné si z MBI zopakovať úvod do biológie pre informatikov (cvičenie) a prednášku o sekvenovaní a zostavovaní genómov
  • 1.3. stretnutie malGlo, 8.3. stretnutie malSym

Koordinácia v rámci skupiny a s cvičiacimi

Každá skupina by si mala vytvoriť spôsob organizácie práce a jej výsledkov

  • Mala by existovať verejne dostupná a prehľadná dokumentácia k všetkému, čo ste robili
    • Kde ste stiahli dáta, ako ste ich spracovali (ideálne postupnosť všetkých relevantných príkazov), poznámky k problematickým krokom
    • Ideálne v angličtine, ale stačia stručné poznámky
  • Takisto by mali byť verejne prístupný archív zdrojových kódov všetkých programov, ktorý ste pre predmet napísali

Z minulého roku existuje projekt na GitHube https://github.com/bbrejova/genomika-2017

  • Obsahuje skripty aj dokumentáciu vo forme wiki
  • Odporúčame použiť, ak nemáte lepší nápad ako prácu zorganizovať
  • Časti z minulého roka nemažte, môžete ich však nejako presunúť do priečinka a pod.

Denník skupiny

  • Každá skupina má Google document, v ktorom sa na stretnutí spíšu dohodnuté úlohy a komu boli priradené a na ďalšom stretnutí aktuálny stav ich plnenia a pridelené body
  • Môžete si tam písať aj ďalšie poznámky o aktuálnom stave prác a problémoch, na aké ste narazili

Predbežný plán cvičení

Časový plán sa ešte môže zmeniť podľa okolností

  • 6.4. MalGlo (Becza, Hraška, Jariabka, Krajčovič, Smolík, Šuppa, Zeleňák)
  • 12.4. MalSym (Ižip, Mayer, Metohajrová, Novák, Rabatin, D. Simeunovič, R. Simeunovič)
  • 19.4. MalGlo
  • 26.4. MalSym
  • 3.5. nebude
  • 10.5. MalGlo
  • 17.5. MalSym

Genomika: cvičenie UCSC browser

Cvičenie na predmet Genomika

Základy browsera, gény

  • On-line grafický nástroj na prezeranie genómov
  • Konfigurovateľný, veľa možností, ale pomerne málo organizmov
  • V programe Firefox choďte na stránku UCSC genome browser http://genome-euro.ucsc.edu/ (európsky mirror stránky http://genome.ucsc.edu/ )
  • Hore v modrom menu zvoľte Genomes, potom zvoľte ľudský genóm verzia hg38. Do okienka search term zadajte HOXA2. Vo výsledkoch hľadania (Known genes) zvoľte gén homeobox A2 na chromozóme 7.
    • Pozrime si spolu túto stránku
    • V hornej časti sú ovládacie prvky na pohyb vľavo, vpravo, približovanie, vzďaľovanie
    • Pod tým schéma chromozómu, červeným vyznačená zobrazená oblasť
    • Pod tým obrázok vybranej oblasti, rôzne tracky
    • Pod tým zoznam všetkých trackov, dajú sa zapínať, vypínať a konfigurovať
    • Po kliknutí na obrázok sa často zobrazí ďalšia informácia o danom géne alebo inom zdroji dát (treba mať zapnuté na full alebo pack, inak prepína úroveň zobrazenia)
    • V génoch exóny hrubé, UTR tenšie, intróny vodorovné čiary
  • Koľko má HOXA2 exónov? Na ktorom chromozóme a pozícii je? Pozor, je na opačnom vlákne. Ako je táto skutočnosť naznačená na obrázku?
  • V tracku GENCODE kliknite na gén, mali by ste sa dostať na stránku popisujúcu jeho rôzne vlastnosti, pozrite si ju.

Dôležité tracky

Tracky sú rozdelené do viacerých skupín

  • Mapping and sequencing: kvalita sekvencie zostavenej z čítaní, základné vlastnosti ako napr. GC%
  • Genes and Gene Predictions: známe gény z rôznych databáz, automatické predikcie
  • Phenotype and Literature: gény a iné miesta v genóme spomínané v literatúre alebo v databázach o ľudských chorobách a pod.
  • mRNA and EST: osekvenované mRNA sekvencie
  • Expression: údaje o expresii génov v rôznych tkanivách, napr. GTEx
  • Regulation: merania o regulácii aktivity génov (väzobné miesta transkripčných faktorov, histónové modifikácie)
  • Comparative genomics: porovnanie viacerých genómov
    • PhyloP - uroven konzerovanosti danej bazy len na zaklade jedneho stlpca zarovnania
    • Element Conservation/Conserved Elements vysledky z phyloHMM phastCons, ktory berie do uvahy aj okolite stlpce
    • multiz celogenómové zarovnania
    • nets and chains: zodpovedajúce si úseky rôznych genómov
  • Variation: populacna genomika a polymorfizmy (viac v starsich verziach ludskeho genomu)
  • Repeats: casti genomu, ktore sa velakrat opakuju, ale aj segmentalne duplikacie

Verzie genómov, prechádzanie medzi verziami (liftOver)

  • Vráťte sa na UCSC genome browser http://genome-euro.ucsc.edu/
  • Pozrieme si niekoľko vecí týkajúcich sa sekvenovania a skladania genómov
  • Hore v modrom menu zvoľte Genomes, časť Other
  • Na ďalšej stránke zvoľte človeka a pomocou menu Human Assembly zistite, kedy boli pridané posledné dve verzie ľudského genómu (hg19 a hg38)
  • Na tej istej stránke dole nájdete stručný popis zvolenej verzie genómu.
  • Zapnite si tracky "Assembly" a "Gaps" a pozrite si región chr2:110,000,000-110,300,000 v hg19: [13] Aká dlhá je neosekvenovaná medzera (gap) v strede tohto regiónu? Približnú veľkosť môžete odčítať z obrázku, presnejší údaj zistíte kliknutím na čierny obdĺžnik zodpovedajúci tejto medzere (úplne presná dĺžka aj tak nebola známa, nakoľko nebola osekvenovaná).
  • Cez menu položku View, In other genomes si pozrite, ako zobrazený úsek vyzerá vo verzii hg38. Ako sa zmenila dĺžka z pôvodných 300kb?

BLAT, prechádzanie medzi genómami rôznych druhov

  • Sekvencia uvedená nižšie vznikla sekvenovaním ľudskej mRNA
  • Choďte na UCSC genome browser http://genome.ucsc.edu/ , na modrej lište zvoľte BLAT, zadajte túto sekvenciu a hľadajte ju v ľudskom genóme. Akú podobnosť (IDENTITY) má najsilnejší nájdený výskyt? Aký dlhý úsek genómu zasahuje? (SPAN). Všimnite si, že ostatné výskyty sú oveľa kratšie.
  • V stĺpci ACTIONS si pomocou Details môžete pozrieť detaily zarovnania a pomocou Browser si pozrieť príslušný úsek genómu.
  • V tomto úseku genómu si zapnite track Vertebrate net na full a kliknutím na farebnú čiaru na obrázku pre tento track zistite, na ktorom chromozóme sliepky sa vyskytuje homologický úsek.
  • Skusme tu istu sekvenciu zarovnat ku genomu sliepky programom Blat: stlacte najprv na hornej modrej liste Genomes, zvolte Vertebrates a Chicken a potom na hornej liste BLAT. Do okienka zadajte tu istu sekvenciu. Akú podobnosť a dĺžku má najsilnejší nájdený výskyt teraz? Na ktorom je chromozóme?

Ľudská sekvencia pre BLAT

AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG

Table browser

Genome browser is nice for manual browsing but also allows programmers to download data

  • each track based on one or several tables in an SQL database
  • you can download genomic sequences and data from these tables [14]
  • you can also write queries for a public SQL server [15] or create queries using Table browser forms (blue bar: Tools->Table browser)
  • conversely, you can also display your own data in "custom tracks" of the browser

Table browser examples

  • Basic type of query: e.g. export all genes in the part of the genome displayed in the browser
  • Several output formats, e.g.:
    • sequence: file of protein or DNA sequences of these genes (various settings)
    • GTF: coordinates of genes and their exons
    • Hyperlinks to genome browser: list of genes with links to the browser for each gene
    • Instead of export we can get summary statistics (number of items, how much sequence they cover)
  • More complex query, "intersection" of two tables: e.g. all genes that are more than 50% covered by simple repeats, filtering

Predbežné informácie k štátniciam

Na tejto stránke sú predbežné neoficiálne informácie k magisterskému štátnicovému predmetu Bioinformatika a strojové učenie pre školský rok 2017/18. Môže ešte dôjsť k nejakým zmenám (najmä v oblasti dátových štruktúr), finálna verzia by sa v prebehu pár dní mala objaviť na stránke Katedry informatiky.

Úvod

Jedným z cieľov štátnic je uvedomiť si prepojenia medzi rôznymi predmetmi. Predmety v štátnicovom predmete Bioinformatika a strojové učenie navzájom súvisia, ale tieto súvislosti sa len v malej miere ukážu priamo v osnovách jednotlivých predmetov. Preto sme vybrali články z vedeckej literatúry, ktoré spájajú témy z viacerých predmetov a budú odrazovým mostíkom pre diskusiu na štátnych skúškach. Na štátnej skúške si vylosujete jeden z nižšie uvedených článov a trojicu otázok s ním súvisiacich. V prvej otázke bude vždy vašim cieľom sumarizovať hlavné výsledky článku a vysvetliť ich aj informatikom, ktorí nie sú priamo odborníkmi v oblasti zamerania článku. V tejto otázke očakávame cca 5-minútový prehľad článku s dôrazom na vysvetlenie potrebných pojmov a základných myšlienok článku, nie technických detailov. Druhá otázka bude z nižšie uvedených okruhov učiva. Môže ale nemusí súvisieť s témou článku. Tretia otázka bude podrobne vysvetliť niektorý technický detail článku (napr. nejakú časť algoritmu, zložitejšiu definíciu, dôkaz lemy, detaily experimentu a podobne). Po vylosovaní otázky dostanete k dispozícii vytlačený článok a budete mať aspoň hodinu času na prípravu, takže nie je potrebné tieto články poznať naspamäť. Pri príprave na štátnice vám odporúčame okrem opakovania si učiva v uvedených okruhoch pozrieť si aj uvedené články a s nimi súvisiacu terminológiu.

Články

  • Apostolico A, Bock ME, Lonardi S, Xu X. Efficient detection of unusual words. Journal of Computational Biology. 2000 Feb 1;7(1-2):71-94. [16]
  • Štefankovič D, Vempala S, Vigoda E. A deterministic polynomial-time approximation scheme for counting knapsack solutions. SIAM Journal on Computing. 2012 Apr 19;41(2):356-66. [17]
  • Dowell RD, Eddy SR. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics. 2004 Jun 4;5(1):1. [18]
  • Heng L, Durbin R. (2009): Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14): 1754-1760 [19]
  • Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Research. 1998 Jan 1;26(2):544-8. [20]
  • Wieland SC, Cassa CA, Mandl KD, Berger B. Revealing the spatial distribution of a disease while preserving privacy. Proceedings of the National Academy of Sciences. 2008 Nov 18;105(46):17608-13. [21]
  • Elias I, Lagergren J. Fast neighbor joining. Theoretical Computer Science. 2009 May 17;410(21):1993-2000. [22]
  • Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning 2006 Jun 25 (pp. 369-376). ACM. [23]
  • Bachem O, Lucic M, Hassani H, Krause A. Fast and provably good seedings for k-means. In Advances in Neural Information Processing Systems 2016 (pp. 55-63). [24]
  • Turk M, Pentland A. Eigenfaces for recognition. Journal of cognitive neuroscience. 1991 Jan;3(1):71-86. [25]

Okruhy učiva

V zátvorke skratky súvisiacich predmetov: AOP: Aproximácia optimalizačných problémov; G: Genomika; IDZ: Integrácia dátových zdrojov; MBI: Metódy v bioinformatike; NS: Neurónové siete; PaŠ: Pravdepodobnosť a štatistika; SU: Strojové učenie; VPDŠ: Vybrané partie z dátových štruktúr

  • Neurónové siete: viacvrstvový perceptrón, metóda spätného šírenia chyby, hlboké architektúry neurónových sietí, Hebbovské učenie (SU,NS)
  • Modelovanie sekvenčných dát: Skryté Markovove modely, podmienená pravdepodobnosť a Bayesove vety, Viterbiho a dopredný algoritmus, príklady využitia v bioinformatike (hľadanie génov a profilové HMM), rekurentné neurónové siete, Hopfieldov model (MBI,PaŠ,NS)
  • Klasifikačné modely: support vector machines, rozhodovacie stromy, náhodné lesy, bagging, boosting (SU)
  • Regresia: lineárna a generalizovaná lineárna regresia, metóda najmenších štvorcov, štatistický model s normálnym rozdelením chýb, regularizácia (PaŠ,SU)
  • Teória strojového učenia: štatistický model strojového učenia, výchylka vs. rozptyl, preučenie a podučenie, PAC učenie, odhady pomocou VC dimenzie (SU,NS)
  • Strojové učenie bez učiteľa: zhlukovanie, samoorganizujúce sa zobrazenia, analýza hlavných komponentov, využitie na analýzu génovej expresie (SU,NS,MBI)
  • Testovanie štatistických hypotéz: Fisherov exaktný test, Welchov t-test, Mann-Whitneyho U-test, Bonferroniho korekcia viacnásobného testovania, log likelihood ratio test, príklady použitia testov v bioformatike (PaŠ,IDZ,MBI)
  • Stredná hodnota náhodnej premennej: linearita strednej hodnoty, Markovova a Čebyševova nerovnosť (PaŠ)
  • Limitné vety teórie pravdepodobnosti: centrálna limitná veta, Moivrova-Laplaceova veta, slabý zákon veľkých čísel (PaŠ)
  • Sekvenovanie DNA: technológie sekvenovania a ich charakteristiky (Sanger, Illumina, nanopórové sekvenovanie), skladanie genómov, deBruijnove grafy, RNA-seq (MBI,G)
  • Fylogenetika a komparatívna genomika: metóda spájania susedov, metóda úspornosti, Jukes-Cantorov model a iné substitučné modely, pozitívna a negatívna selekcia a jej vplyv na evolúciu biologických sekvencií (MBI, G)
  • Zarovnania a algoritmy na reťazcoch: lokálne a globálne zarovnávanie sekvencií, BLAST (jadrá zarovnaní), perfektné hešovanie, Bloomov filter, efektívna reprezentácia sekvencií (sufixové stromy a polia, Burrowsova–Wheelerova transformácia, FM index) (MBI,VPDŠ)
  • Metóda maximálnej vierohodnosti: odhad parametrov rozdelenia, nevychýlené odhady parametrov, metóda maximálnej vierohodnosti na rekonštrukciu fylogenetických stromov, Felsensteinov algoritmus, EM algoritmus, trénovanie skrytých Markovových modelov, hľadanie sekvenčných motívov (PaŠ, MBI)
  • Lineárne programovanie: lineárne a kvadratické programovanie, simplexová metóda, dualita, celočíselné lineárne programovanie a jeho využitie na riešenie ťažkých problémov v bioinformatike, využitie lineárneho programovania v aproximačných algoritmoch (deterministické zaokrúhľovanie, iterované zaokrúhľovanie, randomizované zaokrúhľovanie + derandomizácia, primárno-duálne metódy), semidefinitné programovanie a max-cut, využitie duality v support vector machines (kernelové metódy) (AOP, SU, MBI)
  • Aproximovateľnosť: Zložitostné triedy aproximačných algoritmov, PCP veta a jej použitie, AP-redukcia, APX úplné problémy, aproximovateľnosť problému obchodného cestujúceho, polynomiálne aproximačné schémy a príklady PTAS algoritmov (AOP)
  • Aplikácie formálnych jazykov: Knuth-Morris-Pratt algoritmus na hľadanie vzorky v texte, stochastické bezkontextové gramatiky, kovariačný model a rodiny RNA, Nussinovovej algoritmus (MBI, VPDŠ)
  • Modely dátových štruktúr: amortizovaná zložitosť a potenciálová funkcia, I/O model a B-stromy, cache-oblivious model a statický binárny strom s van Emde Boas rozložením, úsporné dátové štruktúry (rank a select) (VPDŠ)
  • Dátové štruktúry pre intervaly: range minimum query, lowest common ancestor, segmentové stromy, rozsahové stromy (VPDŠ)

Príklad otázok

Príklady otázok ku článku Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. [26]

Otázka 1: Sumarizujte hlavné výsledky článku a vysvetlite, prečo je skúmaný problém dôležitý pre moderné strojové učenie (ak v odpovedi na túto otázku nevysvetlíte, čo je neurónová sieť, pravdepodobne sa vás spýtame na definíciu)

Otázka 2: Vysvetlite, čo je normalizovaná inicializácia a na obrázkoch 7 a 9 vysvetlite, aký má normalizovaná inicializácia vplyv na priebeh učenia. (bude k dispozícii projektor, na ktorom sa dajú obrázky z článku ukázať)

Otázka 3: Štatistický model strojového učenia, výchylka vs. rozptyl, preučenie a podučenie

Genomika: Informácie ku trackom

Informácie k predmetu Genomika

Na tejto stránke sú informácie k trackom ktoré budete vytvárať na browseri (obe skupiny). K niektorým trackom pridáme ďalšie informácie v nasledujúcich dňoch.

Comments to the task list

  • Task (A) is a prerequisite of all other tasks, the rest are mostly independent of each other.
  • Tasks are marked as fast (no significant computation required), medium (estimated computation up to 1 hour), slow (longer computation, possibly several hours).
    • These times are only estimates, reality may vary. Perhaps provide actual running times (approximate) in your documentation.
    • Fast tasks can be done entirely on genomika server.
    • Students having accounts on compbio research cluster may run medium and slow tasks there.
  • If you get stuck on one task, you can try to do at least initial stages of another one. Coordinate within group!
  • Document your work. Documentation should be independent of this page and of the documentation created last year - copy and modify relevant passages, cite sources.

Basic information on creating tracks

(A) Genome (fast)

hgsql hgcentral -e '
insert into dbDb values (...);

insert into defaultDb values (...);

insert into genomeClade values (...);
'

(B) Protein coding genes and other items from the annotation (fast, needs A)

baseColorUseCds given
baseColorDefault genomicCodons

(C) RepeatMasker (slow, needs A)

(D) tRNAscan-SE (medium, needs A)

  • Run software for finding tRNA genes (for comparison with annotation)
  • Download software from http://lowelab.ucsc.edu/tRNAscan-SE/ (already installed on compbio servers as tRNAscan-SE command)
  • Convert output by script rna/tRNAscan-SEtoBED.py on github
  • trackDb.ra record:
track tRNAs
shortLabel tRNA Genes
longLabel Transfer RNA Genes Identified with tRNAscan-SE
group genes
visibility hide
color 0,20,150
type bed 12
nextItemButton on
priority 10

(E) Augustus (slow, needs A)

  • Run gene finder Augustus, create track with predicted genes (for comparison with annotation)
  • Download and install software from http://bioinf.uni-greifswald.de/augustus/
    • Already installed on compbio servers
  • Example of command line: augustus --uniqueGeneId=true --species=ustilago_maydis genome.fa > augustus.gtf
  • ustilago_maydis is a related fungal species used for training parameters
  • The result needs to be converted from gtf to genepred, by gtfToGenePred (at genomika server) with option -genePredExt
  • If you name your track augustus, genome browser will recognize it automatically, no need to modify trackDb.ra

(F) Self-alignment (medium/slow needs A)

lastdb genome.fa genome.fa 
lastal genome.fa genome.fa -E 1e-20 > self.maf #slow part
maf-convert psl self.maf > tmpC.psl

# filter out trivial self-alignments as well as alignments shorter than 100bp in one of the two sequences or with identity less than 0.9
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<100 || $F[16]-$F[15]<100 || $F[0]<0.9*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC100_90.psl
pslToChain tmpC100_90.psl tmpC100_90.chain # kent tools binary, available on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC100_90.chain > self100_90.chain

# another chain for alignments with at least 70% identity and length at least 300bp
perl -lane 'die unless @F==21; $s=($F[9] eq $F[13] && $F[10]==$F[12] && $F[11]==0); $s = $s || $F[12]-$F[11]<300 || $F[16]-$F[15]<300 || $F[0]<0.7*($F[0]+$F[1]+$F[2]+$F[3]+$F[5]+$F[7]); print unless $s' tmpC.psl > tmpC300_70.psl
/projects2/dipMag/magCap-2017/assembly/magCapA/seq-tracks/pslToChain tmpC300_70.psl tmpC300_70.chain # kent tools binary copied from genome-dev
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmpC300_70.chain > self300_70.chain

Parts of trackDb.ra (replace magCap5 with your genome name):

track selfChain100_90
shortLabel Self aln >90%id
longLabel Self alignments with length >100bp, identity >90%
group varRep
type chain magCapA5

track selfChain300_70
shortLabel Self aln >70%id
longLabel Self alignments with length >300bp, identity >70%
group varRep
type chain magCapA5

(G) Chains between genomes (medium, needs A from both groups)

  • The goal is to create chains from malGlo to malSym and vice versa
    • Each group creates chains from its browser to the other browser
  • This is done similarly as self-similarity chains, but alignments are done between two different genomes and filtering is done differently
lastdb genome.fa genome.fa 
lastal genome.fa genome2.fa -E 1e-20 > firstSecond.maf'
maf-convert psl firstSecond.maf > tmp.psl

# keep only alignments of length at least 100 in both sequences
perl -lane 'die unless @F==21; $s = $F[12]-$F[11]<100 || $F[16]-$F[15]<100; print unless $s' tmp.psl > tmp100.psl
pslToChain tmp100.psl tmp100.chain # kent tools binary on genomika
# fix bad coordinates on reverse strand 
perl -lane 'if ($F[0] eq "chain" && $F[9] eq "-") { ($F[10],$F[11]) = ($F[8]-$F[11], $F[8]-$F[10]); print join(" ", @F) } else { print }' tmp100.chain > firstSecond.chain
  • trackDb.ra record similar, but include target species in line with type chain

(H) Protein-based chains between genomes (medium, needs A,B from both groups)

  • In more distant species, DNA-based chains from part G are not sufficiently sensitive, but it is easier to find similarity between proteins
  • In this type of track you extract protein sequences based on genome sequence and gene annotation, then you compare protein sets from the two species and map protein alignments back to the genome
  • Commands from the last year create a psl file and load it. Then the alignments cannot be used to move between genomes. It would be better to convert psl to chain as in parts F and G.
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Chains-from-protein-alignments

(I) Genomes for comparative genomics (fast, only one group)

  • Download genomes of additional Malassezia species (other than malGlo and malSym)
  • Use list here [29], download M. pachydermatis, M. nana, M. equina, M. caprae, M. dermatis, M. restricta
    • Download one representative assembly per species (some species have multipe strains /assemblies)
  • Rename chromosomes similarly as in A, name fasta files in a systematic way (malPac1.fa etc.)
  • Store files in a directory at genomika server
  • Do not forget to note down in your documentation the URL of each downloaded fasta file.

(J) Multiple whole-genome alignment (slow, needs A from both groups, I)

  • The goal of this track is to create a whole-genome multiple alignment of several genomes
  • Use genomes from part I as well as malGlo and malSym genomes from the browser
  • Beware that malSym1 and malGlo1 should be correctly named, both the genome as a whole and their chromosomes as in the browser
  • The task requires some preprocessing - renaming things etc (fast), alignment computation (slow, we recommend running on compbio servers) and postprocessing (fast/medium)
    • Preprocessing and possibly also part of running alignment can be reused between groups - collaborate
  • The notes from the last year consist of three parts: general introduction, Brona's notes (Example of use of tba in a different project), and student notes (Example of use of tba in a our project, display alignments).
    • Probably follow student notes.
    • The notes are not finished (end with "track does not work"), but the track was finished, see track "S. Align (L)" in sacCer3 browser. See final version of sacCer3 trackDb.ra on genomika server.
  • To run alignment, you need phylogenetic tree of these species. Use the tree from paper by Wu et al 2015 [30] - our species are in group B. Write the tree in the parenthesis notation
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Alignments

(K) Conservation by phyloP (medium, needs A,I,J)

  • Based on multiple alignment from part J, find which positions are conserved in evolution (the result is a numerical level of conservation per position in a wiggle format)
  • See tracks Align. Cons. (L) and Multiz. Cons. (L) in sacCer3 browser (here we want only one track)
  • Use the same tree as in I
  • https://github.com/fmfi-genomika/genomika-2017/wiki/PhyloP-tracks

(L) Conserved elements by phastCons (medium, needs A,I,J)

  • Similar as track K, but uses a different program from the phast package. Phastcons is based on and HMM, finds contiguous conserved regions. The result is a list of conserved regions (bed format) as well as posterior probability of conserved region at each position (wig format)
  • On sacCer3, wig format are e.g. tracks Cons. new (L), Cons. old (L); bed format track is PhastCons Most, but that was taken from the original UCSC database so no commands for it are available, but hopefully it should be easy to create and load.
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Conservation

(M) Protein domain and other protein annotation from Uniprot (fast/medium, needs A,B)

  • The uniprot database (http://www.uniprot.org/) contains information about proteins. The goal is to download information about malGlo and malSym proteins, parse out info about particular regions and map these to the corresponding regions of the genome.
  • See sacCer3 tracks Pfam (L), uniProtAnnot (L), uniProtStruct (L)
  • Download protein info in XML format malGlo [31], malSym [32]
  • Last year's protocol links uniprot proteins to genes from browser annotation via sequence similarity search (blat). Possibly this could be done also by cross-linking information from the databases, but blat is fine.
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Uniprot-data
  • Last year, Pfam track was created by runing Interproscan tool locally [33]. However, this is time-consuming and uniprot contains pre-computed info about Pfam domains. Therefore it would be better to modify scripts so that they parse Pfam out of uniprot XML files together with other info.

(N) Expression data from RNA-seq (medium/slow, needs A)

  • The goal is to display the results of measurement of expression (amount of mRNA) by RNA-seq
  • Workflow:
    • The original data are reads in fastq format. Some preprocessing can be done (quality trimming etc)
    • Reads are aligned to the genome to produce sam/bam file. This is SLOW. The file is then sorted and indexed.
    • Bam files can be used in the browser, but they are big. We will report only the number of reads at each position in a wig (wiggle) format.
    • Wig files can be loaded to the database but perhaps more efficiently converted to binary bigwig files. The database then contains only reference to bigwig file.
  • Data:
    • malGlo [34] - only reads provided. Out of 27 experiments choose only 1-2, align to genome, e.g. this one: [35]
    • malSym [36] - bam files provided
  • malGlo needs to align reads to the genome.
    • Currently recommended aligner is STAR https://github.com/alexdobin/STAR
    • It seems that STAR can directly create wig files, read the manual for recommended settings (e.g. the section on small genomes)
    • To convert wig to bigwig, use wigToBigWig on genomika
    • To load bigwig file, see commands below
  • malSym already has bam files for several experiments
    • These need to be converted to wig / bigwig
    • First use bedtools suite to create bedgraph (see commands below), then convert to bigwig using bedGraphToBigWig (installed on genomika)
    • To load bigwig file, see commands below
    • Multiple experiments are better combined to a single composite track with individual subtracks
    • Subtracks are loaded to db normally, composite tract is noted only in trackdb file, see below
  • Useful commands (modify for your situation):
# bam to bedgraph 
faSize -detailed genome.fa > genome.sizes
bedtools genomecov -ibam reads.bam -g genome.size -bga -split > reads.bedgraph

# to create track, place bigwig file to appropriate place in /gbdb
# then create table with reference to this file:
hgsql malXyz1 -e "CREATE TABLE table_name (fileName varchar(255) not null);"
hgsql malXyz1 -e "insert into table_name values ('/gbdb/malXyz1/filename.bw');"

# in trackDb.ra include something like this: (change 500 to appropriate value at which read depth is clipped)
track table_name
shortLabel RNA-seq coverage
longLabel RNA-seq coverage
visibility dense
group rna
type bigWig 0 500

# composite track from multiple experiments:
track track_name
compositeTrack on
type bigWig 0 200
shortLabel RNA-seq coverage
longLabel RNA-seq coverage
group rna
visibility dense

track subtrack_name
shortLabel subtrack_label
longLabel subtrack_label
parent track_name
type bigWig 0 250
visibility full
maxHeightPixels 80:16:8

(O) Differences between strains (slow, needs A)

  • The goal is to compare multiple strains of the same species and display differences between them in the browser
  • The usual way is to align sequencing reads from one strain to the reference strain, identify differences and display them in vcf format
  • Read files are large, therefore we directly compare assembled genomes and create the vcf file using c-sibelia tool
  • You can mostly follow last-year's notes except for the final steps. Instead of placing vcf.gz and vcf.gz.tbi files to a different server, place them to genomika to /gbdb/malXyz1/subdir, then insert to database using commands below
  • As in part N, you can group several strains to a single composite track, see parts of trackDb.ra in commands below
  • https://github.com/fmfi-genomika/genomika-2017/wiki/Strain-comparison
  • Last year's tracks are currently broken, but you can at least check their setting. eg. CLIB89 variants (L) in yarLip browser
  • Download other strains:
    • malGlo [37] use strains CBS 7966, CBS 7874
    • malSym [38] use all strains except ATCC 42132
  • Useful commands (modify for your situation):
# to create track, place vcf.gz and vcf.gz.tbi files to appropriate place in /gbdb
# then create table with reference to the vcf.gz file:
hgsql malXyz1 -e "CREATE TABLE table_name (fileName varchar(255) not null);"
hgsql malZyz1 -e "insert into table_name values ('/gbdb/maglXyz1/subdir/filename.vcf.gz');"

# in trackDb.ra include something like this:
# composite track:
track track_name
compositeTrack on
type vcfTabix
shortLabel ...
longLabel ...
group varRep
visibility hide

# subtrack:
track subtrack_name
shortLabel ...
longLabel ...
parent track_name
visibility pack

Genomika: Rozvojové projekty

Informácie k predmetu Genomika

Na tejto stránke sú informácie k podprojektom na záverečné týždne semestra.

MalGlo group

User trackDb, code management

  • Think how to better manage changes to browser code in the future instances of the course
  • Explore possibilities of each user having their own trackDb
  • Start by reading short info in /kentsrc/trackDb/makefile on genomika server
# Browser supports multiple trackDb's so that individual developers
# can change things rapidly without stepping on other people's toes. 
...
  • Write a manual how to do your suggested changes and test it

Rfam

  • Rfam http://rfam.xfam.org/ is a database of families of non-coding RNAs
  • It contains a covariance model for each family
  • The database can be downloaded and searched against a genome using Infernal tool http://eddylab.org/infernal/
  • Do this search, then convert the output to appropriate format and display in the browser
  • Possibly use BEDdetail format https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7
  • After clicking on an Rfam match, there should be some display of additional information about the match and a link to the Rfam database. You can achieve this by the following lines in trackDb.ra:
type bedDetail 14
url http://rfam.xfam.org/family/$$
urlLabel Rfam:

Example of BEDdetail format for a Rfam match (items should be tab-separated, the last column starts at "truncated:")

chrom chromStart chromEnd name score strand thickStart thickEnd reserved blockCount blockSizes chromStarts id description
contigA 75109 75380 Fungi_SRP-1 1002 - 75109 75109 0 1 271 0 RF01502 truncated: no, E-value: 3.5e-19
  • Further things which you might want to explore:
    • Remove matches that correspond to tRNAScan-SE matches (try tool overlapSelect)
    • From several overlapping matches keep only the strongest (try tool overlapSelect)
    • More ambitious: Explore creating image of each RNA structure and somehow linking it to the info page for the match (as in non-coding RNA track in the human genome browser - see for example http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr1%3A16520585%2D16520658, display non-coding RNA track and click on the tRNA match)

Information for users

  • Each track should provide basic information for users in the HTML document displayed after clicking on track name or left bar of the browser image.
  • The information should summarize what is displayed, what was source of the data, what program was used to produce the results etc
    • keep it less technical, with a link to your github wiki page for the track for potential developers replicating your work
  • See examples for tracks on the http://genome-euro.ucsc.edu/ browser
  • Also, the genome as a whole should have a description page. On the title page of http://genome-euro.ucsc.edu/ you see details of the selected assembly, e.g. for the guinea pig genome you see text
Guinea pig Genome Browser - cavPor3 assembly
The Feb. 2008 Cavia porcellus draft assembly (Broad Institute cavPor3) was produced by the Broad Institute at MIT and Harvard.
...
  • You should create some explanatory text for you species and genome and make it display on the title page
    • This already works for Yarrowia lipolitica on genomika server, so you can try to find out how it was done

MalSym group

Informácie k predmetu Genomika

Gene info pages

  • If you click on a gene or other displayed item in a well-setup genome browser, you get a page with more information about this item
  • These info pages do not work satisfactorily on our genomika browser
  • Look at all protein coding gene tracks in four browsers:
    • sacCer3 in original UCSC genome browser [39], tracks NCBI RefSeq, SGD Genes, Ensembl Genes
    • sacCer3 in our genomika genome browser [40], tracks NCBI RefSeq, SGD Genes, Ens. Genes, NCBI RefSeq (L), SGD Genes (L), Ens. Genes (L),
    • yarLip1 in our genomika genome browser [41], tracks Ens. Genes (L), RefSeq Genes (L)
    • malSym1 in our genomika genome browser [42], track Ensemble Genes (should be renamed Genes from NCBI)
  • For each explored track, find out what gets displayed on the gene info page, whether there are any error messages, whether the page contains a link to the source database (e.g. Ensembl, RefSeq, NCBI, SGD)
  • Explore how the differences in these info pages are encoded in the database and trackDb.ra
  • Suggest and implement improvements in these info pages on our browser in sacCer, yarLip, malSym and after warning the other group also in malGlo
  • The most comprehensive gene info pages use additional db tables downloaded from the uniprot database. This database is too large to be completely mirrored on our server. Can you suggest and implement a method for downloading only parts of the database for our species and loading it to the tables? (You were downloading uniprot for one species, its "proteome" in task M, possibly it can be used here.)

Note:

  • To explore how things work at UCSC, you can see setup notes in theit github [43], particularly the uniProt section and sacCer3.txt
  • You can also check their original trackDb.ra files [44] - see also parent directory and subdirectories
  • You can explore even the UCSC mysql database through their mysql server [45]

Blat and name search

Blat:

  • In the blue menu bar on top of the genome browser screen find Tool->Blat. This is a fast alignment tool which find sequences highly similar to your query.
  • In the genomika browser it seems to work for sacCer3 but not for the other three genomes. Make it work for all four, document your changes.

Name search:

  • Browser screen also contains text input field, where you can enter particular coordinates but also other keywords, such a gene name etc.
    • Try searching for gene YDR157W in sacCer3
    • Try searching for gene CAG83524 in yarLip1 - the gene is there but is not found, instead we get an error message
    • Make the search work for gene identifiers in all 4 genomes (sacCer, yarLip, malGlo, malSym)
  • Possibly also allow searching for other entities (keywords from gene descriptions, tRNA anti-codons, domains from Uniprot annotation track etc)
    • For example searching for keyword "ribosomal" in UCSC sacCer genome browser returns a list of genes with ribosomal in their description - try: [46]
  • Get rid of misleading error message when search is unsuccessful (see what error you get in the UCSC brwoser)

See the note in the previous task for information sources on how things are setup at UCSC

Information for users

  • Each track should provide basic information for users in the HTML document displayed after clicking on track name or left bar of the browser image.
  • The information should summarize what is displayed, what was source of the data, what program was used to produce the results etc
    • keep it less technical, with a link to your github wiki page for the track for potential developers replicating your work
  • See examples for tracks on the http://genome-euro.ucsc.edu/ browser
  • Also, the genome as a whole should have a description page. On the title page of http://genome-euro.ucsc.edu/ you see details of the selected assembly, e.g. for the guinea pig genome you see text
Guinea pig Genome Browser - cavPor3 assembly
The Feb. 2008 Cavia porcellus draft assembly (Broad Institute cavPor3) was produced by the Broad Institute at MIT and Harvard.
...
  • You should create some explanatory text for you species and genome and make it display on the title page
    • This already works for Yarrowia lipolitica on genomika server, so you can try to find out how it was done