1-BIN-301, 2-AIN-501 Methods in Bioinformatics, 2021/22

Introduction · Rules · Tasks and dates · Materials · Moodle · Discussion
Cvičenia vo štvrtok o 14:00 sú určené pre študentov BIN, INF, mINF, mAIN, DAV. Cvičenia vo štvrtok o 17:20 sú pre študentov z PriFUK a z fyzikálnych odborov. Obidvoje cvičenia sa budú konať už v prvom týždni semestra.


CI-en-db: Rozdiel medzi revíziami

Z MBI
Prejsť na: navigácia, hľadanie
(Úvod do bioinformatických databáz a on-line nástrojov)
(Blat)
 
(46 intermediate revisions by the same user not shown)
Riadok 13: Riadok 13:
 
** convenient, because no need to download large database, but also very slow
 
** convenient, because no need to download large database, but also very slow
 
* Try sequence below at http://blast.ncbi.nlm.nih.gov/Blast.cgi
 
* Try sequence below at http://blast.ncbi.nlm.nih.gov/Blast.cgi
** first search using nucleotide blast with default settings in nr database, to speed up search you can enter Homo sapiens (taxid:9606) as Organism
+
** the sequence is from the human genome but we will try to find its homolog in chicken
** nr database contains all nucleotide sequences submitted by various researchers
+
** choose nucleotide blast, database reference genomic sequence, organism chicken (taxid:9031), program blastn)
** you should find a record exactly corresponding to this sequence
+
** while this search runs, start another: search the same sequence in the chicken genome: choose nucleotide blast, database reference genomic sequence, organism chicken (taxid:9031), program blastn)
+
 
** on which chromosome is the best chicken homolog, what is alignment length, score, E-value, identity level?
 
** on which chromosome is the best chicken homolog, what is alignment length, score, E-value, identity level?
 
<pre>
 
<pre>
Riadok 33: Riadok 31:
 
CCGAAAAGCCCCCACAAAAAGCCG
 
CCGAAAAGCCCCCACAAAAAGCCG
 
</pre>
 
</pre>
 
  
 
===UCSC genome browser===
 
===UCSC genome browser===
Riadok 40: Riadok 37:
 
* also allows custom queries and data download
 
* also allows custom queries and data download
  
====Sekvenovanie====
+
====Basics====
* Hore v modrom menu zvoľte Genomes
+
* on the front page, choose Genomes in the top blue menu bar
* Na ďalšej stránke zvoľte človeka a v menu Assembly '''zistite, kedy boli pridané posledné dve verzie ľudského genómu (hg19 a hg38)'''
+
* select a genome and its version, optionally enter position or keyword, press submit
* Na tej istej stránke dole nájdete stručný popis zvolenej verzie genómu. '''Pre ktoré oblasti genómu máme v hg19 viacero alternatívnych verzií?'''
+
* on the browser screen top image shows chromosome map, selected region in red
* Zadajte región chr21:31,200,000-31,350,000 v hg19
+
* below a view of selected region and various track with information about this region
* Zapnite si tracky Mapability a RepeatMasker na "full"
+
* for example some of the top tracks display genes (boxes are exons, lines are introns)
* Mapability: nakoľko sa daný úsek opakuje v genóme a či teda vieme jednoznačne jeho ready namapovať pri použití Next generation sequencing
+
* tracks can be switched on and off and configured in the bottom part of the page
* Ako a prečo sa  pri rôznych dĺžkach readov líšia? (Keď kliknete na linku "Mapability", môžete si prečítať bližšie detaily.)
+
** different display levels, full contains all information but takes a lot of vertical space
* Približne v strede zobrazeného regiónu je pokles mapovateľnosti. '''Akému typu opakovania zodpovedá?''' (pozrite track RepeatMasker)
+
* navigation at the top (move, zoom, etc.)
* Zapnite si tracky "Assembly" a "Gaps" a pozrite si región chr2:110,000,000-110,300,000. '''Aká dlhá je neosekvenovaná medzera (gap) v strede tohto regiónu?''' Približnú veľkosť môžete odčítať z obrázku, presnejší údaj zistíte kliknutím na čierny obdĺžnik zodpovedajúci tejto medzere (úplne presná dĺžka aj tak nebola známa, nakoľko nebola osekvenovaná).
+
* various actions in the menu
 
+
* clicking at the browser figure allows you to get more information about a gene or other displayed item
====Geny====
+
* Zvolte starsiu verziu ludskeho genomu hg18, ktora ma viac informacii
+
* Do okienka position zadajte gen MAGEA2B a potom zvolte jeden jeho vyskyt (ma dva vyskyty)
+
** Dostanete sa tam aj touto linkou: [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chrX:151883119-151887095]
+
* Ak date 3x zoom out, mozete si vsimnut, ze tento gen ma viacero foriem zostrihu, ktore sa ale lisia iba v 5' UTR
+
* Vela veci sa mozete dozvediet klikanim na rozne casti broswera: napr, kliknutim na gen si mozete precitat o jeho funkcii, kliknutim na listu ku tracku (lavy okraj obazku) sa dozviete viac o tracku a mozete nastavovat parametre zobrazenia
+
 
+
====Komparativna genomika====
+
* V casti '''multiz alignments''' vidite zarovnania k roznym inym genomom (da sa zapinat, ze ku ktorym). Mozete si pozriet, ako sa uroven zarovnania zmeni ked sa priblizujeme a vzdalujeme (zoom in/zoom out).
+
* Ked sa priblizite spat na gen MAGEA2B a potom tak, aby ste boli na urovni "base", t.j. zobrazenych cca 100bp, v obdlzniku multiz alignment uvidite zarovnanie s homologickym usekom v inych genomoch. Konkretne v MAGEA2B vidime pomerne dost rozdielov v proteine medzi clovekom a makakom rezus, vdaka ktorym bol zrejme klasifikovany ako pod pozitivnym vyberom.
+
* V casti '''conservation by PhyloP '''vidime graf toho, ako silne su zachovane jednotlive stlpce zarovnania
+
* Da sa zapnut track Placental Chain/Net a pozriet sa na ktorych chromozomoch je ortologicky usek v inych genomoch
+
  
 
====Blat====
 
====Blat====
* Choďte na UCSC genome browser (http//genome.ucsc.edu/), na modrej lište zvoľte BLAT, zadajte DNA sekvenciu vyssie a hľadajte ju v ľudskom genóme. '''Akú podobnosť (IDENTITY) má najsilnejší nájdený výskyt? Aký dlhý úsek genómu zasahuje? (SPAN).''' Všimnite si, že ostatné výskyty sú oveľa kratšie.
+
* Instead of BLAST, UCSC genome browser uses faster but less sensitive BLAT (good for the same or very closely related species)
* V stĺpci ACTIONS si pomocou Details môžete pozrieť detaily zarovnania a pomocou Browser si pozrieť príslušný úsek genómu.
+
* Go to http//genome.ucsc.edu/, choose Blat in the top blue menu bar, enter DNA sequence above, search in the human genome
* V tomto úseku genómu si zapnite track Vertebrate net na full a kliknutím na farebnú čiaru na obrázku pre tento track zistite, '''na ktorom chromozóme kuraťa sa vyskytuje homologický úsek.'''
+
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
* Skusme tu istu sekvenciu namapovat do genomu sliepky: stlacte najprv na hornej modrej liste Genomes, zvolte Vertebrates a Chicken a potom na hornej liste BLAT. Do okienka zadajte tu istu sekvenciu. '''Akú podobnosť a dĺžku má najsilnejší nájdený výskyt teraz? Na ktorom je chromozóme?'''
+
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
* Ako sa to porovna s hodnotami, ktore sme dostali pomocou BLASTu na NCBI?
+
* Go to the browser, switch on Vertebrate net/chain on full
 
+
** This track allows you to move to the corresponding parts of other genomes
====Objavenie génu HAR1 pomocou komparatívnej genomiky====
+
** In the chicken chain, notice chromosome number of the corresponding region in chicken
* {{cite journal |author=Pollard KS, Salama SR, Lambert N, ''et al.'' |title=An RNA gene expressed during cortical development evolved rapidly in humans |journal=Nature |volume=443 |issue=7108 |pages=167–72 |year=2006 |month=September |pmid=16915236 |doi=10.1038/nature05113 |url=}} [http://ribonode.ucsc.edu/Pubs/Pollard_etal06.pdf pdf]
+
* Optionally, you can try to use BLAT to map the query to the chicken genome directly
* Zobrali všetky regióny dĺžky aspoň 100bp s > 96% podobnosťou medzi šimpanzom a myšou/potkanom (35,000)
+
** on the blue bar press genomes, choose vertebrate and chicken, then blat on the top bar in submenu Tools
* Porovnali s ostatnými cicavcami, zistili, ktoré majú veľa mutáci v človeku, ale málo inde (pravdepodobnostný model)
+
** what is the identity level and span of the best match? Is it on the same chromosome? How does it compare with the values obtained at NCBI?
* 49 štatisticky významných regiónov, 96% nekódujúcich oblastiach
+
* Najvýznamnejší HAR1: 118nt, 18 substitúcii u človeka, očakávali by sme 0.27. Iba 2 zmeny medzi šimpanzom a sliepkou (310 miliónov rokov), ale nebol nájdený v rybách a žabe.
+
* Nezdá sa byť polymorfný u človeka
+
* Prekrývajúce sa RNA gény HAR1R a HAR1F
+
* HAR1F je exprimovaný v neokortexe u 7 a 9 týždenných embrií, neskôr aj v iných častiach mozgu (u človeka aj iných primátov)
+
* Všetky substitúcie v človeku A/T->C/G, stabilnejšia RNA štruktúra (ale tiež sú blízko k telomére, kde je viacej takýchto mutácii kvôli rekombinácii a biased gene conversion)
+
* Môžete si pozrieť tento region v browseri: [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18&position=chr20:61203911-61204071 '''chr20:61,203,911-61,204,071''' (hg18)], pricom ak sa este priblizite, uvidite zarovnanie aj s bazami a mozete vidiet, ze vela zmien je specifickych pre cloveka
+
* Vynimkou je slon, niektore zmeny v slonovi su sposobene nizkou kvalitou sekvencie. Ked pomocou nastroja In other genomes (convert) v polozke View na hornej liste premapujete do novsej verzie ludskeho genomu (hg19), uvidite, ze aj v najnovsej verzii genomu slona su mnohe zmeny, nechyba tam uz vsak cast sekvencie, ako vo verzii pouzitej v hg18.
+
 
+
====Práca s tabuľkami, sťahovanie anotácií====
+
* Položka Tables na hornej lište umožnuje robiť rafinované veci s tabuľkami, ktoré obsahujú súradnice génov a pod.
+
* Základná vec: vyexportovať napr. všetky gény v zobrazenom výseku v niektorom formáte:
+
** sequence: fasta súbor proteínov, génov alebo mRNA s rôznymi nastaveniami
+
** GTF: súradnice
+
** Hyperlinks to genome browser: klikacia stránka
+
* Namiesto exportu si môžeme pozrieť rôzne štatistiky
+
 
+
* Zložitejšie: prienik dvoch tabuliek, napr. gény, ktoré sú viac než 50% pokryté simple repeats
+
** V intersection zvolíme group: Variation and repeats, track: RepeatMasker, nastavíme records that have at least 50% overlap with RepeatMasker
+
** V summary/statistics zistíme, kolko ich je v genóme, môžeme si ich preklikať cez Hyperlinks to genome browser
+
 
+
* Filter na tabuľku, napr. gény, ktoré majú v názve ribosomal (postup pre drozofilu):
+
** V casti hg19.kgXref based filters  políčko description dáme <tt>*ribosomal*</tt>
+
====Populacna genomika v UCSC genome browseri====
+
UCSC genome browser ma viacero trackov tykajucich sa populacnej genomiky a polymorfizmov
+
* Pozrite si napriklad region [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr2:174,862-436,468 chr2:174,862-436,468 v hg19]
+
* V casti Phenotype and Disease Associations si zapnite GAD view
+
* V casti Variation and Repeats si zapnite
+
** HGDP Allele Freq na Pack (po kliknuti na SNP zobrazi mapu sveta s distribuciou alel)
+
** "DGV Struct Var" na Pack
+
* Track Genome Variants obsahuje genomy niekolkych ludi, napr Jima Watsona
+
* Takisto sa da pozriet genom ludi z jaskyne Denisova a Neandertalcov
+
  
V starsej verzii ludskeho genomu je aj trojuholnikovy graf linkage disequilibria
+
====Sequencing and assembly====
* [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg18&position=chr2:164,862-426,468 region vyssie premapovany do hg18]  
+
* UCSC genome browser has numbered version of individual genomes - errors and missing parts are fixed over time
* zapnite "HapMap LD Phased" na Full (cast Variation and Repeats)
+
* Go to genome.ucsc.edu, choose Genomes in the Blue bar, select human, see when were the last version of the human genome added
* vsimnite si, ze miery LD sa medzi ludskymi podpopulaciami lisia (YRI: Nigeria; CEU: Europa; JPT+CHB: Japonsko, Cina)
+
** if you are interested in detail, each assembly has a description at the bottom of the page
 +
* Go to the browser for human assembly hg19, region chr2:110,000,000-110,300,000, you can use this link: [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr2:110000000-110300000]
 +
* Display tracks "Assembly" and "Gap" in the full mode.
 +
** What is the length of the unsequenced gap in the middle? (you can click on the gap to get details; only an estimate, not sequenced in this assembly)
 +
** This gap is closed in the most recent assembly hg38. You can have a look by transfering to corresponding region in hg38 - click on the blue bar View -> In other genomes (convert), seelect hg38. Notice that the length of the region shrank from 300,000 to 158,880. So the gap length estimate was not very accurate.
  
Browser diverzity u S.cerevisae:  
+
====Comparative genomics====
* [http://www.sanger.ac.uk/research/projects/genomeinformatics/browser.html]
+
Background: HAR1 gene
 +
* {{cite journal |author=Pollard KS, Salama SR, Lambert N, ''et al.'' |title=An RNA gene expressed during cortical development evolved rapidly in humans |journal=Nature |volume=443 |issue=7108 |pages=167–72 |year=2006  |doi=10.1038/nature05113 |url=}} [http://ribonode.ucsc.edu/Pubs/Pollard_etal06.pdf pdf]
 +
* Authors found regions with many human-specific mutations but conserved in other mammals (using probabilistic models)
 +
* 49 statistically significant regions
 +
* The most significant is HAR1: length 118, 18 substitutions in human, expected value 0.27. Only 2 substitutions between chimpanzee and chicken.
 +
* Overlaps RNA gene HAR1 (multiple forms)
 +
* One of the forms is expessed in embryonic neocortex and other parts of the brain
  
===Fylogeneticke stromy, mobyle portal===
+
HAR1 and comparative genomics in the browser
* V UCSC browseri mozeme ziskavat viacnasobne zarovnania jednotlivych genov (nukleotidy alebo proteiny). Nasledujuci postup nemusite robit, subor si stiahnite tu: http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb06/cb06-aln.fa
+
* You can see this region in the browser: [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr20:61733466-61733626 '''chr20:61,733,466-61,733,626''' (hg19)]
** UCSC browseri si pozrieme usek ludskeho genomu chr6:136,214,527-136,558,402 s genom PDE7B (phosphodiesterase 7B)
+
* Make sure Conservation track is switched on full mode (perhaps press default tracks button)
** Na modrej liste zvolime Tables, v nej RefSeq genes, zaklikneme Region: position, a Output fomat: CDS FASTA alignment a stlacime Get output
+
* If you zoom in closer, you will see a multiple sequence alignment, with many changes specific to human
** Na dalsej obrazovke zaklikneme show nucleotides. Z primatov zvolime chimp, rhesus, tarsier, z inych cicavcov mouse, rat, dog, elephant a z dalsich organizmov opposum, platypus, chicken, lizard, stlacime Get output.
+
* If you zoom out to a wider region, e.g. chr20:61,733,305-61,733,787, you can look at PhyloP substrack which shows for every base its conservation level - increase conservation over mammals in general in the HAR1 region
** Vystup ulozime do suboru, z mien sekvencii zmazeme spolocny prefix NM_018945_, pripadne celkovo prepiseme mena na anglicke nazvy
+
  
* Skusme zostavit strom na stranke http://mobyle.pasteur.fr/cgi-bin/portal.py
+
====Population genomics in the browser====
* Pouzijeme program quicktree, metodu neighbor joining, bootstrap 100
+
Population genomics studies differences between individuals within species, e.g. between different people
* Na zobrazenie stromu vysledok dalej prezenieme cez zobrazovacie programy drawtree alebo newicktops (zvolit v menu pri tlacidle further analysis)
+
* Go to region [http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&position=chr2:174,862-436,468 chr2:174,862-436,468 in hg19]
** [[:Image:Cb07-njtree.jpg|Vysledok z drawtree]], nezakoreneny, nezobrazuje bootstrap hodnoty
+
* In section Phenotype and Disease Associations set GAD view track to full
** [[:Image:Cb07-njtree2.pdf|Vysledok z newicktops]], zakoreneny na nahodnom mieste (nie spravne) zobrazuje bootstrap hodnoty
+
** This track shows knows associations of particular genetic regions or mutations to diseases
** v drawtree sme nastavili sme formát výstupu MS-Windows Bitmap a X,Y resolution aspoň 1000, v newicktops sme nastavili show bootstrap values
+
** You can e.g. look at details of associations for gene ACP1
* "Spravny strom" [http://genome.ucsc.edu/images/phylo/hg18_44way.gif] v nastaveniach Conservation track-u v UCSC browseri (podla clanku Murphy WJ, Eizirik E, O'Brien SJ, Madsen O, Scally M, Douady CJ, Teeling E, Ryder OA, Stanhope MJ, de Jong WW, Springer MS. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51.)
+
* Nas strom ma long branch attraction (zle postavenie hlodavcov, ktori maju dlhu vetvu aj slona, co moze byt zapricene sekvenovacimi chybami).
+
* Ine programy, ktore mozete skusit na mobyle
+
** phyml: metoda maximalnej vierohodnosti (daju sa nastavit detaily modelu, bootstraps, ktory ale moze dost dlho trvat, typy operacii na strome pri heuristickom hladani najlepsieho stromu)
+
** dnapars alebo protpars na parsimony
+
** viacnasobne zarovnanie pomocou clustalw alebo modernejsou alternativou muscle
+
** Ak chcete skusat zarovnania, zacnite z nezarovnanych sekvencii: http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb06/cb06-seq.fa
+
  
===Gene expression===
+
* In section Variation set HGDP Allele Freq to pack
Data o expresii ludskych genov v roznych tkanivach a podobne v '''UCSC genome browseri'''
+
** Shows posotions were people differ from each other
* Chodte na stranku http://genome.ucsc.edu/, najdite PTPRZ1 gen v ludskom genome
+
** after clicking on a particular position you get a world map with distribution of variant frequencies in different human populations
* Zvolte Tools->Gene Sorter, sort by nechajme Expression (GNF Atlas 2), search PTPRZ1
+
* Browser also contains tracks displaying genomes of specific people (e.g. Jim Watson) or ancient humans (Neandertals, Denisovans)
** Dostane tabulku genov s podobny profilom expresie ako PTPRZ1 (červená je vysoká expresia, zelená nízka)
+
  
* Chceme zistiť, či v tomto zozname je nadreprezentovaná nejaká funkčná kategória
+
====Work with tables, downloading data====
** Potrebujeme najskôr získať zoznam genov bez dalsich udajov
+
Genome browser is nice for manual browsing but also allows programmers to download data
** Stlacte ''configure'', tlacidlom ''hide all'' zrusite vsetky zaskrtnute typy informacie a zakrtnite iba ''Name'', stlačíte ''submit''
+
* each track based on one or several tables in an SQL database
** Potom stlačte tlačidlo ''text'' a dostanete čisto zoznam mien génov v textovom formáte
+
* you can download genomic sequences and data from these tables [http://hgdownload.cse.ucsc.edu/downloads.html]
** V prípade problémov ho nájdete ho aj [http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb08/zoznam_genov.txt tu]
+
* you can also write queries for a public SQL server [http://genome.ucsc.edu/goldenPath/help/mysql.html] or create queries using Table browser forms (blue bar: Tools->Table browser)
* http://biit.cs.ut.ee/gprofiler/ mena genov skopirujme do policka ''Query'', stlacte g:Profile!
+
* conversely, you can also display your own data in "custom tracks" of the browser
** Vo vyslednej tabulke je kazdy riadok jedna funkcna kategoria, v ktorej su geny s tymto profilom expresie nadreprezentovane, kazdy stlpec jeden gen. Mena kategorii su uplne vpravo.
+
* Co by sme na zaklade nadreprezentovanych kategorii usudzovali o tomto gene?
+
* Najdite tento gen v Uniprote (http://www.uniprot.org/), potvrdzuje nase domnienky?
+
  
* Vratme sa do genome browsera, najdime si PTPRZ1 gen v genome
+
Table browser examples
* V browseri su rozne tracky tykajuce sa expresie, napr. GNF Atlas 2. Precitajte si, co je v tomto tracku zobrazene, zapnite si ho a pozrite si expresiu okolitych genov okolo PTPRZ1
+
* Basic type of query: e.g. export all genes in the part of the genome displayed in the browser
* Kliknite na gen v tracku UCSC known genes. V tabulke uvidite zase prehlad expresie v roznych tkanivach (podla GNF Atlasu), linku na Visigene.
+
* Several output formats, e.g.:
 +
** sequence: file of protein or DNA sequences of these genes (various settings)
 +
** GTF: coordinates of genes and their exons
 +
** Hyperlinks to genome browser: list of genes with links to the browser for each gene
 +
** Instead of expoert we can get summary statistics (number of items, how much sequence they cover)
 +
* More complex query, "intersection" of two tables: e.g. all genes that are more than 50% covered by simple repeats
  
'''NCBI Gene Expression Omnibus''' http://www.ncbi.nlm.nih.gov/geo/
+
===Phylogenetic trees, mobyle portal===
* Databaza gene expression dat na NCBI
+
* Do okienka Data sets zadajme GDS2925
+
* Mali by sme dostat ''Various weak organic acids effect on anaerobic yeast chemostat cultures''
+
* Mozeme si pozriet zakladne udaje, napr. citation, platform
+
* Link "Expression profiles" nam zobrazi grafy pre rozne geny
+
* Pri kazdom profile mozeme kliknut na profile neighbors, aby sme videli geny s podobnym profilom
+
* Data analysis tools, cast Cluster heatmaps, K-means, skuste rozne pocty clustrov
+
  
===Sekvenčné motívy, program MEME===
+
Preparing data
 +
* Skip this part, download the result here: http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb06/cb06-aln.fa
 +
* UCSC browser allows us to download multiple alignments of individual genes (DNA or protein sequences)
 +
* In UCSC browser find gene PDE7B (phosphodiesterase 7B)
 +
* In the blue bar choose Tools->Table browser, track RefSeq genes, select Region: position, and Output fomat: CDS FASTA alignment and press Get output
 +
* At the next screen select show nucleotides. From primates select chimp, rhesus, tarsier, from other mammals mouse, rat, dog, elephant and from other species opposum, platypus, chicken, lizard, press Get output.
 +
* Output store on a file, remove common prefix NM_018945_ from sequence names, or completely rewrite species names
  
* Vazobne miesta transkripcnych faktorov sa casto reprezentuju ako sekvencne motivy
+
Building tree
* Ak mame skupinu sekvencii, mozeme hladat motiv, ktory maju spolocny
+
* We will build the tree using tools at  http://mobyle.pasteur.fr/cgi-bin/portal.py
* Znamy program na tento problem je MEME
+
* We will use program quicktree, neighbor joining method, bootstrap 100
* Chodte na stranku http://meme.nbcr.net/
+
** Bootstrap means the program does 100 replicates with random subsets of the data and show how many of them contain each edge
* Zvolte nastroj MEME a do okienka "actual sequences" zadajte [http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb11/seq.fa tieto sekvencie]
+
** Low bootstrap value means there is not enough evidence in the data for a particular branch of the tree
* Pozrite si ostatne nastavenia. Co asi robia?
+
* To display the tree you can use display plugins or send the tree to other display tools (button futher analysis, first choose tool in the menu)
* Ak server pocita dlho, mozete si pozriet vysledky [http://nbcr-222.ucsd.edu/opal-jobs/appMEME_4.9.114170270799951339152135/meme.html tu]
+
** [[:Image:Cb07-njtree.jpg|The result from drawtree tool]], unrooted, does not display bootstrap values (choose MS-Windows Bitmap and resolution 1000)
 +
** [[:Image:Cb07-njtree2.pdf|The result from newicktops tool]], rooted by a heuristic (incorrectly), can show bootstrap values (choose in settings)
 +
* "Correct tree" [http://genome.ucsc.edu/images/phylo/hg18_44way.gif] in Conservation track settings in the UCSC browseri (based on  Murphy WJ et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51.)
 +
* Our tree exhibits long branch attraction (bad position of rodents with a long branch as well as the elephant, which might be caused by sequencing errors).
 +
* Other programs you can try at mobyle
 +
** phyml: phylogenetic trees by maximum likelihood (you can choose details of the model, bootstrap, type of local moves in hill climbing,...)
 +
** dnapars and protpars for parsimony
 +
** multiple alignment by clustalw or muscle
  
===Uniprot===
+
===Sequence motifs, program MEME===
* Prehladnejsi pohlad na proteiny, vela linkov na ine databazy, cast vytvarana rucne
+
** Pozrieme sa na enzým Bis(5'-adenosyl)-triphosphatase
+
** Nájdime ho na stránke http://www.uniprot.org/ pod názvom FHIT_HUMAN
+
** Pozrime si podrobne jeho stránku, ktoré časti boli predpovedané bioinformatickými metódami z prednášky?
+
** Všimnime si Pfam doménu a pozrime si jej stránku, do akej super-rodiny (klanu) patrí?
+
  
==Summerschool 2011==
+
* Program MEME gets a group of sequences and finds a motif they have in common
===BLAST homology search (local alignment)===
+
* Based of EM algorithm and probabilistic models
Use protein BLAST to find homologs of this protein in the human genome. Go to http://blast.ncbi.nlm.nih.gov/, choose program protein blast, on the next page enter our sequence, choose Reference proteins database and Homo sapiens as species. Use all other options at default settings. Which human protein is the closest homolog of our query? What is the score and E-value of this alignment? How many gaps are in the alignment?
+
* Go to http://meme.nbcr.net/ select MEME tool in Motif discovery section
 +
* As "primary sequences" paste in [http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb11/seq.fa this data]
 +
* If the server computes too long, you can see precomputed results [http://nbcr-222.ucsd.edu/opal-jobs/appMEME_4.10.01427314516179-851932843/meme.html here]
  
===Pfam domain database===
+
===Gene expression data===
Pfam database http://pfam.sanger.ac.uk/ contains profile HMMs of protein domain families. Use Sequence search at this webpage to find which domains are in our protein.
+
  
Then study in more detail zf-C4 domain which should be among the results. In Summary tab we can see description of the domain as well as Gene ontology (GO) terms. In HMM logo tab we can see the graphical representation of the HMM for this family. Which amino acid is most frequent at positions 3 and 6 of this domain?
+
NCBI Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/
 +
* Database of gene expression data at NCBI
 +
* Enter GDS2925 to the search box
 +
* You should get ''Various weak organic acids effect on anaerobic yeast chemostat cultures''
 +
* You can see basic data, such as citation, technology platform
 +
* Link "Expression profiles" shows plots for individual genes
 +
* For each gene we can get its profile neighbors - genes with similar expression
 +
* Data analysis tools, part Cluster heatmaps, K-means, shows results of K-means clustering for different values of K
  
===PDB dababase for protein structures===
+
===Proteins===
Use Sequence search at http://www.rcsb.org/ to find the closest homolog with known structure. You see an overview of the structure, download the file with coordinates, but also can find e.g. the paper where the structure was published and secondary structure (alpha helices, beta sheets).
+
  
===Uniprot database of proteins===
+
Uniprot database http://www.uniprot.org/
Uniprot http://www.uniprot.org/ organizes known information about function, structure and other aspects of individual proteins from all organisms. Use BLAST at this webpage to find which protein was used in this excercise (it should have 100% sequence identity in BLAST results). Which protein it comes from and what is its name? Proteins denoted by golden star in BLAST results have detailed information available. Which is the closest homolog with the star?
+
* Collects experimental and computed information about proteins, some parts curated by hand, links to many other databases
 +
* Find enzyme  Bis(5'-adenosyl)-triphosphatase under name  FHIT_HUMAN
 +
* This protein is relatively well studied with a lot of available information
  
===UCSCS genome browser===
+
Pfam database http://pfam.xfam.org/
The browser http://genome.ucsc.edu/ allows us to explore the gene encoding this protein and its genomic context. Enter the protein sequence to BLAT search in the blue bar and find its closest homolog in the human genome. Which chromosome is the gene at? How many exons does it have? Switch on track ''Placental Chain/Net'' in ''Comparative Genomics'' section and find out which mouse chromosome contains homolog of this gene (color key of chromosomes is located below the main figure).
+
* contains profile HMMs for domain families
 +
* FHIT_HUMAN above contains a HIT domain (id PF01230)
 +
* You can see graphical logo of the HMM, sequence alignments and more

Aktuálna revízia z 10:01, 26. marec 2015

Introduction to bioinformatics databases and on-line tools

The goal of this excercise is to

  • see results of bioinformatics research in the form of on-line tools used by many biologists
  • get to know some basic tools in case you might want to try your algorithms on biology data
  • review some of the topics from the lectures

NCBI, Genbank, Pubmed, blast

  • National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
  • Collects publicly available data in molecular biology
  • We can search for keywords in various databases
  • BLAST finds alignments of query sequence and a specified sequence database
    • convenient, because no need to download large database, but also very slow
  • Try sequence below at http://blast.ncbi.nlm.nih.gov/Blast.cgi
    • the sequence is from the human genome but we will try to find its homolog in chicken
    • choose nucleotide blast, database reference genomic sequence, organism chicken (taxid:9031), program blastn)
    • on which chromosome is the best chicken homolog, what is alignment length, score, E-value, identity level?
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG

UCSC genome browser

  • http://genome.ucsc.edu/
  • nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
  • also allows custom queries and data download

Basics

  • on the front page, choose Genomes in the top blue menu bar
  • select a genome and its version, optionally enter position or keyword, press submit
  • on the browser screen top image shows chromosome map, selected region in red
  • below a view of selected region and various track with information about this region
  • for example some of the top tracks display genes (boxes are exons, lines are introns)
  • tracks can be switched on and off and configured in the bottom part of the page
    • different display levels, full contains all information but takes a lot of vertical space
  • navigation at the top (move, zoom, etc.)
  • various actions in the menu
  • clicking at the browser figure allows you to get more information about a gene or other displayed item

Blat

  • Instead of BLAST, UCSC genome browser uses faster but less sensitive BLAT (good for the same or very closely related species)
  • Go to http//genome.ucsc.edu/, choose Blat in the top blue menu bar, enter DNA sequence above, search in the human genome
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
  • Go to the browser, switch on Vertebrate net/chain on full
    • This track allows you to move to the corresponding parts of other genomes
    • In the chicken chain, notice chromosome number of the corresponding region in chicken
  • Optionally, you can try to use BLAT to map the query to the chicken genome directly
    • on the blue bar press genomes, choose vertebrate and chicken, then blat on the top bar in submenu Tools
    • what is the identity level and span of the best match? Is it on the same chromosome? How does it compare with the values obtained at NCBI?

Sequencing and assembly

  • UCSC genome browser has numbered version of individual genomes - errors and missing parts are fixed over time
  • Go to genome.ucsc.edu, choose Genomes in the Blue bar, select human, see when were the last version of the human genome added
    • if you are interested in detail, each assembly has a description at the bottom of the page
  • Go to the browser for human assembly hg19, region chr2:110,000,000-110,300,000, you can use this link: [1]
  • Display tracks "Assembly" and "Gap" in the full mode.
    • What is the length of the unsequenced gap in the middle? (you can click on the gap to get details; only an estimate, not sequenced in this assembly)
    • This gap is closed in the most recent assembly hg38. You can have a look by transfering to corresponding region in hg38 - click on the blue bar View -> In other genomes (convert), seelect hg38. Notice that the length of the region shrank from 300,000 to 158,880. So the gap length estimate was not very accurate.

Comparative genomics

Background: HAR1 gene

  • Pollard KS, Salama SR, Lambert N, et al. (2006). "An RNA gene expressed during cortical development evolved rapidly in humans". Nature 443 (7108): 167–72. doi:10.1038/nature05113. pdf
  • Authors found regions with many human-specific mutations but conserved in other mammals (using probabilistic models)
  • 49 statistically significant regions
  • The most significant is HAR1: length 118, 18 substitutions in human, expected value 0.27. Only 2 substitutions between chimpanzee and chicken.
  • Overlaps RNA gene HAR1 (multiple forms)
  • One of the forms is expessed in embryonic neocortex and other parts of the brain

HAR1 and comparative genomics in the browser

  • You can see this region in the browser: chr20:61,733,466-61,733,626 (hg19)
  • Make sure Conservation track is switched on full mode (perhaps press default tracks button)
  • If you zoom in closer, you will see a multiple sequence alignment, with many changes specific to human
  • If you zoom out to a wider region, e.g. chr20:61,733,305-61,733,787, you can look at PhyloP substrack which shows for every base its conservation level - increase conservation over mammals in general in the HAR1 region

Population genomics in the browser

Population genomics studies differences between individuals within species, e.g. between different people

  • Go to region chr2:174,862-436,468 in hg19
  • In section Phenotype and Disease Associations set GAD view track to full
    • This track shows knows associations of particular genetic regions or mutations to diseases
    • You can e.g. look at details of associations for gene ACP1
  • In section Variation set HGDP Allele Freq to pack
    • Shows posotions were people differ from each other
    • after clicking on a particular position you get a world map with distribution of variant frequencies in different human populations
  • Browser also contains tracks displaying genomes of specific people (e.g. Jim Watson) or ancient humans (Neandertals, Denisovans)

Work with tables, downloading data

Genome browser is nice for manual browsing but also allows programmers to download data

  • each track based on one or several tables in an SQL database
  • you can download genomic sequences and data from these tables [2]
  • you can also write queries for a public SQL server [3] or create queries using Table browser forms (blue bar: Tools->Table browser)
  • conversely, you can also display your own data in "custom tracks" of the browser

Table browser examples

  • Basic type of query: e.g. export all genes in the part of the genome displayed in the browser
  • Several output formats, e.g.:
    • sequence: file of protein or DNA sequences of these genes (various settings)
    • GTF: coordinates of genes and their exons
    • Hyperlinks to genome browser: list of genes with links to the browser for each gene
    • Instead of expoert we can get summary statistics (number of items, how much sequence they cover)
  • More complex query, "intersection" of two tables: e.g. all genes that are more than 50% covered by simple repeats

Phylogenetic trees, mobyle portal

Preparing data

  • Skip this part, download the result here: http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb06/cb06-aln.fa
  • UCSC browser allows us to download multiple alignments of individual genes (DNA or protein sequences)
  • In UCSC browser find gene PDE7B (phosphodiesterase 7B)
  • In the blue bar choose Tools->Table browser, track RefSeq genes, select Region: position, and Output fomat: CDS FASTA alignment and press Get output
  • At the next screen select show nucleotides. From primates select chimp, rhesus, tarsier, from other mammals mouse, rat, dog, elephant and from other species opposum, platypus, chicken, lizard, press Get output.
  • Output store on a file, remove common prefix NM_018945_ from sequence names, or completely rewrite species names

Building tree

  • We will build the tree using tools at http://mobyle.pasteur.fr/cgi-bin/portal.py
  • We will use program quicktree, neighbor joining method, bootstrap 100
    • Bootstrap means the program does 100 replicates with random subsets of the data and show how many of them contain each edge
    • Low bootstrap value means there is not enough evidence in the data for a particular branch of the tree
  • To display the tree you can use display plugins or send the tree to other display tools (button futher analysis, first choose tool in the menu)
  • "Correct tree" [4] in Conservation track settings in the UCSC browseri (based on Murphy WJ et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51.)
  • Our tree exhibits long branch attraction (bad position of rodents with a long branch as well as the elephant, which might be caused by sequencing errors).
  • Other programs you can try at mobyle
    • phyml: phylogenetic trees by maximum likelihood (you can choose details of the model, bootstrap, type of local moves in hill climbing,...)
    • dnapars and protpars for parsimony
    • multiple alignment by clustalw or muscle

Sequence motifs, program MEME

  • Program MEME gets a group of sequences and finds a motif they have in common
  • Based of EM algorithm and probabilistic models
  • Go to http://meme.nbcr.net/ select MEME tool in Motif discovery section
  • As "primary sequences" paste in this data
  • If the server computes too long, you can see precomputed results here

Gene expression data

NCBI Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/

  • Database of gene expression data at NCBI
  • Enter GDS2925 to the search box
  • You should get Various weak organic acids effect on anaerobic yeast chemostat cultures
  • You can see basic data, such as citation, technology platform
  • Link "Expression profiles" shows plots for individual genes
  • For each gene we can get its profile neighbors - genes with similar expression
  • Data analysis tools, part Cluster heatmaps, K-means, shows results of K-means clustering for different values of K

Proteins

Uniprot database http://www.uniprot.org/

  • Collects experimental and computed information about proteins, some parts curated by hand, links to many other databases
  • Find enzyme Bis(5'-adenosyl)-triphosphatase under name FHIT_HUMAN
  • This protein is relatively well studied with a lot of available information

Pfam database http://pfam.xfam.org/

  • contains profile HMMs for domain families
  • FHIT_HUMAN above contains a HIT domain (id PF01230)
  • You can see graphical logo of the HMM, sequence alignments and more