CI-en-db: Rozdiel medzi revíziami

Aktuálna revízia z 09:01, 26. marec 2015

Obsah

1 Introduction to bioinformatics databases and on-line tools

Introduction to bioinformatics databases and on-line tools

The goal of this excercise is to

see results of bioinformatics research in the form of on-line tools used by many biologists
get to know some basic tools in case you might want to try your algorithms on biology data
review some of the topics from the lectures

NCBI, Genbank, Pubmed, blast

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
Collects publicly available data in molecular biology
We can search for keywords in various databases
BLAST finds alignments of query sequence and a specified sequence database
- convenient, because no need to download large database, but also very slow
Try sequence below at http://blast.ncbi.nlm.nih.gov/Blast.cgi
- the sequence is from the human genome but we will try to find its homolog in chicken
- choose nucleotide blast, database reference genomic sequence, organism chicken (taxid:9031), program blastn)
- on which chromosome is the best chicken homolog, what is alignment length, score, E-value, identity level?

AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG

UCSC genome browser

http://genome.ucsc.edu/
nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
also allows custom queries and data download

Basics

on the front page, choose Genomes in the top blue menu bar
select a genome and its version, optionally enter position or keyword, press submit
on the browser screen top image shows chromosome map, selected region in red
below a view of selected region and various track with information about this region
for example some of the top tracks display genes (boxes are exons, lines are introns)
tracks can be switched on and off and configured in the bottom part of the page
- different display levels, full contains all information but takes a lot of vertical space
navigation at the top (move, zoom, etc.)
various actions in the menu
clicking at the browser figure allows you to get more information about a gene or other displayed item

Blat

Instead of BLAST, UCSC genome browser uses faster but less sensitive BLAT (good for the same or very closely related species)
Go to http//genome.ucsc.edu/, choose Blat in the top blue menu bar, enter DNA sequence above, search in the human genome
- What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
- Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
Go to the browser, switch on Vertebrate net/chain on full
- This track allows you to move to the corresponding parts of other genomes
- In the chicken chain, notice chromosome number of the corresponding region in chicken
Optionally, you can try to use BLAT to map the query to the chicken genome directly
- on the blue bar press genomes, choose vertebrate and chicken, then blat on the top bar in submenu Tools
- what is the identity level and span of the best match? Is it on the same chromosome? How does it compare with the values obtained at NCBI?

Sequencing and assembly

UCSC genome browser has numbered version of individual genomes - errors and missing parts are fixed over time
Go to genome.ucsc.edu, choose Genomes in the Blue bar, select human, see when were the last version of the human genome added
- if you are interested in detail, each assembly has a description at the bottom of the page
Go to the browser for human assembly hg19, region chr2:110,000,000-110,300,000, you can use this link: [1]
Display tracks "Assembly" and "Gap" in the full mode.
- What is the length of the unsequenced gap in the middle? (you can click on the gap to get details; only an estimate, not sequenced in this assembly)
- This gap is closed in the most recent assembly hg38. You can have a look by transfering to corresponding region in hg38 - click on the blue bar View -> In other genomes (convert), seelect hg38. Notice that the length of the region shrank from 300,000 to 158,880. So the gap length estimate was not very accurate.

Comparative genomics

Background: HAR1 gene

Pollard KS, Salama SR, Lambert N, et al. (2006). "An RNA gene expressed during cortical development evolved rapidly in humans". Nature 443 (7108): 167–72. doi:10.1038/nature05113. pdf
Authors found regions with many human-specific mutations but conserved in other mammals (using probabilistic models)
49 statistically significant regions
The most significant is HAR1: length 118, 18 substitutions in human, expected value 0.27. Only 2 substitutions between chimpanzee and chicken.
Overlaps RNA gene HAR1 (multiple forms)
One of the forms is expessed in embryonic neocortex and other parts of the brain

HAR1 and comparative genomics in the browser

You can see this region in the browser: chr20:61,733,466-61,733,626 (hg19)
Make sure Conservation track is switched on full mode (perhaps press default tracks button)
If you zoom in closer, you will see a multiple sequence alignment, with many changes specific to human
If you zoom out to a wider region, e.g. chr20:61,733,305-61,733,787, you can look at PhyloP substrack which shows for every base its conservation level - increase conservation over mammals in general in the HAR1 region

Population genomics in the browser

Population genomics studies differences between individuals within species, e.g. between different people

Go to region chr2:174,862-436,468 in hg19
In section Phenotype and Disease Associations set GAD view track to full
- This track shows knows associations of particular genetic regions or mutations to diseases
- You can e.g. look at details of associations for gene ACP1

In section Variation set HGDP Allele Freq to pack
- Shows posotions were people differ from each other
- after clicking on a particular position you get a world map with distribution of variant frequencies in different human populations
Browser also contains tracks displaying genomes of specific people (e.g. Jim Watson) or ancient humans (Neandertals, Denisovans)

Work with tables, downloading data

Genome browser is nice for manual browsing but also allows programmers to download data

each track based on one or several tables in an SQL database
you can download genomic sequences and data from these tables [2]
you can also write queries for a public SQL server [3] or create queries using Table browser forms (blue bar: Tools->Table browser)
conversely, you can also display your own data in "custom tracks" of the browser

Table browser examples

Basic type of query: e.g. export all genes in the part of the genome displayed in the browser
Several output formats, e.g.:
- sequence: file of protein or DNA sequences of these genes (various settings)
- GTF: coordinates of genes and their exons
- Hyperlinks to genome browser: list of genes with links to the browser for each gene
- Instead of expoert we can get summary statistics (number of items, how much sequence they cover)
More complex query, "intersection" of two tables: e.g. all genes that are more than 50% covered by simple repeats

Phylogenetic trees, mobyle portal

Preparing data

Skip this part, download the result here: http://compbio.fmph.uniba.sk/vyuka/mbi-data/cb06/cb06-aln.fa
UCSC browser allows us to download multiple alignments of individual genes (DNA or protein sequences)
In UCSC browser find gene PDE7B (phosphodiesterase 7B)
In the blue bar choose Tools->Table browser, track RefSeq genes, select Region: position, and Output fomat: CDS FASTA alignment and press Get output
At the next screen select show nucleotides. From primates select chimp, rhesus, tarsier, from other mammals mouse, rat, dog, elephant and from other species opposum, platypus, chicken, lizard, press Get output.
Output store on a file, remove common prefix NM_018945_ from sequence names, or completely rewrite species names

Building tree

We will build the tree using tools at http://mobyle.pasteur.fr/cgi-bin/portal.py
We will use program quicktree, neighbor joining method, bootstrap 100
- Bootstrap means the program does 100 replicates with random subsets of the data and show how many of them contain each edge
- Low bootstrap value means there is not enough evidence in the data for a particular branch of the tree
To display the tree you can use display plugins or send the tree to other display tools (button futher analysis, first choose tool in the menu)
- The result from drawtree tool, unrooted, does not display bootstrap values (choose MS-Windows Bitmap and resolution 1000)
- The result from newicktops tool, rooted by a heuristic (incorrectly), can show bootstrap values (choose in settings)
"Correct tree" [4] in Conservation track settings in the UCSC browseri (based on Murphy WJ et al. Resolution of the early placental mammal radiation using Bayesian phylogenetics. Science. 2001 Dec 14;294(5550):2348-51.)
Our tree exhibits long branch attraction (bad position of rodents with a long branch as well as the elephant, which might be caused by sequencing errors).
Other programs you can try at mobyle
- phyml: phylogenetic trees by maximum likelihood (you can choose details of the model, bootstrap, type of local moves in hill climbing,...)
- dnapars and protpars for parsimony
- multiple alignment by clustalw or muscle

Sequence motifs, program MEME

Program MEME gets a group of sequences and finds a motif they have in common
Based of EM algorithm and probabilistic models
Go to http://meme.nbcr.net/ select MEME tool in Motif discovery section
As "primary sequences" paste in this data
If the server computes too long, you can see precomputed results here

Gene expression data

NCBI Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/

Database of gene expression data at NCBI
Enter GDS2925 to the search box
You should get Various weak organic acids effect on anaerobic yeast chemostat cultures
You can see basic data, such as citation, technology platform
Link "Expression profiles" shows plots for individual genes
For each gene we can get its profile neighbors - genes with similar expression
Data analysis tools, part Cluster heatmaps, K-means, shows results of K-means clustering for different values of K

Proteins

Uniprot database http://www.uniprot.org/

Collects experimental and computed information about proteins, some parts curated by hand, links to many other databases
Find enzyme Bis(5'-adenosyl)-triphosphatase under name FHIT_HUMAN
This protein is relatively well studied with a lot of available information

Pfam database http://pfam.xfam.org/

contains profile HMMs for domain families
FHIT_HUMAN above contains a HIT domain (id PF01230)
You can see graphical logo of the HMM, sequence alignments and more

@@ Riadok 47: / Riadok 47: @@
 * navigation at the top (move, zoom, etc.)
 * various actions in the menu
-* clicking at the browser figure allows you to get more information about a gene or displayed other item
+* clicking at the browser figure allows you to get more information about a gene or other displayed item
 ====Blat====
@@ Riadok 55: / Riadok 55: @@
 ** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
 * Go to the browser, switch on Vertebrate net/chain on full
-** this track allows you to move to corresponding parts of other genomes
+** This track allows you to move to the corresponding parts of other genomes
-** in the chicken chain notice chromosome number of the corresponding region in chicken
+** In the chicken chain, notice chromosome number of the corresponding region in chicken
 * Optionally, you can try to use BLAT to map the query to the chicken genome directly
 ** on the blue bar press genomes, choose vertebrate and chicken, then blat on the top bar in submenu Tools
-** what is not idemtity level and span of the best match? Is it on the same chromosome? How does it compare with the values obtained at NCBI?
+** what is the identity level and span of the best match? Is it on the same chromosome? How does it compare with the values obtained at NCBI?
 ====Sequencing and assembly====

CI-en-db: Rozdiel medzi revíziami

Aktuálna revízia z 09:01, 26. marec 2015

Obsah

Introduction to bioinformatics databases and on-line tools

NCBI, Genbank, Pubmed, blast

UCSC genome browser

Basics

Blat

Sequencing and assembly

Comparative genomics

Population genomics in the browser

Work with tables, downloading data

Phylogenetic trees, mobyle portal

Sequence motifs, program MEME

Gene expression data

Proteins

Navigačné menu

Osobné nástroje

Menné priestory

Varianty

Zobrazení

Operácie

Hľadať

Navigácia

Nástroje