1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.


Lbioinf3

From MAD
Jump to navigation Jump to search

HWbioinf3

Polymorphisms

  • Individuals within species differ slightly in their genomes.
  • Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%).
  • SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide).
  • Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father.
  • At a particular chromosomal location, a single human can thus have two different alleles (heterozygot) or two copies of the same allele (homozygot).

Finding polymorphisms / genome variants

  • We compare sequencing reads coming from an individual to a reference genome of the species.
  • First we align them, as in the exercises on genome assembly.
  • Then we look for positions where a substantial fraction of the reads does not agree with the reference (this process is called variant calling).

Programs and file formats

Human variants

  • For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world.
  • There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia.

UCSC genome browser

A short video for this section: [1]

  • The UCSC genome browser is an on-line tool similar to IGV.
  • It has nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented.

Basics

  • On the front page, choose Genomes in the top blue menu bar.
  • Select a genome and its version, optionally enter a position or a keyword, press submit.
  • On the browser screen, the top image shows chromosome map, the selected region is in red.
  • Below there is a view of the selected region and various tracks with information about this region.
  • For example some of the top tracks display genes (boxes are exons, lines are introns).
  • Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space).
  • Buttons for navigation are at the top (move, zoom, etc.).
  • Clicking at the browser figure allows you to get more information about a gene or other displayed item.
  • In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15.

Blat

  • For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species).
  • Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome.
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter).
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region.
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG