1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lbioinf3"

From MAD
Jump to navigation Jump to search
Line 17: Line 17:
 
==Programs and file formats==
 
==Programs and file formats==
 
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]])
 
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]])
* For SNP calling, we will use [https://github.com/ekg/freebayes Freebayes]
+
* For variant calling, we will use [https://github.com/ekg/freebayes Freebayes]
 
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]]
 
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]]
 
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files]
 
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files]

Revision as of 19:32, 31 March 2020

HWbioinf3

Polymorphisms

  • Individuals within species differ slightly in their genomes
  • Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
  • SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucletide)
  • Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
  • At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)

Finding polymorphisms / genome variants

  • We compare sequencing reads coming from an individual to a reference genome of the species
  • First we align them, as in the exercises on genome assembly
  • Then we look for positions where a substantial fraction of reads does not agree with the reference (this process is called variant calling)

Programs and file formats

Human variants

  • For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
  • There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia

UCSC genome browser

  • On-line tool similar to IGV
  • http://genome-euro.ucsc.edu/
  • Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented

Basics

  • On the front page, choose Genomes in the top blue menu bar
  • Select a genome and its version, optionally enter position or keyword, press submit
  • On the browser screen top image shows chromosome map, selected region in red
  • Below a view of the selected region and various track with information about this region
  • For example some of the top tracks display genes (boxes are exons, lines are introns)
  • Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space)
  • Buttons for navigation are at the top (move, zoom, etc.)
  • Clicking at the browser figure allows you to get more information about a gene or other displayed item
  • In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15

Blat

  • For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species)
  • Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG