1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lbioinf3"

From MAD
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
==Polymorphisms==
 
==Polymorphisms==
* Individuals within species differ slightly in their genomes
+
* Individuals within species differ slightly in their genomes.
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
+
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%).
* [https://ghr.nlm.nih.gov/primer/genomicresearch/snp SNP]: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide)
+
* SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide).
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
+
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father.
* At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)
+
* At a particular chromosomal location, a single human can thus have two different alleles (heterozygot) or two copies of the same allele (homozygot).
  
 
==Finding polymorphisms / genome variants==
 
==Finding polymorphisms / genome variants==
* We compare sequencing reads coming from an individual to a reference genome of the species
+
* We compare sequencing reads coming from an individual to a reference genome of the species.
* First we align them, as in [[HWbioinf1|the exercises on genome assembly]]
+
* First we align them, as in [[HWbioinf1|the exercises on genome assembly]].
* Then we look for positions where a substantial fraction of reads does not agree with the reference (this process is called variant calling)
+
* Then we look for positions where a substantial fraction of the reads does not agree with the reference (this process is called variant calling).
  
 
==Programs and file formats==
 
==Programs and file formats==
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]])
+
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]]).
* For variant calling, we will use [https://github.com/ekg/freebayes FreeBayes]
+
* For variant calling, we will use [https://github.com/ekg/freebayes FreeBayes].
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]]
+
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]].
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files]
+
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files].
* For storing genome intervals, we will use [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED files]
+
* For storing genome intervals, we will use [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED files] as in the [[Lbioinf2|previous lecture]].
  
 
==Human variants==
 
==Human variants==
* For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
+
* For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world.
* There are various databases, e.g. [https://www.ncbi.nlm.nih.gov/SNP/ dbSNP], [https://www.omim.org/ OMIM], or user-editable [https://www.snpedia.com/index.php/SNPedia SNPedia]
+
* There are various databases, e.g. [https://www.ncbi.nlm.nih.gov/SNP/ dbSNP], [https://www.omim.org/ OMIM], or user-editable [https://www.snpedia.com/index.php/SNPedia SNPedia].
  
 
==UCSC genome browser==
 
==UCSC genome browser==
Line 30: Line 30:
 
A short video for this section: [https://youtu.be/RwEBS62Avaw]
 
A short video for this section: [https://youtu.be/RwEBS62Avaw]
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
* On-line tool similar to IGV
+
* The [http://genome-euro.ucsc.edu/ UCSC genome browser] is an on-line tool similar to IGV.
* http://genome-euro.ucsc.edu/
+
* It has nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented.
* Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
 
  
 
====Basics====
 
====Basics====
* On the front page, choose Genomes in the top blue menu bar
+
* On the front page, choose Genomes in the top blue menu bar.
* Select a genome and its version, optionally enter a position or a keyword, press submit
+
* Select a genome and its version, optionally enter a position or a keyword, press submit.
* On the browser screen, the top image shows chromosome map, the selected region is in red
+
* On the browser screen, the top image shows chromosome map, the selected region is in red.
* Below there is a view of the selected region and various tracks with information about this region
+
* Below there is a view of the selected region and various tracks with information about this region.
* For example some of the top tracks display genes (boxes are exons, lines are introns)
+
* For example some of the top tracks display genes (boxes are exons, lines are introns).
* Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space)
+
* Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space).
* Buttons for navigation are at the top (move, zoom, etc.)
+
* Buttons for navigation are at the top (move, zoom, etc.).
* Clicking at the browser figure allows you to get more information about a gene or other displayed item
+
* Clicking at the browser figure allows you to get more information about a gene or other displayed item.
* In this lecture, we will need tracks GENCODE and dbSNP - check e.g. [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr11%3A66546841-66563329 gene ACTN3] and within it SNP <tt>rs1815739</tt> in exon 15
+
* In this lecture, we will need tracks GENCODE and dbSNP - check e.g. [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr11%3A66546841-66563329 gene ACTN3] and within it SNP <tt>rs1815739</tt> in exon 15.
  
 
====Blat====
 
====Blat====
* For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species)
+
* For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species).
* Choose <tt>Tools->Blat</tt> in the top blue menu bar, enter DNA sequence below, search in the human genome
+
* Choose <tt>Tools->Blat</tt> in the top blue menu bar, enter DNA sequence below, search in the human genome.
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
+
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter).
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
+
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region.
 
<pre>
 
<pre>
 
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
 
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC

Latest revision as of 19:14, 10 April 2024

HWbioinf3

Polymorphisms

  • Individuals within species differ slightly in their genomes.
  • Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%).
  • SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide).
  • Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father.
  • At a particular chromosomal location, a single human can thus have two different alleles (heterozygot) or two copies of the same allele (homozygot).

Finding polymorphisms / genome variants

  • We compare sequencing reads coming from an individual to a reference genome of the species.
  • First we align them, as in the exercises on genome assembly.
  • Then we look for positions where a substantial fraction of the reads does not agree with the reference (this process is called variant calling).

Programs and file formats

Human variants

  • For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world.
  • There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia.

UCSC genome browser

A short video for this section: [1]

  • The UCSC genome browser is an on-line tool similar to IGV.
  • It has nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented.

Basics

  • On the front page, choose Genomes in the top blue menu bar.
  • Select a genome and its version, optionally enter a position or a keyword, press submit.
  • On the browser screen, the top image shows chromosome map, the selected region is in red.
  • Below there is a view of the selected region and various tracks with information about this region.
  • For example some of the top tracks display genes (boxes are exons, lines are introns).
  • Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space).
  • Buttons for navigation are at the top (move, zoom, etc.).
  • Clicking at the browser figure allows you to get more information about a gene or other displayed item.
  • In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15.

Blat

  • For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species).
  • Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome.
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter).
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region.
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG