1-DAV-202 Data Management 2024/25

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt


Difference between revisions of "Lbioinf3"

From MAD
Jump to navigation Jump to search
(Created page with "<!-- NOTEX --> HWbioinf3 <!-- /NOTEX --> ==Polymorphisms== * Individuals within species differ slightly in their genomes * Polymorphisms are genome variants which are rel...")
 
 
(7 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
==Polymorphisms==
 
==Polymorphisms==
* Individuals within species differ slightly in their genomes
+
* Individuals within species differ slightly in their genomes.
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
+
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%).
* [https://ghr.nlm.nih.gov/primer/genomicresearch/snp SNP]: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucletide)
+
* SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide).
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
+
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father.
* At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)
+
* At a particular chromosomal location, a single human can thus have two different alleles (heterozygot) or two copies of the same allele (homozygot).
  
 
==Finding polymorphisms / genome variants==
 
==Finding polymorphisms / genome variants==
* We compare sequencing reads coming from an individual to a reference genome of the species
+
* We compare sequencing reads coming from an individual to a reference genome of the species.
* First we align them, as in [[HWbioinf1|the exercises on genome assembly]]
+
* First we align them, as in [[HWbioinf1|the exercises on genome assembly]].
* Then we look for positions where a substantial fraction of reads does not agree with the reference (SNP-calling)
+
* Then we look for positions where a substantial fraction of the reads does not agree with the reference (this process is called variant calling).
  
 
==Programs and file formats==
 
==Programs and file formats==
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]])
+
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]]).
* For SNP calling, we will use [https://github.com/ekg/freebayes Freebayes]
+
* For variant calling, we will use [https://github.com/ekg/freebayes FreeBayes].
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]]
+
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]].
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files]
+
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files].
* For storing genome intervals, we will use [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED files]
+
* For storing genome intervals, we will use [https://genome.ucsc.edu/FAQ/FAQformat.html#format1 BED files] as in the [[Lbioinf2|previous lecture]].
  
 
==Human variants==
 
==Human variants==
* For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
+
* For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world.
* There are various databases, e.g. [https://www.ncbi.nlm.nih.gov/SNP/ dbSNP], [https://www.omim.org/ OMIM], or user-editable [https://www.snpedia.com/index.php/SNPedia SNPedia]
+
* There are various databases, e.g. [https://www.ncbi.nlm.nih.gov/SNP/ dbSNP], [https://www.omim.org/ OMIM], or user-editable [https://www.snpedia.com/index.php/SNPedia SNPedia].
  
 
==UCSC genome browser==
 
==UCSC genome browser==
* On-line tool similar to IGV
+
<!-- NOTEX -->
* http://genome-euro.ucsc.edu/
+
A short video for this section: [https://youtu.be/RwEBS62Avaw]
* Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
+
<!-- /NOTEX -->
 +
* The [http://genome-euro.ucsc.edu/ UCSC genome browser] is an on-line tool similar to IGV.
 +
* It has nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented.
  
 
====Basics====
 
====Basics====
* On the front page, choose Genomes in the top blue menu bar
+
* On the front page, choose Genomes in the top blue menu bar.
* Select a genome and its version, optionally enter position or keyword, press submit
+
* Select a genome and its version, optionally enter a position or a keyword, press submit.
* On the browser screen top image shows chromosome map, selected region in red
+
* On the browser screen, the top image shows chromosome map, the selected region is in red.
* Below a view of the selected region and various track with information about this region
+
* Below there is a view of the selected region and various tracks with information about this region.
* For example some of the top tracks display genes (boxes are exons, lines are introns)
+
* For example some of the top tracks display genes (boxes are exons, lines are introns).
* Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space)
+
* Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space).
* Buttons for navigation are at the top (move, zoom, etc.)
+
* Buttons for navigation are at the top (move, zoom, etc.).
* Clicking at the browser figure allows you to get more information about a gene or other displayed item
+
* Clicking at the browser figure allows you to get more information about a gene or other displayed item.
* In this lecture, we will need tracks GENCODE and dbSNP - check e.g. [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr11%3A66546841-66563329 gene ACTN3] and within it SNP <tt>rs1815739</tt> in exon 15
+
* In this lecture, we will need tracks GENCODE and dbSNP - check e.g. [http://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chr11%3A66546841-66563329 gene ACTN3] and within it SNP <tt>rs1815739</tt> in exon 15.
  
 
====Blat====
 
====Blat====
* For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species)
+
* For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species).
* Choose <tt>Tools->Blat</tt> in the top blue menu bar, enter DNA sequence below, search in the human genome
+
* Choose <tt>Tools->Blat</tt> in the top blue menu bar, enter DNA sequence below, search in the human genome.
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
+
** What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter).
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
+
** Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region.
 
<pre>
 
<pre>
 
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
 
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC

Latest revision as of 20:14, 10 April 2024

HWbioinf3

Polymorphisms

  • Individuals within species differ slightly in their genomes.
  • Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%).
  • SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide).
  • Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father.
  • At a particular chromosomal location, a single human can thus have two different alleles (heterozygot) or two copies of the same allele (homozygot).

Finding polymorphisms / genome variants

  • We compare sequencing reads coming from an individual to a reference genome of the species.
  • First we align them, as in the exercises on genome assembly.
  • Then we look for positions where a substantial fraction of the reads does not agree with the reference (this process is called variant calling).

Programs and file formats

Human variants

  • For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world.
  • There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia.

UCSC genome browser

A short video for this section: [1]

  • The UCSC genome browser is an on-line tool similar to IGV.
  • It has nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented.

Basics

  • On the front page, choose Genomes in the top blue menu bar.
  • Select a genome and its version, optionally enter a position or a keyword, press submit.
  • On the browser screen, the top image shows chromosome map, the selected region is in red.
  • Below there is a view of the selected region and various tracks with information about this region.
  • For example some of the top tracks display genes (boxes are exons, lines are introns).
  • Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space).
  • Buttons for navigation are at the top (move, zoom, etc.).
  • Clicking at the browser figure allows you to get more information about a gene or other displayed item.
  • In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15.

Blat

  • For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species).
  • Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome.
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter).
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region.
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG