1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "Lbioinf3"
Jump to navigation
Jump to search
Line 6: | Line 6: | ||
* Individuals within species differ slightly in their genomes | * Individuals within species differ slightly in their genomes | ||
* Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%) | * Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%) | ||
− | * [https://ghr.nlm.nih.gov/primer/genomicresearch/snp SNP]: single-nucleotide polymorphism (a polymorphism which is a substitution of a single | + | * [https://ghr.nlm.nih.gov/primer/genomicresearch/snp SNP]: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide) |
* Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father | * Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father | ||
* At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity) | * At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity) | ||
Line 17: | Line 17: | ||
==Programs and file formats== | ==Programs and file formats== | ||
* For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]]) | * For mapping, we will use <tt>[https://github.com/lh3/bwa BWA-MEM]</tt> (you can also try Minimap2, as in [[HWbioinf1|the exercises on genome assembly]]) | ||
− | * For variant calling, we will use [https://github.com/ekg/freebayes | + | * For variant calling, we will use [https://github.com/ekg/freebayes FreeBayes] |
* For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]] | * For reads and read alignments, we will use FASTQ and BAM files, as in the [[Lbioinf1|previous lectures]] | ||
* For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files] | * For storing found variants, we will use [http://www.internationalgenome.org/wiki/Analysis/vcf4.0/ VCF files] |
Revision as of 09:05, 18 March 2021
Contents
Polymorphisms
- Individuals within species differ slightly in their genomes
- Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%)
- SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide)
- Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father
- At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity)
Finding polymorphisms / genome variants
- We compare sequencing reads coming from an individual to a reference genome of the species
- First we align them, as in the exercises on genome assembly
- Then we look for positions where a substantial fraction of reads does not agree with the reference (this process is called variant calling)
Programs and file formats
- For mapping, we will use BWA-MEM (you can also try Minimap2, as in the exercises on genome assembly)
- For variant calling, we will use FreeBayes
- For reads and read alignments, we will use FASTQ and BAM files, as in the previous lectures
- For storing found variants, we will use VCF files
- For storing genome intervals, we will use BED files
Human variants
- For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world
- There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia
UCSC genome browser
A short video for this section: [1]
- On-line tool similar to IGV
- http://genome-euro.ucsc.edu/
- Nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented
Basics
- On the front page, choose Genomes in the top blue menu bar
- Select a genome and its version, optionally enter a position or a keyword, press submit
- On the browser screen, the top image shows chromosome map, the selected region is in red
- Below there is a view of the selected region and various tracks with information about this region
- For example some of the top tracks display genes (boxes are exons, lines are introns)
- Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space)
- Buttons for navigation are at the top (move, zoom, etc.)
- Clicking at the browser figure allows you to get more information about a gene or other displayed item
- In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15
Blat
- For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species)
- Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome
- What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter)
- Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC CCGAAAAGCCCCCACAAAAAGCCG