1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWbioinf3"

From MAD
Jump to navigation Jump to search
 
(2 intermediate revisions by the same user not shown)
Line 49: Line 49:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task B: UCSC browser===
+
===Task B: UCSC genome browser===
  
(a) Where is the sequence from <tt>regionChr7.fasta</tt> file located in the browser?
+
(a) Where is the sequence from <tt>humanChr7Region.fasta</tt> file located in the browser?
 
* Go to http://genome-euro.ucsc.edu/, from the blue menu, select <tt>Tools->Blat</tt>
 
* Go to http://genome-euro.ucsc.edu/, from the blue menu, select <tt>Tools->Blat</tt>
 
* Check that Blat uses Human, hg38 assembly
 
* Check that Blat uses Human, hg38 assembly
* Open <tt>regionChr7.fasta</tt> in a graphical editor (e.g. <tt>kate</tt>), select all, paste to the BLAT window, run BLAT.
+
* Open <tt>humanChr7Region.fasta</tt> in a graphical editor (e.g. <tt>kate</tt>), select all, paste to the BLAT window, run BLAT.
 
* In the table of results, the best result should have identity close to 100% and span close to 7kb.
 
* In the table of results, the best result should have identity close to 100% and span close to 7kb.
 
* For this best result, click on the link named Browser.
 
* For this best result, click on the link named Browser.
Line 60: Line 60:
  
 
(b) Which gene is located in this region?
 
(b) Which gene is located in this region?
* Once you are in the browser, press Default tracks button
+
* Once you are in the browser, you should see a track called GENCODE (if not, you can turn it on in the section Genes and Gene Predictions).
* Track named GENCODE contains known genes, shown as rectangles (exons) connected by lines (introns). Short gene names are next to them.
+
* This track contains known genes, shown as rectangles (exons) connected by lines (introns). Short gene names are next to them.
* Report the name of the gene in the region
+
* Report the name of the gene in the region.
  
 
(c) In which tissue is this gene most highly expressed? What is the function of this gene?
 
(c) In which tissue is this gene most highly expressed? What is the function of this gene?
Line 69: Line 69:
  
 
(d) Which SNPs are located in this gene? Which trait do they influence?
 
(d) Which SNPs are located in this gene? Which trait do they influence?
* You can see SNPs in the Common SNPs(155) track, but their IDs appear only after switching this track to pack mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
+
* You can see SNPs in the dbSNPs track, but their IDs appear only after switching this track to pack or full mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
* Information page of the gene (part c) also describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
+
* Information page of the gene (part c) describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
* You can also find information about individual SNPs by looking for them by their ID in [https://www.snpedia.com/index.php/SNPedia SNPedia] (not required in this task)
+
* You can also find information about individual SNPs by looking for them by their ID in [https://www.snpedia.com/index.php/SNPedia SNPedia] (not required in this task).
  
 
===Task C: Examining larger vcf files===
 
===Task C: Examining larger vcf files===
 
In this task, we will look at <tt>motherChr12.vcf</tt> and <tt>fatherChr12.vcf</tt> files and compute various statistics. You can use command-line tools, such as <tt>grep, wc, sort, uniq</tt> and Perl one-liners (as in [[Lbash]]), or you can write small scripts in Perl or Python (as in [[Lperl]] and [[Lsql]]).  
 
In this task, we will look at <tt>motherChr12.vcf</tt> and <tt>fatherChr12.vcf</tt> files and compute various statistics. You can use command-line tools, such as <tt>grep, wc, sort, uniq</tt> and Perl one-liners (as in [[Lbash]]), or you can write small scripts in Perl or Python (as in [[Lperl]] and [[Lsql]]).  
<!-- NOTEX -->
+
 
* Write all used commands to your protocol
+
* Write all used commands to your protocol.
* If you write any scripts, submit them as well
+
* If you write any scripts, submit them as well.
<!-- /NOTEX -->
 
  
 
Questions:
 
Questions:
  
 
(a) How many SNPs are in each file?
 
(a) How many SNPs are in each file?
* This can be found easily by <tt>wc</tt>, only make sure to exclude lines with comments
+
* This can be found easily by <tt>wc</tt>, only make sure to exclude lines with comments.
  
 
(b) How many heterozygous SNPs are in each file?
 
(b) How many heterozygous SNPs are in each file?
* The last column contains <tt>1|1</tt> for homozygous and either <tt>0|1</tt> or <tt>1|0</tt> for heterozygous SNPs
+
* The last column contains <tt>1|1</tt> for homozygous and either <tt>0|1</tt> or <tt>1|0</tt> for heterozygous SNPs.
* Character <tt>|</tt> has special meaning on the command line and in <tt>grep</tt> patterns; make sure to place it in apostrophes <tt>' '</tt> and possibly escape it with backslash <tt>\</tt>
+
* Character <tt>|</tt> has a special meaning on the command line and in <tt>grep</tt> patterns; make sure to place it in apostrophes <tt>' '</tt> and possibly escape it with backslash <tt>\</tt>.
  
 
(c) How many SNP positions are shared between the two files?
 
(c) How many SNP positions are shared between the two files?
* The second column of each file lists the position. We want to compute the size of intersection of the set of positions in <tt>motherChr12.vcf</tt> and <tt>fatherChr12.vcf</tt> files
+
* The second column of each file lists the position. We want to compute the size of intersection of the set of positions in <tt>motherChr12.vcf</tt> and <tt>fatherChr12.vcf</tt> files.
* You can e.g. create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using [http://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html comm] command with options <tt>-1 -2</tt>. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.
+
* You can create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using [http://www.gnu.org/software/coreutils/manual/html_node/comm-invocation.html comm] command with options <tt>-1 -2</tt>. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.
  
 
(d) List the 5 most frequent pairs of reference/alternate allele in <tt>motherChr12.vcf</tt> and their frequencies. Do they correspond to transitions or transversions?
 
(d) List the 5 most frequent pairs of reference/alternate allele in <tt>motherChr12.vcf</tt> and their frequencies. Do they correspond to transitions or transversions?
Line 97: Line 96:
 
* For each possible pair of nucleotides, find how many times it occurs in the <tt>motherChr12.vcf</tt> file.
 
* For each possible pair of nucleotides, find how many times it occurs in the <tt>motherChr12.vcf</tt> file.
 
* For example, pair <tt>C,A</tt> occurs 6894 times.
 
* For example, pair <tt>C,A</tt> occurs 6894 times.
* Then sort the pairs by their frequencies and report 5 most frequent pairs.
+
* Then sort the pairs by their frequencies and report the 5 most frequent pairs.
* Mutations can be classified as transitions and transversions. Transitions change purine to purine or pyrimidine to pyrimidine, transversions change a purine to pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?  
+
* Mutations can be classified as transitions and transversions. Transitions change a purine to a purine or a pyrimidine to a pyrimidine, transversions change a purine to a pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?  
 
* To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands <tt>sort</tt> and <tt>uniq</tt> to count each pair.  
 
* To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands <tt>sort</tt> and <tt>uniq</tt> to count each pair.  
  

Latest revision as of 19:47, 10 April 2024

See also the lecture

Submit the protocol and the required files to /submit/bioinf3

Input files

Copy files from /tasks/bioinf3/

mkdir bioinf3
cd bioinf3
cp -iv /tasks/bioinf3/* .

Files:

  • humanChr7Region.fasta is a 7kb piece of the human chromosome 7
  • motherChr7Region.fastq is a sample of reads from an anonymous donor known as NA12878; these reads come from region in humanChr7Region.fasta
  • fatherChr12.vcf and motherChr12.vcf are single-nucleotide variants on the chromosome 12 obtained by sequencing two individuals NA12877, NA12878 (these come from a larger family)

Task A: read mapping and variant calling

Align reads to the reference:

bwa index humanChr7Region.fasta
bwa mem humanChr7Region.fasta  motherChr7Region.fastq | \
  samtools view -S -b - |  samtools sort - motherChr7Region
samtools index motherChr7Region.bam

Call variants:

freebayes -f humanChr7Region.fasta --min-alternate-count 10 \
  motherChr7Region.bam >motherChr7Region.vcf

Run IGV, use humanChr7Region.fasta as genome, open motherChr7Region.bam and motherChr7Region.vcf. Looking at the aligned reads and the VCF file, answer the following questions:

(a) How many variants were found in the VCF file?

(b) How many variants are heterozygous and how many are homozygous?

(c) Are all variants single-nucleotide variants or do you also see some insertions/deletions (indels)?

Also export the overall view of the whole region from IGV to file motherChr7Region.png.

Submit the following files: motherChr7Region.png, motherChr7Region.bam, motherChr7Region.vcf

Task B: UCSC genome browser

(a) Where is the sequence from humanChr7Region.fasta file located in the browser?

  • Go to http://genome-euro.ucsc.edu/, from the blue menu, select Tools->Blat
  • Check that Blat uses Human, hg38 assembly
  • Open humanChr7Region.fasta in a graphical editor (e.g. kate), select all, paste to the BLAT window, run BLAT.
  • In the table of results, the best result should have identity close to 100% and span close to 7kb.
  • For this best result, click on the link named Browser.
  • Report which chromosome and which region you get.

(b) Which gene is located in this region?

  • Once you are in the browser, you should see a track called GENCODE (if not, you can turn it on in the section Genes and Gene Predictions).
  • This track contains known genes, shown as rectangles (exons) connected by lines (introns). Short gene names are next to them.
  • Report the name of the gene in the region.

(c) In which tissue is this gene most highly expressed? What is the function of this gene?

  • When you click on the gene (possibly twice), you get an information page which starts with a summary of the known function of this gene. Copy the first sentence to your protocol.
  • Further down on the gene information page you see RNA-Seq Expression Data (colorful boxplots). Find out which tissues have the highest signal.

(d) Which SNPs are located in this gene? Which trait do they influence?

  • You can see SNPs in the dbSNPs track, but their IDs appear only after switching this track to pack or full mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
  • Information page of the gene (part c) describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
  • You can also find information about individual SNPs by looking for them by their ID in SNPedia (not required in this task).

Task C: Examining larger vcf files

In this task, we will look at motherChr12.vcf and fatherChr12.vcf files and compute various statistics. You can use command-line tools, such as grep, wc, sort, uniq and Perl one-liners (as in Lbash), or you can write small scripts in Perl or Python (as in Lperl and Lsql).

  • Write all used commands to your protocol.
  • If you write any scripts, submit them as well.

Questions:

(a) How many SNPs are in each file?

  • This can be found easily by wc, only make sure to exclude lines with comments.

(b) How many heterozygous SNPs are in each file?

  • The last column contains 1|1 for homozygous and either 0|1 or 1|0 for heterozygous SNPs.
  • Character | has a special meaning on the command line and in grep patterns; make sure to place it in apostrophes ' ' and possibly escape it with backslash \.

(c) How many SNP positions are shared between the two files?

  • The second column of each file lists the position. We want to compute the size of intersection of the set of positions in motherChr12.vcf and fatherChr12.vcf files.
  • You can create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using comm command with options -1 -2. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.

(d) List the 5 most frequent pairs of reference/alternate allele in motherChr12.vcf and their frequencies. Do they correspond to transitions or transversions?

  • The fourth column contains the reference allele, fifth column the alternate allele. For example, the first SNP in motherChr12.vcf has a pair C,A.
  • For each possible pair of nucleotides, find how many times it occurs in the motherChr12.vcf file.
  • For example, pair C,A occurs 6894 times.
  • Then sort the pairs by their frequencies and report the 5 most frequent pairs.
  • Mutations can be classified as transitions and transversions. Transitions change a purine to a purine or a pyrimidine to a pyrimidine, transversions change a purine to a pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?
  • To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands sort and uniq to count each pair.

(e) Which parts of the chromosome have the highest and lowest density of SNPs in motherChr12.vcf?

  • First create a list of windows of size 100kb covering the whole chromosome 12 using these two commands:
perl -le 'print "chr12\t133275309"' > humanChr12.size
bedtools makewindows -g humanChr12.size -w 100000 -i srcwinnum > humanChr12-windows.bed
  • Then count SNPs in each window using this command:
bedtools coverage -a  humanChr12-windows.bed -b motherChr12.vcf > motherChr12-windows.tab
  • Find out which column of the resulting file contains the number of SNPs per window, e.g. by reading the documentation obtained by command bedtools coverage -h
  • Sort according to the column with the SNP number, look at the first and last line of the sorted file.
  • For checking: the second highest count is 387 in window with coordinates 20,800,000-20,900,000.