1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWbioinf2"

From MAD
Jump to navigation Jump to search
 
(11 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<!-- NOTEX -->
+
See also the [[Lbioinf2|lecture]].
See also the [[Lbioinf2|lecture]]
 
  
 
Submit the protocol and the required files to <tt>/submit/bioinf2</tt>
 
Submit the protocol and the required files to <tt>/submit/bioinf2</tt>
<!-- /NOTEX -->
 
  
 
===Input files===
 
===Input files===
Line 21: Line 19:
  
 
Run the Augustus gene finder with two versions of parameters:
 
Run the Augustus gene finder with two versions of parameters:
* one trained specifically for ''A. nidulas'' genes
+
* one trained specifically for ''A. nidulans'' genes
 
* one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
 
* one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 28: Line 26:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
The results of gene finding are in the [http://mblab.wustl.edu/GTF22.html GTF format]. Rows starting with <tt>#</tt> are comments, each of the remaining rows describes some interval of the sequence. If the second column is <tt>CDS</tt>, it is a coding part of an exon. The reference annotation <tt>annot.gff</tt> is in a similar format called [http://gmod.org/wiki/GFF3 GFF3].
+
The results of gene finding are in the GTF format. Examine the files and try to find the answers to the following questions using command-line tools.
 
 
Examine the files and try to find the answers to the following questions using command-line tools
 
  
 
(a) How many exons are in each of the two GTF files? (Beware: simply using <tt>grep</tt> with pattern <tt>CDS</tt> may yield lines containing this string in a different column. You can use e.g. techniques from the [[Lbash|lecture]] and [[HWbash|exercises]] on command-line tools).
 
(a) How many exons are in each of the two GTF files? (Beware: simply using <tt>grep</tt> with pattern <tt>CDS</tt> may yield lines containing this string in a different column. You can use e.g. techniques from the [[Lbash|lecture]] and [[HWbash|exercises]] on command-line tools).
Line 36: Line 32:
 
(b) How many genes are in each of the two GTF files? (The files contain rows with word <tt>gene</tt> in the second column, one for each gene)
 
(b) How many genes are in each of the two GTF files? (The files contain rows with word <tt>gene</tt> in the second column, one for each gene)
  
(c) How many exons and genes are in the <tt>annot.gff</tt> file?
+
(c) How many exons and genes are in the <tt>annot.gff</tt> file with reference annotation?
  
<!-- NOTEX -->
+
Write the answers and commands to the '''protocol'''. '''Submit''' files <tt>augustus-anidulans.gtf</tt> and <tt>augustus-human.gtf</tt>.
Write the anwsers and commands to the '''protocol'''. '''Submit''' files <tt>augustus-anidulans.gtf</tt> and <tt>augustus-human.gtf</tt>.
 
<!-- /NOTEX -->
 
  
 
===Task B: Aligning RNA-seq reads===
 
===Task B: Aligning RNA-seq reads===
  
* Align RNA-seq reads to the genome
+
* Align RNA-seq reads to the genome.
* We will use a specialized tool <tt>tophat</tt>, which can recognize introns
+
* We will use a specialized tool <tt>STAR</tt>, which can recognize introns.
* Then we will sort and index the BAM file, similarly as in the [[HWbioinf1|previous lecture]]
+
* First, run the following commands:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
bowtie2-build ref.fasta ref.fasta
+
STAR --runMode genomeGenerate --genomeDir ref-index --genomeFastaFiles ref.fasta --genomeSAindexNbases 6
tophat2 -i 10 -I 10000 --max-multihits 1 --output-dir rnaseq ref.fasta rnaseq.fastq
+
STAR --genomeDir ref-index --alignIntronMax 10000 --readFilesIn rnaseq.fastq --outFileNamePrefix rnaseq-star.
samtools sort rnaseq/accepted_hits.bam rnaseq
 
samtools index rnaseq.bam
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
In addition to the BAM file, TopHat produced several other files in the <tt>rnaseq</tt> folder. Examine them to find out answers to the following questions (you can do it manually by looking at the the files, e.g. by <tt>less</tt> command):
+
* Then sort the resulting SAM file using samtools, store it as a BAM file and create its index, similarly as in the [[HWbioinf1|previous homework]].
 +
* In addition to the BAM file, we produced a file <tt>rnaseq-star.SJ.out.tab</tt> containing the position of detected introns (here called splice junctions or splices). Find in the [https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf manual of STAR] description of this file.
 +
* There are also additional files with logs and statistics.
 +
 
 +
Examine the files to find out answers to the following questions (you can find some of the answers manually by looking at the the files, e.g. by <tt>less</tt> command):
 +
 
 +
(a) How many reads were in the FASTQ file? How many of them were successfully mapped?
  
(a) How many reads were in the the FASTQ file? How many of them were successfully mapped?
+
(b) How many introns (splice junctions) were predicted? How many of them are supported by more than one read?  
  
(b) How many introns ("junctions") were predicted? How many of them are supported by more than one read? (The 5th column of the corresponding file is the number of reads supporting a junction.)
+
Finally, convert the file with splice junctions to BED format in which each line will be one intron. It should have the following columns:
 +
* Sequence name (as in the <tt>SJ.out.tab</tt> file)
 +
* Start (beware, it should be 1 less than the number in the SJ.out.tab file because of 1-based vs. 0-based coordinates)
 +
* End (as in the <tt>SJ.out.tab</tt> file)
 +
* Name (create some identifier, e.g. numbering the junctions sequentially)
 +
* Score (use the number of supporting reads as score)
 +
* Strand (+, -, or ., needs to be converted from numeric value in <tt>SJ.out.tab</tt>)
 +
For conversion, you can write a short script in your favorite language or use a one-liner. The result should be named <tt>rnaseq-star-introns.bed</tt>.
  
<!-- NOTEX -->
+
Write your answers to the '''protocol'''. '''Submit''' the files <tt>rnaseq-star.bam</tt> and <tt>rnaseq-star-introns.bed</tt>.
Write anwsers to the '''protocol'''. '''Submit''' the file <tt>rnaseq.bam</tt>.
 
<!-- /NOTEX -->
 
  
 
===Task C: Visualizing in IGV===
 
===Task C: Visualizing in IGV===
Line 72: Line 75:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Open additional files using menu <tt>File -> Load from File</tt>: <tt>annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam</tt>
+
Open additional files using menu <tt>File -> Load from File</tt>: <tt>annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam, rnaseq-star-introns.bed</tt>
* Exons are shown as thicker boxes, introns are thinner.  
+
* In GTF and GFF files, exons are shown as thicker boxes, introns are thinner.  
 
* For each of the following questions, select a part of the sequence illustrating the answer and export a figure using <tt>File->Save image</tt>
 
* For each of the following questions, select a part of the sequence illustrating the answer and export a figure using <tt>File->Save image</tt>
 
* You can check these images using command <tt>eog</tt>
 
* You can check these images using command <tt>eog</tt>
Line 83: Line 86:
 
(b) Find some differences between Augustus with ''A. nidulans'' parameters and the reference annotation. Store an illustrative figure as <tt>b.png</tt>. Which parameters have yielded a more accurate prediction?
 
(b) Find some differences between Augustus with ''A. nidulans'' parameters and the reference annotation. Store an illustrative figure as <tt>b.png</tt>. Which parameters have yielded a more accurate prediction?
  
(c) Zoom in to one of the genes with high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as <tt>c.png</tt>.
+
(c) Zoom in to one of the genes with a high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as <tt>c.png</tt>.
  
 
<!-- NOTEX -->
 
<!-- NOTEX -->
 
'''Submit''' files <tt>a.png, b.png, c.png</tt>. Write answers to your '''protocol'''.
 
'''Submit''' files <tt>a.png, b.png, c.png</tt>. Write answers to your '''protocol'''.
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->

Latest revision as of 06:54, 4 April 2024

See also the lecture.

Submit the protocol and the required files to /submit/bioinf2

Input files

Copy files from /tasks/bioinf2/

mkdir bioinf2
cd bioinf2
cp -iv /tasks/bioinf2/* .

Files:

  • ref.fasta is a 38kb piece of the genome of the fungus Aspergillus nidulans
  • rnaseq.fastq are RNA-seq reads from Illumina sequencer extracted from the Short read archive
  • annot.gff is the reference gene annotation from the database (we will consider this as correct gene positions)

Task A: Gene finding

Run the Augustus gene finder with two versions of parameters:

  • one trained specifically for A. nidulans genes
  • one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
augustus --species=anidulans ref.fasta > augustus-anidulans.gtf
augustus --species=human ref.fasta > augustus-human.gtf

The results of gene finding are in the GTF format. Examine the files and try to find the answers to the following questions using command-line tools.

(a) How many exons are in each of the two GTF files? (Beware: simply using grep with pattern CDS may yield lines containing this string in a different column. You can use e.g. techniques from the lecture and exercises on command-line tools).

(b) How many genes are in each of the two GTF files? (The files contain rows with word gene in the second column, one for each gene)

(c) How many exons and genes are in the annot.gff file with reference annotation?

Write the answers and commands to the protocol. Submit files augustus-anidulans.gtf and augustus-human.gtf.

Task B: Aligning RNA-seq reads

  • Align RNA-seq reads to the genome.
  • We will use a specialized tool STAR, which can recognize introns.
  • First, run the following commands:
STAR --runMode genomeGenerate --genomeDir ref-index --genomeFastaFiles ref.fasta  --genomeSAindexNbases 6
STAR --genomeDir ref-index --alignIntronMax 10000 --readFilesIn rnaseq.fastq  --outFileNamePrefix rnaseq-star.
  • Then sort the resulting SAM file using samtools, store it as a BAM file and create its index, similarly as in the previous homework.
  • In addition to the BAM file, we produced a file rnaseq-star.SJ.out.tab containing the position of detected introns (here called splice junctions or splices). Find in the manual of STAR description of this file.
  • There are also additional files with logs and statistics.

Examine the files to find out answers to the following questions (you can find some of the answers manually by looking at the the files, e.g. by less command):

(a) How many reads were in the FASTQ file? How many of them were successfully mapped?

(b) How many introns (splice junctions) were predicted? How many of them are supported by more than one read?

Finally, convert the file with splice junctions to BED format in which each line will be one intron. It should have the following columns:

  • Sequence name (as in the SJ.out.tab file)
  • Start (beware, it should be 1 less than the number in the SJ.out.tab file because of 1-based vs. 0-based coordinates)
  • End (as in the SJ.out.tab file)
  • Name (create some identifier, e.g. numbering the junctions sequentially)
  • Score (use the number of supporting reads as score)
  • Strand (+, -, or ., needs to be converted from numeric value in SJ.out.tab)

For conversion, you can write a short script in your favorite language or use a one-liner. The result should be named rnaseq-star-introns.bed.

Write your answers to the protocol. Submit the files rnaseq-star.bam and rnaseq-star-introns.bed.

Task C: Visualizing in IGV

As before, run IGV as follows:

igv -g ref.fasta &

Open additional files using menu File -> Load from File: annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam, rnaseq-star-introns.bed

  • In GTF and GFF files, exons are shown as thicker boxes, introns are thinner.
  • For each of the following questions, select a part of the sequence illustrating the answer and export a figure using File->Save image
  • You can check these images using command eog

Questions:

(a) Create an image illustrating differences between Augustus with human parameters and the reference annotation, save as a.png. Briefly describe the differences in words.

(b) Find some differences between Augustus with A. nidulans parameters and the reference annotation. Store an illustrative figure as b.png. Which parameters have yielded a more accurate prediction?

(c) Zoom in to one of the genes with a high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as c.png.

Submit files a.png, b.png, c.png. Write answers to your protocol.