1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
HWbioinf2
See also the lecture
Submit the protocol and the required files to /submit/bioinf2
Contents
Input files
Copy files from /tasks/bioinf2/
mkdir bioinf2
cd bioinf2
cp -iv /tasks/bioinf2/* .
Files:
- ref.fasta is a 38kb piece of the genome of the fungus Aspergillus nidulans
- rnaseq.fastq are RNA-seq reads from Illumina sequencer extracted from the Short read archive
- annot.gff is the reference gene annotation from the database (we will consider this as correct gene positions)
Task A: Gene finding
Run the Augustus gene finder with two versions of parameters:
- one trained specifically for A. nidulans genes
- one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
augustus --species=anidulans ref.fasta > augustus-anidulans.gtf
augustus --species=human ref.fasta > augustus-human.gtf
The results of gene finding are in the GTF format. Rows starting with # are comments, each of the remaining rows describes some interval of the sequence. If the second column is CDS, it is a coding part of an exon. The reference annotation annot.gff is in a similar format called GFF3.
Examine the files and try to find the answers to the following questions using command-line tools
(a) How many exons are in each of the two GTF files? (Beware: simply using grep with pattern CDS may yield lines containing this string in a different column. You can use e.g. techniques from the lecture and exercises on command-line tools).
(b) How many genes are in each of the two GTF files? (The files contain rows with word gene in the second column, one for each gene)
(c) How many exons and genes are in the annot.gff file?
Write the anwsers and commands to the protocol. Submit files augustus-anidulans.gtf and augustus-human.gtf.
Task B: Aligning RNA-seq reads
- Align RNA-seq reads to the genome
- We will use a specialized tool hisat2, which can recognize introns
- Then we will sort and index the BAM file, similarly as in the previous lecture
hisat2-build ref.fasta ref.fasta
hisat2 -x ref.fasta -U rnaseq.fastq -S rnaseq.sam -k 1 --min-intronlen 20 --max-intronlen 10000 --novel-splicesite-outfile introns.txt
samtools sort -O BAM -o rnaseq.bam rnaseq.sam
samtools index rnaseq.bam
In addition to the BAM file, we produced a file containing the position of detected introns. Examine the files to find out answers to the following questions (you can do it manually by looking at the the files, e.g. by less command):
(a) How many reads were in the FASTQ file? How many of them were successfully mapped?
(b) How many introns ("junctions") were predicted?
(c) During the mapping, we used a few custom options. Inspect the manual pages of hisat2 and samtools and describe shortly what those options mean.
Write answers to the protocol. Submit the file rnaseq.bam.
Task C: Visualizing in IGV
As before, run IGV as follows:
igv -g ref.fasta &
Open additional files using menu File -> Load from File: annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam
- Exons are shown as thicker boxes, introns are thinner.
- For each of the following questions, select a part of the sequence illustrating the answer and export a figure using File->Save image
- You can check these images using command eog
Questions:
(a) Create an image illustrating differences between Augustus with human parameters and the reference annotation, save as a.png. Briefly describe the differences in words.
(b) Find some differences between Augustus with A. nidulans parameters and the reference annotation. Store an illustrative figure as b.png. Which parameters have yielded a more accurate prediction?
(c) Zoom in to one of the genes with a high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as c.png.
Submit files a.png, b.png, c.png. Write answers to your protocol.