Exercise 2: Genome annotation, RNA-seq

In this exercise, we will annotate a short portion of the genome of the fungus Aspergillus nidulans using Augustus gene finder and compare Augustus annotation with RNA-seq results and with reference annotation from the RefSeq database.

Task A: Running Augustus

Augustus is based on a probabilistic model, which has parameters describing various properties of a genome, such as codon frequencies, splicing motifs, length and number of introns and exons. Since these features vary between genomes, the parameters need to be trained for each new genome or a group of closely related genomes. To see this effect, we will run Augustus with both A. nidulans parameters and parameters for the human genome.

Step 1: Run Augustus

You will find cca 38kb of A. nidulans genome in file ref2.fasta. Place it to your work directory, then run the following commands:

augustus --species=anidulans ref2.fasta > augustus-anidulans.gtf
augustus --species=human ref2.fasta > augustus-human.gtf

Step 2: Examine the output

Look at the resulting GTF files. Suggest a simple way how you can count the number of genes in each file (e.g. using grep command). Which file contains more genes? Which file contains more exons (represented by CDS records)?

File annot.gff contains gene annotation from the RefSeq database. It is in GFF3 format, similar to GTF. How many genes and CDS rows are in this file?

Task B: Aligning RNA-seq reads

RNA-seq reads were extracted from SRA record SRR4048918. Look at the file rnaseq.fastq. How long are the reads?

Step 1: Run Tophat

We will use TopHat to align the reads to the genome. Tophat can align reads spanning exon boundaries, thus identifying intron positions. Run TopHat and sort/index the resulting bam file as follows:

bowtie2-build ref2.fasta ref2.fasta
tophat2 -i 10 -I 10000 --max-multihits 1 --output-dir rnaseq ref2.fasta rnaseq.fastq
samtools sort rnaseq/accepted_hits.bam rnaseq
samtools index rnaseq.bam

Step 2: Examine output

In addition to the bam file, TopHat produced several other files in the rnaseq folder. What can you learn from them?

Task C: Visualizing the results

You can use IGV browser to open all three annotation files (augustus-*.gtf, annot.gff) as well as the RNA-seq bam file and compare them.

  1. Start IGV browser by using igv command
  2. Select the reference file ref2.fasta by using menu Genomes -> Load Genome from File
  3. Open all additional files using menu File -> Load from File
  4. Exons are shown as thicker boxes, introns are thinner.

Questions: