In this exercise, we will annotate a short portion of the genome of the fungus Aspergillus nidulans using Augustus gene finder and compare Augustus annotation with RNA-seq results and with reference annotation from the RefSeq database.
Augustus is based on a probabilistic model, which has parameters describing various properties of a genome, such as codon frequencies, splicing motifs, length and number of introns and exons. Since these features vary between genomes, the parameters need to be trained for each new genome or a group of closely related genomes. To see this effect, we will run Augustus with both A. nidulans parameters and parameters for the human genome.
You will find cca 38kb of A. nidulans genome in file ref2.fasta. Place it to your work directory, then run the following commands:
augustus --species=anidulans ref2.fasta > augustus-anidulans.gtf augustus --species=human ref2.fasta > augustus-human.gtf
Look at the resulting GTF files. Suggest a simple way how you can count the number of genes in each file (e.g. using grep command). Which file contains more genes? Which file contains more exons (represented by CDS records)?
File annot.gff contains gene annotation from the RefSeq database. It is in GFF3 format, similar to GTF. How many genes and CDS rows are in this file?
RNA-seq reads were extracted from SRA record SRR4048918. Look at the file rnaseq.fastq. How long are the reads?
We will use TopHat to align the reads to the genome. Tophat can align reads spanning exon boundaries, thus identifying intron positions. Run TopHat and sort/index the resulting bam file as follows:
bowtie2-build ref2.fasta ref2.fasta tophat2 -i 10 -I 10000 --max-multihits 1 --output-dir rnaseq ref2.fasta rnaseq.fastq samtools sort rnaseq/accepted_hits.bam rnaseq samtools index rnaseq.bam
In addition to the bam file, TopHat produced several other files in the rnaseq folder. What can you learn from them?
You can use IGV browser to open all three annotation files (augustus-*.gtf, annot.gff) as well as the RNA-seq bam file and compare them.
Questions: