Excercise 3: Comparative genomics / Positive selection

IGF1R (Insulin-like growth factor 1 receptor) is a gene central to several growth pathways and has been discovered to be under positive selection in the marmoset genome, and is likely strongly related to the small statue of marmosets.

Marmoset Genome Sequencing and Analysis Consortium. The common marmoset genome provides insight into primate biology and evolution. Nature Genetics, 46(8):850-857. 2014. paper here

In this exercise we will attempt to reconstruct some of the findings from the paper.

Step 1: Download files and install necessary packages

You can find all data files in data subdirectory

You will need some additional packages, if you did not install them previously (install as root):

sudo apt-get install muscle paml seaview pymol bioperl

Step 2: Look at the files

We will start from the alignment file which stores DNA sequences of IGF1R in several mammals (the alignment is in Phylip format): data/igf1r.phy

Species are named in UCSC Genome Browser nomenclature: hg - human, panTro - chimp, ponAbe - orang, rheMac - macaque, calJac - marmoset, mm9 - mouse, rn4 - rat, canFam - dog

Question: Explore this file (you can either look at the file directly, or you can use a seaview viewer) and look at the differences between individual sequences.

Look at the file data/tree_marmoset.nh which contains a phylogenetic tree that we will use in the rest of the analysis. Note that in tree_marmoset.nh we have marked a branch leading to marmoset with mark #1.

Optional question: Use the alignment to build a phylogenetic tree (e.g. by using program phyml). Does this tree differ from what would you expect? Are there any weird branch lengths?

Step 3: Identify sites under positive selection

In this step, we will try to identify sites that are under positive selection in marmoset lineage. We will use "Bayes Empirical Bayes" method from PAML software.

How would you examine positive selection on a branch to macaque instead of marmoset? (You will need files myout.* later, so if you use run_beb again, change the last parameter.)

Step 4: Find a good reference

To identify whether the sites we have found have any relevance to the protein function, we need to find a good reference. We will look for it in the PDB (protein database).

Step 5: Mapping sites to the reference

Step 6: Viewing the results