2-INF-185 Integrácia dátových zdrojov 2018/19
See also Lecture 3
Motivation: Building Phylogenetic Trees
The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences
- A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are target present-day species, internal nodes are their common ancestors.
- Input contains sequences of selected proteins from each species
- Step 1: Identify ortholog groups. Orthologs are proteins from different species that "correspond" to each other. This is done based on sequence similarity and we can use a tool called blast to identify sequence similarities between individual proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species
- Step 2: For each ortholog group, we need to align proteins and build a phylogenetic tree for this protein using existing methods. We can do this using tools muscle (for alignment) and phyml (for phylogenetic tree inference).
Unaligned sequences (start of protein O60568):
>human MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA... >baboon MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA... >dog MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA... >elephant MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA... >guineapig MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA... >opossum SDKLLVITAA... >pig AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA... >rabbit MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA... >rat MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...
rabbit MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ... guineapig MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ... opossum ---------- ---------- ---------- ---------- SDKLLVITAA ... rat MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ... elephant MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ... human MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ... baboon MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ... dog MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ... pig AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...
Phylogenetic tree in newick format:
- Step 3: The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using an interactive tool called phylip.
- Output is a single consensus tree.
Files and submitting
Our goal for today is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.
- Submit by copying requested files to /submit/hw03/username/
- Do not forget to submit protocol, outline of the protocol is in /tasks/hw03/protocol.txt
Start by copying /tasks/hw03 to your user directory
- cp -ipr /tasks/hw03 ~
It contains 3 subdirectories:
- large: larger sample of proteins for task A
- tiny: very small set of proteins for task B
- small: slightly larger set of proteins for task C
- In this task, you will run a long alignment job (>2 hours)
- Use directory large with files:
- ref.fa: selected human proteins
- other.fa: selected proteins from 8 other mammalian species
- Makefile: run blast on ref.fa vs other.fa (also formats database other.fa before that)
- run make -n to see what commands will be done (you should see makeblastdb and blastp + echo for timing), copy the output to the protocol
- run qsub with appropriate options to run make (at least -cwd and -b y)
- then run qstat > queue.txt
- Submit file queue.txt showing your job waiting or running
- When your job finishes, submit also the following two files:
- the last 100 lines from the output file ref.blast under the name ref-end.blast (use tool tail -n 100)
- standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
- In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group
- This Makefile works with much smaller files and so you can run it many times on vyuka, without qsub
- Work in directory tiny
- ref.fa: 2 human proteins
- other.fa: a selected subset of proteins from 8 other mammalian species
- Makefile: a longer makefile
- brm.pl: a Perl script for finding ortholog groups and sorting them to directories
The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3
- If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
- Stage 1: run as make ref.brm
- It runs blast as in task A, then splits proteins into ortholog groups and creates one directory for each group with file prot.fa containing protein sequences
- Stage 2: run as make alignments
- In each directory with a single gene, it will create an alignment prot.phy and link it under names lg.phy and wag.phy
- Stage 3: run as make trees (needs to be written by you)
- In each directory with a single gene, it should create lg.phy_phyml_tree and wag.phy_phyml_tree
- These corresponds to results of phyml commands run with two different evolutionary models WAG and LG, where LG is the default
- Run phyml by commands of the forms:
- phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG
- phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG
- Change INPUT and LOG in the commands to appropriate filenames using make variables $@, $^, $* etc. Input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension .log added (e.g. lg.phy_phyml_tree.log)
- Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target trees which uses these variables
- Stage 4: run as make consensus
- Output trees from stage 3 are concatenated for each model separately to files lg/intree, wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
- This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
- Run your Makefile
- Submit the whole directory tiny, including Makefile and all gene directories with tree files.
- Copy your Makefile from part B to directory small, which contains 9 human proteins and run make on this slightly larger set
- Again, run it on vyuka server without qsub, but it will take some time, particularly if the server is busy
- Look at the two trees from task C (wag.tree, lg.tree) using the figtree program on vyuka (you can also install it on your computer)
- In figtree, change the position of the root in the tree to make opossum the outgroup (species branching as the first away from the others).
- This is done in figtree by clicking on opossum and thus selecting it, then pressing Reroot button.
- Also switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch.
- Use the left panel with options.
- Export the trees in pdf format as wag.tree.pdf and lg.tree.pdf and include in your submission
- Compare the two trees and write your observations to the protocol
- Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
- Do the two trees differ? What is the highest and lowest support for a branch in each tree?
- Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included)
- Submit the entire small directory (including the two pdf files)
Here are some possibilities for further experiments, in case you are interested (do not submit these):
- You could copy your extended Makefile to directory large and create trees for all ortholog groups in the big set
- This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
- After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
- Phyml also supports other models, for example JTT (see manual), you could try to play with those.
- Command touch FILENAME will change modification time of the given file to current file
- What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?