2-INF-185 Integrácia dátových zdrojov 2017/18
In this homework, we will return to the example in homework 2, where we took genes from several organisms, found orthogroups of corresponding genes and built a phylogenetic tree for each orthogroup. This was all done in a single big Perl script. In this homework, we will write a similar pipeline using make and execute it remotely using qsub. We will use proteins instead of DNA and we will use a different set of species. Most of the work is already done, only small modifications are necessary.
- Submit by copying requested files to /submit/hw04/username/
- Do not forget to submit protocol, outline of the protocol is in /tasks/hw04/protocol.txt
- In this task, you will run a long alignment job (>1 hour)
- Copy directory /tasks/hw04/large to your home directory
- ref.fa: all proteins from yeast Yarrowia lipolytica
- other.fa: all proteins from 8 other yeast species
- Makefile: run blast on ref.fa vs other.fa (also formats database other.fa before that)
- run make -n to see what commands will be done (you should see formatdb and blastall + echo for timing), copy the output to the protocol
- run qsub with appropriate options to run make (at least -cwd and -b y)
- then run qstat > queue.txt
- Submit file queue.txt showing your job waiting or running
- When your job finishes, submit also the following two files:
- the last 100 lines from the output file ref.blast under the name ref-end.blast (use tool tail -n 100)
- standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
- In this task, you will finish a Makefile for splitting blast results into orthogroups and building phylogenetic trees for each group
- This Makefile works with much smaller files and so you can run it many times on vyuka, without qsub
- If it runs too slowly, you can temporarily modify ref.fa to contain only the first 2 sequences, debug your makefile and then again copy the original ref.fa from /tasks/hw04/small to run the final analysis
- Copy directory /tasks/hw04/small to your home directory
- ref.fa: 6 proteins from yeast Yarrowia lipolytica
- other.fa: a selected subset of proteins from 8 other yeast species
- Makefile: a longer makefile
The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3
- If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
- Stage 1: run as make ref.brm
- It runs blast as in task A, then splits proteins into orthogroups and creates one directory for each group with file prot.fa containing protein sequences
- Stage 2: run as make alignments
- In each directory with a single gene, it will create an alignment prot.phy and link it under names lg.phy and wag.phy
- Stage 3: run as make trees (needs to be written by you)
- In each directory with a single gene, it should create lg.phy_phyml_tree and wag.phy_phyml_tree
- These corresponds to results of phyml commands run with two different evolutionary models WAG and LG, where LG is the default
- Run phyml by commands of the forms:
- phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG
- phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG
- Change INPUT and LOG in the commands to appropriate filenames using make variables $@, $^, $* etc. Input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension .log added (e.g. lg.phy_phyml_tree.log)
- Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target trees which uses these variables
- Stage 4: run as make consensus
- Output trees from stage 3 are concatenated for each model separately to files lg/intree wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
- This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
- Run your Makefile
- Submit the whole directory small, including Makefile and all gene directories with tree files.
- Look at the two trees from task B (wag.tree, lg.tree) using the figtree program, switch on displaying branch labels in the left panel with options. These labels show for each branch of the tree, how many of the input trees support this branch.
- Write your observations to the protocol: Do the two trees differ? If yes, do they differ in branches supported by many different genes trees, or few? What is the highest and lowest support for a branch in each tree?
- Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
Here are some possibilities for further experiments, in case you are interested (do not submit these):
- You could copy your extended Makefile to directory large and create trees for all orthogroups in the big set
- This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
- After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
- Phyml also supports other models, for example JTT (see manual), you could try to play with those.
- Command touch FILENAME will change modification time of the given file to current file
- What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?