1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


HWmake

From MAD
Jump to navigation Jump to search

See also the lecture

Motivation: Building phylogenetic trees

The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences

  • A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors.
  • The input contains sequences of all proteins from each species (we will use only a smaller subset)
  • The process is typically split into several stages shown below

Identify ortholog groups

Orthologs are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called blast to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species.

Align proteins on each group

For each ortholog group, we need to align proteins in the group to identify corresponding parts of the proteins. This is done by a tool called muscle

Unaligned sequences (start of protein O60568):

>human
MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA...
>baboon
MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA...
>dog
MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA...
>elephant
MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA...
>guineapig
MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA...
>opossum
SDKLLVITAA...
>pig
AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA...
>rabbit
MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA...
>rat
MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...

Aligned sequences:

rabbit       MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ...
guineapig    MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ...
opossum      ---------- ---------- ---------- ---------- SDKLLVITAA ...
rat          MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ...
elephant     MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ...
human        MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ...
baboon       MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ...
dog          MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ...
pig          AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...

Build phylogenetic tree for each group

For each alignment, we build a phylogenetic tree for this group. We will use a program called IQ-tree.

Example of a phylogenetic tree in newick format:

 ((opossum:0.09636245,rabbit:0.85794020):0.05219782,
(rat:0.07263127,elephant:0.03306863):0.01043531,
(dog:0.01700528,(pig:0.02891345,
(guineapig:0.14451043,
(human:0.01169266,baboon:0.00827402):0.02619598
):0.00816185):0.00631423):0.00800806);
Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)

Build a consensus tree

The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree.

Files and submitting

Our goal is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.

  • Submit by copying requested files to /submit/make/username/
  • Do not forget to submit protocol, outline of the protocol is in /tasks/make/protocol.txt


Start by copying folder /tasks/make to your home folder

cp -ipr /tasks/make .
cd make

The folder contains three subfolders:

  • large: a larger sample of proteins for task A
  • tiny: a very small set of proteins for task B
  • small: a slightly larger set of proteins for task C

Task A (long job)

  • In this task, you will run a long alignment job (more than two hours)
  • Go to folder large, which contains the following files:
    • ref.fa: selected human proteins
    • other.fa: selected proteins from 8 other mammalian species
    • Makefile: runs blast on ref.fa vs other.fa (also formats database other.fa before that)
  • Run make -n to see what commands will be done (you should see makeblastdb, blastp, and echo for timing).
    • copy the output to the protocol
  • Run qsub with appropriate options to run make (at least -cwd -b y).
  • Then run qstat > queue.txt
    • Submit file queue.txt showing your job waiting or running
  • When your job finishes, check the following files:
    • the output file ref.blast
    • standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
  • Submit the last 100 lines from ref.blast under the name ref-end.blast (use tool tail -n 100) and the file make.oX mentioned above

Task B (finishing Makefile)

  • In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group.
    • This Makefile works with much smaller files and so you can run it quickly many times without qsub
  • Work in folder tiny
    • ref.fa: 2 human proteins
    • other.fa: a selected subset of proteins from 8 other mammalian species
    • Makefile: a longer makefile
    • brm.pl: a Perl script for finding ortholog groups and sorting them to folders

The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3.

  • If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
  • Stage 1: run as make ref.brm
    • It runs blast as in task A, then splits proteins into ortholog groups and creates one folder for each group with file prot.fa containing protein sequences
  • Stage 2: run as make alignments
    • In each folder with an ortholog group, it will create an alignment prot.phy.
  • Stage 3: run as make trees (needs to be written by you)
    • In each folder with an ortholog group, it should create files prot.lg.tree and prot.wag.tree containing the results of the RAxML program run with two different evolutionary models WAG and LG.
    • Create a rule that will create prot.lg.tree from prot.phy in one gene folder. Use % for the folder name, similarly as rules to make alignments. Within the rule, you can use $* variable to get the name of this folder.
    • Run RAxML by commands of the form:
      cd GENE_FOLDER; raxmlHPC -m PROTGAMMALG -p 12345 -s prot.phy -T 2 -o opossum -n LG >prot.raxml.LG
    • Change GENE_FOLDER to the appropriate filename of the folder using make variables.
    • RAxML will create several output files starting with RAxML_ in the folder with the gene. We are interested in the file containing word result in its filename. Use mv command to move this file to the desired filename prot.lg.tree in the folder with the gene. Again use an appropriate variable to specify the folder name.
    • So the rule for running RAxML with LG model will have two lines with commands, the first running cd and raxmlHPC, the second running mv.
    • When this is done, create a similar rule except replace LG by WAG (lg by wag) in the commands, including -m PROTGAMMALG option which will change to -m PROTGAMMAWAG.
    • Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target trees which uses these variables.
  • Stage 4: run as make consensus
    • Output trees from stage 3 are concatenated for each model separately to files lg/intree, wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
    • This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
  • Run your Makefile and check that the files lg.tree and wag.tree are produced
  • Submit the whole folder tiny (using cp -r), including Makefile and all gene folders with tree files.

Task C (running make)

  • Copy your Makefile from part B to folder small, which contains 9 human proteins and run make on this slightly larger set
    • Again, run it without qsub, but it will take some time, particularly if the server is busy
  • Look at the two resulting trees (wag.tree, lg.tree) using the figtree program
    • it is available on vyuka, but you can also install it on your computer if needed
  • In figtree, switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch. To do this, use the left panel with options.
  • Export the trees in pdf format as wag.tree.pdf and lg.tree.pdf
  • Compare the two trees
    • Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
    • Do the two trees differ? What is the highest and lowest support for a branch in each tree?
    • Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included).
    • Write your observations to the protocol
  • Submit the entire small folder (including the two pdf files).

Further possibilities

Here are some possibilities for further experiments, in case you are interested (do not submit these):

  • You could copy your extended Makefile to folder large and create trees for all ortholog groups in the big set.
    • This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A.
    • After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
  • RAxML also supports other models, for example JTT (for more information, see manual); you could try to play with those.
  • Command touch FILENAME will change the modification time of the given file to the current time.
    • What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?