2-INF-185 Integrácia dátových zdrojov 2018/19

Materiály · Úvod · Pravidlá · Kontakt
· Od 14.3. sa presúvame do učebne F2-T3, a.k.a. F2-128.
· Body z už opravených úloh nájdete na serveri v /grades/userid.txt
· Do stredy 17.4. odovzdajte návrh projektu vo formáte .txt alebo .pdf do adresára /submit/navrh/username
  (príklady projektov pre bioinformatikov)


HW03

From IDZ
Jump to: navigation, search

See also Lecture 3

Motivation: Building Phylogenetic Trees

The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences

  • A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are target present-day species, internal nodes are their common ancestors.
  • Input contains sequences of selected proteins from each species
  • Step 1: Identify ortholog groups. Orthologs are proteins from different species that "correspond" to each other. This is done based on sequence similarity and we can use a tool called blast to identify sequence similarities between individual proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species
  • Step 2: For each ortholog group, we need to align proteins and build a phylogenetic tree for this protein using existing methods. We can do this using tools muscle (for alignment) and phyml (for phylogenetic tree inference).

Unaligned sequences (start of protein O60568):

>human
MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA...
>baboon
MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA...
>dog
MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA...
>elephant
MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA...
>guineapig
MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA...
>opossum
SDKLLVITAA...
>pig
AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA...
>rabbit
MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA...
>rat
MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...

Aligned sequences:

rabbit       MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ...
guineapig    MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ...
opossum      ---------- ---------- ---------- ---------- SDKLLVITAA ...
rat          MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ...
elephant     MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ...
human        MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ...
baboon       MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ...
dog          MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ...
pig          AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...

Phylogenetic tree in newick format:

 ((opossum:0.09636245,rabbit:0.85794020):0.05219782,(rat:0.07263127,elephant:0.03306863):0.01043531,(dog:0.01700528,(pig:0.02891345,(guineapig:0.14451043,(human:0.01169266,baboon:0.00827402):0.02619598):0.00816185):0.00631423):0.00800806);


  • Step 3: The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using an interactive tool called phylip.
  • Output is a single consensus tree.

Files and submitting

Our goal for today is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.

  • Submit by copying requested files to /submit/hw03/username/
  • Do not forget to submit protocol, outline of the protocol is in /tasks/hw03/protocol.txt

Start by copying /tasks/hw03 to your user directory

cp -ipr /tasks/hw03 ~

It contains 3 subdirectories:

  • large: larger sample of proteins for task A
  • tiny: very small set of proteins for task B
  • small: slightly larger set of proteins for task C

Task A

  • In this task, you will run a long alignment job (>2 hours)
  • Use directory large with files:
    • ref.fa: selected human proteins
    • other.fa: selected proteins from 8 other mammalian species
    • Makefile: run blast on ref.fa vs other.fa (also formats database other.fa before that)
  • run make -n to see what commands will be done (you should see makeblastdb and blastp + echo for timing), copy the output to the protocol
  • run qsub with appropriate options to run make (at least -cwd and -b y)
  • then run qstat > queue.txt
    • Submit file queue.txt showing your job waiting or running
  • When your job finishes, submit also the following two files:
    • the last 100 lines from the output file ref.blast under the name ref-end.blast (use tool tail -n 100)
    • standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)

Task B

  • In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group
    • This Makefile works with much smaller files and so you can run it many times on vyuka, without qsub
  • Work in directory tiny
    • ref.fa: 2 human proteins
    • other.fa: a selected subset of proteins from 8 other mammalian species
    • Makefile: a longer makefile
    • brm.pl: a Perl script for finding ortholog groups and sorting them to directories

The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3

  • If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
  • Stage 1: run as make ref.brm
    • It runs blast as in task A, then splits proteins into ortholog groups and creates one directory for each group with file prot.fa containing protein sequences
  • Stage 2: run as make alignments
    • In each directory with a single gene, it will create an alignment prot.phy and link it under names lg.phy and wag.phy
  • Stage 3: run as make trees (needs to be written by you)
    • In each directory with a single gene, it should create lg.phy_phyml_tree and wag.phy_phyml_tree
    • These corresponds to results of phyml commands run with two different evolutionary models WAG and LG, where LG is the default
    • Run phyml by commands of the forms:
      • phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG
      • phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG
    • Change INPUT and LOG in the commands to appropriate filenames using make variables $@, $^, $* etc. Input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension .log added (e.g. lg.phy_phyml_tree.log)
    • Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target trees which uses these variables
  • Stage 4: run as make consensus
    • Output trees from stage 3 are concatenated for each model separately to files lg/intree, wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
    • This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
  • Run your Makefile
  • Submit the whole directory tiny, including Makefile and all gene directories with tree files.

Task C

  • Copy your Makefile from part B to directory small, which contains 9 human proteins and run make on this slightly larger set
    • Again, run it on vyuka server without qsub, but it will take some time, particularly if the server is busy
  • Look at the two trees from task C (wag.tree, lg.tree) using the figtree program on vyuka (you can also install it on your computer)
  • In figtree, change the position of the root in the tree to make opossum the outgroup (species branching as the first away from the others).
      • This is done in figtree by clicking on opossum and thus selecting it, then pressing Reroot button.
    • Also switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch.
      • Use the left panel with options.
    • Export the trees in pdf format as wag.tree.pdf and lg.tree.pdf and include in your submission
    • Compare the two trees and write your observations to the protocol
      • Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
      • Do the two trees differ? What is the highest and lowest support for a branch in each tree?
      • Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included)
  • Submit the entire small directory (including the two pdf files)

Further possibilities

Here are some possibilities for further experiments, in case you are interested (do not submit these):

  • You could copy your extended Makefile to directory large and create trees for all ortholog groups in the big set
    • This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
    • After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
  • Phyml also supports other models, for example JTT (see manual), you could try to play with those.
  • Command touch FILENAME will change modification time of the given file to current file
    • What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?