2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search

Biological background and overall approach

The task for today will be to build a phylogenetic tree of several species using sequences of several genes.

  • We will use 6 mammals: human, chimp, macaque, mouse, rat and dog
  • A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are target present-day species, internal nodes are their common ancestors.
  • There are methods to build trees by comparing DNA or protein sequences of several present-day species.
  • Our input contains a small selection of gene sequences from each species. In a real project we would start from all genes (cca 20,000 per species) and would do a careful filtration of problematic sequences, but we skip this step here.
  • The first step will be to identify which genes from different species "correspond" to each other. More exactly, we are looking for groups of orthologs. To do so, we will use a simple method based on sequence similarity, see details below. Again, in real project, more complex methods might be used.
  • The result of ortholog group identification will be a set of genes, each gene having one sequence from each of the 6 species
  • Next we will process each gene separately, aligning them and building a phylogenetic tree for this gene using existing methods.
  • The result of the previous step will be several trees, one for every gene. Ideally, all trees would be identical, showing the real evolutionary history of the six species. But it is not easy to infer the real tree from sequence data, so trees from different genes might differ. Therefore, in the last step, we will build a consensus tree.

Technical overview

This task can be organized in different ways, but to practice Perl, we will write a single Perl script which takes as an input a set of fasta files, each containing DNA sequences of several genes from a single species and writes on output the resulting consensus tree.

  • For most of the steps, we will use existing bioinformatics tools. The script will run these tools and do some additional simple processing.

Temporary directory

  • During its run, the script and various tools will generate many files. All these files will be stored in a single temporary directory which can be then easily deleted by the user.
  • We will use Perl library File::Temp to create this temporary directory with a unique name so that the script can be run several times simultaneously without clashing filenames.
  • The library by default creates the file in /tmp, but instead we will create it in the current directory so that it is not deleted at restart of the computer and so that it can be more easily inspected for any problems
  • The library by default deletes the directory when the script finishes but again, to allow inspection by the user, we will leave the directory in place


  • The script will have a command line option for restarting the computation and omitting the time-consuming steps that were already finished
  • This is useful in long-running scripts because during development of the script you will want to run it many times as you add more steps. In real usage the computation can also be interrupted for various reasons.
  • Our restart capabilities will be quite rudimentary: before running a potentially slow external program, the script will check if the temporary directory contains a non-empty file with the filename matching the expected output of the program. If the file is found, it is assumed to be correct and complete and the external program is not run.

Command line options

  • The script should be named build-tree.pl and as command-line arguments, it will get names of the species
    • For example, we can run the script as follows: ./build-tree.pl human chimp macaque mouse rat dog
    • The first species, in this case human, will be so called reference species (see task A)
    • The script needs at least 2 species, otherwise it will write an error message and stop
    • For each species X there should be a file X.fa in the current directory, this is also checked by the script
  • Restart is specified by command line option -r followed by the name of temporary directory
  • Command-line option handling and creation of temporary directory is already implemented in the script you are given.

Input files

  • Each input fasta X.fa file contains DNA sequences of several genes from one species X
  • Each sequence name on a line starting with > will contain species name, underscore and gene id, e.g. ">human_00008"
  • Species name matches name of the file, gene id is unique within the fasta file
  • Species names and gene ids do not contain underscore, whitespace or any other special characters
  • Sequence of each gene can be split into several lines

Files and submitting

In /tasks/hw02/ you will find the following files:

  • 6 fasta files (*.fa)
  • skeleton script build-tree.pl
    • This script already contains handling of command line options, entire task B, potentially useful functions my_run and my_delete and suggested function headers for individual tasks. Feel free to change any of this.
  • outline of protocol protocol.txt
  • directory example with files for two different groups of genes

Copy the files to your directory and continue writing the script


  • Submit the script, protocol protocol.txt or protocol.pdf and temporary directory with all files created in the run of your script on all 6 species with human as reference.
  • Since the commands and names of files are specified in the homework, you do not need to write them in the protocol (unless you change them). Therefore it is sufficient if the protocol contains self-assessment and any used information sources other than those linked from this assignment or lectures.
  • Submit by copying to /submit/hw02/your_username

Task A: run blast to find similar sequences

  • To find orthologs, we use a simple method by first finding local alignments (regions of sequence similarity) between genes from different species
  • For finding alignments, we will use tool blast (ubuntu package blast2)
  • Example of running blast:
formatdb -p F -i human.fa
blastall -p blastn -m 9 -d human.fa -i mouse.fa -e 1e-5
  • Example of output file:
# BLASTN 2.2.26 [Sep-21-2011]
# Query: mouse_00492
# Database: human.fa
# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
mouse_22930     human_00008     90.79   1107    102     0       1       1107    1       1107    0.0     1386
mouse_22930     human_34035     80.29   350     69      0       745     1094    706     1055    3e-37    147
mouse_22930     human_34035     79.02   143     30      0       427     569     391     533     8e-07   46.1

(note last column - score)

  • For each non-reference species, save the result of blast search in file species.blast in the temporary directory.

Task B: find orthogroups

This part is already implemented in the skeleton file, you don't need to implement or report anything in this task

  • Here, we process all the species.blast files to find ortholog groups.
  • Matches are symmetric, and there can be multiple matches for the same gene. We are looking for reciprocal best hits: pairs of genes human_A and mouse_B, where mouse_B is the match with the highest score in mouse for human_A and human_A is the best-scoring match in human for mouse_B.
  • Some genes in reference species may have no reciprocal best hits in some of the non-reference species.
  • Gene in the reference species and all of its reciprocal best hits constitute orthogroup. If the size of an orthogroup is the same as the number of species, we will call it a complete orthogroup
  • In file genes.txt in temporary directory list we will list all orthogroups, one per line.
chimp_94013 dog_84719 human_15749 macaque_34640 mouse_17461 rat_09232
chimp_61053 human_18570 macaque_12627
chimp_41364 human_19217 macaque_88256 rat_82436

Task C: create a file for each orthogroup

  • For each complete orthogroup, we will create a fasta file with corresponding DNA sequences.
  • The file will be located in temporary directory and will be named genename.fa, where genename is the name of the orthogroup gene from reference species.
  • The fasta name for each sequence is the name of species, NOT the name of the gene.

Task D: build tree for each gene

  • For each orthogroup, we need to build a phylogenetic tree.
  • The result for file genename.fa should be saved in file genename.tree
  • Example of how to do this:
# create multiple alignment of the sequences
muscle -diags -in genename.fa -out genename.mfa
# change format of the multiple alignment
readseq -f12 genename.mfa -o=genename.phy -a
# run phylogenetic inferrence program
phyml -i genename.phy --datatype nt --bootstrap 0 --no_memory_check
# rename the result
mv genename.phy_phyml_tree.txt genename.tree
  • You can view the multiple alignment (*.mfa and *.phy) by using program seaview
  • You can view the resulting tree (*.tree) by using program njplot or figtree

Task E: build consensus tree

  • Trees built on individual genes can differ from each other.
  • Therefore we build a consensus tree: tree that only contains branches present in most gene trees; other branches are collapsed.
  • phylip is an "interactive" program for manipulation of trees. Specific command for building consensus trees is
phylip consense
  • input file for phylip needs to contain all trees of which consensus should be built, one per line
  • text you would type to phylip manually, can be instead passed on the standard input from the script
  • store the output tree from phylip in all_trees.consensus in temporary directory and also print it to standard output