2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


L02

From IDZ
Jump to: navigation, search

Motivation: Building Phylogenetic Trees

The task for today will be to build a phylogenetic tree of several species using sequences of several genes.

  • A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are target present-day species, internal nodes are their common ancestors.
  • Input contains sequences of genes from each species.
  • Step 1: Identify ortholog groups. Orthologs are genes from different species that "correspond" to each other. This is done based on sequence similarity and we can use a tool called blast to identify sequence similarities between individual genes. The result of ortholog group identification will be a set of genes, each gene having one sequence from each of the 6 species
chimp_94013 dog_84719 human_15749 macaque_34640 mouse_17461 rat_09232
  • Step 2: For each ortholog group, we need to align genes and build a phylogenetic tree for this gene using existing methods. We can do this using tools muscle (for alignment) and phyml (for phylogenetic tree inference).

Unaligned sequences:

>mouse
ATGCAGTTCCCGCACCCGGGGCCCGCGGCTGCGCCCGCCGTGGGAGTCCCGCTGTATGCG
>rat
ATGCAGTTCCCGCACCCGGGGCCCGCGGCTGCGCCCGCCGTCGGAGTCCCGCTGTACGCG
>dog
ATGCAGTACCACCCCGGGCCGGCGGCGGCGGGCGCCGTGGGGGTGCCGCTGTACGCG
>human
ATGCAGTACCCGCACCCCGGGCCGGCGGCGGGCGCCGTGGGGGTGCCGCTGTACGCG
>chimp
ATGCAGTACCCGCACCCCGGGCCGGCGGCGGGCGCCGTGGGGGTGCCGCTGTACGCG
>macaque
ATGCAGTACCCGCACCCCGGGCGGCGGCCGTGGGGGTGGC

Aligned sequences:

>mouse
ATGCAGTTCCCGCACCCGGGGCCCGCGGCTGCGCCCGCCGTGGGAGTCCCGCTGTATGCG
>rat
ATGCAGTTCCCGCACCCGGGGCCCGCGGCTGCGCCCGCCGTCGGAGTCCCGCTGTACGCG
>dog
ATGCAGTAC---CACCCCGGGCCGGCGGCGGCGGGCGCCGTGGGGGTGCCGCTGTACGCG
>human
ATGCAGTACCCGCACCCCGGGC---CGGCGGCGGGCGCCGTGGGGGTGCCGCTGTACGCG
>chimp
ATGCAGTACCCGCACCCCGGGC---CGGCGGCGGGCGCCGTGGGGGTGCCGCTGTACGCG
>macaque
ATGCAGTACCCGCACCCCGGGC----------GGCGGCCGTGGGGGTGGC----------

Phylogenetic tree:

(mouse:0.03240286,rat:0.01544553,(dog:0.03632419,(macaque:0.01505050,(human:0.00000001,chimp:0.00000001):0.00627957):0.01396920):0.10645019);
Tree for gene human_15749 (branch lengths ignored)


  • Step 3: The result of the previous step will be several trees, one for every gene. Ideally, all trees would be identical, showing the real evolutionary history of the six species. But it is not easy to infer the real tree from sequence data, so trees from different genes might differ. Therefore, in the last step, we will build a consensus tree. This can be done by usina interactive tool called phylip.
  • Output is a single consensus tree.
Tree for gene human_15749
Tree for gene human_13531
Tree for gene human_31770
Strict consensus for the three gene trees


Our goal for today is to build a pipeline that automates the whole task.

Opening files

my $in;
open $in, "<", "path/file.txt" or die;  # open file for reading
while(my $line = <$in>) {
  # process line
}
close $in;

my $out;
open $out, ">", "path/file2.txt" or die; # open file for writing
print $out "Hello world\n";
close $out;
# if we want to append to a file use the following instead:
# open $out, ">>", "cesta/subor2.txt" or die;

# standard files
print STDERR "Hello world\n";
my $line = <STDIN>;
# files as arguments of a function
citaj_subor($in);
citaj_subor(\*STDIN);

Working with files and directories

Working directories or files with automatically generated names are automagically deleted after the program finishes.

use File::Temp qw/tempdir/;
my $dir = tempdir("atoms_XXXXXXX", TMPDIR => 1, CLEANUP => 1 ); 
print STDERR "Creating temporary directory $dir\n";
open $out,">$dir/myfile.txt" or die;

Copying files

use File::Copy;
copy("file1","file2") or die "Copy failed: $!";
copy("Copy.pm",\*STDOUT);
move("/dev1/fileA","/dev2/fileB");

Other functions for working with file system, e.g. chdir, mkdir, unlink, chmod, ...

Function glob finds files with wildcard characters similarly as on command line (see also opendir, readdir, and File::Find module)

ls *.pl
perl -le'foreach my $f (glob("*.pl")) { print $f; }'

Additional functions for working with file names, paths, etc. in modules File::Spec and File::Basename.

Testing for an existence of a file (more in perldoc -f -X)

if(-r "file.txt") { ... }  # is file.txt readable?
if(-d "dir") {.... }       # is dir a directory?

Running external programs

my $ret = system("command arguments");
# returns -1 if cannot run command, otherwise pass the return code
my $allfiles = `ls`;
# returns the result of a command as a text
# cannot test return code

Using pipes

open $in, "ls |";
while(my $line = <$in>) { ... }
open $out, "| wc"; 
print $out "1234\n"; 
close $out;'

      1       1       5

Command-line arguments

# module for processing options in a standardized way
use Getopt::Std;
# string with usage manual
my $USAGE = "$0 [options] length filename

Options:
-l           switch on lucky mode
-o filename  write output to filename
";

# all arguments to the command are stored in @ARGV array
# parse options and remove them from @ARGV
my %options;
getopts("lo:", \%options);
# now there should be exactly two arguments in @ARGV
die $USAGE unless @ARGV==2;
# process options
my ($length, $filenamefile) = @ARGV;
# values of options are in the %options array
if(exists $options{'l'}) { print "Lucky mode\n"; }

For long option names, see module Getopt::Long

Defining functions

Defining new functions

sub function_name {
  # arguments are stored in @_ array
  my ($firstarg, $secondarg) = @_;
  # do something
  return ($result, $second_result);
}
  • Arrays and hashes are usually passed as references: function_name(\@array, \%hash);
  • It is advantageous to pass long string as references as well to prevent needless copying: function_name(\$sequence);
  • References need to be dereferenced, e.g. substr($$sequence) or $array->[0]

Bioperl

use Bio::Tools::CodonTable;
sub translate
{
    my ($seq, $code) = @_;
    my $CodonTable = Bio::Tools::CodonTable->new( -id => $code);
    my $result = $CodonTable->translate($seq);

    return $result;
}


Defining modules

Module with name XXX should be in file XXX.pm.

package shared;

BEGIN {
    use Exporter   ();
    our (@ISA, @EXPORT, @EXPORT_OK);
    @ISA = qw(Exporter);
    # symbols to export by default
    @EXPORT = qw(funkcia1, funkcia2);
}

sub funkcia1 {
...
}

sub funkcia2 {
...
}

#module must return true
1;

Using the module located in the same directory as .pl file:

use FindBin qw($Bin);  # $Bin is the directory with the script
use lib "$Bin";        # add bin to the library path
use shared;