2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search

Lecture 1: Perl, part 1

Why Perl

  • From Wikipedia: It has been nicknamed "the Swiss Army chainsaw of scripting languages" because of its flexibility and power, and possibly also because of its "ugliness".

Oficial slogans:

  • There's more than one way to do it
  • Easy things should be easy and hard things should be possible


  • Good capabilities for processing text files, regular expressions, running external programs etc.
  • Closer to common programming language than shell scripts
  • Perl one-liners on the command line can replace many other tools such as sed and awk
  • Many existing libraries


  • Quirky syntax
  • It is easy to write very unreadable programs (sometimes joking called write-only language)
  • Quite slow and uses a lot of memory. If possible do no read entire input to memory, process line by line

Warning: we will use Perl 5, Perl 6 is quite a different language

Sources of Perl-related information

  • In package perl-doc man pages:
    • man perlintro introduction to Perl
    • man perlfunc list of standard functions in Perle
    • perldoc -f split describes function split, similarly other functions
    • perldoc -q sort shows answers to commonly asked questions (FAQ)
    • man perlretut and man perlre regular expressions
    • man perl list of other manual pages about Perl
  • The same content on the web http://perldoc.perl.org/
  • Various web tutorials e.g. this one
  • Books
    • Simon Cozens: Beginning Perl [1] freely downloadable
    • Larry Wall et al: Programming Perl [2] classics, Camel book
  • Bioperl [3] big library for bioinformatics
  • Perl for Windows: http://strawberryperl.com/

Hello world

It is possible to run the code directly from a command line (more later):

perl -e'print "Hello world\n"'

This is equivalen to the following code stored in a file:

#! /usr/bin/perl -w
use strict;
print "Hello world!\n";
  • First line is a path to the interpreter
  • Swith -w switches warnings on, e.g. if we manipulate with an undefined value (equivalen to "use warnings;")
  • Second line use strict will switch on a more strict syntax checks, e.g. all variables must be defined
  • Use of -w and use strict is strongly recommended
  • Store the program in a file, e.g. hello.pl
  • Make it executable (chmod a+x hello.pl)
  • Run it with command ./hello.pl
  • Also possible to run as perl hello.pl (e.g. if we don't have the path to the interpreter in the file or the executable bit set)

The first input file for today: sequence repeats

  • In genomes some sequences occur in many copies (often not exactly equal, only similar)
  • We have downloaded a table containing such sequence repeats on chromosome 2L of the fruitfly Drosophila melanogaster
  • It was done as follows: on webpage http://genome.ucsc.edu/ we select drosophila genome, then in main menu select Tools, Table browser, select group: variation and repeats, track: ReapatMasker, region: position chr2L, output format: all fields from the selected table a output file: repeats.txt
  • Each line of the file contains data about one repeat in the selected chromosome. The first line contains column names. Columns are tab-separated. Here are the first two lines:
#bin    swScore milliDiv        milliDel        milliIns        genoName        genoStart       genoEnd genoLeft        strand  repName repClass        repFamily       repStart        repEnd  repLeft id
585     778     167     7       20      chr2L   1       154     -23513558       +       HETRP_DM        Satellite       Satellite       1519    1669    -203    1
  • The file can be found at our server under filename /tasks/hw01/repeats.txt (17185 lines)
  • A small randomly selected subset of the table rows is in file /tasks/hw01/repeats-small.txt (159 lines)

A sample Perl program

For each type of repeat (column 11 of the file when counting from 0) we want to compute the number of repeats of this type

#!/usr/bin/perl -w
use strict;

#associative array (hash), with repeat type as key
my %count;  

while(my $line = <STDIN>) {  # read every line on input
    chomp $line;    # delete end of line, if any

    if($line =~ /^#/) {  # skip commented lines
       next;       # similar to "continue" in C, move to next iteration

    # split the input line to columns on every tab, store them in an array
    my @columns = split "\t", $line;  

    # check input - should have at least 17 columns
    die "Bad input '$line'" unless @columns >= 17;

    my $type = $columns[11];

    # increase counter for this type

# write out results, types sorted alphabetically
foreach my $type (sort keys %count) {
    print $type, " ", $count{$type}, "\n";

This program does the same thing as the following one-liner (more on one-liners in two weeks)

perl -F'"\t"' -lane 'next if /^#/; die unless @F>=17; $count{$F[11]}++; END { foreach (sort keys %count) { print "$_ $count{$_}" }}' filename

Variables, types

Scalar variables

  • Scalar variables start with $, they can hold undefined value (undef), string, number, reference etc.
  • Perl converts automatically between strings and numbers
perl -e'print((1 . "2")+1, "\n")'
perl -e'print(("a" . "2")+1, "\n")'
perl -we'print(("a" . "2")+1, "\n")'
Argument "a2" isn't numeric in addition (+) at -e line 1.
  • If we switch on strict parsing, each variable needs to be defined by my, several variables created and initialized as follows: my ($a,$b) = (0,1);
  • Usual set of C-style operators, power is **, string concatenation .
  • Numbers compared by <, <=, ==, != etc., strings by lt, le, eq, ne, gt, ge,
  • Comparison operator $a cmp $b for strings, $a <=> $b for numbers: returns -1 if $a<$b, 0 if they are equal, +1 if $a>$b


  • Names start with @, e.g. @a
  • Access to element 0 in array: $a[0]
    • Starts with $, because the expression as a whole is a scalar value
  • Length of array scalar(@a). In scalar context, @a is the same thing.
    • e.g. for(my $i=0; $i<@a; $i++) { ... }
  • If using non-existent indexes, they will be created, initialized to undef (++, += treat undef as 0)
  • Stack/vector using functions push and pop: push @a, (1,2,3); $x = pop @a;
  • Analogicaly shift and unshift on the left end of the array (slower)
  • Sorting
    • @a = sort @a; (sorts alphabetically)
    • @a = sort {$a <=> $b} @a; (sort numerically)
    • { } can contain arbitrary comparison function, $a and $b are the two compared elements
  • Array concatenation @c = (@a,@b);
  • Swap values of two variables: ($x,$y) = ($y,$x);
  • Iterate through values of an array (values can be changed):
perl -e'my @a = (1,2,3); foreach my $val (@a) { $val++; } print join(" ", @a), "\n";'
2 3 4

Associative array (hashes)

  • Names start with %, e.g. %b
  • Access elemtn with name "X": $b{"X"}
  • Write out all elements of associative array %b
foreach my $key (keys %b) {
  print $key, " ", $b{$key}, "\n";
  • Inicialization with constant: %b = ("key1"=>"value1","key2"=>"value2")
    • instead of => you can also use ,
  • test for existence of a key: if(exists $a{"x"}) {...}
  • (other methods will create the queried key with undef value)

Multidimensional arrays, fun with pointers

  • Pointer to a variable: \$a, \@a, \%a
  • Pointer to an anonymous array: [1,2,3], pointer to an anonymous hash: {"kluc1"=>"hodnota1"}
  • Hash of lists:
my %a = ("fruits"=>["apple","banana","orange"], "vegetables"=>["celery","carrot"]}
$x = $a{"fruits"}[1];
push @{$a{"fruits"}}, "kiwi";
my $aref = \%a;
$x = $aref->{"fruits"}[1];
  • Module Data::Dumper has function Dumper, which will recursively print complex data structures

Strings, regular expressions


  • Substring: substr($string, $start, $length)
    • used also to access individual charaters (use length 1)
    • If we omit $length, considers until the end of the string, negative start counted from the end of the stringzaciatok rata od konca,...
    • We can also used to replace a substring by something else: substr($str, 0, 1) = "aaa" (replaces the first character by "aaa")
  • Length of a string: length($str)
  • Splitting a string to parts: split reg_expression, $string, $max_number_of_parts
    • if " " instead of regular expression, splits at whitespace
  • Connecting parts join($separator, @strings)
  • Other useful functions: chomp (removes end of line), index (finds a substring), lc, uc (conversion to lowercase/uppercase), reverse (mirror image), sprintf (C-style formatting)

Regular expressions

$line =~ s/\s+$//;   # remove whitespace at the end of the line
$line =~ s/[0-9]+/X/g;  # replace each sequence of numbers with character X

#from the name of the fasta sequence (starting with >) create a string until the first space
#(\S means non-whitespace), the result is stored in $1, as specified by ()
if($line =~ /^\>(\S+)/) { $name = $1; }

perl -le'$X="123 4 567"; $X=~s/[0-9]+/X/g; print $X'

Conditionals, loops

if(expression) {  # [] and () cannot be omitted
} elsif(expression) {
} else {

command if expression;   # here () not necessary
command unless expression;
die "negative value of x: $x" unless $x>=0;

for(my $i=0; $i<100; $i++) {
   print $i, "\n";

foreach my $i (0..99) {
   print $i, "\n";

while(1) {
   $x *= 2;
   last if $x>=100;
  • Undefined value, number 0 and strings "" and "0" evaluate as false, but I would recommmend always explicitly using logical values in conditional expressions, e.g. if(defined $x), if($x eq ""), if($x==0) etc.

Input, output

  • Reading one line from standard input: $line = <STDIN>
  • If no more input data available, returns undef
  • See also [5]
  • Special idiom while(my $line = <STDIN>) equivalent to while (defined(my $line = <STDIN>))
    • iterates through all lines of input
  • chomp $line removes "\n", if any from the end of the string
  • output to stdout through print or printf

The second input file for today: DNA sequencing reads (fastq)

  • DNA sequencing machines can read only short pieces of DNA called reads
  • Reads are usually stored in fastq format
  • Files can be very large (gigabytes or more), but we will use only a small sample from bacteria Staphylococcus aureus, source [6]
  • Each read is on 4 lines:
    • line 1: ID of the read and other description, line starts with @
    • line 2: DNA sequence, A,C,G,T are bases (nucleotides) of DNA, N means unknown base
    • line 3: +
    • line 4: quality string, which is the string of the same length as DNA in line 2. Each character represents quality of one base in DNA. If p is the probability that this base is wrong, the quality string will contain character with ASCII value 33+(-10 log p), where log is decimal logarithm. This means that higher ASCII means base of higher quality. Character ! (ASCII 33) means probability 1 of error, character $ (ASCII 36) means 50% error, character + (ASCII 43) is 10% error, character 5 (ASCII 53) is 1% error.
    • Note that some sequencing platforms represent qualities differently (see article linked above)
  • Our file has all reads of equal length (this is not always the case)

The first 4 reads from file /tasks/hw01/reads-small.fastq


And now start on HW01