1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.


Data management 2022/23

From MAD
Jump to navigation Jump to search

Website for 2022/23

2023-02-13 (BB/TV) ALL Introduction to the course Contact, Introduction, Rules
Introduction to Perl Lecture, Homework, Connecting to server, Editors, Command-line basics
2023-02-20 (BB) ALL Command-line tools, Perl one-liners Lecture, Homework
2023-02-27 (BB) ALL Job scheduling and make Lecture, Homework
2023-03-06 (BB) ALL Python and SQL for beginners Lecture, Homework
2023-03-13 (VB) INF/DAV Python, web crawling, HTML parsing, sqlite3 Lecture, Homework
(BB) BIN Bioinformatics 1 (sequencing and genome assembly) Lecture, Homework
2023-03-20 (VB) INF/DAV Text data processing, flask Lecture, Homework
(TV) BIN Bioinformatics 2 (gene finding, RNA-seq) Lecture, Homework
2023-03-27 (VB) INF/DAV Data visualization in JavaScript Lecture, Homework
(TV) BIN Bioinformatics 3 (genome variants) Lecture, Homework
2023-04-03 (BB) ALL R, part 1 Lecture, Homework
Project proposals due Wednesday April 5
2023-04-10 Easter
2023-04-17 (VB) ALL Cloud computing Lecture, Homework
2023-04-24 (BB) ALL R, part 2 Lecture, Homework (deadline Friday May 12, 22:00)
2023-05-01 Holiday
2023-05-08 Holiday

Contents

Contact

Instructors

Schedule

  • Monday 16:30-18:45 I-H6 lecture + start solving tasks with our help

Introduction

Target audience

This course is offered at the Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava for the students of the second year of the bachelor Data Science and Bioinformatics study programs and the students of the bachelor and master Computer Science study programs. It is a prerequisite of the master-level state exams in Bioinformatics and Machine Learning. However, the course is open to students from other study programs if they satisfy the following informal prerequisites.

  • We assume that the students are proficient in programming in at least one programming language and are not afraid to learn new languages.
  • We assume basic knowledge of work on the Linux command-line (at least basic commands for working with files and folders, such as cd, mkdir, cp, mv, rm, chmod). If you do not have these skills, please study our tutorial before the second lecture. The first week contains detailed instructions to get you started.

Although most technologies covered in this course can be used for processing data from many application areas, we will illustrate some of them on examples from bioinformatics. We will explain necessary terminology from biology as needed.

Course objectives

Computer science courses cover many interesting algorithms, models and methods that can used for data analysis. However, when you want to use these methods for real data, you will typically need to make considerable efforts to obtain the data, pre-process it into a suitable form, test and compare different methods or settings, and arrange the final results in informative tables and graphs. Often, these activities need to be repeated for different inputs, different settings, and so on. Many jobs in data science and bioinformatics involve data processing using existing tools and small custom scripts. This course will cover some programming languages and technologies suitable for such activities.

This course is also recommended for students whose bachelor or master theses involve substantial empirical experiments (e.g. experimental evaluation of your methods and comparison with other methods on real or simulated data).

Basic guidelines for working with data

As you know, in programming it is recommended to adhere to certain practices, such as good coding style, modular design, thorough testing etc. Such practices add a little extra work, but are much more efficient in the long run. Similar good practices exist for data analysis. As an introduction we recommend the following article by a well-known bioinformatician William Stafford Noble, but his advice applies outside of bioinformatics as well.

Several important recommendations:

  • Noble 2009: "Everything you do, you will probably have to do over again."
    • After doing an entire analysis, you often find out that there was a problem with the input data or one of the early steps, and therefore everything needs to be redone.
    • Therefore it is better to use techniques that allow you to keep all details of your workflow and to repeat them if needed.
    • Try to avoid manually changing files, because this makes rerunning analyses harder and more error-prone.
  • Document all steps of your analysis
    • Note what have you done, why have you done it, what was the result.
    • Some of these things may seem obvious to you at present, but you may forgot them in a few weeks or months and you may need them to write up your thesis or to repeat the analysis.
    • Good documentation is also indispensable for collaborative projects.
  • Keep a logical structure of your files and folders
    • Their names should be indicative of the contents (create a sensible naming scheme).
    • However, if you have too many versions of the experiment, it may be easier to name them by date rather than create new long names (your notes should then detail the meaning of each dated version).
  • Try to detect problems in the data
    • Big files often hide some problems in the format, unexpected values etc. These may confuse your programs and make the results meaningless.
    • In your scripts, check that the input data conform to your expectations (format, values in reasonable ranges etc).
    • In unexpected circumstances, scripts should terminate with an error message and a non-zero exit code.
    • If your script executes another program, check its exit code.
    • Also check intermediate results as often as possible (by manual inspection, computing various statistics etc) to detect errors in the data and your code.

Rules

Grading

  • Homeworks: 45%
  • Project proposal: 5%
  • Project: 40%
  • Oral exam: 10%

Grades:

  • A: 90 and more, B: 80...89, C: 70...79, D: 60...69, E: 50...59, FX: less than 50%

You will get Fx if your oral exam is not satisfactory, even if you have sufficient points from other activities.

Course format

  • Every Monday three-hour class, we start with a short lecture. Then you start solving assigned tasks, which you complete as a homework assignment.
  • We highly recommend doing the homework during class, as we can help you as needed. We really encourage you to ask questions during this time. At other times, ask your questions via email, but you may have to wait longer for the answer.
  • Some weeks will have a separate material for Bioinformatics program and separately for others. If you would like to do a homework other than the one intended for you, you must obtain a prior consent of the instructors.
  • You will submit a project during the exam period. Afterwards there will be an oral exam concentrating on the project and submitted homework.
  • You will have an account on a Linux server dedicated to this course. Use this account only for the purposes of this course and try not to overload the server so that it serves all students. Any attempts to intentionally disrupt the operation of the server will be considered a serious violation of the course rules.

Homework

  • The deadline for each homework is 9:00 of the day of the next lecture, i.e. usually almost one week from when the homework was published.
  • You can work on your homework on any computer, preferably under Linux. However, the submitted code or commands should be executable on the course server, so do not use special software or settings on your computer.
  • The homework is submitted by copying the required files to the required directory on the server. Details will be specified in the assignment.
  • Follow any filenames specified in the assignment and use reasonable filenames for additional files.
  • Make sure the submitted source code is easy to read (indentation, reasonable variable names, comments as needed)

Protocols

  • Usually, the required part of the homework will be a text document called a protocol.
  • Write the protocol in txt format and name the file 'protocol.txt' (copy it to the upload directory)
  • The protocol can be in Slovak or English.
  • If you write with diacritics, use UTF8 encoding, but feel free to omit diacritics in protocols.
  • In most tasks, you get a protocol outline, follow it.

Self-assessment

  • At the top of the protocol, fill in a self-assessment which for every task should contain one of the following codes.
    • Use code DONE if you think the task is completely and correctly solved.
    • Use code PART if you have completed only a part of the task. After the code briefly state, which part was completed and potentially if you had problem with something.
    • Use code UNSURE, if you have completed the task but are not sure about something. Again briefly explain what are you unsure of.
    • Use code NOTHING, if you have not even started to do the task.
  • Your self-assessment will guide us in grading. Tasks marked as DONE will be checked briefly, but we will try to give you feedback to tasks marked UNSURE or PART, particularly if you note down what was causing you problems.
  • Try to fill in self-assessment the best you can. It can influence your grade.

Protocol contents

  • Unless specified otherwise, the protocol should contain the following information:
    • List of submitted files: for each file, state its meaning and whether you produced it manually, obtained it from external sources or produced it by a program. You do not need to list the files whose names are specified in the assignment. If you have a large number of files with a systematic naming scheme, just explain the naming scheme in general.
    • The sequence of commands used, or other steps you took to get the results. Include commands to process data and run your or other programs. It is not necessary to specify commands related to the programming itself (starting the editor, setting file permissions), copying the files to the server, etc. For more complex commands, also provide brief comments explaining the purpose of a particular command or group of commands.
    • Results of running the analysis of some data, as specified in the tasks. We may also ask you to make observations on the result.
    • List of resources: websites and other sources that you used to solve the task. You do not have to list the course website and resources recommended directly in the assignment.

Overall, the protocol should allow the reader to understand your files and also, in case of interest, to perform the same calculations as you used to obtain the result. You do not have write to a formal text, only clear and brief notes.

Project

The aim of the project is to extend your skills on a data processing project. Your task is to obtain data, analyze this data with some techniques from the lectures and display the results in graphs and tables.

  • In about two thirds of the semester, you will submit a short project proposal
  • A deadline for submitting the project (including the written report) will be during the exam period
  • You can also do projects in pairs, but then we require a larger project and each member should be primarily responsible for a certain part of the project

More detailed information on projects is on a separate page.

Oral exam

  • During the oral exam, we will give you our feedback to your project and ask related questions
  • We can also ask you about some of the homeworks you have submitted during the semester
  • You should be able to explain your code and do small modifications in it

Academic integrity

  • You are allowed to talk to classmates and other people about homework and projects and general strategies to solve them. However, the code, the results obtained, and the text you submit must be your own work. It is forbidden to show your code or texts to the classmates.
  • When working on the homework and the project, we expect you to use Internet resources, especially various manuals and discussion forums on the used technologies. However, do not try to find ready-made solutions to the given tasks and do not use AI to solve the problems for you (you do not learn anything that way!). List all resources used in a homework or a project.
  • If we find cases of plagiarism or unauthorized aids, all participating students will receive zero points for the relevant homework or project (including the students who provided their solutions to others to copy). Violations of academic integrity will be also referred to the faculty disciplinary committee.

Sharing materials

Assignments and materials for the course are freely available on this webpage. However, do not publish or otherwise share your homework solutions as they closely follow the outline given by us. You can publish your projects if you wish, as long as it does not conflict with your agreement with the provider of your data.

Lperl

This lecture is a brief introduction to the Perl scripting language. We recommend revisiting necessary parts of this lecture while working on the exercises.

Homework: HWperl

Why Perl

  • Very popular in 1990s and early 2000s for system scripting, also very popular in bioinformatics.
  • Probably not many of you know this language, so a good training to learn a new language quickly.

Advantages

  • Good capabilities for processing text files, regular expressions, running external programs etc. (Perl-style regular expression today used in many languages)
  • Closer to common programming languages than shell scripts
  • Perl one-liners on the command line can replace many other tools such as sed and awk (next lecture)

Disadvantages

  • Quirky syntax
  • It is easy to write very unreadable programs (Perl is sometimes joking called write-only language)
  • Quite slow and uses a lot of memory. If possible, do not read the entire input to memory, process line by line

We will use Perl 5, Perl 6 is quite a different language.

Hello world

It is possible to run the code directly from a command line (more later):

perl -e'print "Hello world\n"'

This is equivalent to the following code stored in a file:

#! /usr/bin/perl -w
use strict;
print "Hello world!\n";
  • The first line is a path to the interpreter
  • Switch -w switches warnings on, e.g. if we manipulate with an undefined value (equivalent to use warnings;)
  • The second line use strict will switch on a more strict syntax checks, e.g. all variables must be defined
  • Use of -w and use strict is strongly recommended

Running the script

  • Store the program in a file hello.pl
  • Make it executable (chmod a+x hello.pl)
  • Run it with command ./hello.pl
  • It is also possible to run as perl hello.pl (e.g. if we don't have the path to the interpreter in the file or the executable bit is not set)

The first input file for today: TV series

  • IMDb is an online database of movies and TV series with user ratings.
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub.
  • From this dataset, we have selected only several series with a high number of voting users.
  • Each line of the file contains data about one episode of one series. Columns are tab-separated and contain the name of the series, the name of the episode, the global index of the episode within the series, the number of the season, the index of the episode with the season, rating of the episode and the number of voting users.
  • Here is a smaller version of this file with only six lines:
Black Mirror	The National Anthem	1	1	1	7.8	35156
Black Mirror	Fifteen Million Merits	2	1	2	8.2	35317
Black Mirror	The Entire History of You	3	1	3	8.6	35266
Game of Thrones	Winter Is Coming	1	1	1	9	27890
Game of Thrones	The Kingsroad	2	1	2	8.8	21414
Game of Thrones	Lord Snow	3	1	3	8.7	20232
  • The smaller and the larger version of this file can be found at our server under filenames /tasks/perl/series-small.tsv and /tasks/perl/series.tsv

A sample Perl program

For each series (column 0 of the file) we want to compute the number of episodes.

#! /usr/bin/perl -w
use strict;

#associative array (hash), with series name as key
my %count;  

while(my $line = <STDIN>) {  # read every line on input
    chomp $line;    # delete end of line, if any

    # split the input line to columns on every tab, store them in an array
    my @columns = split "\t", $line;  

    # check input - should have 7 columns
    die "Bad input '$line'" unless @columns == 7;

    my $series = $columns[0];

    # increase the counter for this series
    $count{$series}++;
}

# write out results, series sorted alphabetically
foreach my $series (sort keys %count) {
    print $series, " ", $count{$series}, "\n";
}

This program does the same thing as the following one-liner (more on one-liners in the next lecture)

perl -F'"\t"' -lane 'die unless @F==7; $count{$F[0]}++;
  END { foreach (sort keys %count) { print "$_ $count{$_}" }}' filename

When we run it for the small six-line input, we get the following output:

Black Mirror 3
Game of Thrones 3

The second input file for today: DNA sequencing reads (fastq)

  • DNA sequencing machines can read only short pieces of DNA called reads
  • Reads are usually stored in FASTQ format
  • Files can be very large (gigabytes or more), but we will use only a small sample from bacteria Staphylococcus aureus (data from the GAGE website)
  • Each read is stored in 4 lines:
    • line 1: ID of the read and other description, line starts with @
    • line 2: DNA sequence, A,C,G,T are bases (nucleotides) of DNA, N means unknown base
    • line 3: +
    • line 4: quality string, which is the string of the same length as DNA in line 2. ASCII code of each character represents quality of one base in DNA, where higher quality means lower probability of a sequencing error.
      Details (not needed today): If p is the probability that this base is wrong, the quality string will contain character with ASCII value 33+(-10 log p), where log is the decimal logarithm. Character ! (ASCII 33) means probability 1 of error, character $ (ASCII 36) means 50% error, character + (ASCII 43) is 10% error, character 5 (ASCII 53) is 1% error.
  • Our file has all reads of equal length (this is not always the case)
  • Technically, a single read and its quality can be split into multiple lines, but this is rarely done, and we will assume that each read takes 4 lines as described above

The first 4 reads from file /tasks/perl/reads-small.fastq (trimmed to 50 bases for better readability)

@SRR022868.1845/1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAG
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI
@SRR022868.1846/1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACA
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5

Variables, types

Scalar variables

  • The names of scalar variables start with $
  • Scalar variables can hold undefined value (undef), string, number, reference etc.
  • Perl converts automatically between strings and numbers
perl -e'print((1 . "2")+1, "\n")'
# 13
perl -e'print(("a" . "2")+1, "\n")'
# 1
perl -we'print(("a" . "2")+1, "\n")'
# Argument "a2" isn't numeric in addition (+) at -e line 1.
# 1
  • If we switch on strict parsing, each variable needs to be defined by my
    • Several variables can be created and initialized as follows: my ($a,$b) = (0,1);
  • Usual set of C-style operators, power is **, string concatenation .
  • Numbers compared by <, <=, ==, != etc., strings by lt, le, eq, ne, gt, ge
  • Comparison operator $a cmp $b for strings, $a <=> $b for numbers: returns -1 if $a<$b, 0 if they are equal, +1 if $a>$b

Arrays

  • Names start with @, e.g. @a
  • Access to element 0 in array @a: $a[0]
    • Starts with $, because the expression as a whole is a scalar value
  • Length of array scalar(@a). In scalar context, @a is the same thing.
    • e.g. for(my $i=0; $i<@a; $i++) { ... } iterates over all elements
  • If using non-existent indexes, they will be created, initialized to undef (++, += treat undef as 0)
  • Command foreach iterates through values of an array (values can be changed during iteration):
my @a = (1,2,3);
foreach my $val (@a) {  # iterate through all values
    $val++;             # increase each value in array by 1
}

Other useful commands

  • Stack/vector using functions push and pop: push @a, (1,2,3); $x = pop @a;
  • Analogically shift and unshift on the left end of the array (slower)
  • Sorting
    • @a = sort @a; (sorts alphabetically)
    • @a = sort {$a <=> $b} @a; (sorts numerically)
    • { } can contain an arbitrary comparison function, $a and $b are the two compared elements
  • Array concatenation @c = (@a,@b);
  • Swap values of two variables: ($x,$y) = ($y,$x);

Hash tables (associative array, dictionaries, maps)

  • Names start with %, e.g. %b
  • Keys are strings, values are scalars
  • Access element with key "X": $b{"X"}
  • Write out all elements of associative array %b
foreach my $key (keys %b) {
    print $key, " ", $b{$key}, "\n";
}
  • Initialization with a constant: %b = ("key1" => "value1", "key2" => "value2");
  • Test for existence of a key: if(exists $a{"X"}) {...}

Multidimensional arrays, fun with pointers

  • Pointer to a variable (scalar, array, dictionary): \$a, \@a, \%a
  • Pointer to an anonymous array: [1,2,3], pointer to an anonymous hash: {"key1" => "value1"}
  • Hash of lists is stored as hash of pointers to lists:
my %a = ("fruits" => ["apple","banana","orange"],
         "vegetables" => ["tomato","carrot"]);
$x = $a{"fruits"}[1];
push @{$a{"fruits"}}, "kiwi";
my $aref = \%a;
$x = $aref->{"fruits"}[1];
  • Module Data::Dumper has function Dumper, which recursively prints complex data structures (good for debugging)

Strings

  • Substring: substr($string, $start, $length)
    • Used also to access individual characters (use length 1)
    • If we omit $length, extracts suffix until the end of the string, negative $start counts from the end of the string,...
    • We can also replace a substring by something else: substr($str, 0, 1) = "aaa" (replaces the first character by "aaa")
  • Length of a string: length($str)
  • Splitting a string to parts: split reg_expression, $string, $max_number_of_parts
    • If " " is used instead of regular expression, splits at any whitespace
  • Connecting parts to a string join($separator, @strings)
  • Other useful functions: chomp (removes the end of line), index (finds a substring), lc, uc (conversion to lower-case/upper-case), reverse (mirror image), sprintf (C-style formatting)

Regular expressions

  • Regular expressions are a powerful tool for working with strings, now featured in many languages
  • Here only a few examples, more details can be found in the official tutorial
if($line =~ /hello/) {  
   print "line contains word hello as a substring";
}
if($line =~ /hello/i) {  # ignore letter case, also finds Hello, HELLO, hElLo
   print "line contains word hello as a substring regardless of ltter case";
}
if($line =~ /hello.*world/) {  # . is any character, * means any number of repeats
   print "line contains word hello later followed by word world";
}
if($line =~ /hello\s+world/) {  # \s is whitespace, + means at least one repeat
   print "line contains words hello and word sepearted by whitespace";
}

# editting strings
$line =~ s/\s+$//;      # remove whitespace at the end of the line
$line =~ s/[0-9]+/X/g;  # replace each sequence of numbers with character X

# if the line starts with >,
# store the word following > (until the first whitespace)
# and store it in variable $name 
# (\S means non-whitespace),
# the string matching part of expression in (..) is stored in $1
if($line =~ /^\>(\S+)/) { $name = $1; }

Conditionals, loops

if(expression) {  # () and {} cannot be omitted
   commands
} elsif(expression) {
   commands
} else {
   commands
}

command if expression;   # here () not necessary
command unless expression;
# good for checking inputs etc
die "negative value of x: $x" unless $x >= 0;

for(my $i=0; $i<100; $i++) {
   print $i, "\n";
}

foreach my $i (0..99) {
   print $i, "\n";
}

my $x = 1;
while(1) {
   $x *= 2;
   last if $x >= 100;
}

Undefined value, number 0 and strings "" and "0" evaluate as false, but we recommend always explicitly using logical values in conditional expressions, e.g. if(defined $x), if($x eq ""), if($x==0) etc.

Input, output

# Reading one line from standard input
$line = <STDIN>
# If no more input data available, returns undef


# The special idiom below reads all the lines from input until the end of input is reached:
while (my $line = <STDIN>) {
   # commands processing $line ...
}

Sources of Perl-related information

  • Man pages (included in ubuntu package perl-doc), also available online at http://perldoc.perl.org/
    • man perlintro introduction to Perl
    • man perlfunc list of standard functions in Perl
    • perldoc -f split describes function split, similarly other functions
    • perldoc -q sort shows answers to commonly asked questions (FAQ)
    • man perlretut and man perlre regular expressions
    • man perl list of other manual pages about Perl
  • Various web tutorials e.g. this one
  • Books

Further optional topics

For illustration, we briefly cover other topics frequently used in Perl scripts (these are not needed to solve the exercises).

Opening files

my $in;
open $in, "<", "path/file.txt" or die;  # open file for reading
while(my $line = <$in>) {
  # process line
}
close $in;

my $out;
open $out, ">", "path/file2.txt" or die; # open file for writing
print $out "Hello world\n";
close $out;
# if we want to append to a file use the following instead:
# open $out, ">>", "cesta/subor2.txt" or die;

# standard files
print STDERR "Hello world\n";
my $line = <STDIN>;
# files as arguments of a function
read_my_file($in);
read_my_file(\*STDIN);

Working with files and directories

Module File::Temp allows to create temporary working directories or files with automatically generated names. These are automatically deleted when the program finishes.

use File::Temp qw/tempdir/;
my $dir = tempdir("atoms_XXXXXXX", TMPDIR => 1, CLEANUP => 1 ); 
print STDERR "Creating temporary directory $dir\n";
open $out,">$dir/myfile.txt" or die;

Copying files

use File::Copy;
copy("file1","file2") or die "Copy failed: $!";
copy("Copy.pm",\*STDOUT);
move("/dev1/fileA","/dev2/fileB");

Other functions for working with file system, e.g. chdir, mkdir, unlink, chmod, ...

Function glob finds files with wildcard characters similarly as on command line (see also opendir, readdir, and File::Find module)

ls *.pl
perl -le'foreach my $f (glob("*.pl")) { print $f; }'

Additional functions for working with file names, paths, etc. in modules File::Spec and File::Basename.

Testing for an existence of a file (more in perldoc -f -X)

if(-r "file.txt") { ... }  # is file.txt readable?
if(-d "dir") {.... }       # is dir a directory?

Running external programs

Using the system command

  • It returns -1 if it cannot run command, otherwise returns the return code of the program
my $ret = system("command arguments");

Using the backtick operator with capturing standard output to a variable

  • This does not tests the return code
my $allfiles = `ls`;

Using pipes (special form of open sends output to a different command, or reads output of a different command as a file)

open $in, "ls |";
while(my $line = <$in>) { ... }
open $out, "| wc"; 
print $out "1234\n"; 
close $out;
# output of wc:
#      1       1       5

Command-line arguments

# module for processing options in a standardized way
use Getopt::Std;
# string with usage manual
my $USAGE = "$0 [options] length filename

Options:
-l           switch on lucky mode
-o filename  write output to filename
";

# all arguments to the command are stored in @ARGV array
# parse options and remove them from @ARGV
my %options;
getopts("lo:", \%options);
# now there should be exactly two arguments in @ARGV
die $USAGE unless @ARGV==2;
# process options
my ($length, $filenamefile) = @ARGV;
# values of options are in the %options array
if(exists $options{'l'}) { print "Lucky mode\n"; }

For long option names, see module Getopt::Long

Defining functions

sub function_name {
  # arguments are stored in @_ array
  my ($firstarg, $secondarg) = @_;
  # do something
  return ($result, $second_result);
}
  • Arrays and hashes are usually passed as references: function_name(\@array, \%hash);
  • It is advantageous to pass very long string as references to prevent needless copying: function_name(\$sequence);
  • References need to be dereferenced, e.g. substr($$sequence) or $array->[0]

Bioperl

A large library useful for bioinformatics. This snippet translates DNA sequence to a protein using the standard genetic code:

use Bio::Tools::CodonTable;
sub translate
{
    my ($seq, $code) = @_;
    my $CodonTable = Bio::Tools::CodonTable->new( -id => $code);
    my $result = $CodonTable->translate($seq);

    return $result;
}

HWperl

Materials: the lecture, Connecting to server, Editors

Files and setup

We recommend creating a directory (folder) for this set of tasks:

mkdir perl  # make directory
cd perl     # change to the new directory

The folder /tasks/perl/ contains input files, an example script working with FASTQ files and correct outputs for some of your scripts; see further details below. In this first homework we will provide almost all commands you need to run.


Write your protocol starting from an outline provided in /tasks/perl/protocol.txt. Make your own copy of the protocol and open it in an editor, e.g. kate:

cp -i /tasks/perl/protocol.txt .   # copy protocol
kate protocol.txt &                # open editor, run in the background

Submitting

  • Directory /submit/perl/your_username will be created for you
  • Copy the required files to this directory, including the protocol named protocol.txt
  • You can modify these files freely until deadline, but after the deadline of the homework, you will lose access rights to this directory
  • Here is an example of the commands you can use for submitting and checking submitted files. Change it to include your username
# protocol
cp -ipv protocol.txt /submit/perl/your_username
# tasks A,B
cp -ipv series-stats.pl series-stats.txt fastq2fasta.pl reads-small.fasta reads-tiny.fasta /submit/perl/your_username
# tasks C, D
cp -ipv fastq-quality.pl reads-qualities.tsv fastq-trim.pl reads-small-trim.fastq reads-trim-qualities.tsv /submit/perl/your_username
# check what was submitted
ls -l /submit/perl/your_username

Task A (series)

Running the script from the lecture

  • Consider the program for counting series in the lecture 1, save it to file series-stats.pl
  • Open editor running in the background: kate series-stats.pl &
  • Copy and paste text to the editor, save it
  • Make the script executable: chmod a+x series-stats.pl
  • Try running it on the small input: ./series-stats.pl < /tasks/perl/series-small.tsv
  • You should get the following output:
Black Mirror 3
Game of Thrones 3

Extending the script

  • Extend the script to compute the average rating of each series (averaging over all episodes in the series)
  • Each row of the input table contains rating in column 5.
  • Output a table with three columns: name of series, the number of episodes, the average rating.
  • Use printf to print these three items right-justified in columns of a sufficient width; print the average rating to 1 decimal place.

Running the script

  • If you run your script on the small file, the output should look something like this (the exact column widths may differ):
./series-stats.pl < /tasks/perl/series-small.tsv
        Black Mirror        3        8.2
     Game of Thrones        3        8.8
  • Run your script also on the large file: ./series-stats.pl < /tasks/perl/series.tsv > series-stats.txt
  • Check the output, e.g. by running cat series-stats.txt

Submitting

  • Submit your script, series-stats.pl and the output series-stats.txt

Task B (FASTQ to FASTA)

Introduction

  • In the rest of the assignment, we will write several scripts for working with FASTQ files introduced in the lecture. Similar scripts are often used by bioinformaticians working with DNA sequencing data.
  • We will work with three input files:
    • /tasks/perl/reads-tiny.fastq a tiny version of the read file, including some corner cases
    • /tasks/perl/reads-small.fastq a small version of the read file
    • /tasks/perl/reads.fastq a bigger version of the read file

Goal

  • Write a script which reformats FASTQ file to FASTA format, call it fastq2fasta.pl
    • FASTQ file should be on standard input, FASTA file written to standard output
  • FASTA format is a typical format for storing DNA and protein sequences.
    • Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optionally some further description separated by whitespace
    • The sequence itself is on the second line, long sequences can be split into multiple lines
  • In our case, the name of the sequence will be the ID of the read with the initial @ replaced by > and each / replaced by (_)
  • For example, the first two reads of the file /tasks/perl/reads.fastq are as follows (only the first 50 columns shown)
@SRR022868.1845/1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAG...
+
IICIIIIIIIIIID%IIII8>I8III1II,II)I+III*II<II,E;-HI...
@SRR022868.1846/1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACA...
+
4CIIIIIIII52I)IIIII0I16IIIII2IIII;IIAII&I6AI+*+&G5...
  • These should be reformatted as follows (again only first 50 columns shown, but you include entire reads):
>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGA...
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACAC...

Start programming

  • You can start by modifying our script /tasks/perl/fastq-lengths.pl, which prints the length of each read in the FASTQ input file. You can use the following commands to start:
# copy our script to your folder under the new name
cp -i /tasks/perl/fastq-lengths.pl fastq2fasta.pl
# open in editor in background
kate fastq2fasta.pl & 

Running the script

  • Run your script on the tiny file, compare with our precomputed correct answer:
./fastq2fasta.pl < /tasks/perl/reads-tiny.fastq > reads-tiny.fasta
diff reads-tiny.fasta /tasks/perl/reads-tiny.fasta
  • Command diff prints differences between files. Here the output of diff should be empty. Otherwise try to look at the input and output files and fix your program to obtain the same output as we have.
  • Also run your script on the small read file
./fastq2fasta.pl < /tasks/perl/reads-small.fastq > reads-small.fasta

Submitting

  • Submit files fastq2fasta.pl, reads-small.fasta, reads-tiny.fasta

Task C (FASTQ quality)

Goal

  • Write a script fastq-quality.pl which for each position in a read computes the average quality
  • FASTQ file should be on standard input. It contains multiple reads, possibly of different lengths
  • As quality we will use ASCII values of characters in the quality string with value 33 subtracted
    • For example character 'A' (ASCII 65) means quality 32.
    • ASCII value can be computed by function ord
  • Positions in reads will be numbered from 0
  • Since reads can differ in length, some positions are used in more reads, some in fewer
  • For each position from 0 up to the highest position used in some read, print three numbers separated by tabs "\t": the position index, the number of times this position was used in reads, the average quality at that position with 1 decimal place (you can again use printf)

Example

  • The first and last lines when you run ./fastq-quality.pl < /tasks/perl/reads-tiny.fastq should be
0	5	28.2
1	5	33.0
...
20	2	32.0
21	2	32.0

For example position 1 occurs in all reads and the qualities are B,A,B,A,D, which have ASCII values 66, 65, 66, 65, 68. After subtracting 33 from each and computing average we get 33. On the last position (21), we have only two reads with qualities A, which translates to value 32.


Running

  • Run the following commands, which runs your script on the large FASTQ file and selects every 10th position. The output is stored in reads-qualities.tsv and printed using cat
./fastq-quality.pl < /tasks/perl/reads.fastq | perl -lane 'print if $F[0] % 10 == 0' > reads-qualities.tsv
cat reads-qualities.tsv
  • This input is a sample of real sequencing data from Illumina technology. What trends (if any) do you see in quality values with increasing position?


Submitting

  • Submit fastq-quality.pl, reads-qualities.tsv
  • In the protocol, discuss the question stated above regarding reads-qualities.tsv

Task D (FASTQ trim)

Goal

  • Write script fastq-trim.pl that trims low quality bases from the end of each read and filters out short reads
  • This script should read a FASTQ file from the standard input and write trimmed FASTQ file to the standard output
  • It should also accept two command-line arguments: character Q and integer L
    • We have not covered processing command line arguments, but you can use the code snippet below
  • Q is the minimum acceptable quality (characters from quality string with ASCII value >= ASCII value of Q are ok)
  • L is the minimum acceptable length of a read
  • First find the last base in a read which has quality at least Q (if any). All bases after this base will be removed from both the sequence and quality string
  • If the resulting read has fewer than L bases, it is omitted from the output

Testing

  • You can check your program by the following tests.
  • If you run the following two commands, nothing should be filtered out and thus you should get file tmp identical with input and thus output of the diff command should be empty
# trim at quality ASCII >=33 and length >=1
./fastq-trim.pl '!' 1 < /tasks/perl/reads-tiny.fastq > tmp
diff /tasks/perl/reads-tiny.fastq tmp
  • If you run the following two commands, nothing is trimmed but the shortest read should be filtered out. Comparing with our solution should produce no differences.
# trim at quality ASCII >=33 and length >=1
./fastq-trim.pl '!' 5 < /tasks/perl/reads-tiny.fastq > reads-tiny-trim1.fastq
diff reads-tiny-trim1.fastq /tasks/perl/reads-tiny-trim1.fastq
  • If you run the following two commands, you should see actual trimming of bases with quality A in many reads and one read is omitted. Again, you should see no differences from our output.
./fastq-trim.pl B 1 < /tasks/perl/reads-tiny.fastq > reads-tiny-trim2.fastq
diff reads-tiny-trim2.fastq /tasks/perl/reads-tiny-trim2.fastq
  • If you run the following commands, you should get empty output (no reads meet the criteria):
./fastq-trim.pl d 1 < /tasks/perl/reads-small.fastq           # quality ASCII >=100, length >= 1
./fastq-trim.pl '!' 102 < /tasks/perl/reads-small.fastq       # quality ASCII >=33 and length >=102

Further runs

  • ./fastq-trim.pl '(' 95 < /tasks/perl/reads-small.fastq > reads-small-trim.fastq # quality ASCII >= 40
  • If you have done task C, run quality statistics on the trimmed version of the bigger file using the commands below. Comment on the differences between statistics on the whole file in part C and filtered file in D. Are they as you expected?
# "2" means quality ASCII >= 50
./fastq-trim.pl 2 50 < /tasks/perl/reads.fastq | ./fastq-quality.pl | perl -lane 'print if $F[0] % 10 == 0' > reads-trim-qualities.tsv
cat reads-trim-qualities.tsv

Submitting

  • Submit files fastq-trim.pl, reads-small-trim.fastq, reads-trim-qualities.tsv
  • In your protocol, include your discussion of the results for reads-trim-qualities.tsv

Note

  • In this task set, you have created tools which can be combined, e.g. you can first trim FASTQ and then convert it to FASTA.

Parsing command-line arguments

  • Use the following snippet, which stores command-line arguments in variables $Q and $L:
#!/usr/bin/perl -w
use strict;

my $USAGE = "
Usage:
$0 Q L < input.fastq > output.fastq

Trim from the end of each read bases with ASCII quality value less
than the given threshold Q. If the length of the read after trimming
is less than L, the read will be omitted from output.

L is a non-negative integer, Q is a character
";

# check that we have exactly 2 command-line arguments
die $USAGE unless @ARGV==2;
# copy command-line arguments to variables Q and L
my ($Q, $L) = @ARGV;
# check that $Q is one character and $L looks like a non-negative integer
die $USAGE unless length($Q)==1 && $L=~/^[0-9]+$/;

Connecting to server

In the course, you will be working on a Linux server. You can connect to this server via ssh, using the same username and password as for AIS2. In the computer classroom at the faculty, we recommend connecting to the server from Linux.

Connection through ssh

If connecting from a Linux computer, open a console (command-line window) and run:

ssh your_username@vyuka.compbio.fmph.uniba.sk -XC

The server will prompt you for password, but it will not display anything while you type. Just type your password and press Enter.

If connecting from a Windows 10 computer, open command-line window in Ubuntu subsystem for Windows or Powershell or cmd.exe Command Prompt and run

ssh your_username@vyuka.compbio.fmph.uniba.sk

When prompted, type your password and press Enter, as for Linux. See also more detailed instructions here [1]

For Windows, this command allows text-only connection, which is sufficient for most of the course. To support for graphics applications, follow instructions in the next section.

Installation of X server on Windows

This is not needed for Linux, just use -XC option in ssh.

To use applications with GUIs, you need to tunnel X-server commands from the server to your local machine (this is accomplished by your ssh client), and you need a program that can interpret these commands on you local machine (this is called X server).

  • Install putty client which you will use instead of ssh.
  • Install X server, such as xming
  • Make sure that X server is running (you should have "X" icon in your app control bar)
  • Run putty, connect using ssh connection type and in your settings choose Connection->SSH->X11 and check "Enable X11 forwarding" box
  • Login to the vyuka.compbio.fmph.uniba.sk server in putty
  • echo $DISPLAY command on the server should show a non-empty string (e.g. localhost:11.0)
  • Try running xeyes &: this simple testing application should display a pair of eyes tracking your mouse cursor

Copying files to/from the server via scp

  • You can copy files using scp command on the command line, both in Windows and Linux
  • Alternatively use the graphical WinSCP program for Windows
  • On Linux, you can mount the filesystem from the server as a directory on your machine and then work with it as with local folders, using both command-line and graphical file managers.


Examples of scp commands

# copies file protocol.txt to /submit/perl/username on server
scp protocol.txt username@vyuka.compbio.fmph.uniba.sk:/submit/perl/username/

# copies file protocol.txt to the home folder of the user on the server
scp protocol.txt username@vyuka.compbio.fmph.uniba.sk:

# copies file protocol.txt from home directory at the server to the current folder on the local computer
scp username@vyuka.compbio.fmph.uniba.sk:protocol.txt .

# copies folder /tasks/perl from the server to the current folder on the local computer
# notice -r option for copying whole directories
scp -r username@vyuka.compbio.fmph.uniba.sk:/tasks/perl .

Example of sshfs command:

mkdir vyuka  # create an empty folder with arbitrary name
sshfs username@vyuka.compbio.fmph.uniba.sk: vyuka
# from now on, folder vyuka will contain you home folder on vyuka server
# you can copy files to and form the server and even open them in editor as if they were on your computer
# however with network-related slowdown

Editors

  • During the course, you will have to edit scripts, protocols and other files
  • Our server has several editors installed
  • If you can use graphical interface, we recommend kate, which is used similarly as Windows-based editors
  • If you need to work in a text-only mode (slow connection or no X support):
    • Simple editor is nano (see keyboard shortcuts at the bottom of the screen)
    • More advanced are vim, emacs, ne (read tutorials before starting using them)
  • When working from Windows, you can also connect to the server via WinScp and edit the files using WinScp built-in editors or other editors installed on your computer
  • When working from Linux, you can mount your home directory using sshfs and again use editors installed on your computer

Command-line basics

Files and folders

  • Images, texts, data, etc. are stored in files.
  • Files are grouped in folders (directories) for better organization.
  • A folder can also contain other folders, forming a tree structure.

Moving around folders (ls, cd)

  • One folder is always selected as the current one; it is shown on the command line
  • The list of files and folders in the current folder can be obtained with the ls command
  • The list of files in some other folder can be obtained with the command ls other_folder
  • The command cd new_folder changes the current folder to the specified new folder
  • Notes: ls is an abbreviation of "list", cd is an abbreviation of "change directory"

Example:

  • When we login to the server, we are in the folder /home/username.
  • We then execute several commands listed below
  • Using cd command, we move to folder /tasks/perl/ (the computer does not print anything, only changes the current folder).
  • Using ls command, we print all files in the /tasks/perl/ folder.
  • Finally we use ls /tasks command to print the folders in /tasks
username@vyuka:~$ cd /tasks/perl/
username@vyuka:/tasks/perl$ ls
fastq-lengths.pl  reads-small.fastq  reads-tiny-trim1.fastq  series.tsv
protocol.txt      reads-tiny.fasta   reads-tiny-trim2.fastq
reads.fastq       reads-tiny.fastq   series-small.tsv
username@vyuka:/tasks/perl$ ls /tasks
bash  bioinf1  bioinf2  bioinf3  cloud  flask  make  perl  python  r1  r2

Absolute a relative paths

  • Absolute path determines how to get to a given file or folder from the root of the whole filesystem.
  • For example /tasks/perl/, /tasks/perl/series.tsv, /home/username etc.
  • Individual folders are separated by a slash / in the path.
  • Absolute paths start with a slash /.
  • Relative path determines how to get to a given file or folder from the current folder.
  • For example, if the current folder is /tasks/perl/, the relative path to file /tasks/perl/series.tsv is simply series.tsv
  • If the current folder is /tasks/, the relative path to file /tasks/perl/series.tsv is perl/series.tsv
  • Relative paths do not start with a slash.
  • A relative path can also go "upwards" to the containing folder using ..
  • For example, if the current folder is /tasks/perl/, the relative path .. will give us the same as /tasks and ../../home will give us /home

Commands ls, cd and others accept both relative and absolute paths.

Important folders

  • Root is the folder with absolute path /, the starting point of the tree structure of folders.
  • Home directory with absolute path /home/username is set as the current folder after login.
    • Users typically store most of their files within their home directory and its subfolders, if there is no good reason to place them elsewhere.
    • Tilde ~ is an abbreviation for your home directory. For example cd ~ will place you there.

Wildcards

  • We can use wildcards to work with only selected files in a folder
  • For example, all files starting with letter x in the current folder can be printed as ls x*
  • The star represents any number of characters in the filename
  • All files containing letter x anywhere in the name can be printed as ls *x*

Examining file content (less)

  • Type less filename
  • This will print the first page of the file on your screen. Then you can move around the file using space or keys Page up and Page down. You can quit the viewer pressing letter q (abbreviation of quit). Help with additional keys is accessed by pressing h (abbreviation of help).
  • If you have a short file and want to just print it all on your screen, use cat filename
  • Try for example the following commands:
less /tasks/perl/reads-small.fastq  # move around the file, then press q
cat /tasks/perl/reads-tiny.fasta    # see the whole file on the screen

Creating new folders (mkdir)

  • To create a new folder (directory), use a command of the form mkdir new_folder_path
  • The path to the new folder can be relative or absolute
  • For example, assume that we are in the home folder, the following two commands both create a new folder called test and folder test2 within it.
mkdir test
# change the next command according to your username
mkdir /home/username/test/test2  

Copying files (cp)

  • To copy files, use a command of the form cp source destination
  • This will copy file specified as the source to the destination.
  • Both source and destination can be specified via absolute or relative paths.
  • The destination can be a folder into which the file is copied or an entire name of the copied file.
  • We can also copy several files to the same folder as follows: cp file1 file2 file3 destination

Example: Let us assume that the current directory is /home/username and that directories test and test2 were created as above. The following will copy file /tasks/perl/reads-small.fastq to the new directory test and afterwards also to the current folder (which is the home directory). In the third step it will be copied again to the current folder under a new name test.fastq. In the final steps it will be copied to the test directory under this new name as well.

# change the next command according to your username
# this copies an existing file to the home directory using absolute paths
cp /tasks/perl/reads-small.fastq /home/username/
# now we use relative paths to copy the file from home to the new folder called test
cp reads-small.fastq test/
# now the file is copied within current folder under a new filename test.fastq
cp reads-small.fastq test.fastq
# change directory to test
cd test
# copy the file again from the home directory to the test directory under name test.fastq
cp ../test.fastq .
# we can copy several files to a different folder
cp test.fastq reads-small.fastq test2/

Other file-related commands (cp -r, mv, rm, rmdir)

  • Copying whole folders can be done via cp -r source destination
  • While using cp, it is good to add -i option which will warn us in case we are going to overwrite some existing file. For example:
cd ~
cp -i reads-small.fastq test/
  • To move files to a new folder or rename them, you can use mv command, which works similarly to cp, i.e. you specify first source, then destination. Option -i can be used here as well.
  • Command rm will delete specified files, rm -r whole folders.
  • An empty folder can be deleted using rmdir

Beware: be very careful on the command-line

  • The command-line will execute whatever you type, it generally does not ask for confirmation, even for dangerous actions.
  • You can very easily remove or overwrite some important file by mistake.
  • There is no undo.
  • Therefore always check your command before pressing Enter. Use -i option for cp, mv, and possibly even rm.

See also

Lbash

HWbash

This lecture introduces command-line tools and Perl one-liners.

  • We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.

When working on the exercises, record all the commands used

  • We strongly recommend making a log of commands for data processing also outside of this course
  • If you have a log of executed commands, you can easily execute them again by copy and paste
  • For this reason any comments are best preceded in the log by #
  • If you use some sequence of commands often, you can turn it into a script

Efficient use of the Bash command line

Some tips for bash shell:

  • use tab key to complete command names, path names etc
  • use up and down keys to walk through the history of recently executed commands, then edit and execute the chosen command
  • press ctrl-r to search in the history of executed commands
  • at the end of session, history stored in ~/.bash_history
  • command history -a appends history to this file right now
    • you can then look into the file and copy appropriate commands to your log
    • for example, print the last commands using tail ~/.bash_history
  • various other history tricks, e.g. special variables [2]
  • cd - goes to previously visited directory (also see pushd and popd)
  • ls -lt | head shows 10 most recent files, useful for seeing what you have done last in a folder

Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook

Redirecting and pipes

# redirect standard output to file
command > file

# append to file
command >> file

# redirect standard error
command 2>file

# redirect file to standard input
command < file

# do not forget to quote > in other uses, 
# e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc
# should be quoted in '' as well

# send stdout of command1 to stdin of command2 
# this is called a pipeline
command1 | command2

# backtick operator executes command, 
# removes trailing \n from stdout, 
# substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file

# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'

# in some commands, file argument can be taken from stdin
# if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2

Make piped commands fail properly:

set -o pipefail

If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, the pipeline then returns exit status of the rightmost command.

Text file manipulation

Commands echo and cat (creating and printing files)

# print text Hello and end of line to stdout
echo "Hello" 
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2

Commands head and tail (looking at start and end of files)

# print 10 first lines of file (or stdin)
head file
some_command | head 
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20

Documentation: head, tail

Commands wc, ls -lh, od (exploring file statistics and details)

# prints three numbers:
# the number of lines (-l), number of words (-w), number of bytes (-c)
wc file

# prints the size of file in human-readable units (K,M,G,T)
ls -lh file

# od -a prints file or stdin with named characters 
#   allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000   h   e   l   l   o  sp   w   o   r   l   d   !  nl
# 0000015

Documentation: wc, ls, od

Command grep (getting lines matching a regular expression)

# get all lines in file known.fa containing string wall
grep wall known.fa
# -i ignores case (upper case and lowercase letters are the same)
grep -i wall known.fa
# -c counts the number of matching lines in each file in the current folder
grep -c '^[IJ]' *

# other options (there is more, see the manual):
# -v print/count non-matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file 
#    (good for selecting e.g. only lines matching one of "good" ids)

Documentation: grep

Commands sort, uniq

# sort lines of a file alphabetically
sort names.txt

# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -s stable
# -t fields separator

# sorting output of the first command
# first compare column 3 numerically (-k3,3g),
# in case of ties use column 1 (-k1,1)
# the long result is piped to less to look at manually
grep -v '#' matches.tsv | sort -k3,3g -k1,1 | less

# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines
# and sorts them by frequency from the most frequent
sort protocol.txt | uniq -c | sort -gr

Documentation: sort, uniq

Commands diff, comm (comparing files)

Command diff compares two files. It is good for manual checking of differences. Useful options:

  • -b (ignore whitespace differences)
  • -r for comparing whole directories
  • -q for fast checking for identity
  • -y show differences side-by-side

Command comm compares two sorted files. It is good for finding set intersections and differences. It writes three columns:

  • lines occurring only in the first file
  • lines occurring only in the second file
  • lines occurring in both files

Some columns can be suppressed with options -1, -2, -3


Commands cut, paste, join (working with columns)

  • Command cut selects only some columns from file (perl/awk more flexible)
  • Command paste puts two or more files side by side, separated by tabs or other characters
  • Command join is a powerful tool for making joins and left-joins as in databases on specified columns in two files

Commands split, csplit (splitting files to parts)

  • Command split splits into fixed-size pieces (size in lines, bytes etc.)
  • Command csplit splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
csplit sequences.fa '/^>/' '{*}'

Programs sed and awk

Both sed and awk process text files line by line, allowing to do various transformations

  • awk newer, more advanced
  • several examples below
  • More info on awk, sed on Wikipedia
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints the first two lines, then quits (like head -n 2)
sed 2q  

# print the first and second column from a file
awk '{print $1, $2}' 

# print the line if the difference between the first and second column > 10
awk '{ if ($2-$1>10) print }'  

# print lines matching pattern
awk '/pattern/ { print }' 

# count the lines (like wc -l)
awk 'END { print NR }'

Perl one-liners

Instead of sed and awk, we will cover Perl one-liners

# -e executes commands
perl -e 'print 2+3,"\n"'
perl -e '$x = 2+3; print $x, "\n"';

# -n wraps commands in a loop reading lines from stdin
# or files listed as arguments
# the following is roughly the same as cat:
perl -ne 'print'
# how to use:
perl -ne 'print' < input > output
perl -ne 'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions, 
# including print, so print is the same as print $_

# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_

# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne 'print $_, "*"' 

# -a splits line into words separated by whitespace and stores them in array @F
# The next example prints difference in the numbers stored in columns 7 and 6.
# This is size of an interval which has start and end stored in these columns
grep -v '#' matches.tsv | perl -lane 'print $F[7]-$F[6]' | less

# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane 'print $F[7]-$F[6]'

# END { commands } is run at the very end, after we finish reading input
# the following example computes the average of interval lengths
grep -v '#' matches.tsv | perl -lane '$count++; $sum += $F[7]-$F[6]; END { print $sum/$count; }'
# similarly BEGIN { command } before we start

Other interesting possibilities:

# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files
# in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces a sequence of whitespace by exactly one space 
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt

# variable $. contains the line number, $ARGV the name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' names.txt protocol.txt  | less

# moving files *.txt to have extension .tsv:
#   first print commands 
#   then execute by hand or replace print with system
#   mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; system("mv -i $_ $s")'

HWbash

Lecture on Perl, Lecture on command-line tools

  • In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)

Preparatory steps and submitting

# create a folder for this homework
mkdir bash
# move to the new folder
cd bash
# link input files to the current folder
ln -s /tasks/bash/known.fa /tasks/bash/yarLip.fa /tasks/bash/matches.tsv /tasks/bash/names.txt .
# copy protocol to the current folder
cp -i /tasks/bash/protocol.txt .
  • Now you can open protocol.txt in your favorite editor and start working
  • Command ln created symbolic links (shortcuts) to the input files, so you can use them under names such as known.fa rather than full paths such as /tasks/bash/known.fa.

When you are done, you can submit all required files as follows (substitute your username):

cp -ipv protocol.txt known.txt pairs.txt frequency.txt best.txt function.txt passwords.csv /submit/bash/your_username

# check what was submitted
ls -l /submit/bash/your_username

Introduction to tasks A-C

  • In these tasks we will again process bioinformatics data. We have two files of sequences in the FASTA format. This time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with '>' contain the identifier of a protein and potentially an additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the Uniprot database.
  • File /tasks/bash/yarLip.fa is a FASTA file with proteins from the yeast Yarrowia lipolytica. Each protein is identified in the FASTA file only by its ID such as Q6CFX1. We will call the proteins from yarLip.fa query proteins.
  • File /tasks/bash/known.fa is a FASTA file with proteins from several yeast species. Each ID is followed by a description of the biological function of the protein. We will call the proteins from known.fa target proteins.
  • These two sets of proteins were compared by the bioinformatics tool called BLAST, which finds proteins with similar sequences. The results of BLAST are in file /tasks/bash/matches.tsv. This file contains a section for each query protein. This section starts with several comments, i.e. lines starting with # symbol. This is followed by a table with the found matches in the TSV format, i.e., several values delimited by tab characters \t. We will be interested in the first two columns representing the IDs of the query and target proteins, respectively.

Task A (counting proteins)

Steps (1) and (2)

  • Use files known.fa and yarLip.fa to find out how many proteins are in each. Each protein starts with a line starting with the > symbol, so it is sufficient to count those.
  • Beware that > symbol means redirect in bash. Therefore you have to enclose it in single quotation marks '>' so that it is taken literally.
  • For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write the commands and the resulting protein counts to the appropriate sections of your protocol.

Step 3

  • Create file known.txt which contains sequence IDs and descriptions extracted from known.fa. This file will be used in Task C.
  • Leading > should be removed. Any text after OS= in the description should be also removed.
  • This file should be sorted alphabetically.
  • The file should start as follows:
1433_CANAL 14-3-3 protein homolog 
1A1D_SCHPO Probable 1-aminocyclopropane-1-carboxylate deaminase 
2A5D_YEAST Serine/threonine-protein phosphatase 2A 56 kDa regulatory subunit delta isoform 
2AAA_SCHPO Protein phosphatase PP2A regulatory subunit A 
2AAA_YEAST Protein phosphatase PP2A regulatory subunit A 
  • Submit file known.txt, write your commands to the protocol.

Task B (counting matches)

Step (1)

  • From file matches.tsv extract pairs of similar proteins and store them in file pairs.txt.
  • Each line of the file should contain a pair of protein IDs extracted from the first two columns of the matches.tsv file.
  • These IDs should be separated by a single space and the file should be sorted alphabetically.
  • Do not forget to omit lines with comments.
  • Each pair from the input should be listed only once in the output.
  • Commands grep, sort and uniq would be helpful. To select only some columns, you can use cut, awk or a perl one-liner.
  • The file pairs.txt should have 71834 lines (command wc) and it should start as follows:
B5FVA8 PLB1_CANAL
B5FVA8 PLB1_SCHPO
B5FVA8 PLB1_YEAST
  • Submit file pairs.txt and write your commands to the protocol.

Step (2)

  • Find out how many query proteins (from yarLip.fa) have at least one similarity found in matches.tsv. This can be done by counting distinct values in the first column of your pairs.txt file from step (1).
  • We suggest commands cut/awk/perl, sort, uniq, wc
  • The result of your commands should be an output consisting of a single number (and the end-of-line character).
  • Write your answer and commands to the protocol. Compare this number with the total number of query proteins found in Task A(2).

Step (3)

  • For each query protein in the first column of pairs.txt file, count how many times it occurs in the file. The result should be a file named frequency.txt with pairs query protein ID, count separated by space, sorted from the proteins with the highest to the lowest count.
  • To check you answer, look at lines 69 and 70 of the file as follows head -n 70 frequency.txt | tail -n 2
  • You should get the following two lines:
Q6CBP9 207
Q6C6A5 165
  • This means that query protein Q6CBP9 occurs 207 times in the first column of pairs.txt, which means 207 target proteins are similar to it. Protein Q6C6A5 has 165 similar target proteins.
  • Submit file frequency.txt, write your commands to the protocol. Also write to the protocol what is the highest and lowest count in the second column of your file.
  • Note: The query proteins with zero matches are not listed in your file. Their number could be deduced from your results in step (2) and Task A(2) if needed.

Task C (joining information)

Step (1)

  • For each query protein, the first (top) match in matches.tsv represents the strongest similarity.
  • In this step, we want to extract such strongest match for each query protein which has at least one match.
  • The result should be a file best.txt listing the two IDs separated by a space. The file should be sorted by the second column (target ID).
  • The file should start as follows:
F2Z5Y1 1433_CANAL
F2Z6F8 1433_CANAL
Q6C7K1 2A5D_YEAST
Q6C3C5 2AAA_SCHPO
  • This task can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
  • Instead of using Perl, you can play with grep. Option -A 1 prints the matching lines as well as one line after each match.
  • Submit file best.txt with the result and write your command to the protocol.

Step 2:

  • Now we want to extend file best.txt with a description of each target protein.
  • Since similar proteins often have similar functions, this will allow somebody studying query proteins from yarLip.fa to learn something about their possible functions based similarity to well-studied proteins from other species.
  • To achieve this, we join together files best.txt and known.txt created in Task A(3). Conveniently, they are both sorted by the ID of the target protein.
  • Use command join to join these files.
  • Use option -1 2 to use the second column of best.txt as a key for joining
  • The output of join may start as follows:
1433_CANAL F2Z5Y1 14-3-3 protein homolog 
1433_CANAL F2Z6F8 14-3-3 protein homolog 
2A5D_YEAST Q6C7K1 Serine/threonine-protein phosphatase 2A 56 kDa regulatory subunit delta isoform 
  • Further reformat the output so that the query ID goes first (e.g. F2Z5Y1), followed by target ID (e.g. 1433_CANAL), followed by the rest of the text.
  • Sort by query ID, store as function.txt
  • The output should start as follows:
B5FVA8 Q5A7D5_CANAL Lysophospholipase
B5FVB0 Q59T91_CANAL Ubiquitin-conjugating enzyme E2 H
B5FVB1 RPAB5_SCHPO DNA-directed RNA polymerases I, II, and III subunit RPABC5
  • Files best.txt and function.txt should have the same number of lines.
  • Which target protein is the best match for the query protein Q6C7M8 and what its function?
  • Submit file best.txt. Write your commands and the answer to the question above to your protocol.

Task D (passwords)

  • The file /tasks/bash/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
    • The output file has columns separated by commas ','
    • The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands. Write your commands to the protocol.

Example line from input:

Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un

Hints:

  • Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
  • We also recommend using perl, wc, paste (check option -d in paste)
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.

Lmake

HWmake


Job Scheduling

  • Some computing jobs take a lot of time: hours, days, weeks,...
  • We do not want to keep a command-line window open the whole time; therefore we run such jobs in the background
  • Simple commands to do it in Linux:
    • To run the program immediately, then switch the whole console to the background: screen, tmux
    • To run the command when the computer becomes idle: batch
  • Now we will concentrate on Sun Grid Engine, a complex software for managing many jobs from many users on a cluster consisting of multiple computers
  • Basic workflow:
    • Submit a job (command) to a queue
    • The job waits in the queue until resources (memory, CPUs, etc.) become available on some computer
    • The job runs on the computer
    • Output of the job is stored in files
    • User can monitor the status of the job (waiting, running)
  • Complex possibilities for assigning priorities and deadlines to jobs, managing multiple queues etc.
  • Ideally all computers in the cluster share the same environment and filesystem
  • We have a simple training cluster for this exercise:
    • You submit jobs to queue on vyuka
    • They will run on computers runner01 and runner02
    • This cluster is only temporarily available until the next Monday


Submitting a job (qsub)

Basic command: qsub -b y -cwd command 'parameter < input > output 2> error'

  • quoting around command parameters allows us to include special characters, such as <, > etc. and not to apply it to qsub command itself
  • -b y treats command as binary, usually preferable for both binary programs and scripts
  • -cwd executes command in the current directory
  • -N name allows to set name of the job
  • -l resource=value requests some non-default resources
  • for example, we can use -l threads=2 to request 2 threads for parallel programs
  • Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
  • qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27

Monitoring and deleting jobs (qstat, qdel)

Command qstat displays jobs of the current user

  • job 28 is running of server runner02 (status <t>r), job 29 is waiting in queue (status qw)
job-ID  prior   name       user         state submit/start at     queue       
---------------------------------------------------------------------------------
     28 0.50000 s3         bbrejova     r     03/15/2016 22:12:18 main.q@runner02
     29 0.00000 s3         bbrejova     qw    03/15/2016 22:14:08             
  • Command qstat -u '*' displays jobs of all users.
  • Finished jobs disappear from the list.
  • Command qstat -F threads shows how many threads available:
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@runner01                BIP   0/1/2          0.00     lx26-amd64    
	hc:threads=1
    238 0.25000 sleeper.pl bbrejova     r     03/05/2020 13:12:28     1        
---------------------------------------------------------------------------------
main.q@runner02                BIP   0/1/2          0.00     lx26-amd64    
    237 0.75000 sleeper.pl bbrejova     r     03/05/2020 13:12:13     1        
  • Command qdel deletes a job (waiting or running)

Interactive work on the cluster (qrsh), screen

Command qrsh creates a job which is a normal interactive shell running on the cluster

  • In this shell you can manually run commands
  • When you close the shell, the job finishes
  • Therefore it is a good idea to run qrsh within screen
    • Run screen command, this creates a new shell
    • Within this shell, run qrsh, then whatever commands
    • By pressing Ctrl-a d you "detach" the screen, so that both shells (local and qrsh) continue running but you can close your local window
    • Later by running screen -r you get back to your shells

Running many small jobs

For example, we many need to run some computation for each human gene (there are roughly 20,000 such genes). Here are some possibilities:

  • Run a script which iterates through all jobs and runs them sequentially
    • Problems: This does not use parallelism, needs more programming to restart after some interruption
  • Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)
    • Jobs can run in parallel on many different computers
    • Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
  • Array jobs in qsub (option -t): runs jobs numbered 1,2,3...; number of the current job is in an environment variable, used by the script to decide which gene to process
    • Queue contains only running sub-jobs plus one line for the remaining part of the array job.
    • After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
  • Next: using make in which you specify how to process each gene and submit a single make command to the queue
    • Make can execute multiple tasks in parallel using several threads on the same computer (qsub array jobs can run tasks on multiple computers)
    • It will automatically skip tasks which are already finished, so restart is easy

Make

Make is a system for automatically building programs (running compiler, linker etc)

  • In particular, we will use GNU make
  • Rules for compilation are written in a file typically called Makefile
  • Rather complex syntax with many features, we will only cover basics

Rules

  • The main part of a Makefile are rules specifying how to generate target files from some source files (prerequisites).
  • For example the following rule generates file target.txt by concatenating files source1.txt and source2.txt:
target.txt : source1.txt source2.txt
      cat source1.txt source2.txt > target.txt
  • The first line describes target and prerequisites, starts in the first column
  • The following lines list commands to execute to create the target
  • Each line with a command starts with a tab character
  • If we have a directory with this rule in file called Makefile and files source1.txt and source2.txt, running make target.txt will run the cat command
  • However, if target.txt already exists, the command will be run only if one of the prerequisites has more recent modification time than the target
  • This allows to restart interrupted computations or rerun necessary parts after modification of some input files
  • make automatically chains the rules as necessary:
    • If we run make target.txt and some prerequisite does not exist, make checks if it can be created by some other rule and runs that rule first.
    • In general it first finds all necessary steps and runs them in appropriate order so that each rules has its prerequisites ready.
    • Option make -n target will show which commands would be executed to build target (dry run) - good idea before running something potentially dangerous.

Pattern rules

We can specify a general rule for files with a systematic naming scheme. For example, to create a .pdf file from a .tex file, we use the pdflatex command:

%.pdf : %.tex
      pdflatex $^
  • In the first line, % denotes some variable part of the filename, which has to agree in the target and all prerequisites.
  • In commands, we can use several variables:
    • Variable $^ contains the names of the prerequisites (source).
    • Variable $@ contains the name of the target.
    • Variable $* contains the string matched by %.

Other useful tricks in Makefiles

Variables

Store some reusable values in variables, then use them several times in the Makefile:

MYPATH := /projects/trees/bin

target : source
       $(MYPATH)/script < $^ > $@

Wildcards, creating a list of targets from files in the directory

The following Makefile automatically creates .png version of each .eps file simply by running make:

EPS := $(wildcard *.eps)
EPSPNG := $(patsubst %.eps,%.png,$(EPS))

all:  $(EPSPNG)

clean:
        rm $(EPSPNG)

%.png : %.eps
        convert -density 250 $^ $@
  • Variable EPS contains names of all files matching *.eps.
  • Variable EPSPNG contains names of desired .png files.
    • It is created by taking filenames in EPS and changing .eps to .png
  • all is a "phony target" which is not really created
    • Its rule has no commands but all .png files are prerequisites, so are done first.
    • The first target in a Makefile (in this case all) is default when no other target is specified on the command-line.
  • clean is also a phony target for deleting generated .png files.
  • Thus run make all or just make to create png files, make clean to delete them.

Useful special built-in target names

Include these lines in your Makefile if desired

.SECONDARY:
# prevents deletion of intermediate targets in chained rules

.DELETE_ON_ERROR:
# delete targets if a rule fails

Parallel make

Running make with option -j 4 will run up to 4 commands in parallel if their dependencies are already finished. This allows easy parallelization on a single computer.

Alternatives to Makefiles

  • Bioinformaticians often uses "pipelines" - sequences of commands run one after another, e.g. by a script or make
  • There are many tools developed for automating computational pipelines in bioinformatics, see e.g. this review: Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016.
  • For example Snakemake
    • Snake workflows can contain shell commands or Python code
    • Big advantage compared to make: pattern rules may contain multiple variable portions (in make only one % per filename)
    • For example, assume we have several FASTA files and several profiles (HMMs) representing protein families and we want to run each profile on each FASTA file:
rule HMMER:
     input: "{filename}.fasta", "{profile}.hmm"
     output: "{filename}_{profile}.hmmer"
     shell: "hmmsearch --domE 1e-5 --noali --domtblout {output} {input[1]} {input[0]}"

Overall advantages of using command-line one-liners and make

  • Compared to popular Python frameworks, such as Pandas, many command-line utilities process files line-by-line and thus do not need to load big files into memory.
  • For small easy tasks (e.g. counting lines, looking occurrences of a specific string, looking at first lines of the file etc.) it might be less effort to run a simple one-liner than to open the file in some graphical interface (Jupyter notebook, Excel) and finding the answer. This is true provided that you take time to learn command-line utilities.
  • Command-line commands, scripts and makefiles can be easily run in background on your computer or even on remote servers.
  • It is easy to document what exactly was run by keeping log of executed commands as well as any scripts used in the process.

HWmake

See also the lecture

Motivation: Building phylogenetic trees

The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences

  • A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors.
  • The input contains sequences of all proteins from each species (we will use only a smaller subset)
  • The process is typically split into several stages shown below

Identify ortholog groups

Orthologs are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called blast to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species.

Align proteins on each group

For each ortholog group, we need to align proteins in the group to identify corresponding parts of the proteins. This is done by a tool called muscle

Unaligned sequences (start of protein O60568):

>human
MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA...
>baboon
MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA...
>dog
MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA...
>elephant
MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA...
>guineapig
MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA...
>opossum
SDKLLVITAA...
>pig
AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA...
>rabbit
MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA...
>rat
MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...

Aligned sequences:

rabbit       MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ...
guineapig    MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ...
opossum      ---------- ---------- ---------- ---------- SDKLLVITAA ...
rat          MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ...
elephant     MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ...
human        MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ...
baboon       MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ...
dog          MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ...
pig          AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...

Build phylogenetic tree for each grup

For each alignment, we build a phylogenetic tree for this group. We will use a program called RAxML.

Example of a phylogenetic tree in newick format:

 ((opossum:0.09636245,rabbit:0.85794020):0.05219782,
(rat:0.07263127,elephant:0.03306863):0.01043531,
(dog:0.01700528,(pig:0.02891345,
(guineapig:0.14451043,
(human:0.01169266,baboon:0.00827402):0.02619598
):0.00816185):0.00631423):0.00800806);
Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)

Build a consensus tree

The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree.


Files and submitting

Our goal is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.

  • Submit by copying requested files to /submit/make/username/
  • Do not forget to submit protocol, outline of the protocol is in /tasks/make/protocol.txt

Start by copying folder /tasks/make to your home folder

cp -ipr /tasks/make .
cd make

The folder contains three subfolders:

  • large: a larger sample of proteins for task A
  • tiny: a very small set of proteins for task B
  • small: a slightly larger set of proteins for task C

Task A (long job)

  • In this task, you will run a long alignment job (more than two hours)
  • Go to folder large, which contains the following files:
    • ref.fa: selected human proteins
    • other.fa: selected proteins from 8 other mammalian species
    • Makefile: runs blast on ref.fa vs other.fa (also formats database other.fa before that)
  • Run make -n to see what commands will be done (you should see makeblastdb, blastp, and echo for timing).
    • copy the output to the protocol
  • Run qsub with appropriate options to run make (at least -cwd -b y).
  • Then run qstat > queue.txt
    • Submit file queue.txt showing your job waiting or running
  • When your job finishes, check the following files:
    • the output file ref.blast
    • standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
  • Submit the last 100 lines from ref.blast under the name ref-end.blast (use tool tail -n 100) and the file make.oX mentioned above

Task B (finishing Makefile)

  • In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group.
    • This Makefile works with much smaller files and so you can run it quickly many times without qsub
  • Work in folder tiny
    • ref.fa: 2 human proteins
    • other.fa: a selected subset of proteins from 8 other mammalian species
    • Makefile: a longer makefile
    • brm.pl: a Perl script for finding ortholog groups and sorting them to folders

The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3.

  • If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
  • Stage 1: run as make ref.brm
    • It runs blast as in task A, then splits proteins into ortholog groups and creates one folder for each group with file prot.fa containing protein sequences
  • Stage 2: run as make alignments
    • In each folder with an ortholog group, it will create an alignment prot.phy.
  • Stage 3: run as make trees (needs to be written by you)
    • In each folder with an ortholog group, it should create files prot.lg.tree and prot.wag.tree containing the results of the RAxML program run with two different evolutionary models WAG and LG.
    • Create a rule that will create prot.lg.tree from prot.phy in one gene folder. Use % for the folder name, similarly as rules to make alignments. Within the rule, you can use $* variable to get the name of this folder.
    • Run RAxML by commands of the form:
      cd GENE_FOLDER; raxmlHPC -m PROTGAMMALG -p 12345 -s prot.phy -T 2 -o opossum -n LG >prot.raxml.LG
    • Change GENE_FOLDER to the appropriate filename of the folder using make variables.
    • RAxML will create several output files starting with RAxML_ in the folder with the gene. We are interested in the file containing word result in its filename. Use mv command to move this file to the desired filename prot.lg.tree in the folder with the gene. Again use an appropriate variable to specify the folder name.
    • So the rule for running RAxML with LG model will have two lines with commands, the first running cd and raxmlHPC, the second running mv.
    • When this is done, create a similar rule except replace LG by WAG (lg by wag) in the commands, including -m PROTGAMMALG option which will change to -m PROTGAMMAWAG.
    • Also add variables LG_TREES and WAG_TREES listing filenames of all desirable trees and uncomment phony target trees which uses these variables.
  • Stage 4: run as make consensus
    • Output trees from stage 3 are concatenated for each model separately to files lg/intree, wag/intree and then phylip is run to produce consensus trees lg.tree and wag.tree
    • This stage also needs variables LG_TREES and WAG_TREES to be defined by you.
  • Run your Makefile and check that the files lg.tree and wag.tree are produced
  • Submit the whole folder tiny (using cp -r), including Makefile and all gene folders with tree files.

Task C (running make)

  • Copy your Makefile from part B to folder small, which contains 9 human proteins and run make on this slightly larger set
    • Again, run it without qsub, but it will take some time, particularly if the server is busy
  • Look at the two resulting trees (wag.tree, lg.tree) using the figtree program
    • it is available on vyuka, but you can also install it on your computer if needed
  • In figtree, switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch. To do this, use the left panel with options.
  • Export the trees in pdf format as wag.tree.pdf and lg.tree.pdf
  • Compare the two trees
    • Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
    • Do the two trees differ? What is the highest and lowest support for a branch in each tree?
    • Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included).
    • Write your observations to the protocol
  • Submit the entire small folder (including the two pdf files).

Further possibilities

Here are some possibilities for further experiments, in case you are interested (do not submit these):

  • You could copy your extended Makefile to folder large and create trees for all ortholog groups in the big set.
    • This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A.
    • After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
  • RAxML also supports other models, for example JTT (for more information, see manual); you could try to play with those.
  • Command touch FILENAME will change the modification time of the given file to the current time.
    • What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?

Lpython

HWpython

This lecture introduces the basics of the Python programming language and the basics of working with databases using the SQL language and SQLite3 lightweight database system.

The next three lectures

  • Computer science students will use Python, SQLite3 and several advanced Python libraries for complex data processing
  • Bioinformatics students will use several bioinformatics command-line tools

Overview, documentation

Outline of this lecture

  • We introduce a simple data set.
  • We show several Python scripts for processing this data set. We will skip this part because most of you know Python.
  • We introduce basics of working with SQLite3 and writing SQL queries.
  • We look at how to combine Python and SQLite.

Python

SQL

  • SQL is a language for working with relational databases. It is covered in more detail in dedicated database courses.
  • We will cover basics of SQL and work with a simple DB system SQLite3.
  • Advantages of storing data in a database rather than CSV files and similar:
    • A database can contain multiple tables and connect them through shared identifiers.
    • You can create indices which allow fast access to records matching your criteria without searching through all data.
    • A database system can compute complex queries without loading all data into memory.
  • Typical database systems are complex, use server-client architecture. In contrast, SQLite3 is a simple "database" stored in one file.
    • It is best used by a single process at a time, although multiple processes can read concurrently.
    • It can be easily created, does not need extensive configuration typical for more complex database systems.
  • SQLite3 documentation
  • SQL tutorial
  • SQLite3 in Python documentation

Dataset for this lecture

  • IMDb is an online database of movies and TV series with user ratings.
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub.
  • From this dataset, we have selected 10 series with high average number of voting users.
  • Data are two files in the CSV format: list of series, list of episodes.
  • CSV stands for comma-separated values.

File series.cvs contains one row per series

  • Columns: (0) series id, (1) series title, (2) TV channel:
3,Breaking Bad,AMC
2,Sherlock,BBC
1,Game of Thrones,HBO 

File episodes.csv contains one row per episode:

  • Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
  • Here is a sample of 4 episodes from the Game of Thrones series.
  • If the episode title contains a comma, the whole title is in quotation marks.
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
1,No One,58,6,8,8.3,20709
1,Battle of the Bastards,59,6,9,9.9,138353
1,The Winds of Winter,60,6,10,9.9,93680

Note that a different version of this data was used already in the lecture on Perl.

Several Python scripts

We will illustrate basic features of Python on several scripts working with these CSV files.

  • In the first two examples we just use standard Python functions for reading files and split lines into columns by split command.
  • This does not work well for episodes.csv file where comma sometimes separates columns and sometimes is in quotes within a column. Therefore we use csv module, which is one the the standard Python modules.
  • Alternatively, one could import CSV files via more complex libraries, such as Pandas.
  • We will see that similar tasks as these scripts can be done by very short SQL commands. In Pandas we could also achieve similar results using a few commands.

prog1.py

The first script prints the second column (series title) from series.csv

#! /usr/bin/python3

# open a file for reading
with open('series.csv') as csvfile:
    # iterate over lines of the input file
    for line in csvfile:
        # split a line into columns at commas
        columns = line.split(",")
        # print the second column
        print(columns[1])
  • Python uses indentation to delimit blocks. In this example, the with command starts a block and within this block the for command starts another block containing commands columns=... and print. The body of each block is indented several spaces relative to the enclosing block.
  • Variables are not declared, but directly used. This program uses variables csvfile, line, columns.
  • The open command opens a file (here for reading, but other options are available).
  • The with command opens the file, stores the file handle in csvfile variable, executes all commands within its block and finally closes the file.
  • The for-loop iterates over all lines in the file, assigning each in variable line and executing the body of the block.
  • Method split of the built-in string type str splits the line at every comma and returns a list of strings, one for every column of the table (see also other string methods)
  • The final line prints the second column and the end of line character.

prog2.py

The following script prints the list of series of each TV channel.

  • For illustration we also separately count the number of the series for each channel, but the count could be also obtained as the length of the list.
#! /usr/bin/python3
from collections import defaultdict

# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we will count the series
channel_counts = defaultdict(int)

# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)

# open a file and iterate over lines
with open('series.csv') as csvfile:
    for line in csvfile:
        # strip whitespace (e.g. end of line) from end of line
        line = line.rstrip()
        # split line into columns, find channel and series names
        columns = line.split(",")
        channel = columns[2]
        series = columns[1]
        # increase counter for channel
        channel_counts[channel] += 1
        # add series to list for the channel
        channel_lists[channel].append(series)

# print counts
print("Counts:")
for (channel, count) in channel_counts.items():
    print(f"Channel \"{channel}\" has {count} series.")

# print series lists
print("\nLists:")
for channel in channel_lists:
    list = ", ".join(channel_lists[channel]) 
    print("Series for channel \"%s\": %s" % (channel,list))
  • In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary channel_counts stores the number of series, channel_lists stores the list of series names.
  • For simplicity we use a library data structure called defaultdict instead of a plain python dictionary. The reason is easier initialization: keys do not need to be explicitly inserted to the dictionary, but are initialized with a default value at the first access.
  • Reading of the input file is similar to the previous script.
  • Afterwards we iterate through the keys of both dictionaries and print both the keys and the values.
  • We format the output string using f-strings f"..." where expressions in { } are evaluated and substituted to the string. Formatting similar to C-style printf, e.g. print(f"{2/3:.3f}").
  • Notice that when we print counts, we iterate through pairs (channel, count) returned by channel_counts.items(), while when we print series, we iterate through keys of the dictionary.

prog3.py

This script finds the episode with the highest number of votes among all episodes.

  • We use a library for csv parsing to deal with quotation marks around episode names with commas, such as "Dark Wings, Dark Words".
  • This is done by first opening a file and then passing it as an argument to csv.reader, which returns a reader object used to iterate over rows.
#! /usr/bin/python3
import csv

# keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None

# open a file
with open('episodes.csv') as csvfile:
    # create a reader for parsing csv files
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    # iterate over rows already split into columns
    for row in reader:
        votes = int(row[6])
        if votes > max_votes:
            max_votes = votes
            max_votes_episode = row[1]
        
# print result
print(f"Maximum votes {max_votes} in episode \"{max_votes_episode}\"")

prog4.py

The following script shows an example of function definition.

  • The function reads a whole csv file into a 2d array.
  • The rest of the program calls this function twice for each of the two files.
  • This could be followed by some further processing of these 2d arrays.
#! /usr/bin/python3
import csv

def read_csv_to_list(filename):
    # create empty list
    rows = []
    # open a file
    with open(filename) as csvfile:
        # create a reader for parsing csv files
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        # iterate over rows already split into columns
        for row in reader:
            rows.append(row)
    return rows

series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print(f"The number of episodes is {len(episodes)}.")
# further processing of series and episodes...

SQL and SQLite

Creating a database

SQLite3 database is a file with your data stored in a special format. To load our csv files to a SQLite database, run command:

sqlite3 series.db < create_db.sql

Contents of create_db.sql file:

CREATE TABLE series (
  id INT,
  title TEXT,
  channel TEXT
);
.mode csv
.import series.csv series
CREATE TABLE episodes (
  seriesId INT,
  title TEXT,
  orderInSeries INT,
  season INT,
  orderInSeason INT,
  rating REAL,
  votes INT
);
.mode csv
.import episodes.csv episodes
  • The two CREATE TABLE commands create two tables named series and episodes.
  • For each column (attribute) of the table we list its name and type.
  • Commands starting with a dot are special SQLite3 commands, not part of SQL itself. Command .import reads a CSV file and stores it in a table.

Other useful SQLite3 commands;

  • .schema tableName (lists columns of a given table)
  • .mode column and .headers on (turn on human-friendly formatting, not good for further processing)

SQL queries

  • Run sqlite3 series.db to get an SQLite command-line where you can interact with your database.
  • Then type the queries below which illustrate the basic features of SQL.
  • In these queries, we use uppercase for SQL keywords and lowercase for our names of tables and columns (SQL keywords are not case sensitive).
/*  switch on human-friendly formatting */
.mode column
.headers on

/* print title of each series (as prog1.py) */
SELECT title FROM series;

/* sort titles alphabetically */
SELECT title FROM series ORDER BY title;

/* find the highest vote number among episodes */
SELECT MAX(votes) FROM episodes;

/* find the episode with the highest number of votes, as prog3.py */
SELECT title, votes FROM episodes
  ORDER BY votes DESC LIMIT 1;

/* print all episodes with at least 50k votes, order by votes */
SELECT title, votes FROM episodes
  WHERE votes>50000 ORDER BY votes DESC;

/* join series and episodes tables, print 10 episodes
 * with the highest number of votes */
SELECT s.title, e.title, votes
  FROM episodes AS e, series AS s
  WHERE e.seriesId=s.id
  ORDER BY votes DESC 
  LIMIT 10;

/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() AS series_count
  FROM series GROUP BY channel;

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Parts of a typical SQL query:

  • SELECT followed by column names, or functions MAX, COUNT etc. These columns or expressions are printed for each row of the table, unless filtered out (see below). Individual columns of the output can be given aliases by keyword AS.
  • FROM followed by a list of tables. Tables also can get aliases (FROM episodes AS e).
  • WHERE followed by a condition used for filtering the rows.
  • ORDER BY followed by an expression used for sorting the rows.
  • LIMIT followed by the maximum number of rows to print.

More complicated concepts:

  • GROUP BY allows to group rows based on common value of some columns and compute statistics per group (count, maximum, sum etc).
  • If you list two tables in FROM, you will conceptually create all pairs of rows, one from one table, one from the other. These are then typically filtered in the FROM clause to only those that have a matching ID (for example WHERE e.seriesId=s.id in one of the queries above).

Accessing a database from Python

We will use sqlite3 library for Python to access data from the database and to process them further in the Python program.

read_db.py

  • The following script illustrates running a SELECT query and getting the results.
  • To start using a database, we first create objects called Connection and Cursor. We use the cursor to run individual SQl queries.
  • If we want to use particular values in the queries, which we have stored in variables, we use placeholders "?" in the query itself and pass the values to the execute command as a tuple.
    • sqlite3 module ensures that the values are passed to the database correctly. If we inserted the values directly into the query string, we could face problems if the values contain e.g. quotation marks leading to SQL syntax errors or even SQL injection attacks.
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('series.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

# run a select query
# supply parameters of the query using placeholders ?
threshold = 40000
cursor.execute("""SELECT title, votes FROM episodes
  WHERE votes>? ORDER BY votes desc""", (threshold,))

# retrieve results of the query
for row in cursor:
    print(f"Episode \"{row[0]}\" votes {row[1]}")
    
# close db connection
connection.close()

write_db.py

This script illustrates creating a new database containing a multiplication table.

  • When we modify the database, it is important to run commit command to actually write them to the disk.
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('multiplication.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

cursor.execute("""
CREATE TABLE mult_table (
a INT, b INT, mult INT)
""")

for a in range(1,11):
    for b in range(1,11):
        cursor.execute("INSERT INTO mult_table (a, b, mult) VALUES (?, ?, ?)",
                       (a, b, a * b))

# important: save the changes
connection.commit()
    
# close db connection
connection.close()

We can check the result by running command

sqlite3 multiplication.db "SELECT * FROM mult_table;"

HWpython

See also the lecture.

Introduction

  • Task A: introduction to Python
  • Tasks B1, B2: introduction to SQL
  • Task C: Python+SQL
  • Task D: SQL and optionally also Python

If you are a beginner in Python (not DAV students!), do tasks A,B1,B2,C. Otherwise do tasks B1,B2,C,D.

Preparation and submitting

Before you start working, copy files to your home folder:

mkdir python
cd python
cp -iv /tasks/python/* .

The folder contains the following files:

  • *.py: python scripts from the lecture, included for convenience
  • series.csv, episodes.csv: data files introduced in the lecture
  • create_db.sql: SQL commands to create the database needed in tasks B1, B2, C, D
  • protocol.txt: fill in and submit the protocol.

Submit by copying requested files to /submit/python/username/ as follows (use the version for beginners or non-beginners):

# for beginners in Python:
cp -ipv protocol.txt taskA.py taskA.txt taskB1.sql taskB1.txt taskB2.sql taskB2.txt taskC.py series.db /submit/python/username/

# for non-beginners in Python:
cp -ipv protocol.txt taskB1.sql taskB1.txt taskB2.sql taskB2.txt taskC.py series.db taskD.py taskD.sql taskD.txt /submit/python/username/
# one of taskD.py and taskD.sql will be missing
# this is ok

Task A (Python)

Write a script taskA.py which reads both csv files and outputs for each TV channel the total number of episodes in their series combined. Run your script as follows:

./taskA.py > taskA.txt

One of the lines of your output should be:

The number of episodes for channel "HBO" is 76

Submit file taskA.py with your script and the output file taskA.txt:

Hints:

  • A good place to start is prog4.py with reading both csv files and prog2.py with a dictionary of counters
  • It might be useful to build a dictionary linking the series id to the channel name for that series

Task B1 (SQL)

To prepare the database for tasks B1, B2, C and D, run the command:

sqlite3 series.db < create_db.sql

To verify that your database was created correctly, you can run the following commands:

sqlite3 series.db ".tables"
# output should be  episodes  series  

sqlite3 series.db "select count() from episodes; select count() from series;"
# output should be 348 and 10

The last query in the lecture counts the number of episodes and average rating per each season of each series:

SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Use join with the series table to replace the numeric series id with the series title and add the channel name. Write your SQL query to file taskB1.sql. The first two lines of the file should be

.mode column
.headers on

Run your query as follows:

sqlite3 series.db < taskB1.sql > taskB1.txt

For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25

True Detective   HBO         1           8              9.3       
True Detective   HBO         2           8              8.25      

Submit taskB1.sql and taskB1.txt

Task B2 (SQL)

For each channel, compute the total count and average rating of all their episodes. Write your SQL query to file taskB2.sql. As before, the first two lines of the file should be

.mode column
.headers on

Run your query as follows:

sqlite3 series.db < taskB2.sql > taskB2.txt

For example, all 76 episodes for the two HBO series have average rating as follows:

HBO         76          8.98947368421053

Submit taskB2.sql and taskB2.txt

Task C (Python+SQL)

If you have not done so already, create an SQLite database, as explained at the beginning of task B1.

Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database file

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • SQL can store the results from a query directly in a table, but in this task you should instead read each row of the SELECT query in Python and to store it by running INSERT command from Python.
  • Also do not forget to create the new table in the database with appropriate column names and types. Execute CREATE TABLE command from Python.
  • The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
  • If you change your database during debugging, you can start over by running the command for creating the database above.
  • Store the script as taskC.py.

To check that your table was created, you can run command

sqlite3 series.db "SELECT * FROM seasons;"

This will print many lines, including this one: "5|1|8|9.3" which is for season 1 of series 5 (True Detective).

Submit your script taskC.py and the modified database series.db.

Task D (SQL, optionally Python)

For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased.

  • For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
  • Print a table containing series name, season number x, average rating in season x and average rating in season x+1
  • The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons with the highest drop.
  • One option is to run a query in SQL in which you join the seasons table from task C with itself and select rows that belong to the same series and successive seasons.
  • You can also read the rows of the seasons table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting.
  • Submit your code as taskD.py or taskD.sql and the resulting table as taskD.txt

The output should start like this (the formatting may differ):

series      season x    rating for x  rating for x+1  
----------  ----------  ------------  ----------------
Sherlock    1           8.825         9.26666666666667
Breaking B  4           9.0           9.375           

When using SQL without Python, include the following two lines in taskD.sql

.mode column
.headers on

and run your query as sqlite3 series.db < taskD.sql > taskD.txt

Lweb

HWweb

It is 2021. Use python3! The default `python` command on vyuka server is python 2.7. Some of the packages do not work with python2. If you type `python3` you will get python3.

Sometimes you may be interested in processing data which is available in the form of a website consisting of multiple webpages (for example an e-shop with one page per item or a discussion forum with pages of individual users and individual discussion topics).

In this lecture, we will extract information from such a website using Python and existing Python libraries. We will store the results in an SQLite database. These results will be analyzed further in the following lectures.

Scraping webpages

In Python, the simplest tool for downloading webpages is requests package:

import requests
r = requests.get("http://en.wikipedia.org")
print(r.text[:10])

Parsing webpages

When you download one page from a website, it is in HTML format and you need to extract useful information from it. We will use beautifulsoup4 library for parsing HTML.

Parsing a webpage:

import requests
from bs4 import BeautifulSoup

text = requests.get("http://en.wikipedia.org").text

parsed = BeautifulSoup(text, 'html.parser')

Parsed variable contains HTML tree. Since our homework is mostly cared about user comments, we need to find appropriate element with user comments in the HTML tree. Here we show several examples of finding elements of interest and extracting data from them.

To select all nodes with tag <a> you might use:

>>> links = parsed.select('a')
>>> links[1]
<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>

Getting inner text from the element is done by:

>>> links[1].string
'Jump to navigation'

To select and <li> element and traversing children of it, you can use following code. Also note difference between string and text (text contains text from all descendants):

>>> li = parsed.select('li')
>>> li[10]
<li><a href="/wiki/Fearless_(Taylor_Swift_album)" title="Fearless (Taylor Swift album)"><i>Fearless</i> (Taylor Swift album)</a></li>
>>> for ch in li[10].children:
...   print(ch.name)
... 
a
>>> ch.string
>>> ch.text
'Fearless (Taylor Swift album)'

Here are examples of more complicated selection of HTML elements.

parsed.select('li i')  # tag inside a tag (not direct descendant), returns inner tag
parsed.select('li > i')  # tag inside a tag (direct descendant), returns inner tag
parsed.select('li > i')[0].parent  # gets the parent tag
parsed.select('.interlanguage-link-target') # select anything with class = "..." atribute
parsed.select('a.interlanguage-link-target') # select <a> tag with class = "..." atribute
parsed.select('li .interlanguage-link-target') # select <li> tag followed by anything with class = "..." atribute
parsed.select('#top') # select anything with id="top"
  • In your code, we recommend following the examples at the beginning of the documentation and the example of CSS selectors. Also you can check out general syntax of CSS selectors.
  • Information you need to extract is located within the structure of the HTML document
  • To find out, how is the document structured, use Inspect element feature in Chrome or Firefox (right click on the text of interest within the website). For example this text on the course webpage is located within LI element, which is within UL element, which is in 4 nested DIV elements, one BODY element and one HTML element. Some of these elements also have a class (starting with a dot) or an ID (starting with #).

Web-screenshot1.png Web-screenshot2.png

  • Based on this information, create a CSS selector and then extract relevant text from selected nodes.

Parsing dates

To parse dates (written as a text), you have two options:

>>> import datetime
>>> datetime_str = '9.10.18 13:55:26'
>>> datetime.datetime.strptime(datetime_str, '%d.%m.%y %H:%M:%S')
datetime.datetime(2018, 10, 9, 13, 55, 26)
  • dateutil package. Beware, that default setting prefer "month.day.year" format. This can be fixed with "dayfirst" flag.
>>> import dateutil.parser
>>> dateutil.parser.parse('2012-01-02 15:20:30')
datetime.datetime(2012, 1, 2, 15, 20, 30)
>>> dateutil.parser.parse('02.01.2012 15:20:30')
datetime.datetime(2012, 2, 1, 15, 20, 30)
>>> dateutil.parser.parse('02.01.2012 15:20:30', dayfirst=True)
datetime.datetime(2012, 1, 2, 15, 20, 30)


Other useful tips

  • Don't forget to commit changes to your SQLite3 database (call db.commit()).
  • SQL command CREATE TABLE IF NOT EXISTS can be useful at the start of your script.
  • Use screen command for long running scripts. Screen allows to you keep script running, even if you close the ssh connection (or it drops due to bad wifi).
  • To get out if screen use Ctrl+A and then press D.
  • To get back in write screen -r. (Or screen -r -d if screen is attached somewhere else in the background).
  • All packages are installed on our server. If you use your own laptop, you need to install them using pip (preferably in an virtualenv).

HWweb

See the lecture

Submit by copying requested files to /submit/web/username/

General goal: Scrape comments from user discussions at the pravda.sk website. Store comments from several (hundreds) users from the last month in an SQLite3 database.

Task A

Create SQLite3 "database" with appropriate schema for storing comments from pravda.sk discussions. For each comment you should be able to tell its content, title, date, time and who wrote it. You don't need to store which comment replies to which one. For each user you should be able to retrieve her/his comments and also her/his name. You will probably need tables for users and comments.

Submit two files:

  • db.sqlite3 - the database
  • schema.txt - a brief description of your schema and rationale behind it


Task B

Build a crawler, which crawls comments in pravda.sk discussions. You have two options:

  • For fewer points: Script which gets URL of a user (e.g. https://debata.pravda.sk/profil/debata/anysta/) and crawls the comments of this user from the last month.
  • For more points: Scripts which gets one starting URL (either user profile or some discussion, your choice) and automatically discovers users and crawls their comments.

This crawler should store the comments in SQLite3 database built in the previous task.

Submit the following:

  • db.sqlite3 - the database
  • every python script used for crawling
  • README (how to start your crawler)

Lbioinf1

HWbioinf1

The next three lectures at targeted at the students in the Bioinformatics program and the goal is to get experience with several common bioinformatics tools. Students will learn more about the algorithms and models behind these tools in the Methods in bioinformatics course.

Overview of DNA sequencing and assembly

DNA sequencing is a technology of reading the order of nucleotides along a DNA strand.

  • The result is represented as a string of A,C,G,T.
  • Only fragments of DNA of limited length can be read, these are called sequencing reads.
  • Different technologies produce reads of different characteristics.
  • Examples:
    • Illumina sequencers produce short reads (typical length 100-200bp), but in great quantities and very low error rate (<0.1%).
    • Illumina reads usually come in pairs sequenced from both ends of a DNA fragment of an approximately known length.
    • Oxford nanopore sequencers produce longer reads (thousands of base pairs or more), but the error rates are higher (2-15%).


The goal of genome sequencing is to read all chromosomes of an organism.

  • Sequencing machines produce many reads coming from different parts of the genome.
  • Using software tools called sequence assemblers, these reads are glued together based on overlaps.
  • Ideally we would get the true chromosomes, but often we get only shorter fragments called contigs.
  • Assembled contigs sometimes contain errors.
  • We prefer longer contigs with fewer errors.

Sequence alignments and dotplots

A short video for this section: [4]

  • Sequence alignment is the task of finding similarities between DNA (or protein) sequences.
  • Here is an example, a short similarity between region at positions 344,447..344,517 of one sequence and positions 3,261..3,327 of another sequence.
Query: 344447 tctccgacggtgatggcgttgtgcgtcctctatttcttttatttctttttgttttatttc 344506
              |||||||| |||||||||||||||||| ||||||| |||||||||||| ||   ||||||
Sbjct: 3261   tctccgacagtgatggcgttgtgcgtc-tctatttattttatttctttgtg---tatttc 3316

Query: 344507 tctgactaccg 344517
              |||||||||||
Sbjct: 3317   tctgactaccg 3327
  • Alignments can be stored in many formats and visualized as dotplots.
  • In a dotplot, the x-axis correspond to positions in one sequence and the y-axis in another sequence.
  • Diagonal lines show alignments between the sequences (direction of the diagonal shows which DNA strand was aligned).
Dotplot of human and Drosophila mitochondrial genomes

File formats

FASTA

  • FASTA is a format for storing DNA, RNA and protein sequences.
  • We have already seen FASTA files in Perl exercises.
  • Each sequence is given on several lines of the file. The first line starts with ">" followed by an identifier of the sequence and optionally some further description separated by whitespace.
  • The sequence itself is on the second line; long sequences are split into multiple lines.
>SRR022868.1845_1
AAATTTAGGAAAAGATGATTTAGCAACATTTAGCCTTAATGAAAGACCAGATTCTGTTGCCATGTTTGAA...
>SRR022868.1846_1
TAGCGTTGTAAAATAAATTTCTAGAATGGAAGTGATGATATTGAAATACACTCAGATCCTGAATGAAAGA...

FASTQ

  • FASTQ is a format for storing sequencing reads, containing DNA sequences but also quality information about each nucleotide.
  • More in the lecture on Perl.

SAM/BAM

  • SAM and BAM are formats for storing alignments of sequencing reads (or other sequences) to a genome.
  • For each read, the file contains the read itself, its quality, but also the chromosome/contig name and position where this read is likely coming from, and an additional information e.g. about mapping quality (confidence in the correct location).
  • SAM files are text-based, thus easier to check manually; BAM files are binary and compressed, thus smaller and faster to read.
  • We can easily convert between SAM and BAM using samtools.
  • Full documentation of the format

PAF format

Gzip

  • Gzip is a general-purpose tool for file compression.
  • It is often used in bioinformatics on large FASTQ or FASTA files.
  • Running command gzip filename.ext will create compressed file filename.ext.gz and the original file will be deleted.
  • The reverse process is done by gunzip filename.ext.gz. This deletes the gziped file and creates the uncompressed version.
  • However, we can access the file without uncompressing it. Command zcat filename.ext.gz prints the content of a gzipped file and keeps the gzipped file as is. We can use pipes | to do further processing on the file.
  • To manually page through the content of a gzipped file use zless filename.ext.gz.
  • Some bioinformatics tools can work directly with gzipped files.

HWbioinf1

See also the lecture

Submit the protocol and the required files to /submit/bioinf1

Technical notes

  • Task D and task E ask you to look at data visualizations.
  • If you are unable to open graphical applications from our server, you can download appropriate files and view them on your computer (in task D these are simply pdf files, in task E you can install IGV software on your computer or use the online version).

Task A: examine input files

Copy files from /tasks/bioinf1/ as follows:

mkdir bioinf1
cd bioinf1
cp -iv /tasks/bioinf1/* .
  • File ref.fasta contains a piece of genome from Escherichia coli.
  • Files miseq_R1.fastq.gz and miseq_R2.fastq.gz contain sequencing reads from Illumina MiSeq sequencer. First reads in pairs are in the R1 file, second reads in the R2 file. These reads come from the region in ref.fasta.
  • File nanopore.fasta contains nanopore sequencing reads in FASTA format (without qualities). These reads are also from the region in ref.fasta.

Try to find the answers to the following questions using command-line tools. In your protocol, note down the commands as well as the answers.

(a) How many reads are in the MiSeq files? Is the number of reads the same in both files?

  • Try command zcat miseq_R1.fastq.gz | wc -l
  • Can you figure out the answer from the result of this command?

(b) How long are individual reads in the MiSeq files?

  • Look at the file using zless - do all reads appear to be of an equal length?
  • Extend the following command with tail and wc -c to get the length of the first read: zcat miseq_R1.fastq.gz | head -n 2
  • Do not forget to consider the end of the line character.
  • Repeat for both MiSeq files.

(c) How many reads are in the nanopore file (beware - different format)

(d) What is the average length of the reads in the nanopore file?

  • Try command: samtools faidx nanopore.fasta
  • This creates nanopore.fasta.fai file, where the second column contains sequence length of each read.
  • Compute the average of this column by a one-liner: perl -lane '$s+=$F[1]; $n++; END { print $s/$n }' nanopore.fasta.fai

(e) How long is the sequence in the ref.fasta file?

Task B: assemble the sequence from the reads

  • We will pretend that the correct answer (ref.fasta) is not known and we will try to assemble it from the reads.
  • We will assemble Illumina reads by program SPAdes and nanopore reads by miniasm.
  • Assembly takes several minutes; we will run it in the background using screen command.

SPAdes

  • Run screen -S spades
  • Press Enter to get the command-line, then run the following command:
spades.py -t 1 -m 1 --pe1-1 miseq_R1.fastq.gz --pe1-2 miseq_R2.fastq.gz -o spades > spades.log
  • Press Ctrl-a followed by d
  • This will take you out of screen command.
  • Run top command to check that your command is running.

Miniasm

  • Create file miniasm.sh containing the following commands:
# Find alignments between pairs of reads
minimap2 -x ava-ont -t 1 nanopore.fasta nanopore.fasta | gzip -1 > nanopore.paf.gz 
# Use overlaps to compute the assembled genome
miniasm -f nanopore.fasta nanopore.paf.gz > miniasm.gfa 2> miniasm.log
# Convert genome to fasta format
perl -lane 'print ">$F[1]\n$F[2]" if $F[0] eq "S"' miniasm.gfa > miniasm.fasta
# Align reads to the assembled genome
minimap2 -x map-ont --secondary=no -t 1 miniasm.fasta nanopore.fasta | gzip -1 > miniasm.paf.gz
# Polish the genome by finding consensus of aligned reads at each position
racon -t 1 -u nanopore.fasta miniasm.paf.gz miniasm.fasta > miniasm2.fasta
  • Run screen -S miniasm.
  • In screen, run source ./miniasm.sh
  • Press Ctrl-a d to exit screen.

To check if your commands have finished:

  • Re-enter the screen environment using screen -r spades or screen -r miniasm
  • If the command finished, terminate screen by pressing Ctrl-d or typing exit

Examine the outputs. Write commands and answers to your protocol.

  • Move the output of SPAdes to a new filename: mv -i spades/contigs.fasta spades.fasta
  • Output of miniasm should be in miniasm2.fasta

(a) How many contigs are in each of these two files?

(b) What can you find out from the names of contigs in spades.fasta? What is the length of the shortest and longest contigs? String cov in the names is abbreviation of read coverage - the average number of reads covering a position in the contig. Do the reads have similar coverage, or are there big differences?

  • Use command grep '>' spades.fasta

(c) What are the lengths of contigs in miniasm2.fa file? (You can use LN:i: in the name of contigs.)

Submit files miniasm2.fasta and spades.fasta

Task C: compare assemblies using Quast command

We have found basic characteristics of the two assemblies in task B. Now we will use program Quast to compare both assemblies to the correct answer in ref.fa

quast.py -R ref.fasta miniasm2.fasta spades.fasta -o stats

Submit file stats/report.txt.

Look at the results in stats/report.txt and answer the following questions.

(a) How many contigs has quast reported in the two assemblies? Does it agree with your counts in part B?

(b) What is the number of mismatches per 100kb in the two assemblies? Which one is better? Why do you think it is so? (look at the properties of used sequencing technologies in the lecture)

(c) What portion of the reference sequence is covered by the two assemblies (reported as genome fraction)? Which assembly is better in this aspect?

(d) What is the length of the longest alignment between contigs and the reference in the two assemblies? Which assembly is better in this aspect?

Task D: create dotplots of assemblies

We will now visualize alignments between each assembly and the reference genome using dotplots. As in other tasks, write commands and answers to your protocol.

(a) Create a dotplot comparing miniasm assembly to the reference sequence

# alignments
minimap2 -x asm10 -t 1 ref.fasta miniasm2.fasta > ref-miniasm2.paf
# creating dotplot
/usr/local/share/miniasm/miniasm/minidot -f 12 ref-miniasm2.paf | \
  ps2pdf -dEPSCrop - ref-miniasm2.pdf
# displaying dotplot
# if evince does not work, copy the pdf file to your commputer and view there
evince ref-miniasm2.pdf &
  • x-axis is reference, y-axis assembly.
  • Which part of the reference is missing in the assembly?
  • Do you see any other big differences between the assembly and the reference?

(b) Use analogous commands to create a dotplot for spades assembly, call it ref-spades.pdf.

  • What are vertical gray lines in the dotplot?
  • Is any contig aligning to multiple places in the reference? To how many places?

(c) Use analogous commands to create a dotplot of reference to itself, call it ref-ref.pdf.

  • However, in the minimap2 command add option -p 0 to include also weaker self-alignments.
  • Do you see any self-alignments, showing repeated sequences in the reference? Does this agree with the dotplot in part (b)?

Submit all three pdf files ref-miniasm2.pdf, ref-spades.pdf, ref-ref.pdf

Task E: Align reads and assemblies to reference, visualize in IGV

Finally, we will align all source reads as well as assemblies to the reference genome, then visualize the alignments in IGV tool or its online version.

A short video introducing IGV: [5]

  • Write commands and answers to your protocol
  • Submit all four BAM files ref-miseq.bam, ref-nanopore.bam, ref-spades.bam, ref-miniasm2.bam

(a) Align illumina reads (MiSeq files) to the reference sequence

# align illumina reads to reference
# minimap produces SAM file, samtools view converts to BAM, 
# samtools sort orders reads by coordinate
minimap2 -a -x sr --secondary=no -t 1 ref.fasta  miseq_R1.fastq.gz miseq_R2.fastq.gz | \
  samtools view -S -b -o - |  samtools sort - -o ref-miseq.bam
# index BAM file for faster access
samtools index ref-miseq.bam

(b) Similarly align nanopore reads, but instead of -x sr use -x map-ont, call the result ref-nanopore.bam, ref-nanopore.bam.bai

(c) Similarly align spades.fasta, but instead of -x sr use -x asm10, call the result ref-spades.bam

(d) Similarly align miniasm2.fasta, but instead of -x sr use -x asm10, call the result ref-miniasm2.bam

(e) Run the IGV viewer. Beware: It needs a lot of memory, do not keep open on the server unnecessarily.

  • igv -g ref.fasta &
  • Using Menu->File->Load from File, open all four BAM files.
  • Look at region ecoli-frag:224,000-244,000.
  • How many spades contigs do you see aligning in this region?
  • Look at region ecoli-frag:227,300-227,600.
  • Comment on what you see. How frequent are errors in the individual assemblies and read sets?
  • If you are unable to run igv from home, you can install it on your computer [6] and download ref.fasta and all .bam and .bam.bai files.

Lflask

HWflask

In this lecture, we will use Python to process user comments obtained in the previous lecture.

  • We will display information about individual users as a dynamic website written in Flask framework.
  • We will use simple text processing utilities from ScikitLearn library to extract word use statistics from the comments.

Flask

Flask is a simple web server for Python. Using Flask you can write a simple dynamic website in Python.


Running Flask

You can find a sample Flask application at /tasks/flask/simple_flask. Beware, the database included in this folder is just an empty one. You need to either copy in your db from previous exercise or use one from directory above.

You can run the example Flask app using these commands:

cd <your directory>
export FLASK_APP=main.py
# this is optional, but recommended for debugging
# If you are running flask on your own machine you might want to use add `--debug` flag in the `flask run` command
# instead of the FLASK_ENV environment variable.
export FLASK_ENV=development 

# before running the following, change the port number
# so that no two users use the same number
flask run --port=PORT

PORT is a random number greater than 1024. This number should be different from other people running flask on the same machine (if you run into the problem where flask writes out lot of error messages complaining about permissions, select a different port number). Flask starts a webserver on port PORT and serves the pages created in your Flask application. Keep it running while you need to access these pages.

To view these pages, open a web browser on the same computer where the Flask is running, e.g. chromium-browser http://localhost:PORT/ (use the port number you have selected to run Flask). If you are running flask on a server, you probably want to run the web browser on your local machine. In such case, you need to use ssh tunnel to channel the traffic through ssh connection:

  • On your local machine, open another console window and create an ssh tunnel as follows: ssh -L PORT:localhost:PORT username@vyuka.compbio.fmph.uniba.sk (replace PORT with the port number you have selected to run Flask)
  • For Windows machines, follow a tutorial how to create an ssh tunnel. Destination should be localhost:PORT, source port should be PORT. Do not forget to click add.
  • Ideally do not use Putty, but use Ubuntu subsystem for Windows or Powershell ssh, where -L options works out of box. See [7].
  • Keep this ssh connection open while you need to access your Flask web pages; it makes port PORT available on your local machine
  • In your browser, you can now access your Flask webpages, using e.g. chromium-browser http://localhost:PORT/

Structure of a Flask application

  • The provided Flask application resides in the main.py script.
  • Some functions in this script are annotated with decorators starting with @app.
  • Decorator @app.before_request marks a function which will be executed before processing a particular request from a web browser. In this case we open a database connection and store it in a special variable g which can be used to store variables for a particular request. (Sidenote: Opening connection before every requests is quite bad practice. Also using sqlite3 for web application is not ideal. If you are want to build a serious web app you should use Postgresql and something like SQLAlchemy for handling connections. We are simplifying stuff here for educational purposes). If you open db connection using any other way than through g.db, e.g. as a normal global variable, you might get various unpleasant errors.
  • Decorator @app.route('/') marks a function which will serve the main page of the application with URL http://localhost:4247/. Similarly decorator @app.route('/wat/<random_id>/') marks a function which will serve URLs of the form http://localhost:4247/wat/100 where the particular string which the user uses in the URL (here 100) will be stored in random_id variable accessible within the function.
  • Functions serving a request return a string containing the requested webpage (typically a HTML document). For example, function wat returns a simple string without any HTML markup.
  • To more easily construct a full HTML document, you can use jinja2 templating language, as is done in the home function. The template itself is in file templates/main.html. You might want to construct different template for different webpages (main menu, use page).
  • To fill in variables in template we use {{ ... }} notation. There are also {% for x in something %} statemens and also {% if ... %} statements.
  • To get url of some other page you can use url_for (see provided template).


Processing text

The main tool we will use for processing text is CountVectorizer class from the Scikit-learn library. It transforms a text into a bag of words representation. In this representation we get the list of words and the count for each word. Example:

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(strip_accents='unicode')

texts = [
 "Ema ma mamu.",
 "Zirafa sa vo vani kupe a hneva sa."
]

t = vec.fit_transform(texts).toarray()

print(t)
# prints:
# [[1 0 0 1 1 0 0 0 0]
#  [0 1 1 0 0 2 1 1 1]]

print(vec.vocabulary_)
# prints:
# {'vani': 6, 'ema': 0, 'kupe': 2, 'mamu': 4, 
# 'hneva': 1, 'sa': 5, 'ma': 3, 'vo': 7, 'zirafa': 8}

NumPy arrays

Array t in the example above is a NumPy array provided by the NumPy library. This library has also lots of nice tricks. First let us create two matrices:

>>> import numpy as np
>>> a = np.array([[1,2,3],[4,5,6]])
>>> b = np.array([[7,8],[9,10],[11,12]])
>>> a
array([[1, 2, 3],
       [4, 5, 6]])
>>> b
array([[ 7,  8],
       [ 9, 10],
       [11, 12]])

We can sum these matrices or multiply them by some number:

>>> 3 * a
array([[ 3,  6,  9],
       [12, 15, 18]])
>>> a + 3 * a
array([[ 4,  8, 12],
       [16, 20, 24]])

We can calculate sum of elements in each matrix, or sum by some axis:

>>> np.sum(a)
21
>>> np.sum(a, axis=1)
array([ 6, 15])
>>> np.sum(a, axis=0)
array([5, 7, 9])

There are many other useful functions, check the documentation.

Other useful frameworks (for general interest, not for this lecture)

  • FastAPI is sort of similar to Flask but more focused on making API (not webpages).
  • Django is big web framework with all belts and whistles (e.g. database support, i18n, ...).
  • Dash is another fully featured (read bloated) web framework for creating analytics pages (has extensive support, for graphs, tables, ...).

HWflask

See the lecture

General goal: Build a simple website, which lists all crawled users and for each users has a page with simple statistics regarding the posts of this user.


Submit your source code (web server and preprocessing scripts) and database files. Copy these files to /submit/flask/username/

This lesson requires crawled data from previous lesson. If you don't have one, you can find it at /tasks/flask/db.sqlite3

Task A

Create a simple Flask web application which:

  • Has a homepage with a list of all users (with links to their pages).
  • Has a page for each user with basic information: the nickname, the number of posts and the last 10 posts of this user.
  • Warning: Next lecture homework depends on this task.

Task B

Make a separate script which computes and stores in the database the following information for each user:

  • the list of 10 most frequently used words,
  • the list of top 10 words typical for this user (words which this user uses much more often than other users). Come up with some simple heuristics for measuring this.

Show this information on the page of each user.

Hint: To get the most frequently used words for each user, you can use argsort from NumPy.

Task C

Preprocess and store the list of top three similar users for each user (try to come up with some simple definition of similarity based on the text in the posts). Again show this information on the user page.

Bonus: Try to use some simple topic modeling (e.g. PCA as in TruncatedSVD from scikit-learn) and use it for finding similar users.

Lbioinf2

HWbioinf2

Eukaryotic gene structure

  • Recall the Central dogma of molecular biology: the flow of genetic information from DNA to RNA to protein (gene expression)
  • In eukaryotes, mRNA often undergoes splicing, where introns are removed and exons are joined together
  • The very start and end of mRNA remain untranslated (UTR = untranslated region)
  • The coding part of the gene starts with a start codon, contains a sequence of additional codons and ends with a stop codon. Codons can be interrupted by introns.
Gene expression in eukaryotes

Computational gene finding

  • Input: DNA sequence (an assembled genome or a part of it)
  • Output: positions of protein coding genes and their exons
  • If we know the exact position of coding regions of a gene, we can use the genetic code table to predict the protein sequence encoded by it.
  • Gene finders use statistical features observed from known genes, such as typical sequence motifs near the start codons, stop codons and splice sites, typical codon frequencies, typical exon and intron lengths etc.
  • These statistical parameters need to be adjusted for each genome.
  • We will use a gene finder called Augustus.

Gene expression

  • Not all genes undergo transcription and translation all the time and at the same level.
  • The processes of transcription and translation are regulated according to cell needs.
  • The term "gene expression" has two meanings:
    • the process of transcription and translation (synthesis of a gene product),
    • the amount of mRNA or protein produced from a single gene (genes with high or low expression).

RNA-seq technology can sequence mRNA extracted from a sample of cells.

  • We can align sequenced reads back to the genome.
  • The number of reads coming from a gene depends on its expression level (and on its length).

HWbioinf2

See also the lecture

Submit the protocol and the required files to /submit/bioinf2

Input files

Copy files from /tasks/bioinf2/

mkdir bioinf2
cd bioinf2
cp -iv /tasks/bioinf2/* .

Files:

  • ref.fasta is a 38kb piece of the genome of the fungus Aspergillus nidulans
  • rnaseq.fastq are RNA-seq reads from Illumina sequencer extracted from the Short read archive
  • annot.gff is the reference gene annotation from the database (we will consider this as correct gene positions)

Task A: Gene finding

Run the Augustus gene finder with two versions of parameters:

  • one trained specifically for A. nidulans genes
  • one trained for the human genome, where genes have different statistical properties (for example, they are longer and have more introns)
augustus --species=anidulans ref.fasta > augustus-anidulans.gtf
augustus --species=human ref.fasta > augustus-human.gtf

The results of gene finding are in the GTF format. Rows starting with # are comments, each of the remaining rows describes some interval of the sequence. If the second column is CDS, it is a coding part of an exon. The reference annotation annot.gff is in a similar format called GFF3.

Examine the files and try to find the answers to the following questions using command-line tools

(a) How many exons are in each of the two GTF files? (Beware: simply using grep with pattern CDS may yield lines containing this string in a different column. You can use e.g. techniques from the lecture and exercises on command-line tools).

(b) How many genes are in each of the two GTF files? (The files contain rows with word gene in the second column, one for each gene)

(c) How many exons and genes are in the annot.gff file?

Write the anwsers and commands to the protocol. Submit files augustus-anidulans.gtf and augustus-human.gtf.

Task B: Aligning RNA-seq reads

  • Align RNA-seq reads to the genome
  • We will use a specialized tool hisat2, which can recognize introns
  • Then we will sort and index the BAM file, similarly as in the previous lecture


hisat2-build ref.fasta ref.fasta
hisat2 -x ref.fasta -U rnaseq.fastq -S rnaseq.sam -k 1 --min-intronlen 20 --max-intronlen 10000 --novel-splicesite-outfile introns.txt


After the hisat2 command, sort the resulting SAM file using samtools and store it as a BAM file. Create the index for the BAM file as well. In addition to the BAM file, we produced a file containing the position of detected introns. Examine the files to find out answers to the following questions (you can do it manually by looking at the the files, e.g. by less command):

(a) How many reads were in the FASTQ file? How many of them were successfully mapped?

(b) How many introns ("junctions") were predicted?

(c) During the mapping, we used a few custom options. Inspect the manual pages of hisat2 and samtools and describe shortly what those options mean.


Write answers to the protocol. Submit the file rnaseq.bam.

Task C: Visualizing in IGV

As before, run IGV as follows:

igv -g ref.fasta &

Open additional files using menu File -> Load from File: annot.gff, augustus-anidulans.gtf, augustus-human.gtf, rnaseq.bam

  • Exons are shown as thicker boxes, introns are thinner.
  • For each of the following questions, select a part of the sequence illustrating the answer and export a figure using File->Save image
  • You can check these images using command eog

Questions:

(a) Create an image illustrating differences between Augustus with human parameters and the reference annotation, save as a.png. Briefly describe the differences in words.

(b) Find some differences between Augustus with A. nidulans parameters and the reference annotation. Store an illustrative figure as b.png. Which parameters have yielded a more accurate prediction?

(c) Zoom in to one of the genes with a high expression level and try to find spliced read alignments supporting the annotated intron boundaries. Store the image as c.png.

Submit files a.png, b.png, c.png. Write answers to your protocol.

Ljavascript

HWjavascript


In this lecture we will extend the website from the previous lecture with interactive visualizations written in JavaScript. We will not cover details of the JavaScript programming language, only use visualization contained in the Google Charts library.

Your goal is to take examples from the documentation and tweak them for your purposes.

A short explanation how things fit together:

  • Your Python server (in Flask) produces a HTML page with some content.
  • You can embed some Javascript code inside the HTML. This code will be run in browser.

Tips:

  • Each graph contains also HTML+JS code example. That is a good startpoint. You can just put it inside the flask template a see what it does.
  • You can write your data into JavaScript data structures (`var data` from examples) in a Flask template. You might need a jinja for loop (https://jinja.palletsprojects.com/en/2.11.x/templates/#for). Or you can produce string in Python, which you will put into a HTML. You might need to turnoff autoescaping (https://stackoverflow.com/questions/3206344/passing-html-to-template-using-flask-jinja2). It is a (very) bad practice, but sufficient for this lecture. (A better way is to load data in JSON format through API).
  • Debugging tip: When looking for errors, limit amount of data you send through template, so you can easily read the resulting JS code.
  • Consult the previous lecture on running and accessing Flask applications.

Merging multiple examples together:

  • You need to include `<script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>` just once. This is similar to `import` in python.
  • You can either call `google.charts.load("current", {packages:["calendar"]});` separatelly for each package or list all packages in one line in one array.
  • After loading of library is done, this calls you function named `drawChart` `google.charts.setOnLoadCallback(drawChart);`. You can either draw multiples charts in one fuction, or call multiple function (you need to set multiple callbacks). Be careful to not name functions with same name.
  • Each chart needs its own element with different ID. That's this thing: <div id="chart_div" style="width: 900px; height: 500px;">. Refferenced here: `new google.visualization.Histogram(document.getElementById('chart_div'));`
  • Follow instructions here: [8]

HWjavascript

See the lecture

General goal: Extend the user pages from the previous lecture with simple visualizations.

Submit your source code to /submit/javascript/username/

Task A

Display a calendar, which shows during which days was the user active. Use calendar from Google Charts.

Task B

Show a histogram of comment lengths. Use histogram from Google Charts.

Task C

Either: Show a word tree for a user using word tree from Google Charts. Try to normalize the text before building the tree (convert to lowercase, remove accents). CountVectorizer has build_analyzer method, which returns a function, which does this for you.

Or: Pick some other appropriate visualization from the gallery, feed it with data a show it. Also add some description to it.

Lbioinf3

HWbioinf3

Polymorphisms

  • Individuals within species differ slightly in their genomes.
  • Polymorphisms are genome variants which are relatively frequent in a population (e.g. at least 1%).
  • SNP: single-nucleotide polymorphism (a polymorphism which is a substitution of a single nucleotide).
  • Recall that most human cells are diploid, with one set of chromosomes inherited from the mother and the other from the father.
  • At a particular location, a single human can thus have two different alleles (heterozygosity) or two copies of the same allele (homozygosity).

Finding polymorphisms / genome variants

  • We compare sequencing reads coming from an individual to a reference genome of the species.
  • First we align them, as in the exercises on genome assembly.
  • Then we look for positions where a substantial fraction of the reads does not agree with the reference (this process is called variant calling).

Programs and file formats

Human variants

  • For many human SNPs we already know something about their influence on phenotype and their prevalence in different parts of the world.
  • There are various databases, e.g. dbSNP, OMIM, or user-editable SNPedia.

UCSC genome browser

A short video for this section: [9]

  • The UCSC genome browser is an on-line tool similar to IGV.
  • It has nice interface for browsing genomes, lot of data for some genomes (particularly human), but not all sequenced genomes represented.

Basics

  • On the front page, choose Genomes in the top blue menu bar.
  • Select a genome and its version, optionally enter a position or a keyword, press submit.
  • On the browser screen, the top image shows chromosome map, the selected region is in red.
  • Below there is a view of the selected region and various tracks with information about this region.
  • For example some of the top tracks display genes (boxes are exons, lines are introns).
  • Tracks can be switched on and off and configured in the bottom part of the page (browser supports different display levels, full contains all information but takes a lot of vertical space).
  • Buttons for navigation are at the top (move, zoom, etc.).
  • Clicking at the browser figure allows you to get more information about a gene or other displayed item.
  • In this lecture, we will need tracks GENCODE and dbSNP - check e.g. gene ACTN3 and within it SNP rs1815739 in exon 15.

Blat

  • For sequence alignments, UCSC genome browser offers a fast but less sensitive BLAT (good for the same or very closely related species).
  • Choose Tools->Blat in the top blue menu bar, enter DNA sequence below, search in the human genome.
    • What is the identity level for the top found match? What is its span in the genome? (Notice that other matches are much shorter).
    • Using Details link in the left column you can see the alignment itself, Browser link takes you to the browser at the matching region.
AACCATGGGTATATACGACTCACTATAGGGGGATATCAGCTGGGATGGCAAATAATGATTTTATTTTGAC
TGATAGTGACCTGTTCGTTGCAACAAATTGATAAGCAATGCTTTCTTATAATGCCAACTTTGTACAAGAA
AGTTGGGCAGGTGTGTTTTTTGTCCTTCAGGTAGCCGAAGAGCATCTCCAGGCCCCCCTCCACCAGCTCC
GGCAGAGGCTTGGATAAAGGGTTGTGGGAAATGTGGAGCCCTTTGTCCATGGGATTCCAGGCGATCCTCA
CCAGTCTACACAGCAGGTGGAGTTCGCTCGGGAGGGTCTGGATGTCATTGTTGTTGAGGTTCAGCAGCTC
CAGGCTGGTGACCAGGCAAAGCGACCTCGGGAAGGAGTGGATGTTGTTGCCCTCTGCGATGAAGATCTGC
AGGCTGGCCAGGTGCTGGATGCTCTCAGCGATGTTTTCCAGGCGATTCGAGCCCACGTGCAAGAAAATCA
GTTCCTTCAGGGAGAACACACACATGGGGATGTGCGCGAAGAAGTTGTTGCTGAGGTTTAGCTTCCTCAG
TCTAGAGAGGTCGGCGAAGCATGCAGGGAGCTGGGACAGGCAGTTGTGCGACAAGCTCAGGACCTCCAGC
TTTCGGCACAAGCTCAGCTCGGCCGGCACCTCTGTCAGGCAGTTCATGTTGACAAACAGGACCTTGAGGC
ACTGTAGGAGGCTCACTTCTCTGGGCAGGCTCTTCAGGCGGTTCCCGCACAAGTTCAGGACCACGATCCG
GGTCAGTTTCCCCACCTCGGGGAGGGAGAACCCCGGAGCTGGTTGTGAGACAAATTGAGTTTCTGGACCC
CCGAAAAGCCCCCACAAAAAGCCG

HWbioinf3

See also the lecture

Submit the protocol and the required files to /submit/bioinf3

Input files

Copy files from /tasks/bioinf3/

mkdir bioinf3
cd bioinf3
cp -iv /tasks/bioinf3/* .

Files:

  • humanChr7Region.fasta is a 7kb piece of the human chromosome 7
  • motherChr7Region.fastq is a sample of reads from an anonymous donor known as NA12878; these reads come from region in humanChr7Region.fasta
  • fatherChr12.vcf and motherChr12.vcf are single-nucleotide variants on the chromosome 12 obtained by sequencing two individuals NA12877, NA12878 (these come from a larger family)

Task A: read mapping and variant calling

Align reads to the reference:

bwa index humanChr7Region.fasta
bwa mem humanChr7Region.fasta  motherChr7Region.fastq | \
  samtools view -S -b - |  samtools sort - motherChr7Region
samtools index motherChr7Region.bam

Call variants:

freebayes -f humanChr7Region.fasta --min-alternate-count 10 \
  motherChr7Region.bam >motherChr7Region.vcf

Run IGV, use humanChr7Region.fasta as genome, open motherChr7Region.bam and motherChr7Region.vcf. Looking at the aligned reads and the VCF file, answer the following questions:

(a) How many variants were found in the VCF file?

(b) How many variants are heterozygous and how many are homozygous?

(c) Are all variants single-nucleotide variants or do you also see some insertions/deletions (indels)?

Also export the overall view of the whole region from IGV to file motherChr7Region.png.

Submit the following files: motherChr7Region.png, motherChr7Region.bam, motherChr7Region.vcf

Task B: UCSC browser

(a) Where is the sequence from regionChr7.fasta file located in the browser?

  • Go to http://genome-euro.ucsc.edu/, from the blue menu, select Tools->Blat
  • Check that Blat uses Human, hg38 assembly
  • Open regionChr7.fasta in a graphical editor (e.g. kate), select all, paste to the BLAT window, run BLAT.
  • In the table of results, the best result should have identity close to 100% and span close to 7kb.
  • For this best result, click on the link named Browser.
  • Report which chromosome and which region you get.

(b) Which gene is located in this region?

  • Once you are in the browser, press Default tracks button
  • Track named GENCODE contains known genes, shown as rectangles (exons) connected by lines (introns). Short gene names are next to them.
  • Report the name of the gene in the region

(c) In which tissue is this gene most highly expressed? What is the function of this gene?

  • When you click on the gene (possibly twice), you get an information page which starts with a summary of the known function of this gene. Copy the first sentence to your protocol.
  • Further down on the gene information page you see RNA-Seq Expression Data (colorful boxplots). Find out which tissues have the highest signal.

(d) Which SNPs are located in this gene? Which trait do they influence?

  • You can see SNPs in the Common SNPs(155) track, but their IDs appear only after switching this track to pack mode. You can click on each SNPs to see more information and to copy their ID to your protocol.
  • Information page of the gene (part c) also describes function of various alleles of this gene (see e.g. part POLYMORPHISM).
  • You can also find information about individual SNPs by looking for them by their ID in SNPedia (not required in this task)

Task C: Examining larger vcf files

In this task, we will look at motherChr12.vcf and fatherChr12.vcf files and compute various statistics. You can use command-line tools, such as grep, wc, sort, uniq and Perl one-liners (as in Lbash), or you can write small scripts in Perl or Python (as in Lperl and Lpython).

  • Write all used commands to your protocol
  • If you write any scripts, submit them as well

Questions:

(a) How many SNPs are in each file?

  • This can be found easily by wc, only make sure to exclude lines with comments

(b) How many heterozygous SNPs are in each file?

  • The last column contains 1|1 for homozygous and either 0|1 or 1|0 for heterozygous SNPs
  • Character | has special meaning on the command line and in grep patterns; make sure to place it in apostrophes ' ' and possibly escape it with backslash \

(c) How many SNP positions are shared between the two files?

  • The second column of each file lists the position. We want to compute the size of intersection of the set of positions in motherChr12.vcf and fatherChr12.vcf files
  • You can e.g. create temporary files containing only positions from the two files and sort them alphabetically. Then you can find the intersection using comm command with options -1 -2. Alternatively, you can store positions as keys in a hash table (dictionary) in a Perl or Python script.

(d) List the 5 most frequent pairs of reference/alternate allele in motherChr12.vcf and their frequencies. Do they correspond to transitions or transversions?

  • The fourth column contains the reference value, fifth column the alternate value. For example, the first SNP in motherChr12.vcf has a pair C,A.
  • For each possible pair of nucleotides, find how many times it occurs in the motherChr12.vcf
  • For example, pair C,A occurs 6894 times
  • Then sort the pairs by their frequencies and report 5 most frequent pairs
  • Mutations can be classified as transitions and transversions. Transitions change purine to purine or pyrimidine to pyrimidine, transversions change a purine to pyrimidine or vice versa. For example, pair C,A is a transversion changing pyrimidine C to purine A. Which of these most frequent pairs correspond to transitions and which to transversions?
  • To count pairs without writing scripts, you can create a temporary file containing only columns 4 and 5 (without comments), and then use commands sort and uniq to count each pair.

(e) Which parts of the chromosome have the highest and lowest density of SNPs in motherChr12.vcf?

  • First create a list of windows of size 100kb covering the whole chromosome 12 using these two commands:
perl -le 'print "chr12\t133275309"' > humanChr12.size
bedtools makewindows -g humanChr12.size -w 100000 -i srcwinnum > humanChr12-windows.bed
  • Then count SNPs in each window using this command:
bedtools coverage -a  humanChr12-windows.bed -b motherChr12.vcf > motherChr12-windows.tab
  • Find out which column of the resulting file contains the number of SNPs per window, e.g. by reading the documentation obtained by command bedtools coverage -h
  • Sort according to the column with the SNP number, look at the first and last line of the sorted file
  • For checking: the second highest count is 387 in window with coordinates 20,800,000-20,900,000

Lr1

HWr1 · Video introduction from an older edition of the course

Program for this lecture: basics of R

  • A very short introduction will be given as a lecture.
  • Exercises have the form of a tutorial: read a bit of text, try some commands, extend/modify them as requested in individual tasks

In this course we cover several languages popular for scripting and data processing: Perl, Python, R.

  • Their capabilities overlap, many extensions emulate strengths of one in another.
  • Choose a language based on your preference, level of knowledge, existing code for the task, the rest of the team.
  • Quickly learn a new language if needed.
  • Also possibly combine, e.g. preprocess data in Perl or Python, then run statistical analyses in R, automate the entire pipeline with bash or make.

Introduction

  • R is an open-source system for statistical computing and data visualization.
  • Programming language, command-line interface
  • Many built-in functions, additional libraries
  • We will concentrate on useful commands rather than language features.

Working in R

Option 1: Run command R, type commands in a command-line interface

  • It supports history of commands (arrows, up and down, Ctrl-R) and completing command names with the tab key

Option 2: Write a script to a file, run it from the command-line as follows:
R --vanilla --slave < file.R

Option 3: Use rstudio command to open a graphical IDE

  • Sub-windows with editor of R scripts, console, variables, plots.
  • Ctrl-Enter in editor executes the current command in console.
  • You can also install RStudio on your home computer and work there.

Option 4: If you like Jupyter notebooks, you can use them also to run R code, see for example an explanation of how to enable it in Google Colab [10].

In R, you can easily create plots. In command-line interface these open as a separate window, in Rstudio they open in one of the sub-windows.

x = c(1:10)
plot(x, x * x)

Suggested workflow

  • Work interactively in Rstudio, notebook or on command line, try various options.
  • Select useful commands, store in a script.
  • Run the script automatically on new data/new versions, potentially as a part of a bigger pipeline.

Additional information

Gene expression data

  • DNA molecules contain regions called genes, which "recipes" for making proteins.
  • Gene expression is the process of creating a protein according to the "recipe".
  • It works in two stages: first a gene is copied (transcribed) from DNA to RNA, then translated from RNA to protein.
  • Different proteins are created in different quantities and their amount depends on the needs of a cell.
  • There are several technologies (microarray, RNA-seq) for measuring the amount of RNA for individual genes, this gives us some measure how active each gene is under given circumstances.

Gene expression data is typically a table with numeric values.

  • Rows represent genes.
  • Columns represent experiments (e.g. different conditions or different individuals).
  • Each value is the expression of a gene, i.e. the relative amount of mRNA for one gene in one experiment.

We will use a data set for yeast:

  • Abbott, Derek A., et al. "Generic and specific transcriptional responses to different weak organic acids in anaerobic chemostat cultures of Saccharomyces cerevisiae." FEMS yeast research 7.6 (2007): 819-833.
  • Downloaded from the GEO database
  • Data was preprocessed: normalized, converted to logarithmic scale.
  • Only 1220 genes with the biggest changes in expression are included in our dataset.
  • The dataset contains gene expression measurements under 5 conditions:
    • Control: yeast grown in a normal environment.
    • 4 different acids added so that cells grow 50% slower (acetic, propionic, sorbic, benzoic).
  • From each condition (reference and each acid) we have 3 replicates, together 15 experiments.
  • The goal is to observe how the acids influence the yeast and the activity of its genes.

Part of the file (only the first 4 experiments and first 3 genes shown), strings 2mic_D_protein, AAC3, AAD15 are identifiers of genes.

,control1,control2,control3,acetate1,acetate2,acetate3,...
2mic_D_protein,1.33613934199651,1.13348900359964,1.2726678684356,1.42903234691804,...
AAC3,0.558482767397578,0.608410781454015,0.6663002997292,0.231622581964063,...
AAD15,-0.927871996497105,-1.04072379902481,-1.01885986692013,-2.62459941525552,...

HWr1

See also the lecture.

In this homework, try to read the text, execute given commands, potentially trying some small modifications. Within the tutorial, you will find the tasks to complete in this exercise.

  • Submit the required files (.png).
  • In your protocol, enter the commands used in all tasks, with explanatory comments in more complicated situations.
  • In tasks B and D also enter the required output to the protocol.
  • Protocol template in /tasks/r1/protocol.txt

Two versions of the homework based on your R proficiency (your decision which one you choose).

Beginners:

  • Do tasks A-E, skip F.

Intermediate / advanced

  • Skip to section Expression data.
  • Do tasks D-F, but instead of plot and matplot functions from the basic R graphics library, use ggplot2 library. See the list of resources below.

The first steps

Type a command, R writes the answer, e.g.:

> 1 + 2
[1] 3

We can store values in variables and use them later:

> # population of Slovakia in millions, 2019
> population = 5.457
> population
[1] 5.457
> # area of Slovakia in thousands of km2
> area = 49.035
> density = population / area
> density
[1] 0.1112879

Several surprises in the R language:

  • Dots are used as parts of id's, e.g. read.table is name of a single function (not a method for the object read).
  • Assignment can be done via <- or =.
  • Vectors etc are indexed from 1, not from 0.

Vectors, basic plots

A vector is a sequence of values of the same type (all are numbers or all are strings or all are booleans).

# Vector can be created from a list of numbers by function named c
a = c(1, 2, 4)
a
# prints [1] 1 2 4

# c also concatenates vectors
c(a, a)
# prints [1] 1 2 4 1 2 4

# Vector of two strings 
b = c("hello", "world")

# Create a vector of numbers 1..10
x = 1:10
x
# prints [1]  1  2  3  4  5  6  7  8  9 10

Vector arithmetic

Many operations can be easily applied to each member of a vector.

x = 1:10
# Square each number in vector x
x * x
# prints [1]   1   4   9  16  25  36  49  64  81 100

# New vector y: logarithm of a number in x squared
y = log(x * x)
y
# prints [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519 3.891820 4.158883
# [9] 4.394449 4.605170

# Draw the graph of function log(x*x) for x=1..10
plot(x, y)
# The same graph but use lines instead of dots
plot(x, y, type="l")

# Addressing elements of a vector: positions start at 1
# Second element of the vector 
y[2]
# prints [1] 1.386294

# Which elements of the vector satisfy certain condition? 
# (vector of logical values)
y > 3
# prints [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

# write only those elements from y that satisfy the condition
y[y > 3]
# prints [1] 3.218876 3.583519 3.891820 4.158883 4.394449 4.605170

# we can also write values of x such that values of y satisfy the condition...
x[y > 3]
# prints [1]  5  6  7  8  9 10

Task A

Create a plot of the binary logarithm with dots in the graph more densely spaced (from 0.1 to 10 with step 0.1)

  • Store it in file log.png and submit this file

Hints:

  • Create x and y by vector arithmetic.
  • To compute binary logarithm check help ? log
  • Before running plot, use command png("log.png") to store the result, afterwards call dev.off() to close the file (in Rstudio you can also export plots manually).

Data frames and simple statistics

Data frame: a table similar to a spreadsheet. Each column is a vector, all are of the same length.

We will use a table with the following columns:

  • country name,
  • region (continent),
  • area in thousands of km2,
  • population in millions in 2019.

Data is from the UN. The table is stored in the csv format (columns separated by commas). Here are the first six lines:

Afghanistan,Asia,652.864,38.0418
Albania,Europe,28.748,2.8809
Algeria,Africa,2381.741,43.0531
American Samoa,Oceania,0.199,0.0553
Andorra,Europe,0.468,0.0771
Angola,Africa,1246.7,31.8253
# reading a data frame from a file
a = read.csv("/tasks/r1/countries.csv",header = TRUE)

# display mean, median, etc. of each column
summary(a)
# Compactly display structure of a 
# (good for checking that import worked etc)
str(a)

# print the column with the name "Area"
a$Area

# population density: divide the population by the area
a$Population / a$Area

# Add density as a new column to frame a
a = cbind(a, Density = a$Population / a$Area)

# Scatter plot of area vs population
plot(a$Area, a$Population)

# we will see smaller values better in log-scale (both axes)
plot(a$Area, a$Population, log='xy')

# use linear scale, but zoom in on smaller countries:
plot(a$Area, a$Population, xlim=c(0,1500), ylim=c(0,150))

# average country population 33.00224 million
mean(a$Population)
# median country population 5.3805 million
median(a$Population)

# median country population in Europe
median(a$Population[a$Region=="Europe"])
# Standard deviation
sd(a$Population)

# Histogram of country populations in Europe
hist(a$Population[a$Region=="Europe"])

Task B

Create frame europe which contains data for European countries selected from frame a. Also create a similar frame for African countries. Hint:

  • One option is to select rows using boolean expressions as in the computation of median country size in Europe above. This time we apply it on a frame rather than single column. To select some rows (and all columns) from frame a use expression of the form a[which_rows, ]. Here, which_rows can be a list, such as c(1,2,3) or a boolean expression producing a list of booleans.
  • Alternatively, you can also explore subset command.

Run the command summary separately for each new frame. Comment on how their characteristics differ. Write the output and your conclusion to the protocol.

Task C

Draw a graph comparing the area vs population in Europe and Africa; use different colors for points representing European and African countries. Apply log scale on both axes.

  • Submit the plot in file countries.png

To draw the graph, you can use one of the options below, or find yet another way.

Option 1: first draw Europe with one color, then add Africa in another color

  • Color of points can be changed by as follows: plot(1:10, 1:10, col="red").
  • After the plot command, you can add more points to the same graph by command points, which can be used similarly as plot.
  • Warning: command points does not change the ranges of x and y axes. You have to set these manually so that points from both groups are visible. You can do this using options xlim and ylim, e.g. plot(x,y, col="red", xlim=c(0.1,100), ylim=c(0.1,100)).

Option 2: plot both Europe and Africa in one plot command, and give it a vector of colors, one for each point. Command plot(1:10,1:10,col=c(rep("red",5),rep("blue",5))) will plot the first 5 points red and the last 5 points blue

Bonus task: add a legend to the plot, showing which color is Europe and which is Africa.

Expression data

The dataset was described in the lecture.

# Read gene expression data table
a = read.csv("/tasks/r1/microarray.csv", row.names=1)
# Visual check of the first row
a[1,]
# Plot control replicate 1 vs. acetate acid replicate 1
plot(a$control1, a$acetate1)
# Plot control replicate 1 vs. control replicate 2
plot(a$control1, a$control2)
# To show density in dense clouds of points, use this plot
smoothScatter(a$control1, a$acetate1)

Optional: ggplot2 library

Required for intermediate / advanced in R, who decide to do tasks D-F.

Several resources for ggplot2 library

Task D

In the plots above we compare two experiments, say A=control1 and B=acetate1. Outliers away from the diagonal in the plot are the genes whose expression changes. However distance from the diagonal is hard to judge visually, instead we will create MA plot:

  • As above, each gene is one dot in the plot (use plot rather than smoothScatter).
  • The x-axis is the average between values for conditions A and B. The points on the right have overall higher expression than points on the left.
  • The y-axis is the difference between condition A and B. The values in frame a are in log-scale base 2, so the difference of 1 means 2-fold change in expression.
  • The points far from the line y=0 have the highest change in expression. Use R functions min, max, which.min and which.max to find the largest positive and negative difference from line y=0 and which genes they correspond to. Functions min and max give you the minimum and maximum of a given vector. Functions which.min and which.max return the index where this extreme value is located. You can use this index to get the appropriate row of the dataframe a, including the gene name.
  • Submit file ma.png with your plot. Include the genes with the extreme changes in your protocol.

Clustering applied to expression data

Clustering is a wide group of methods that split data points into groups with similar properties. We will group together genes that have a similar reaction to acids, i.e. their rows in gene expression data matrix have similar values. We will consider two simple clustering methods.

  • K means clustering splits points (genes) into k clusters, where k is a parameter given by the user. It finds a center of each cluster and tries to minimize the sum of distances from individual points to the center of their cluster. Note that this algorithm is randomized so you will get different clusters each time.
Examples of heatmaps
  • Hierarchical clustering puts all data points (genes) to a hierarchy so that smallest subtrees of the hierarchy are the most closely related groups of points and these are connected to bigger and more loosely related groups.


# Create a new version of frame a in which row is scaled so that 
# it has mean 0 and standard deviation 1
# Function scale does such transformation on columns instead of rows, 
# so we transpose the frame using function t, then transpose it back
b = t(scale(t(a)))
# Matrix b shows relative movements of each gene, 
# disregarding its overall high or low expression

# Command heatmap creates hierarchical clustering of rows, 
# then shows values using colors ranging from red (lowest) to white (highest)
heatmap(as.matrix(a), Colv=NA, scale="none")
heatmap(as.matrix(b), Colv=NA, scale="none")
# compare the two matrices - which phenomena influenced clusters in each of them?
# k means clustering to 5 clusters
k = 5
cl = kmeans(b, k)
# Each gene is assigned a cluster (number between 1 and k)
# the command below displays the first 10 elements, i.e. clusters of first 10 genes
head(cl$cluster)
# Draw heatmap of cluster number 3 out of k, no further clustering applied
# Do you see any common pattern to genes in the cluster?
heatmap(as.matrix(b[cl$cluster==3,]), Rowv=NA, Colv=NA, scale="none")

# Reorder genes in the whole table according to their cluster cluster number
# Can you spot our k clusters?
heatmap(as.matrix(b[order(cl$cluster),]), Rowv=NA, Colv=NA, scale="none")

# Compare overall column means with column means in cluster 3
# Function apply runs mean on every column (or row if 2 changed to 1)
apply(b, 2, mean)
# Now means within cluster 3
apply(b[cl$cluster==3,], 2, mean)

# Clusters have centers which are also computed as means
# so this is the same as the previous command
cl$centers[3,]

Task E

Example of a required plot (but for k=3, not k=5)

Draw a plot in which the x-axis corresponds to experiments, the y-axis is the expression level and the center of each cluster is shown as a line (use k-means clustering on the scaled frame b, computed as shown above)

  • Use command matplot(x, y, type="l", lwd=2) which gets two matrices x and y of the same size and plots each column of matrices x and y as one line (setting lwd=2 makes lines thicker)
  • In this case we omit matrix x, the command will use numbers 1,2,3... as columns of the missing matrix
  • Create y from cl$centers by applying function t (transpose)
  • Submit file clusters.png with your final plot


Task F

Only for students intermediate / advanced in R.

In task E, each cluster is represented only by its center. However, we would like see the spread of values over all genes assigned to the cluster. Use ggplot2 to create a boxplot for each cluster, with a box for every experiment, showing the median, quartiles, range and outliers over all genes assigned to that cluster. Place plots for all clusters to a single image using faceting.

  • Submit file boxplots.png with your final plot

Project

The goal of the project is to apply and extend the skills acquired during the course while working on a data analysis project of your choice. Your task is to obtain data, analyze it and display the obtained results in clear graphs and tables. It is ideal if you obtain interesting or useful conclusions, but we will mainly evaluate your choice of suitable methods and technical difficulty of the project. The scope of programming or data analysis should correspond to roughly three homework assignments, but overall the project will be more demanding, because unlike assignments, you are not provided with data and a sequence of tasks, but you have to come up with them yourself, and the first ideas do not always work.

The emphasis should be on tools run on the command line and the use of technologies covered during the course, but if needed, you can supplement them by other methods. While prototyping your tool and creating visualizations for your final report, you may prefer to work in interactive environments such as the Jupyter notebook, but in the submitted version of the project, most of the code should be in scripts runnable from the command line, potentially excluding the final visualization, which can remain as a notebook or an interactive website (flask).

Project proposal

In roughly two-thirds of the semester, you will submit a project proposal with length of about half a page. In this proposal, state what data you will process, how you will collect it, what the purpose of the analysis is and what technologies you plan to use. You can slightly change the goals and technologies as you work on the project, but you should have an initial idea. We will give you feedback on the proposal, and in some cases it may be necessary to change the theme slightly or completely. For a suitable project proposal submitted on time, you will receive 5% of the total grade. We recommend consulting the instructors before submitting the proposal.

How to submit the proposal: copy a file in txt alebo pdf format to /submit/navrh/username on the course server.

Submitting projects

Similarly as for homeworks, submit the required files to the specified folder. Submit the following:

  • Your programs and data files (do not submit large data above 50Mb)
  • Protocol similarly as for homeworks
    • in txt or pdf format, brief notes
    • contains the list of files, detailed steps done in data analysis (commands used), list of sources (data, programs, documentation and other literature etc.)
  • Project report in pdf format. Unlike the less format protocol, the report should be a coherent text written in a technical style. You can write in English or Slovak, but avoid grammar mistakes. The report should contain:
    • introduction explaining goals of the project, necessary background from the area you work in, and what data you have used
    • brief method description which does not contain details of individual steps but rather overview of the method and its justification
    • the results of your analysis (tables, graphs etc.), description of these results and discussion about conclusions that can be made from your results
      • do not forget to explain the meaning of individual tables and graphs (axes, colors etc.)
      • include the final results as well initial exploration of your data and analyses that you have used to verify correctness of the data and your analysis method
    • conclusion where you discuss which parts of the project were difficult and what problems you have encountered, which parts you have in contrast were able to do easily, which parts you would recommend in retrospect to do differently, what you have learned during the project etc.

Project can be done by pairs of students, however the project should be then bigger. Each student should be primarily responsible for some parts of the project and the division of labor should be listed in the project report. Pairs submit only one report and protocol, but the oral exam is separate for each student.

Typical scheme of a project

Most projects consist of the following stages which should also be represented in the report

  • Acquiring data. This can be easy if someone gives you the data or if you download it from internet as a single file. It can also be more difficult if you need to parse the data from many data files or webpages. Do not forget to check if the data was downloaded correctly (at least by manually checking several records). The report should clearly indicate where and how you obtained the data.
  • Preprocessing data to a suitable form. This stage involves parsing input formats, selecting useful data, checking them for correctness, filtering incomplete records etc. Store your data set in a file or database suitable for further processing. Do not forget to check if data seems to be correct. In the report, include basic characteristics of the data such as the overall number of records, attributes of each record and value ranges for these records etc. This will help the reader to understand the character of your data set.
  • Further analyses and data visualization. At this stage try to arrive to interesting or useful conclusions. The results can be static figures and tables or an interactive webpage in flask. However, even for an interactive webpage include selected results in the report as static images.

If your project significantly differs from this scheme, consult the instructors.

Project topics

  • You can process some data useful for your bachelor or master thesis or data necessary for another course (in that case mention in the report which course it is and also notify the instructors of the other course that you have used results from this course). For DAV and BIN students, this project can be a good opportunity to select a topic of the bachelor thesis and start working on it.
  • You can try to find someone who needs to process some data set but does not have the necessary skills (this could be scientists from different fields, non-profit organizations or even companies) If you contact third parties in this way, it is especially important that you try to produce the best project you can, in order to maintain the good reputation of your study program.
  • In your project, you can compare speed or accuracy of several programs for the same task. The project will consist of preparing the input data for the programs, support for automated running of the programs and evaluation of results.
  • You can also find interesting data on the internet and analyze them. Students often choose topics related to their hobbies and activities, such as sports, computer games, programming contests etc.
  • You can also try to replicate an analysis published in a scientific paper or a blog. You can check if you obtain the same results as the original authors and try variations of the original analysis, such as trying different parameters, changing settings, adding new visualizations etc.

Lcloud

Today we will work with Google Cloud (GCP), which is a cloud computing platform. GCP contains many services (virtual machines, kubernetes, storage, databases, ...). We are mainly interested in Dataflow and Storage. Dataflow allows highly parallel computation on large datasets. We will use an educational account which gives you a certain amount of resources for free.

Basic setup

You should have received instructions how to create GCloud account via MS Teams. You should be able to login to google cloud console. (TODO picture).

Now:

  • Login to some Linux machine (ideally vyuka)
  • If the machine is not vyuka, install gcloud command line package (I recommend via snap: [11]).
  • Run
    gcloud init --console-only
    
  • Follow instructions (copy link to browser, login and then copy code back to console).

Input files and data storage

Today we will use Gcloud storage to store input files and outputs. Think of it as some limited external disk (more like gdrive, than dropbox). You can just upload and download files, no random access to the middle of the file.

Run the following two commands to check if you can see the "bucket" (data storage) associated with this lecture:

# the following command should give you a big list of files
gsutil ls gs://fmph-mad-2023-public/

# this command downloads one file from the bucket
gsutil cp gs://fmph-mad-2023-public/splitaa splitaa

# the following command prints the file in your console 
# (no need to do this).
gsutil cat gs://fmph-mad-2023-public/splitaa

You should also create your own bucket (storage area). Pick your own name, must be globally unique:

gsutil mb gs://mysuperawesomebucket

If you get "AccessDeniedException: 403 The billing account for the owning project is disabled in state absent", you should open project in web UI (console.cloud.google.com), head to page Billing -> Link billing account and select "XY for Education".

Billing.png.png

Apache beam and Dataflow

We will be using Apache Beam in this session (because Pyspark stinks).

Running locally

If you want to use your own machine, please install packages with

pip install 'apache-beam[gcp]'

You are given basic template with comments in /tasks/cloud/example_job.py

You can run it locally as follows:

python3 example_job.py --output out

This job downloads one file from cloud storage and stores it into file starting with name out. You can change the name if you want. This is very useful for any debugging.

The actual job just counts amount of each base in the input data (and discards any fastq headers).

You can even use parameter --input to use some input file from your harddrive.

Running in Dataflow

First we need to create service account (account for machine). Run following commands:

# This will show you the project-id, you will need it.
gcloud projects list
# This will create service account named mad-sacc
gcloud iam service-accounts create mad-sacc
 
# This will give your service account some permissions. Do not forget to change PROJECT_ID to your project-id. Also note that we are quite liberal with permissions, this is not ideal in production.
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/storage.objectAdmin
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/dataflow.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="serviceAccount:mad-sacc@PROJECT_ID.iam.gserviceaccount.com" --role=roles/editor

# This creates key for your service account, named key.json (in your current directory).
gcloud iam service-accounts keys create key.json --iam-account=mad-sacc@PROJECT_ID.iam.gserviceaccount.com

# This will setup environment variable for some tools, which communicate with google cloud (change path to key to relevant path). You will need to set this everytime you open a console (or put it into .bashrc).

export GOOGLE_APPLICATION_CREDENTIALS=/home/jano/hrasko/key.json


Now you can run Beam job in Dataflow on small sample:

python3 example_job.py --output gs://YOUR_BUCKET/out/outxy --region europe-west1 --runner DataflowRunner --project PROJECT_ID --temp_location gs://YOUR_BUCKET/temp/ --input gs://fmph-mad-2023-public/splitaa

You will probably get an error like:

Dataflow API has not been used in project XYZ before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview?project=XYZ then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

Visit the URL (from you error, not from this lecture) and click enable API, then run command again.

You can then download output using:

gsutil cp gs://YOUR_BUCKET/out/outxy* .

Now you can run job on full dataset (this is what you should be doing in homework):

python3 example_job.py --output gs://YOUR_BUCKET/out/outxy --region europe-west1 --runner DataflowRunner --project PROJECT_ID --temp_location gs://YOUR_BUCKET/temp/ --input gs://fmph-mad-2023-public/* --num_workers 5 --worker_machine_type n2-standard-4

If you want to watch progress:

  • Go to web console. Find dataflow in the menu (or type it into search bar), and go to jobs and select your job.
  • If you want to see machines magically created, go to VM Instances.

Apache beam

This is the relevant part of the code:

  with beam.Pipeline(options=pipeline_options) as p:

    # Read the text file[pattern] into a PCollection.
    lines = p | 'Read' >> ReadFromText(known_args.input)

    counts = (
        lines
        | 'Filter' >> beam.Filter(good_line)
        | 'Split' >> (beam.ParDo(WordExtractingDoFn()))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

    # Format the counts into a PCollection of strings.
    def format_result(word, count):
      return '%s: %d' % (word, count)

    output = counts | 'Format' >> beam.MapTuple(format_result)

    # Write the output using a "Write" transform that has side effects.
    # pylint: disable=expression-not-assigned
    output | 'Write' >> WriteToText(known_args.output)

First we create a collection of data (thing about it as a big array, where indices are not significant). Then we apply various beam functions over it. First we filter it to keep only good lines, then we extract relevant parts of line (we emit (c, 1) for each letter c) and then we group results by key (first part of the tuple) and sum values (second part of the tuple).

Note that this is just template for the job. Beam decides what part of computation is run where and parallelizes things automatically.

One might ask what is the difference between ParDo and Map. Map only outputs one element per one input. ParDo can output as many as it wants.

You might want to check out more examples at beam documentation.

Tips

  • First run your code locally. It is much faster to iterate. Only if you are satisfied with the result, run it in cloud on full dataset.
  • If you run code locally, you can use print in processing functions (e.g. inside WordExtractionFN::process)
  • CombinePerKey requires called function to be associative and commutative. If you want something more complicated look at averaging example 5 here: https://beam.apache.org/documentation/transforms/python/aggregation/combineperkey/

HWcloud

See also the lecture

For both tasks, submit your source code and the result, when run on whole dataset (gs://fmph-mad-2023-public/). The code is expected to use the Apache beam framework and Dataflow presented in the lecture. Submit directory is /submit/cloud/

Task A for bonus points

Document (with screenshots) your login journey until step `gcloud projects list`.

Task B

Count the number of occurrences of each 4-mer in the provided data.

Task C

Count the number of pairs of reads which overlap in exactly 30 bases (end of one read overlaps beginning of the second read). You can ignore reverse complement.

Hints:

  • Try counting pairs for each 30-mer first.
  • You can yield something structured from Map/ParDo operatation (e.g. tuple).
  • You can have another Map/ParDo after CombinePerKey.
  • Run code locally on small data to quickly iterate and test :)

Lr2

HWr2

The topic of this lecture are statistical tests in R.

Introduction to statistical tests: sign test

  • Two friends A and B played their favorite game n=10 times, A won 6 times and B won 4 times.
  • A claims that she/he is a better player, B claims that such a result could easily happen by chance if they were equally good players.
  • Hypothesis of player B is called the null hypothesis. It claims that the pattern we see (A won more often than B) is simply a result of chance.
  • The null hypothesis reformulated: we toss coin n times and compute value X: the number of times we see head. The tosses are independent and each toss has equal probability of being head or tail.
  • Similar situation: comparing programs A and B on several inputs, and counting how many times is program A better than B.
# simulation in R: generate 10 pseudorandom bits
# (1=player A won)
sample(c(0,1), 10, replace = TRUE)
# result e.g. 0 0 0 0 1 0 1 1 0 0

# directly compute random variable X, i.e. the sum of bits
sum(sample(c(0,1), 10, replace = TRUE))
# result e.g. 5

# we define a function which will m times repeat 
# the coin tossing experiment with n tosses 
# and returns a vector with m values of random variable X
experiment <- function(m, n) {
  x = rep(0, m)     # create vector with m zeroes
  for(i in 1:m) {   # for loop through m experiments
    x[i] = sum(sample(c(0,1), n, replace = TRUE)) 
  }
  return(x)         # return array of values     
}
# call the function for m=20 experiments, each with n=10 tosses
experiment(20,10)
# result e.g.  4 5 3 6 2 3 5 5 3 4 5 5 6 6 6 5 6 6 6 4
# draw histograms for 20 experiments and 1000 experiments
png("hist10.png")  # open png file
par(mfrow=c(2,1))  # matrix of plots with 2 rows and 1 column
hist(experiment(20,10))
hist(experiment(1000,10))
dev.off() # finish writing to file
  • It is easy to realize that we get binomial distribution (binomické rozdelenie)
  • The probability of getting k ones out of n coin tosses is
  • The p-value of the test is the probability that simply by chance we would get k the same or more extreme than in our data.
  • In other words, what is the probability that in n=10 tosses we see head 6 times or more (this is called one sided test).
  • P-value for k ones out of n coin tosses .
  • If the p-value is very small, say smaller than 0.01, we reject the null hypothesis and assume that player A is in fact better than B.
# computing the probability that we get exactly 6 heads in 10 tosses
dbinom(6, 10, 0.5) # result 0.2050781
# we get the same as our formula above:
7*8*9*10/(2*3*4*(2^10)) # result 0.2050781

# entire probability distribution: probabilities 0..10 heads in 10 tosses
dbinom(0:10, 10, 0.5)
# [1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250
# [6] 0.2460937500 0.2050781250 0.1171875000 0.0439453125 0.0097656250
# [11] 0.0009765625

# we can also plot the distribution
plot(0:10, dbinom(0:10, 10, 0.5))
barplot(dbinom(0:10, 10, 0.5))

# our p-value is the sum for k=6,7,8,9,10
sum(dbinom(6:10, 10, 0.5))
# result: 0.3769531
# so results this "extreme" are not rare by chance,
# they happen in about 38% of cases

# R can compute the sum for us using pbinom 
# this considers all values greater than 5
pbinom(5, 10, 0.5, lower.tail=FALSE)
# result again 0.3769531

# if probability is too small, use log of it
pbinom(9999, 10000, 0.5, lower.tail=FALSE, log.p = TRUE)
# [1] -6931.472
# the probability of getting 10000x head is exp(-6931.472) = 2^{-100000}

# generating numbers from binomial distribution 
# - similarly to our function experiment
rbinom(20, 10, 0.5)
# [1] 4 4 8 2 6 6 3 5 5 5 5 6 6 2 7 6 4 6 6 5

# running the test
binom.test(6, 10, p = 0.5, alternative="greater")
#
#        Exact binomial test
#
# data:  6 and 10
# number of successes = 6, number of trials = 10, p-value = 0.377
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
# 0.3035372 1.0000000
# sample estimates:
# probability of success
#                   0.6

# to only get p-value, run
binom.test(6, 10, p = 0.5, alternative="greater")$p.value
# result 0.3769531

Comparing two sets of values: Welch's t-test

  • Let us now consider two sets of values drawn from two normal distributions with unknown means and variances.
  • The null hypothesis of the Welch's t-test is that the two distributions have equal means.
  • The test computes test statistics (in R for vectors x1, x2):
    • (mean(x1)-mean(x2))/sqrt(var(x1)/length(x1)+var(x2)/length(x2))
  • If the null hypothesis holds, i.e. x1 and x2 were sampled from distributions with equal means, this test statistics is approximately distributed according to the Student's t-distribution with the degree of freedom obtained by
n1=length(x1)
n2=length(x2)
(var(x1)/n1+var(x2)/n2)**2/(var(x1)**2/((n1-1)*n1*n1)+var(x2)**2/((n2-1)*n2*n2))
  • Luckily R will compute the test for us simply by calling t.test
# generate x1: 6 values from normal distribution with mean 2 and standard deviation 1
x1 = rnorm(6, 2, 1)
# for example 2.70110750  3.45304366 -0.02696629  2.86020145  2.37496993  2.27073550

# generate x2: 4 values from normal distribution with mean 3 and standard deviation 0.5
x2 = rnorm(4, 3, 0.5)
# for example 3.258643 3.731206 2.868478 2.239788
t.test(x1, x2)
# t = -1.2898, df = 7.774, p-value = 0.2341
# alternative hypothesis: true difference in means is not equal to 0
# means 2.272182  3.024529
# this time the test was not significant

# regenerate x2 from a distribution with a much more different mean
x2 = rnorm(4, 5, 0.5)
# 4.882395 4.423485 4.646700 4.515626
t.test(x1, x2)
# t = -4.684, df = 5.405, p-value = 0.004435
# means 2.272182  4.617051
# this time much more significant p-value

# to get only p-value, run 
t.test(x1,x2)$p.value

We will apply Welch's t-test to microarray data

Multiple testing correction

  • When we run t-tests on the control vs. benzoate on all 6398 genes, we get 435 genes with p-value at most 0.01.
  • Purely by chance this would happen in 1% of cases (from the definition of the p-value).
  • So purely by chance we would expect to get about 64 genes with p-value at most 0.01.
  • So roughly 15% of our detected genes (maybe less, maybe more) are false positives which happened purely by chance.
  • Sometimes false positives may even overwhelm the results.
  • Multiple testing correction tries to limit the number of false positives among the results of multiple statistical tests. There are many different methods.
  • The simplest one is Bonferroni correction, where the threshold on the p-value is divided by the number of tested genes, so instead of 0.01 we use threshold 0.01/6398 = 1.56e-6.
  • This way the expected overall number of false positives in the whole set is 0.01 and so the probability of getting even a single false positive is also at most 0.01 (by Markov inequality).
  • We could instead multiply all p-values by the number of tests and apply the original threshold 0.01 - such artificially modified p-values are called corrected.
  • After the Bonferroni correction we do not get any significant gene.
# the results of t-tests are in vector pb of length 6398
# manually multiply p-values by length(pb), count those that have value <= 0.01
sum(pb * length(pb) <= 0.01)
# in R you can use p.adjust for multiple testing correction
pb.adjusted = p.adjust(pa, method ="bonferroni")
# this is equivalent to multiplying by the length and using 1 if the result > 1
pb.adjusted = pmin(pa*length(pa), rep(1, length(pa)))

# there are less conservative multiple testing correction methods, 
# e.g. Holm's method, but in this case we get almost the same results
pa.adjusted2 = p.adjust(pa, method ="holm")

Another frequently used correction is false discovery rate (FDR), which is less strict and controls the overall proportion of false positives among the results.

HWr2

See also the current and the previous lecture.

  • Do either tasks A,B,C (beginners) or B,C,D (more advanced). You can also do all four for bonus credit.
  • In your protocol write used R commands with brief comments on your approach.
  • Submit required plots with filenames as specified.
  • For each task also include results as required and a short discussion commenting the results/plots you have obtained. Is the value of interest increasing or decreasing with some parameter? Are the results as expected or surprising?
  • Outline of protocol is in /tasks/r2/protocol.txt

Task A: sign test

  • Consider a situation in which players played n games, out of which a fraction of q were won by A (the example in the lecture corresponds to q=0.6 and n=10).
  • Compute a table of p-values for combinations of n=10,20,...,90,100 and q=0.6, 0.7, 0.8, 0.9.
  • Plot the table using matplot (n is x-axis, one line for each value of q).
  • Submit the plot in sign.png.
  • Discuss the values you see in the plot / table.

Outline of the code:

# create vector rows with values 10,20,...,100
rows = (1:10) * 10
# create vector columns with required values of q
columns = c(0.6, 0.7, 0.8, 0.9)
# create empty matrix of pvalues 
pvalues = matrix(0, length(rows), length(columns))
# TODO: fill in matrix pvalues using binom.test

# set names of rows and columns
rownames(pvalues) = rows
colnames(pvalues) = columns
# careful: pvalues[10,] is now 10th row, i.e. value for n=100, 
#          pvalues["10",] is the first row, i.e. value for n=10

# check that for n=10 and q=0.6 you get p-value 0.3769531
pvalues["10","0.6"]

# create x-axis matrix (as in HWr1, part D)
x=matrix(rep(rows, length(columns)), nrow=length(rows))
# matplot command
png("sign.png")
matplot(x, pvalues, type="l", col=c(1:length(columns)), lty=1)
legend("topright", legend=columns, col=c(1:length(columns)), lty=1)
dev.off()

Task B: Welch's t-test on microarray data

Run the code below which reads the microarray data and preprocesses them (the last time we worked with preprocessed data). We first transform it to log scale and then shift and scale values in each column so that median is 0 and sum of squares of values is 1. This makes values more comparable between experiments; in practice more elaborate normalization is often performed. In the rest, work with table a containing preprocessed data.

# read the input file
input = read.table("/tasks/r2/acids.tsv", header=TRUE, row.names=1)
# take logarithm of all the values in the table
input = log2(input)
# compute median of each column
med = apply(input, 2, median)
# shift and scale values
a = scale(input, center=med)

Columns 1,2,3 are control, columns 4,5,6 acetic acid, 7,8,9 benzoate, 10,11,12 propionate, and 13,14,15 sorbate.

Write a function my.test which will take as arguments table a and 2 lists of columns (e.g. 1:3 and 4:6) and will run for each row of the table Welch's t-test of the first set of columns versus the second set. It will return the resulting vector of p-values, one for each gene.

  • For example by calling pb <- my.test(a, 1:3, 7:9) we will compute p-values for differences between control and benzoate (computation may take some time)
  • The first 5 values of pb should be
> pb[1:5]
[1] 0.02358974 0.05503082 0.15354833 0.68060345 0.04637482
  • Run the test for all four acids.
  • Report for each acid how many genes were significant with p-value at most 0.01.
  • Report also how many genes are significant for both acetic and benzoate acids simultaneously (logical "and" is written as &).

Task C: multiple testing correction

Run the following snippet of code, which works on the vector of p-values pb obtained for benzoate in task B.

# adjusts vectors of p-values from tasks B for using Bonferroni correction
pb.adjusted = p.adjust(pb, method ="bonferroni")
# add both p-value columns pb and pb.adjusted to frame a
a <-  cbind(a, pb, pb.adjusted)
# create permutation ordered by pb.adjusted
ob = order(pb.adjusted)
# select from table five rows with the lowest pb.adjusted (using vector ob)
# and display columns containing control, benzoate and both original and adjusted p-value
a[ob[1:5],c(1:3,7:9,16,17)]

You should get an output like this:

      control1  control2  control3  benzoate1  benzoate2  benzoate3
PTC4 0.5391444 0.5793445 0.5597744  0.2543546  0.2539317  0.2202997
GDH3 0.2480624 0.2373752 0.1911501 -0.3697303 -0.2982495 -0.3616723
AGA2 0.6735964 0.7860222 0.7222314  1.4807101  1.4885581  1.3976753
CWP2 1.4723713 1.4582596 1.3802390  2.3759288  2.2504247  2.2710695
LSP1 0.7668296 0.8336119 0.7643181  1.3295121  1.2744859  1.2986457
               pb pb.adjusted
PTC4 4.054985e-05   0.2594379
GDH3 5.967727e-05   0.3818152
AGA2 8.244790e-05   0.5275016
CWP2 1.041416e-04   0.6662979
LSP1 1.095217e-04   0.7007201

Do the same procedure for acetate p-values and report the result (in your table, report both p-values and expression levels for acetate, not bezoate). Comment on the results for both acids.

Task D: volcano plot, test on data generated from null hypothesis

Draw a volcano plot for the acetate data.

  • The x-axis of this plot is the difference between the mean of control and the mean of acetate. You can compute row means of a matrix by rowMeans.
  • The y-axis is -log10 of the p-value (use the p-values before multiple testing correction).
  • You can quickly see the genes that have low p-values (high on y-axis) and also big difference in the mean expression between the two conditions (far from 0 on x-axis). You can also see if acetate increases or decreases the expression of these genes.

Now create a simulated dataset sharing some features of the real data but observing the null hypothesis that the mean of control and acetate are the same for each gene.

  • Compute vector m of means for columns 1:6 from matrix a
  • Compute vectors sc and sa of standard deviations for control columns and for acetate columns respectively. You can compute standard deviation for each row of a matrix by apply(some.matrix, 1, sd).
  • For each i in 1:6398, create three samples from the normal distribution with mean m[i] and standard deviation sc[i] and three samples with mean m[i] and deviation sa[i] (use the rnorm function).
  • On the resulting matrix, apply Welch's t-test and draw the volcano plot.
  • How many random genes have p-value at most 0.01? Is it roughly what we would expect under the null hypothesis?

Draw a histogram of p-values from the real data (control vs acetate) and from the random data (use function hist). Describe how they differ. Is it what you would expect?

Submit plots volcano-real.png, volcano-random.png, hist-real.png, hist-random.png (real for real expression data and random for generated data).