2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search

See Lecture 1


We have 4 input files for this homework. We recommend creating soft links to your working directory as follows:

ln -s /tasks/hw01/repeats-small.txt .  # small version of the repeat file
ln -s /tasks/hw01/repeats.txt .        # full version of the repeat file
ln -s /tasks/hw01/reads-small.fastq .  # smaller version of the read file
ln -s /tasks/hw01/reads.fastq .        # bigger version of the read file

We recommend writing your protocol starting from an outline provided in /tasks/hw01/protocol.txt


  • Directory /submit/hw01/your_username will be created for you
  • Copy required files to this directory, including the protocol named protocol.txt or protocol.pdf
  • You can modify these files freely until deadline, but after the deadline of the homework, you will lose access rights to this directory

Task A

  • Consider the program for counting repeat types in the lecture 1, save it to file repeat-stat.pl
  • Extend it to compute the average length of each type of repeat
    • Each row of the input table contains the start and end coordinates of the repeat in columns 7 and 6. The length is simply the difference of these two values.
  • Output a table with three columns: type of repeat, the number of occurrences, the average length of the repeat.
    • Use printf to print these three items right-justified in columns of sufficient width, print the average length to 1 decimal place.
  • If you run your script on the small file, the output should look something like this (exact column widths may differ):
./repeat-stat.pl < repeats-small.txt
                 DNA         5     377.4
                LINE         4     410.2
                 LTR        13     355.4
      Low_complexity        22      47.2
                  RC         8     236.2
       Simple_repeat       106      39.0
  • Include in your protocol the output when you run your script on the large file: ./repeat-stat.pl < repeats.txt
  • Find out on Wikipedia, what acronyms LINE and LTR stand for. Do their names correspond to their lengths? (Write a short answer in the protocol.)
  • Submit only your script, repeat-stat.pl

Task B

  • Write a script which reformats FASTQ file to FASTA format, call it fastq2fasta.pl
    • fastq file should be on standard input, fasta file written to standard output
  • FASTA format is a typical format for storing DNA and protein sequences.
    • Each sequence consists of several lines of the file. The first line starts with ">" followed by identifier of the sequence and optinally some further description separated by whitespace
    • The sequence itself is on the second line, long sequences are split into multiple lines
  • In our case, the name of the sequence will be the ID of the read with @ replaced by > and / replaced by _
  • For example, the first two reads of reads.fastq are:
  • These should be reformatted as follows:
  • Submit files fastq2fasta.pl and reads-small.fasta
    • the latter file is created by running ./fastq2fasta.pl < reads-small.fastq > reads-small.fasta

Task C

  • Write a script fastq-quality.pl which for each position in a read computes the average quality
  • Standard input has fastq file with multiple reads, possibly of different lengths
  • As quality we will use simply ASCII values of characters in the quality string with value 33 substracted, so the quality is -10 log p
    • ASCII value can be computed by function ord
  • Positions in reads will be numbered from 0
  • Since reads can differ in length, some positions are used in more reads, some in fewer
  • For each position from 0 up to the highest position used in some read, print three numbers separated by tabs "\t": the position index, the number of times this position was used in reads, average quality at that position with 1 decimal place (you can again use printf)
  • The last two lines when you run ./fastq-quality.pl < reads-small.fastq should be
99      86      5.5
100     86      8.6
  • Run the following command, which runs your script on the larger file and selects every 10th position. Include the output in your protocol. Do you see any trend in quality values with increasing position? (Include a short comment in protocol.)
./fastq-quality.pl < reads.fastq | perl -lane 'print if $F[0]%10==0'
  • Submit only fastq-quality.pl

Task D

  • Write script fastq-trim.pl that trims low quality bases from the end of each read and filters out short reads
  • This script should read a fastq file from standard input and write trimmed fastq file to standard output
  • It should also accept two command-line arguments: character Q and integer L
    • We have not covered processing command line arguments, but you can use the code snippet below for this
  • Q is the minimum acceptable quality (characters from quality string with ASCII value >= ASCII value of Q are ok)
  • L is the minimum acceptable length of a read
  • First find the last base in a read which has quality at least Q (if any). All bases after this base will be removed from both sequence and quality string
  • If the resulting read has fewer than L bases, it is omitted from the output

You can check your program by the following tests:

  • If you run the following two commands, you should get tmp identical with input and thus output of diff should be empty
./fastq-trim.pl '!' 101 < reads-small.fastq > tmp  # trim at quality ASCII >=33 and length >=101
diff reads-small.fastq tmp                         # output should be empty (no differences)
  • If you run the following two commands, you should see differences in 4 reads, 2 bases trimmed from each
./fastq-trim.pl '"' 1 < reads-small.fastq > tmp   # trim at quality ASCII >=34 and length >=1
diff reads-small.fastq tmp                        # output should be differences in 4 reads
  • If you run the following commands, you should get empty output (no reads meet the criteria):
./fastq-trim.pl d 1 < reads-small.fastq           # quality ASCII >=100, length >= 1
./fastq-trim.pl '!' 102 < reads-small.fastq       # quality ASCII >=33 and length >=102

Further runs and submitting

  • Run ./fastq-trim.pl '(' 95 < reads-small.fastq > reads-small-filtered.fastq # quality ASCII >= 40
  • Submit files fastq-trim.pl and reads-small-filtered.fastq
  • If you have done task C, run quality statistics on a trimmed version of the bigger file using command below and include in the protocol the result. Comment in the protocol on differences between statistics on the whole file in part C and D. Are they as you expected?
./fastq-trim.pl 2 50 < reads.fastq | ./fastq-quality.pl | perl -lane 'print if $F[0]%10==0'  # quality ASCII >= 50
  • Note: you have created tools which can be combined, e.g. you can create quality-trimmed version of the fasta file by first trimming fastq and then converting to fasta (no need to submit these files)

Parsing command-line arguments in this task (they will be stored in variables $Q and $L):

#!/usr/bin/perl -w
use strict;

my $USAGE = "
$0 Q L < input.fastq > output.fastq

Trim from the end of each read bases with ASCII quality value less
than the given threshold Q. If the length of the read after trimming
is less than L, the read will be omitted from output.

L is a non-negative integer, Q is a character

# check that we have exactly 2 command-line arguments
die $USAGE unless @ARGV==2;
# copy command-line arguments to variables Q and L
my ($Q, $L) = @ARGV;
# check that $Q is one character and $L looks like a non-negative integer
die $USAGE unless length($Q)==1 && $L=~/^[0-9]+$/;