2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search

Lecture 1, Lecture 2, Lecture 3

  • In this homework, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
  • Include all relevant used commands in your protocol and add a short description of your approach.
  • Submit the protocol and required output files.
  • Outline of the protocol is in /tasks/hw03/protocol.txt, submit to directory /submit/hw03/yourname


  • If you are bored, you can try to write solution of Task B using as small number of characters as possible
  • In the protocol, include both normal readable form and the condensed form
  • Winner with the shortest set of commands gets some bonus points

Task A

  • /tasks/hw03/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
  • The output file has columns separated by commas ','
  • The first column contains username extracted from email address, second column surname, third column all given names and fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands.
  • In addition to commands that produce the desired file, also run some commands that could help you to detect potential problems in the input.
    • Examples of such problems include input lines with fewer than 3 columns, missing @ in email, or fields containing commas.
    • For every type of error, run a command which should have a different result for correct and incorrect files, and this output should be short even if the input file is long. In your protocol, explain what output you expect for correct and incorrect files (e.g. you can use grep to find incorrect lines and count their number - in a good file the number would be 0, in a bad file higher than 0)
  • Such checks are not necessary in the other tasks, but in a real project it is a good idea to check your input and intermediate files.

Example line from input:

Pavol Országh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Országh,3T3Pu3un


  • Passwords can be generated using pwgen, we also recommend using perl, wc, paste.
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.

Task B


  • /tasks/hw03/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
    • Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [1].
    • It was further processed to omit DNA sequences from the end of file.
    • The size of the file is 5.6M.
  • For easier work, link the file to your directory by ln -s /tasks/hw03/saccharomyces_cerevisiae.gff yeast.gff
  • The file is in GFF3 format [2]
  • Lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
  • Meaning of the first 5 columns:
    • column 0 chromosome name
    • column 1 source (can be ignored)
    • column 2 type of interval
    • column 3 start of interval (1-based coordinates)
    • column 4 end of interval (1-based coordinates)
  • You can assume that these first 5 columns do not contain whitespace


  • For each chromosome, the file contains a line which has in column 2 string chromosome, and the interval is the whole chromosome.
  • To file chrosomes.txt, print a tab-separated list of chromosome names and sizes in the same order as in the input
  • The last line of chromosomes.txt should list the total size of all chromosomes combined.
  • Submit file chromosomes.txt
  • Hint: tab is written in Perl as "\t". Command cat may be useful.
  • Your output should start and end as follows:
chrI    230218
chrII   813184
chrXVI  948066
chrmt   85779
total   12157105

Task C

  • Continue processing file from task B. Print for each type of interval (column 2), how many times it occurs in the file.
  • Sort from the most common to the least common interval types.
  • Hint: commands sort and uniq will be useful. Do not forget to skip comments.
  • Submit file types.txt with the output formatted as follows:
   7058 CDS
   6600 mRNA
      1 telomerase_RNA_gene
      1 mating_type_region
      1 intein_encoding_region

Task D

Overall goal:

  • Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09
  • We have also downloaded proteins from yeast Yarrowia lipolytica. We will pretend that nothing is known about these proteins (as if they were produced by gene finding program in a newly sequenced genome).
  • We have run blast of known proteins vs. Y.lip. proteins.
  • Now we want to find for each protein in Y.lip. its closest match among all known proteins.


  • /tasks/hw03/known.fa is a fasta file with known proteins from several species
  • /tasks/hw03/yarLip.fa is a fasta file with proteins from Y.lip.
  • /tasks/hw03/known.blast is the result of running blast of yarLip.fa versus known.fa by these commands:
formatdb -i known.fa
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
  • you can link these files to your directory as follows:
ln -s /tasks/hw03/known.fa .
ln -s /tasks/hw03/yarLip.fa .
ln -s /tasks/hw03/known.blast .

Step 1:

  • Get the first (strongest) match for each query from known.blast.
  • This can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide of you print the current line.
  • Instead of using perl, you can play with grep. Option -A 1 prints the matching lines as well as one line ofter each match
  • Print only the first two columns separated by tab (name of query, name of target), sort the file by second column.
  • Submit file best.tsv with the result
  • File should start as follows:
Q6CBS2  sp|B5BP46|YP52_SCHPO
Q6C8R4  sp|B5BP48|YP54_SCHPO
Q6CG80  sp|B5BP48|YP54_SCHPO
Q6CH56  sp|B5BP48|YP54_SCHPO

Step 2:

  • Submit file known.tsv which contains sequence names extracted from known.fa with leading > removed
  • This file should be sorted alphabetically.
  • File should start as follows:
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=YAL019W-A PE=5 SV=1
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=YAR019W-A PE=5 SV=1

Step 3:

  • Use command join to join the files best.tsv and known.tsv so that each line of best.tsv is extended with the text describing the corresponding target in known.tsv
  • Use option -1 2 to use second column of best.tsv as a key for joining
  • The output of join may look as follows:
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=SPBC460.02c PE=3 SV=1
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=SPBC460.04c PE=3 SV=1
  • Further reformat the output so that query name goes first (e.g. Q6CBS2), followed by target name (e.g. sp|B5BP46|YP52_SCHPO), followed by the rest of the text, but remove all text after OS=
  • Sort by query name
  • Submit file best.txt with the result
  • The output should start as follows:
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
B5FVB1  sp|O13877|RPAB5_SCHPO   DNA-directed RNA polymerases I, II, and III subunit RPABC5


  • not all Y.lip. are necessarily included in your final output (some proteins do not have blast match)
    • you can think how to find the list of such proteins, but this is not part of the assignment
  • but files best.txt and best.tsv should have the same number of lines