2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z HW01 a HW04 nájdete na serveri v /grades/userid.txt
Do 20.4. odovzdajte návrh projektu vo formáte .txt alebo .pdf do adresára /submit/navrh/username


From IDZ
Jump to: navigation, search

Lecture 1 (Perl 1), Lecture 2 (Perl 2), Lecture 3 (command-line)

  • In this homework, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
  • Include all relevant used commands in your protocol and add a short description of your approach.
  • Submit the protocol and required output files.
  • Outline of the protocol is in /tasks/hw03/protocol.txt, submit to directory /submit/hw03/yourname

Task A

  • /tasks/hw03/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
    • The output file has columns separated by commas ','
    • The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands.

Example line from input:

Pavol Országh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Országh,3T3Pu3un


  • Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
  • We also recommend using perl, wc, paste (check option -d in paste)
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.

Task B


  • /tasks/hw03/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
    • Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [1].
    • It was further processed to omit DNA sequences from the end of file.
    • The size of the file is 5.6M.
  • For easier work, link the file to your directory by ln -s /tasks/hw03/saccharomyces_cerevisiae.gff yeast.gff
  • The file is in GFF3 format [2]
  • Lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
  • Meaning of the first 5 columns:
    • column 0 chromosome name
    • column 1 source (can be ignored)
    • column 2 type of interval
    • column 3 start of interval (1-based coordinates)
    • column 4 end of interval (1-based coordinates)
  • You can assume that these first 5 columns do not contain whitespace


  • Print for each type of interval (column 2), how many times it occurs in the file.
  • Sort from the most common to the least common interval types.
  • Hint: commands sort and uniq will be useful. Do not forget to skip comments, for example using grep -v '^#'
  • Submit file types.txt with the output formatted as follows:
   7058 CDS
   6600 mRNA
      1 telomerase_RNA_gene
      1 mating_type_region
      1 intein_encoding_region

Task C

  • Continue processing file from task B.
  • For each chromosome, the file contains a line which has in column 2 string chromosome, and the interval is the whole chromosome.
  • To file chrosomes.txt, print a tab-separated list of chromosome names and sizes in the same order as in the input
  • The last line of chromosomes.txt should list the total size of all chromosomes combined.
  • Submit file chromosomes.txt
  • Hints:
    • The total size can be computed by a perl one-liner.
    • Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
    • Grepping for word chromosome does not check if this word is indeed in the second column
    • Tab character is written in Perl as "\t".
  • Your output should start and end as follows:
chrI    230218
chrII   813184
chrXVI  948066
chrmt   85779
total   12157105

Task D

Overall goal:

  • Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09
  • We have also downloaded proteins from yeast Yarrowia lipolytica. We will pretend that nothing is known about these proteins (as if they were produced by gene finding program in a newly sequenced genome).
  • For each Y.lip. proteins we have similar proteins from other yeasts by blast
  • Now we want to find for each protein in Y.lip. its closest match among all known proteins.


  • /tasks/hw03/known.fa is a fasta file with known proteins from several species
  • /tasks/hw03/yarLip.fa is a fasta file with proteins from Y.lip.
  • /tasks/hw03/known.blast is the result of running blast of yarLip.fa versus known.fa by these commands:
formatdb -i known.fa
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
  • you can link these files to your directory as follows:
ln -s /tasks/hw03/known.fa .
ln -s /tasks/hw03/yarLip.fa .
ln -s /tasks/hw03/known.blast .

Step 1:

  • Get the first (strongest) match for each query from known.blast.
  • This can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide of you print the current line.
  • Instead of using perl, you can play with grep. Option -A 1 prints the matching lines as well as one line ofter each match
  • Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
  • Submit file best.tsv with the result
  • File should start as follows:
Q6CBS2  sp|B5BP46|YP52_SCHPO
Q6C8R4  sp|B5BP48|YP54_SCHPO
Q6CG80  sp|B5BP48|YP54_SCHPO
Q6CH56  sp|B5BP48|YP54_SCHPO

Step 2:

  • Submit file known.tsv which contains sequence names extracted from known.fa with leading > removed
  • This file should be sorted alphabetically.
  • File should start as follows:
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=YAL019W-A PE=5 SV=1
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces cerevisiae (strain ATCC 204508 / S288c) GN=YAR019W-A PE=5 SV=1

Step 3:

  • Use command join to join the files best.tsv and known.tsv so that each line of best.tsv is extended with the text describing the corresponding target in known.tsv
  • Use option -1 2 to use the second column of best.tsv as a key for joining
  • The output of join may look as follows:
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=SPBC460.02c PE=3 SV=1
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) GN=SPBC460.04c PE=3 SV=1
  • Further reformat the output so that query name goes first (e.g. Q6CBS2), followed by target name (e.g. sp|B5BP46|YP52_SCHPO), followed by the rest of the text, but remove all text after OS=
  • Sort by query name
  • Submit file best.txt with the result
  • The output should start as follows:
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
B5FVB1  sp|O13877|RPAB5_SCHPO   DNA-directed RNA polymerases I, II, and III subunit RPABC5


  • Not all Y.lip. are necessarily included in your final output (some proteins do not have blast match).
    • You can think how to find the list of such proteins, but this is not part of the assignment.
  • Files best.txt and best.tsv should have the same number of lines.