1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWbash"

From MAD
Jump to navigation Jump to search
Line 37: Line 37:
 
* File <tt>/tasks/bash/known.fa</tt> is a FASTA file with proteins from several yeast species. Each identifier is followed by a description of the biological function of the protein.  
 
* File <tt>/tasks/bash/known.fa</tt> is a FASTA file with proteins from several yeast species. Each identifier is followed by a description of the biological function of the protein.  
 
* These two sets of proteins were compared by a bioinformatics tool called [https://blast.ncbi.nlm.nih.gov/doc/blast-help/ BLAST], which finds proteins with similar sequences. The results of BLAST are in file <tt>/tasks/bash/matches.tsv</tt>. This file contains a section for each protein from <tt>yarLip.fa</tt>. This section starts with several comments, i.e. lines starting with <tt>#</tt> symbol.  This is followed by a table with the found matches in the <tt>TSV</tt> format, i.e., several values delimited by tab characters <tt>\t</tt>. We will be interested in the first two columns representing the IDs of proteins from <tt>yarLip.fa</tt> and from <tt>known.fa</tt>, respectively.
 
* These two sets of proteins were compared by a bioinformatics tool called [https://blast.ncbi.nlm.nih.gov/doc/blast-help/ BLAST], which finds proteins with similar sequences. The results of BLAST are in file <tt>/tasks/bash/matches.tsv</tt>. This file contains a section for each protein from <tt>yarLip.fa</tt>. This section starts with several comments, i.e. lines starting with <tt>#</tt> symbol.  This is followed by a table with the found matches in the <tt>TSV</tt> format, i.e., several values delimited by tab characters <tt>\t</tt>. We will be interested in the first two columns representing the IDs of proteins from <tt>yarLip.fa</tt> and from <tt>known.fa</tt>, respectively.
 +
* We will call the proteins from <tt>yarLip.fa</tt> '''query''' proteins and proteins from <tt>known.fa</tt> '''target''' proteins.
  
 
===Task A (counting proteins)===
 
===Task A (counting proteins)===

Revision as of 21:49, 19 February 2023

Lecture on Perl, Lecture on command-line tools

  • In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)

Preparatory steps and submitting

# create a folder for this homework
mkdir bash
# move to the new folder
cd bash
# link input files to the current folder
ln -s /tasks/bash/known.fa /tasks/bash/yarLip.fa /tasks/bash/matches.tsv names.tsv .
# copy protocol to the current folder
cp -i /tasks/bash/protocol.txt .
  • Now you can open protocol.txt in your favorite editor and start working
  • Command ln created symbolic links to the input files, so you can use them under names such as known.fa rather than full paths such as /tasks/bash/known.fa.

When you are done, you can submit all required files as follows (substitute your username):

cp -ipv protocol.txt /submit/bash/your_username

# check what was submitted
ls -l /submit/bash/your_username


Introduction to tasks A-C

  • In these tasks we will again process bioinformatics data. We have two files of sequences in the FASTA format. This time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with '>' contain the identifier of a protein and potentially an additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the Uniprot database.
  • File /tasks/bash/yarLip.fa is a FASTA file with proteins from the yeast Yarrowia lipolytica. Each protein is identified in the FASTA file only by its identifier such as Q6CFX1.
  • File /tasks/bash/known.fa is a FASTA file with proteins from several yeast species. Each identifier is followed by a description of the biological function of the protein.
  • These two sets of proteins were compared by a bioinformatics tool called BLAST, which finds proteins with similar sequences. The results of BLAST are in file /tasks/bash/matches.tsv. This file contains a section for each protein from yarLip.fa. This section starts with several comments, i.e. lines starting with # symbol. This is followed by a table with the found matches in the TSV format, i.e., several values delimited by tab characters \t. We will be interested in the first two columns representing the IDs of proteins from yarLip.fa and from known.fa, respectively.
  • We will call the proteins from yarLip.fa query proteins and proteins from known.fa target proteins.

Task A (counting proteins)

Steps (1) and (2)

  • Use files known.fa and yarLip.fa to find out how many proteins are in each. Each protein starts with a line starting with the > symbol, so it is sufficient to count those.
  • Beware that > symbol means redirect in bash. Therefore you have to enclose it in single quotation marks '>' so that it is taken literally.
  • For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write the commands and the resulting protein counts to the appropriate sections of your protocol.

Step 3

  • Create file known.tsv which contains sequence IDs and descriptions extracted from known.fa
  • Leading > should be removed. Any text after OS= in the description should be also removed.
  • This file should be sorted alphabetically.
  • The file should end as follows:
tr|Q5AQ78|Q5AQ78_CANAL Uncharacterized protein 
tr|Q5AQ79|Q5AQ79_CANAL Potential SET3 histone deacetylase complex component 
tr|Q5AQ80|Q5AQ80_CANAL Potential SET3 histone deacetylase complex component 
tr|Q5AQ81|Q5AQ81_CANAL Uncharacterized protein 
tr|Q9P3E3|Q9P3E3_SCHPO NAD-dependent malic enzyme (Predicted), partial (Fragment) 
  • Submit file known.tsv, write your command(s) to the protocol.

Task B (counting matches)

Step (1)

  • From file matches.tsv extract pairs of similar proteins and store them in file pairs.txt.
  • Each line of the file should contain a pair of protein IDs extracted from the first two columns of the matches.tsv file.
  • These IDs should be separated by a single space and the file should be sorted alphabetically.
  • Do not forget to omit lines with comments.
  • Each pair from the input should be listed only once in the output.
  • Commands grep, sort and uniq would be helpful. To select only some columns, you can use cut, awk or a perl one-liner.
  • The file pairs.txt should have 66622 lines (command wc) and it should start as follows:
B5FVA8 sp|O13857|PLB2_SCHPO
B5FVA8 sp|P39105|PLB1_YEAST
B5FVA8 sp|P53541|SPO1_YEAST
  • Submit file pairs.txt and write your commands to the protocol.

Step (2)

  • Find out how many proteins from yarLip.fa have at least one similarity found in matches.tsv. This can be done by counting distinct values in the first column of your pairs.txt file from step (1).
  • Write your answer and commands to the protocol. Compare this number with the total number of proteins from yarLip.fa found in Task A(2).
  • We suggest commands cut/awk/perl, sort, uniq, wc
  • The result of your commands should be an output consisting of a single number (and the end-of-line character).

Step (3)

  • For each protein in the first column of pairs.txt file count how many times it occurs in the file. The result should be a file named frequency.txt with pairs protein ID, count separated by space, sorted from the proteins with the highest to the lowest count.
  • To check you answer, look at lines 79 and 80 of the file as follows head -n 80 frequency.txt | tail -n 2
  • You should get the following two lines:
Q6C607 118
Q6CHE8 117
  • This means that protein Q6C607 from yarLip.fa occurs 118 times in the first column of pairs.txt, which means 118 proteins from known.fa are similar to it. Protein Q6CHE8 has 117 such similar proteins.
  • Submit file frequency.txt, write your commands to the protocol. Also write to the protocol what is the highest and lowest count in the second column of your file.
  • Note: The highest count is actually caused by setting in the BLAST algorithm producing the file matches.tsv. Proteins with zero matches are not listed in your file but their number could be deduced from your results in step (2) and Task A(2) if needed.

Task C (joining information)

Step (1)

  • For each protein from yarLip.fa the first (top) match in matches.tsv represents the strongest similarity.
  • In this step, we want to extract such strongest match for each protein from yarLip.fa, which has at least one match.
  • The result should be a file best.txt listing the two IDs separated by a space. The file should be sorted by the second column.
  • The file should start as follows:
Q6CBS2 sp|B5BP46|YP52_SCHPO
Q6C8R4 sp|B5BP48|YP54_SCHPO
Q6CG80 sp|B5BP48|YP54_SCHPO
Q6CH56 sp|B5BP48|YP54_SCHPO
  • This task can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
  • Instead of using Perl, you can play with grep. Option -A 1 prints the matching lines as well as one line after each match.
  • Submit file best.txt with the result and write your command to the protocol.

Step 2:

  • Now we want to extend file best.txt with a description of each protein from known.fa.
  • Since similar proteins often have similar functions, this will allow somebody studying proteins from yarLip.fa to learn something about their possible functions based other well-studied proteins from other species.
  • We join together files best.txt and known.txt created in Task A(3). Conveniently, they are both sorted by the ID of the protein from known.fa.
  • Use command join to join these files.
  • Use option -1 2 to use the second column of best.txt as a key for joining
  • The output of join may look as follows:
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase
  • Further reformat the output so that the query name goes first (e.g. Q6CBS2), followed by target name (e.g. sp|B5BP46|YP52_SCHPO), followed by the rest of the text.
  • Sort by query name, store as function.txt
  • The output should start as follows:
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
B5FVB1  sp|O13877|RPAB5_SCHPO   DNA-directed RNA polymerases I, II, and III subunit RPABC5
  • Submit file best.txt

Note:

  • Not all Y.lipolytica proteins are necessarily included in your final output (some proteins do not have blast match).
  • Files best.txt and function.txt should have the same number of lines.

Task D (passwords)

  • The file /tasks/bash/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
    • The output file has columns separated by commas ','
    • The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands.

Example line from input:

Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un

Hints:

  • Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
  • We also recommend using perl, wc, paste (check option -d in paste)
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.