1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWbash"

From MAD
Jump to navigation Jump to search
Line 12: Line 12:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
===Task D (blast)===
 
'''Overall goal:'''
 
* Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09. The file contains sequence of the protein as well as a short description of its biological function.
 
* We have also downloaded proteins from the yeast ''Yarrowia lipolytica''. We will pretend that nothing is known about the function of these proteins (as if they were produced by gene finding program in a newly sequenced genome).
 
* For each ''Y.lipolytica'' protein, we have found similar proteins from other yeasts
 
* Now we want to extract for each protein in ''Y.lipolytica'' its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the ''Y.lipolytica'' protein.
 
  
'''Files:'''
+
===Preparatory steps===
* <tt>/tasks/bash/known.fa</tt> is a FASTA file containing sequences of known proteins from several species
+
We will not include such detailed steps in future homeworks, you will have to adjust these as needed.
* <tt>/tasks/bash/yarLip.fa</tt> is a FASTA file with proteins from ''Y.lipolytica''
+
 
* <tt>/tasks/bash/known.blast</tt> is the result of finding similar proteins in <tt>yarLip.fa</tt> versus <tt>known.fa</tt> by these commands (already done by us):
+
<pre>
<syntaxhighlight lang="bash">
+
# create folder for this homework
formatdb -i known.fa
+
mkdir bash
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
+
# move to the new folder
</syntaxhighlight>
+
cd bash
* you can link these files to your directory as follows:
+
# link input files to current folder
<syntaxhighlight lang="bash">
+
ln -s /tasks/bash/known.fa /tasks/bash/yarLip.fa /tasks/bash/pairs.tsv names.tsv .
ln -s /tasks/bash/known.fa .
+
# copy protocol to the current folder
ln -s /tasks/bash/yarLip.fa .
+
cp -i /tasks/bash/protocol.txt .
ln -s /tasks/bash/known.blast .
 
 
</syntaxhighlight>
 
</syntaxhighlight>
  
'''Step 1:'''
+
* Now you can open <tt>protocol.txt</tt> in your favorite editor and start working
 +
* Command <tt>ln</tt> created symbolic links to the input files, so you can use them under names such as <tt>known.fa</tt> rather than full paths such as <tt>/tasks/bash/known.fa</tt>.
 +
 
 +
===Introduction to tasks A-C===
 +
* In these tasks we will again process bioinformatics data. We have two files in the FASTA format, which you have seen in [[HWperl]]. Unlike before, this time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with '>' contain identifier of a protein and potentially additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the [https://www.uniprot.org/ Uniprot] database.
 +
* File <tt>/tasks/bash/yarLip.fa</tt> is a FASTA file with proteins from the yeast ''Yarrowia lipolytica''. We will pretend that nothing is known about the function of these proteins and indeed, each protein is identified in the FASTA file only by its identifier such as <tt>Q6CFX1</tt>.
 +
* File <tt>/tasks/bash/known.fa</tt> is a FASTA file with proteins from several yeast species. Each identifier is followed by a description of its biological function.
 +
* These two sets of proteins were compared by a bioinformatics tool called BLAST which finds proteins or their parts with similar sequences. The results of BLAST are in file <tt>/tasks/bash/pairs.tsv</tt>. This file contains a section for each protein in <tt>yarLip.fa</tt> which starts with several comments, i.e. lines starting with <tt>#</tt> symbol.
 +
This is followed by a table with found matches in <tt>TSV</tt> format, i.e., several values delimited by tab characters <tt>\t</tt>. We will be most interested in the first two columns representing the ID of a protein from <tt>yarLip.fa</tt> and from the <tt>known.fa</tt>, respectively.
 +
 
 +
===Task A===
 +
 
 +
'''Steps (1) and (2)'''
 +
* Use files <tt>yarLip.fa</tt> and <tt>known.fa</tt> to find put how many proteins are in each. Each protein starts with a line starting with <tt>></tt> symbol, so it is sufficient to count those.
 +
* Beware that <tt>></tt> symbol means redirect, therefore you have to enclose it in single quotation marks <tt>'>'</tt> so that it is taken literally. 
 +
* For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write both the command and corresponding number to the appropriate section of your protocol.
 +
 
 +
'''Step 3'''
 +
* Create file <tt>known.tsv</tt> which contains sequence names and descriptions extracted from <tt>known.fa</tt>
 +
* Leading <tt>></tt> should be removed. Any text after  <tt>OS=</tt> should be also removed.
 +
* This file should be sorted alphabetically.
 +
* The file should start as follows (lines are trimmed below):
 +
<pre>
 +
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A
 +
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A
 +
</pre>
 +
* '''Submit''' file <tt>known.tsv</tt>, write your command(s) to the protocol.
 +
 
 +
===Task B===
 +
* From file <tt>pairs.tsv</tt> extract pairs of similar proteins
 +
 
 
* Get the first (strongest) match for each query from <tt>known.blast</tt>.
 
* Get the first (strongest) match for each query from <tt>known.blast</tt>.
 
* This can be done by printing the lines that are not comments but follow a comment line starting with #.  
 
* This can be done by printing the lines that are not comments but follow a comment line starting with #.  
Line 64: Line 87:
  
 
'''Step 3:'''
 
'''Step 3:'''
 +
* For each ''Y.lipolytica'' protein, we have found similar proteins from other yeasts
 +
* Now we want to extract for each protein in ''Y.lipolytica'' its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the ''Y.lipolytica'' protein.
 +
 +
 
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join the files <tt>best.tsv</tt> and <tt>known.tsv</tt> so that each line of <tt>best.tsv</tt> is extended with the text describing the corresponding target in <tt>known.tsv</tt>
 
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join the files <tt>best.tsv</tt> and <tt>known.tsv</tt> so that each line of <tt>best.tsv</tt> is extended with the text describing the corresponding target in <tt>known.tsv</tt>
 
* Use option <tt>-1 2</tt> to use the second column of <tt>best.tsv</tt> as a key for joining
 
* Use option <tt>-1 2</tt> to use the second column of <tt>best.tsv</tt> as a key for joining
Line 87: Line 114:
 
** You can think how to find the list of such proteins, but this is not part of the task.
 
** You can think how to find the list of such proteins, but this is not part of the task.
 
* Files <tt>best.txt</tt> and <tt>best.tsv</tt> should have the same number of lines.
 
* Files <tt>best.txt</tt> and <tt>best.tsv</tt> should have the same number of lines.
 +
 +
 +
===Task D (passwords)===
 +
* The file <tt>/tasks/bash/names.txt</tt> contains data about several people, one per line.
 +
* Each line consists of given name(s), surname and email separated by spaces.
 +
* Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form <tt>username@uniba.sk</tt>.
 +
* The task is to generate file <tt>passwords.csv</tt> which contains a randomly generated password for each of these users
 +
** The output file has columns separated by commas ','
 +
** The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
 +
<!-- NOTEX -->
 +
* '''Submit''' file <tt>passwords.csv</tt> with the result of your commands.
 +
<!-- /NOTEX -->
 +
 +
Example line from input:
 +
<pre>
 +
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
 +
</pre>
 +
 +
Example line from output (password will differ):
 +
<pre>
 +
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
 +
</pre>
 +
 +
Hints:
 +
* Passwords can be generated using <tt>pwgen</tt> (e.g. <tt>pwgen -N 10 -1</tt> prints 10 passwords, one per line)
 +
* We also recommend using <tt>perl</tt>, <tt>wc</tt>, <tt>paste</tt> (check option <tt>-d</tt> in <tt>paste</tt>)
 +
* In Perl, function <tt>[http://perldoc.perl.org/functions/pop.html pop]</tt> may be useful for manipulating <tt>@F</tt> and function <tt>[http://perldoc.perl.org/functions/join.html join]</tt> for connecting strings with a separator.

Revision as of 13:24, 19 February 2023

Lecture on Perl, Lecture on command-line tools

  • In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
  • Include all relevant used commands in your protocol and add a short description of your approach.
  • Submit the protocol and required output files.
  • Outline of the protocol is in /tasks/bash/protocol.txt, submit to directory /submit/bash/yourname


Preparatory steps

We will not include such detailed steps in future homeworks, you will have to adjust these as needed.

# create folder for this homework
mkdir bash
# move to the new folder
cd bash
# link input files to current folder
ln -s /tasks/bash/known.fa /tasks/bash/yarLip.fa /tasks/bash/pairs.tsv names.tsv .
# copy protocol to the current folder
cp -i /tasks/bash/protocol.txt .
</syntaxhighlight>

* Now you can open <tt>protocol.txt</tt> in your favorite editor and start working
* Command <tt>ln</tt> created symbolic links to the input files, so you can use them under names such as <tt>known.fa</tt> rather than full paths such as <tt>/tasks/bash/known.fa</tt>. 

===Introduction to tasks A-C===
* In these tasks we will again process bioinformatics data. We have two files in the FASTA format, which you have seen in [[HWperl]]. Unlike before, this time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with '>' contain identifier of a protein and potentially additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the [https://www.uniprot.org/ Uniprot] database.
* File <tt>/tasks/bash/yarLip.fa</tt> is a FASTA file with proteins from the yeast ''Yarrowia lipolytica''. We will pretend that nothing is known about the function of these proteins and indeed, each protein is identified in the FASTA file only by its identifier such as <tt>Q6CFX1</tt>.
* File <tt>/tasks/bash/known.fa</tt> is a FASTA file with proteins from several yeast species. Each identifier is followed by a description of its biological function. 
* These two sets of proteins were compared by a bioinformatics tool called BLAST which finds proteins or their parts with similar sequences. The results of BLAST are in file <tt>/tasks/bash/pairs.tsv</tt>. This file contains a section for each protein in <tt>yarLip.fa</tt> which starts with several comments, i.e. lines starting with <tt>#</tt> symbol. 
This is followed by a table with found matches in <tt>TSV</tt> format, i.e., several values delimited by tab characters <tt>\t</tt>. We will be most interested in the first two columns representing the ID of a protein from <tt>yarLip.fa</tt> and from the <tt>known.fa</tt>, respectively. 

===Task A===

'''Steps (1) and (2)'''
* Use files <tt>yarLip.fa</tt> and <tt>known.fa</tt> to find put how many proteins are in each. Each protein starts with a line starting with <tt>></tt> symbol, so it is sufficient to count those.
* Beware that <tt>></tt> symbol means redirect, therefore you have to enclose it in single quotation marks <tt>'>'</tt> so that it is taken literally.  
* For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write both the command and corresponding number to the appropriate section of your protocol. 

'''Step 3'''
* Create file <tt>known.tsv</tt> which contains sequence names and descriptions extracted from <tt>known.fa</tt>
* Leading <tt>></tt> should be removed. Any text after  <tt>OS=</tt> should be also removed.
* This file should be sorted alphabetically.
* The file should start as follows (lines are trimmed below):
<pre>
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A
  • Submit file known.tsv, write your command(s) to the protocol.

Task B

  • From file pairs.tsv extract pairs of similar proteins
  • Get the first (strongest) match for each query from known.blast.
  • This can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
  • Instead of using Perl, you can play with grep. Option -A 1 prints the matching lines as well as one line after each match
  • Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
  • Store the result in file best.tsv. The file should start as follows:
Q6CBS2  sp|B5BP46|YP52_SCHPO
Q6C8R4  sp|B5BP48|YP54_SCHPO
Q6CG80  sp|B5BP48|YP54_SCHPO
Q6CH56  sp|B5BP48|YP54_SCHPO
  • Submit file best.tsv with the result

Step 2:

  • Create file known.tsv which contains sequence names extracted from known.fa with leading > removed
  • This file should be sorted alphabetically.
  • The file should start as follows (lines are trimmed below):
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces...
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces...
  • Submit file known.tsv

Step 3:

  • For each Y.lipolytica protein, we have found similar proteins from other yeasts
  • Now we want to extract for each protein in Y.lipolytica its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the Y.lipolytica protein.


  • Use command join to join the files best.tsv and known.tsv so that each line of best.tsv is extended with the text describing the corresponding target in known.tsv
  • Use option -1 2 to use the second column of best.tsv as a key for joining
  • The output of join may look as follows:
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces...
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=...
  • Further reformat the output so that the query name goes first (e.g. Q6CBS2), followed by target name (e.g. sp|B5BP46|YP52_SCHPO), followed by the rest of the text, but remove all text after OS=
  • Sort by query name, store as best.txt
  • The output should start as follows:
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
B5FVB1  sp|O13877|RPAB5_SCHPO   DNA-directed RNA polymerases I, II, and III subunit RPABC5
  • Submit file best.txt

Note:

  • Not all Y.lipolytica proteins are necessarily included in your final output (some proteins do not have blast match).
    • You can think how to find the list of such proteins, but this is not part of the task.
  • Files best.txt and best.tsv should have the same number of lines.


Task D (passwords)

  • The file /tasks/bash/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
    • The output file has columns separated by commas ','
    • The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands.

Example line from input:

Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un

Hints:

  • Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
  • We also recommend using perl, wc, paste (check option -d in paste)
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.