1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWbash"

From MAD
Jump to navigation Jump to search
 
(33 intermediate revisions by 2 users not shown)
Line 4: Line 4:
  
 
* In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.  
 
* In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.  
* Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
 
 
* Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
 
* Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
<!-- NOTEX -->
 
* Include all relevant used commands in your protocol and add a short description of your approach.
 
* Submit the protocol and required output files.
 
* Outline of the protocol is in <tt>/tasks/bash/protocol.txt</tt>, submit to directory <tt>/submit/bash/yourname</tt>
 
<!-- /NOTEX -->
 
  
===Task A (passwords)===
+
===Preparatory steps and submitting===
* The file <tt>/tasks/bash/names.txt</tt> contains data about several people, one per line.
+
 
* Each line consists of given name(s), surname and email separated by spaces.
+
<syntaxhighlight lang="bash">
* Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form <tt>username@uniba.sk</tt>.
+
# create a folder for this homework
* The task is to generate file <tt>passwords.csv</tt> which contains a randomly generated password for each of these users
+
mkdir bash
** The output file has columns separated by commas ','
+
# move to the new folder
** The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
+
cd bash
<!-- NOTEX -->
+
# link input files to the current folder
* '''Submit''' file <tt>passwords.csv</tt> with the result of your commands.  
+
ln -s /tasks/bash/human.fa /tasks/bash/dog.fa /tasks/bash/matches.tsv /tasks/bash/names.txt .
<!-- /NOTEX -->
+
# copy protocol to the current folder
 +
cp -i /tasks/bash/protocol.txt .
 +
</syntaxhighlight>
 +
 
 +
* Now you can open <tt>protocol.txt</tt> in your favorite editor and start working
 +
* Command <tt>ln</tt> created symbolic links (shortcuts) to the input files, so you can use them under names such as <tt>human.fa</tt> rather than full paths such as <tt>/tasks/bash/human.fa</tt>.
  
Example line from input:
+
When you are done, you can '''submit''' all required files as follows (substitute your username):
<pre>
+
<syntaxhighlight lang="bash">
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
+
cp -ipv protocol.txt human.txt pairs.txt frequency.txt best.txt function.txt passwords.csv /submit/bash/your_username
</pre>
 
  
Example line from output (password will differ):
+
# check what was submitted
<pre>
+
ls -l /submit/bash/your_username
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
+
</syntaxhighlight>
</pre>
 
  
Hints:
+
===Introduction to tasks A-C===
* Passwords can be generated using <tt>pwgen</tt> (e.g. <tt>pwgen -N 10 -1</tt> prints 10 passwords, one per line)
+
* In these tasks we will again process bioinformatics data. We have two files of sequences in the FASTA format. This time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with '>' contain the identifier of a protein and potentially an additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the [https://www.uniprot.org/ Uniprot] database.
* We also recommend using <tt>perl</tt>, <tt>wc</tt>, <tt>paste</tt> (check option <tt>-d</tt> in <tt>paste</tt>)
+
* File <tt>/tasks/bash/dog.fa</tt> is a FASTA file conatining about 10% of randomly selected dog proteins. Each protein is identified in the FASTA file only by its ID such as <tt>A0A8I3MJS8_CANLF</tt>.
* In Perl, function <tt>[http://perldoc.perl.org/functions/pop.html pop]</tt> may be useful for manipulating <tt>@F</tt> and function <tt>[http://perldoc.perl.org/functions/join.html join]</tt> for connecting strings with a separator.
+
* File <tt>/tasks/bash/human.fa</tt> is a FASTA file with all human proteins. Each ID is followed by a description of the biological function of the protein.
 +
* These two sets of proteins were compared by the bioinformatics tool called [https://blast.ncbi.nlm.nih.gov/doc/blast-help/ BLAST], which finds proteins with similar sequences. The results of BLAST are in file <tt>/tasks/bash/matches.tsv</tt>. This file contains a section for each dog protein. This section starts with several comments, i.e. lines starting with <tt>#</tt> symbol.  This is followed by a table with the found matches in the <tt>TSV</tt> format, i.e., several values delimited by tab characters <tt>\t</tt>. We will be interested in the first two columns representing the IDs of the dog and human proteins, respectively.
  
===Task B (yeast genome)===
+
===Task A (counting proteins)===
  
'''The input file:'''
+
'''Steps (1) and (2)'''
* <tt>/tasks/bash/saccharomyces_cerevisiae.gff</tt> contains annotation of the yeast genome
+
* Use files  <tt>human.fa</tt> and <tt>dog.fa</tt> to find out how many proteins are in each. Each protein starts with a line starting with the <tt>></tt> symbol, so it is sufficient to count those.
** Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff].
+
* Beware that <tt>></tt> symbol means redirect in bash. Therefore you have to enclose it in single quotation marks <tt>'>'</tt> so that it is taken literally. 
** It was further processed to omit DNA sequences from the end of file.  
+
* For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write the commands and the resulting protein counts to the appropriate sections of your '''protocol'''.
** The size of the file is 5.6M.
 
* For easier work, link the file to your directory by <tt>ln -s /tasks/bash/saccharomyces_cerevisiae.gff yeast.gff</tt>
 
* The file is in [http://www.sequenceontology.org/gff3.shtml GFF3 format]
 
* The lines starting with <tt>#</tt> are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
 
* Meaning of the first 5 columns:
 
** column 0 chromosome name
 
** column 1 source (can be ignored)
 
** column 2 type of interval
 
** column 3 start of interval (1-based coordinates)
 
** column 4 end of interval (1-based coordinates)
 
* You can assume that these first 5 columns do not contain whitespace
 
  
'''Task:'''
+
'''Step (3)'''
* Print for each type of interval (column 2), how many times it occurs in the file.
+
* Create file <tt>human.txt</tt> which contains sequence IDs and descriptions extracted from <tt>human.fa</tt>. This file will be used in Task C.
* Sort from the most common to the least common interval types.
+
* Leading <tt>></tt> should be removed. Any text after <tt>OS=</tt> in the description should be also removed.
* Hint: commands <tt>sort</tt> and <tt>uniq</tt> will be useful. Do not forget to skip comments, for example using <tt>grep -v '^#'</tt>
+
* This file should be sorted alphabetically.
* The result should be a file <tt>types.txt</tt> formatted as follows:
+
* The file should start as follows:
 
<pre>
 
<pre>
  7058 CDS
+
1433B_HUMAN 14-3-3 protein beta/alpha
  6600 mRNA
+
1433E_HUMAN 14-3-3 protein epsilon
...
+
1433F_HUMAN 14-3-3 protein eta
...
+
1433G_HUMAN 14-3-3 protein gamma
      1 telomerase_RNA_gene
+
1433S_HUMAN 14-3-3 protein sigma
      1 mating_type_region
+
</pre>
      1 intein_encoding_region
+
* '''Submit''' file <tt>human.txt</tt>, write your commands to the '''protocol'''.
 +
 
 +
===Task B (counting matches)===
  
 +
'''Step (1)'''
 +
* From file <tt>matches.tsv</tt> extract pairs of similar proteins and store them in file <tt>pairs.txt</tt>. 
 +
* Each line of the file should contain a pair of protein IDs extracted from the first two columns of the <tt>matches.tsv</tt> file.
 +
* These IDs should be separated by a single space and the file should be sorted alphabetically.
 +
* Do not forget to omit lines with comments.
 +
* Each pair from the input should be listed only once in the output.
 +
* Commands <tt>grep</tt>, <tt>sort</tt> and <tt>uniq</tt> would be helpful. To select only some columns, you can use commands <tt>cut</tt>, <tt>awk</tt> or a Perl one-liner.
 +
* The file <tt>pairs.txt</tt> should have 18939 lines (verify using command <tt>wc</tt>) and it should start as follows:
 +
<pre>
 +
A0A1Y6D565_CANLF P_HUMAN
 +
A0A1Y6D565_CANLF S13A4_HUMAN
 +
A0A222YTD8_CANLF S39A4_HUMAN
 
</pre>
 
</pre>
<!-- NOTEX -->
+
* '''Submit''' file <tt>pairs.txt</tt> and write your commands to the '''protocol'''.
'''Submit''' the file <tt>types.txt</tt>
+
 
<!-- /NOTEX -->
+
'''Step (2)'''
 +
* Find out how many proteins from <tt>dog.fa</tt> have at least one similarity found in <tt>matches.tsv</tt>. This can be done by counting distinct values in the first column of your <tt>pairs.txt</tt> file from step (1).
 +
* We suggest commands <tt>cut/awk/perl</tt>, <tt>sort</tt>, <tt>uniq</tt>, <tt>wc</tt>
 +
* The result of your commands should be an output consisting of a single number (and the end-of-line character).
 +
* Write your answer and commands to the '''protocol'''. What percentage is this number out of all dog proteins found in Task A(2)?
  
===Task C (chromosomes)===
+
'''Step (3)'''
* Continue processing file from task B.
+
* For each dog protein in the first column of <tt>pairs.txt</tt> file, count how many times it occurs in the file. The result should be a file named <tt>frequency.txt</tt> with pairs dog protein ID, count separated by space. It should be sorted by the second column (count) from highest to lowest and in case of ties by the first column alphabetically.
* For each chromosome, the file contains a line which has in column 2 string <tt>chromosome</tt>, and the interval is the whole chromosome.
+
* To check you answer, look at lines 999 and 1000 of the file as follows: <tt>head -n 1000 frequency.txt | tail -n 2</tt>
* To file <tt>chrosomes.txt</tt>, print a tab-separated list of chromosome names and sizes in the same order as in the input
+
* You should get the following two lines:
* The last line of <tt>chromosomes.txt</tt> should list the total size of all chromosomes combined.
 
<!-- NOTEX -->
 
* '''Submit''' file <tt>chromosomes.txt</tt>
 
<!-- /NOTEX -->
 
* Hints:
 
** The total size can be computed by a perl one-liner.
 
** Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: <tt>perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'</tt>
 
** Grepping for word chromosome does not check if this word is indeed in the second column
 
** Tab character is written in Perl as <tt>"\t"</tt>.
 
* Your output should start and end as follows:
 
 
<pre>
 
<pre>
chrI    230218
+
A0A8I3P697_CANLF 6
chrII  813184
+
A0A8I3P812_CANLF 6
...
 
...
 
chrXVI  948066
 
chrmt  85779
 
total  12157105
 
 
</pre>
 
</pre>
 +
* This means that dog proteins <tt>A0A8I3P697_CANLF</tt> and <tt>A0A8I3P812_CANLF 6</tt> both occur 6 times in the first column of <tt>pairs.txt</tt>, which means 6 human proteins are similar to each.
 +
* '''Submit''' file <tt>frequency.txt</tt>, write your commands to the protocol. Also write to the protocol what is the highest and lowest count in the second column of your file.
 +
* Note: The dog proteins with zero matches are not listed in your file. Their number could be deduced from your results in step (2) and Task A(2) if needed.
 +
* Note2: The highest number of matches per dog protein is actually restricted by a parameter in the search algorithm to produce at most that many answers. Without this setting the number would be much higher.
  
===Task D (blast)===
+
===Task C (joining information) ===
'''Overall goal:'''
 
* Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09. The file contains sequence of the protein as well as a short description of its biological function.
 
* We have also downloaded proteins from the yeast ''Yarrowia lipolytica''. We will pretend that nothing is known about the function of these proteins (as if they were produced by gene finding program in a newly sequenced genome).
 
* For each ''Y.lipolytica'' protein, we have found similar proteins from other yeasts
 
* Now we want to extract for each protein in ''Y.lipolytica'' its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the ''Y.lipolytica'' protein.
 
  
'''Files:'''
+
'''Step (1)'''
* <tt>/tasks/bash/known.fa</tt> is a FASTA file containing sequences of known proteins from several species
+
* For each dog protein, the first (top) match in <tt>matches.tsv</tt> represents the strongest similarity.
* <tt>/tasks/bash/yarLip.fa</tt> is a FASTA file with proteins from ''Y.lipolytica''
+
* In this step, we want to extract such strongest match for each dog protein which has at least one match.
* <tt>/tasks/bash/known.blast</tt> is the result of finding similar proteins in <tt>yarLip.fa</tt> versus <tt>known.fa</tt> by these commands (already done by us):
+
* The result should be a file <tt>best.txt</tt> listing the two IDs separated by a space. The file should be sorted by the '''second column''' (human protein ID). 
<syntaxhighlight lang="bash">
+
* The file should start as follows:
formatdb -i known.fa
+
<pre>
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
+
A0A8I3MVQ9_CANLF 1433E_HUMAN
</syntaxhighlight>
+
A0A8I3MT06_CANLF 1433G_HUMAN
* you can link these files to your directory as follows:
+
A0A8I3QRX3_CANLF 1433T_HUMAN
<syntaxhighlight lang="bash">
+
</pre>
ln -s /tasks/bash/known.fa .
+
* This task can be done by printing the lines that are not comments but follow a comment line starting with <tt>#</tt>.
ln -s /tasks/bash/yarLip.fa .
+
* In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
ln -s /tasks/bash/known.blast .
+
* Instead of using Perl, you can play with <tt>grep</tt>. Option <tt>-A 1</tt> prints the matching lines as well as one line after each match.
</syntaxhighlight>
+
* '''Submit''' file <tt>best.txt</tt> with the result and write your commands to the '''protocol'''.
  
'''Step 1:'''
+
'''Step (2):'''
* Get the first (strongest) match for each query from <tt>known.blast</tt>.
+
* Now we want to extend file <tt>best.txt</tt> with a description of each human protein.  
* This can be done by printing the lines that are not comments but follow a comment line starting with #.  
+
* Since similar proteins often have similar functions, this will allow somebody studying dog proteins to learn something about their possible functions based on similarity to well-studied human proteins.
* In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.  
+
* To achieve this, we join together file <tt>best.txt</tt> from step 1 and <tt>human.txt</tt> created in Task A(3). Conveniently, they are both sorted by the ID of the human protein.
* Instead of using Perl, you can play with grep. Option <tt>-A 1</tt> prints the matching lines as well as one line after each match
+
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join these files.  
* Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
+
* Use option <tt>-1 2</tt> to use the second column of <tt>best.txt</tt> as a key for joining.
* Store the result in file <tt>best.tsv</tt>. The file should start as follows:
+
* The output of <tt>join</tt> may start as follows:
 
<pre>
 
<pre>
Q6CBS2  sp|B5BP46|YP52_SCHPO
+
1433E_HUMAN A0A8I3MVQ9_CANLF 14-3-3 protein epsilon
Q6C8R4  sp|B5BP48|YP54_SCHPO
+
1433G_HUMAN A0A8I3MT06_CANLF 14-3-3 protein gamma
Q6CG80  sp|B5BP48|YP54_SCHPO
+
1433T_HUMAN A0A8I3QRX3_CANLF 14-3-3 protein theta
Q6CH56  sp|B5BP48|YP54_SCHPO
 
 
</pre>
 
</pre>
<!-- NOTEX -->
+
* Further reformat the output so that the dog ID goes first (e.g. <tt>A0A8I3MVQ9_CANLF</tt>), followed by human protein ID (e.g. <tt>1433E_HUMAN</tt>), followed by the rest of the text.
* '''Submit''' file <tt>best.tsv</tt> with the result
+
* Sort by dog protein ID, store in file <tt>function.txt</tt>.
<!-- /NOTEX -->
+
* The output should start as follows:
 
 
'''Step 2:'''
 
* Create file <tt>known.tsv</tt> which contains sequence names extracted from <tt>known.fa</tt> with leading <tt>></tt> removed
 
* This file should be sorted alphabetically.
 
* The file should start as follows (lines are trimmed below):
 
 
<pre>
 
<pre>
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces...
+
A0A1Y6D565_CANLF P_HUMAN P protein
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces...
+
A0A222YTD8_CANLF S39AA_HUMAN Zinc transporter ZIP10
 +
A0A5F4CW23_CANLF ELA_HUMAN Apelin receptor early endogenous ligand
 
</pre>
 
</pre>
 +
* Files <tt>best.txt</tt> and <tt>function.txt</tt> should have the same number of lines.
 +
* Which human protein is the best match for the dog protein <tt>A0A8I3RVG4_CANLF</tt> and what is its function?
 +
* '''Submit''' file  <tt>best.txt</tt>. Write your commands and the answer to the question above to your '''protocol'''.
 +
 +
===Task D (passwords)===
 +
* The file <tt>/tasks/bash/names.txt</tt> contains data about several people, one per line.
 +
* Each line consists of given name(s), surname and email separated by spaces.
 +
* Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form <tt>username@uniba.sk</tt>.
 +
* The task is to generate file <tt>passwords.csv</tt> which contains a randomly generated password for each of these users
 +
** The output file has columns separated by commas ','
 +
** The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
 
<!-- NOTEX -->
 
<!-- NOTEX -->
* '''Submit''' file <tt>known.tsv</tt>
+
* '''Submit''' file <tt>passwords.csv</tt> with the result of your commands. Write your commands to the '''protocol'''.
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
'''Step 3:'''
+
Example line from input:
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join the files <tt>best.tsv</tt> and <tt>known.tsv</tt> so that each line of <tt>best.tsv</tt> is extended with the text describing the corresponding target in <tt>known.tsv</tt>
 
* Use option <tt>-1 2</tt> to use the second column of <tt>best.tsv</tt> as a key for joining
 
* The output of <tt>join</tt> may look as follows:
 
 
<pre>
 
<pre>
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces...
+
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=...
 
 
</pre>
 
</pre>
* Further reformat the output so that the query name goes first (e.g. <tt>Q6CBS2</tt>), followed by target name (e.g. <tt>sp|B5BP46|YP52_SCHPO</tt>), followed by the rest of the text, but remove all text after <tt>OS=</tt>
+
 
* Sort by query name, store as <tt>best.txt</tt>
+
Example line from output (password will differ):
* The output should start as follows:
 
 
<pre>
 
<pre>
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
+
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
 
B5FVB1  sp|O13877|RPAB5_SCHPO  DNA-directed RNA polymerases I, II, and III subunit RPABC5
 
 
</pre>
 
</pre>
<!-- NOTEX -->
 
* '''Submit''' file  <tt>best.txt</tt>
 
<!-- /NOTEX -->
 
  
'''Note:'''
+
Hints:
* Not all ''Y.lipolytica'' proteins are necessarily included in your final output (some proteins do not have blast match).
+
* Passwords can be generated using <tt>pwgen</tt> (e.g. <tt>pwgen -N 10 -1</tt> prints 10 passwords, one per line)  
** You can think how to find the list of such proteins, but this is not part of the task.
+
* We also recommend using <tt>perl</tt>, <tt>wc</tt>, <tt>paste</tt> (check option <tt>-d</tt> in <tt>paste</tt>)
* Files <tt>best.txt</tt> and <tt>best.tsv</tt> should have the same number of lines.
+
* In Perl, function <tt>[http://perldoc.perl.org/functions/pop.html pop]</tt> may be useful for manipulating <tt>@F</tt> and function <tt>[http://perldoc.perl.org/functions/join.html join]</tt> for connecting strings with a separator.

Latest revision as of 12:02, 29 February 2024

Lecture on Perl, Lecture on command-line tools

  • In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
  • Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)

Preparatory steps and submitting

# create a folder for this homework
mkdir bash
# move to the new folder
cd bash
# link input files to the current folder
ln -s /tasks/bash/human.fa /tasks/bash/dog.fa /tasks/bash/matches.tsv /tasks/bash/names.txt .
# copy protocol to the current folder
cp -i /tasks/bash/protocol.txt .
  • Now you can open protocol.txt in your favorite editor and start working
  • Command ln created symbolic links (shortcuts) to the input files, so you can use them under names such as human.fa rather than full paths such as /tasks/bash/human.fa.

When you are done, you can submit all required files as follows (substitute your username):

cp -ipv protocol.txt human.txt pairs.txt frequency.txt best.txt function.txt passwords.csv /submit/bash/your_username

# check what was submitted
ls -l /submit/bash/your_username

Introduction to tasks A-C

  • In these tasks we will again process bioinformatics data. We have two files of sequences in the FASTA format. This time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with '>' contain the identifier of a protein and potentially an additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the Uniprot database.
  • File /tasks/bash/dog.fa is a FASTA file conatining about 10% of randomly selected dog proteins. Each protein is identified in the FASTA file only by its ID such as A0A8I3MJS8_CANLF.
  • File /tasks/bash/human.fa is a FASTA file with all human proteins. Each ID is followed by a description of the biological function of the protein.
  • These two sets of proteins were compared by the bioinformatics tool called BLAST, which finds proteins with similar sequences. The results of BLAST are in file /tasks/bash/matches.tsv. This file contains a section for each dog protein. This section starts with several comments, i.e. lines starting with # symbol. This is followed by a table with the found matches in the TSV format, i.e., several values delimited by tab characters \t. We will be interested in the first two columns representing the IDs of the dog and human proteins, respectively.

Task A (counting proteins)

Steps (1) and (2)

  • Use files human.fa and dog.fa to find out how many proteins are in each. Each protein starts with a line starting with the > symbol, so it is sufficient to count those.
  • Beware that > symbol means redirect in bash. Therefore you have to enclose it in single quotation marks '>' so that it is taken literally.
  • For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write the commands and the resulting protein counts to the appropriate sections of your protocol.

Step (3)

  • Create file human.txt which contains sequence IDs and descriptions extracted from human.fa. This file will be used in Task C.
  • Leading > should be removed. Any text after OS= in the description should be also removed.
  • This file should be sorted alphabetically.
  • The file should start as follows:
1433B_HUMAN 14-3-3 protein beta/alpha 
1433E_HUMAN 14-3-3 protein epsilon 
1433F_HUMAN 14-3-3 protein eta 
1433G_HUMAN 14-3-3 protein gamma 
1433S_HUMAN 14-3-3 protein sigma 
  • Submit file human.txt, write your commands to the protocol.

Task B (counting matches)

Step (1)

  • From file matches.tsv extract pairs of similar proteins and store them in file pairs.txt.
  • Each line of the file should contain a pair of protein IDs extracted from the first two columns of the matches.tsv file.
  • These IDs should be separated by a single space and the file should be sorted alphabetically.
  • Do not forget to omit lines with comments.
  • Each pair from the input should be listed only once in the output.
  • Commands grep, sort and uniq would be helpful. To select only some columns, you can use commands cut, awk or a Perl one-liner.
  • The file pairs.txt should have 18939 lines (verify using command wc) and it should start as follows:
A0A1Y6D565_CANLF P_HUMAN
A0A1Y6D565_CANLF S13A4_HUMAN
A0A222YTD8_CANLF S39A4_HUMAN
  • Submit file pairs.txt and write your commands to the protocol.

Step (2)

  • Find out how many proteins from dog.fa have at least one similarity found in matches.tsv. This can be done by counting distinct values in the first column of your pairs.txt file from step (1).
  • We suggest commands cut/awk/perl, sort, uniq, wc
  • The result of your commands should be an output consisting of a single number (and the end-of-line character).
  • Write your answer and commands to the protocol. What percentage is this number out of all dog proteins found in Task A(2)?

Step (3)

  • For each dog protein in the first column of pairs.txt file, count how many times it occurs in the file. The result should be a file named frequency.txt with pairs dog protein ID, count separated by space. It should be sorted by the second column (count) from highest to lowest and in case of ties by the first column alphabetically.
  • To check you answer, look at lines 999 and 1000 of the file as follows: head -n 1000 frequency.txt | tail -n 2
  • You should get the following two lines:
A0A8I3P697_CANLF 6
A0A8I3P812_CANLF 6
  • This means that dog proteins A0A8I3P697_CANLF and A0A8I3P812_CANLF 6 both occur 6 times in the first column of pairs.txt, which means 6 human proteins are similar to each.
  • Submit file frequency.txt, write your commands to the protocol. Also write to the protocol what is the highest and lowest count in the second column of your file.
  • Note: The dog proteins with zero matches are not listed in your file. Their number could be deduced from your results in step (2) and Task A(2) if needed.
  • Note2: The highest number of matches per dog protein is actually restricted by a parameter in the search algorithm to produce at most that many answers. Without this setting the number would be much higher.

Task C (joining information)

Step (1)

  • For each dog protein, the first (top) match in matches.tsv represents the strongest similarity.
  • In this step, we want to extract such strongest match for each dog protein which has at least one match.
  • The result should be a file best.txt listing the two IDs separated by a space. The file should be sorted by the second column (human protein ID).
  • The file should start as follows:
A0A8I3MVQ9_CANLF 1433E_HUMAN
A0A8I3MT06_CANLF 1433G_HUMAN
A0A8I3QRX3_CANLF 1433T_HUMAN
  • This task can be done by printing the lines that are not comments but follow a comment line starting with #.
  • In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
  • Instead of using Perl, you can play with grep. Option -A 1 prints the matching lines as well as one line after each match.
  • Submit file best.txt with the result and write your commands to the protocol.

Step (2):

  • Now we want to extend file best.txt with a description of each human protein.
  • Since similar proteins often have similar functions, this will allow somebody studying dog proteins to learn something about their possible functions based on similarity to well-studied human proteins.
  • To achieve this, we join together file best.txt from step 1 and human.txt created in Task A(3). Conveniently, they are both sorted by the ID of the human protein.
  • Use command join to join these files.
  • Use option -1 2 to use the second column of best.txt as a key for joining.
  • The output of join may start as follows:
1433E_HUMAN A0A8I3MVQ9_CANLF 14-3-3 protein epsilon 
1433G_HUMAN A0A8I3MT06_CANLF 14-3-3 protein gamma 
1433T_HUMAN A0A8I3QRX3_CANLF 14-3-3 protein theta 
  • Further reformat the output so that the dog ID goes first (e.g. A0A8I3MVQ9_CANLF), followed by human protein ID (e.g. 1433E_HUMAN), followed by the rest of the text.
  • Sort by dog protein ID, store in file function.txt.
  • The output should start as follows:
A0A1Y6D565_CANLF P_HUMAN P protein
A0A222YTD8_CANLF S39AA_HUMAN Zinc transporter ZIP10
A0A5F4CW23_CANLF ELA_HUMAN Apelin receptor early endogenous ligand
  • Files best.txt and function.txt should have the same number of lines.
  • Which human protein is the best match for the dog protein A0A8I3RVG4_CANLF and what is its function?
  • Submit file best.txt. Write your commands and the answer to the question above to your protocol.

Task D (passwords)

  • The file /tasks/bash/names.txt contains data about several people, one per line.
  • Each line consists of given name(s), surname and email separated by spaces.
  • Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
  • The task is to generate file passwords.csv which contains a randomly generated password for each of these users
    • The output file has columns separated by commas ','
    • The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
  • Submit file passwords.csv with the result of your commands. Write your commands to the protocol.

Example line from input:

Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk

Example line from output (password will differ):

hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un

Hints:

  • Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
  • We also recommend using perl, wc, paste (check option -d in paste)
  • In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.