1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lbash"

From MAD
Jump to navigation Jump to search
 
(16 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
* We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.  
 
* We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.  
  
When working on the exercises, record all the commands used
+
When working on the exercises, record all the commands used.
* We strongly recommend making a log of commands for data processing also outside of this course
+
* We strongly recommend making a log of commands for data processing also outside of this course.
* If you have a log of executed commands, you can easily execute them again by copy and paste
+
* If you have a log of executed commands, you can easily execute them again by copy and paste.
* For this reason any comments are best preceded in the log by <tt>#</tt>
+
* For this reason any comments are best preceded in the log by <tt>#</tt>.
* If you use some sequence of commands often, you can turn it into a script
+
* If you use some sequence of commands often, you can turn it into a script.
  
 
==Efficient use of the Bash command line==
 
==Efficient use of the Bash command line==
  
 
Some tips for bash shell:
 
Some tips for bash shell:
* use ''tab'' key to complete command names, path names etc  
+
* Use ''tab'' key to complete command names, path names etc.
** tab completion [https://www.debian-administration.org/article/316/An_introduction_to_bash_completion_part_1 can be customized]
+
** Tab completion [https://www.gnu.org/software/bash/manual/html_node/Programmable-Completion.html can be customized].
* use ''up'' and ''down'' keys to walk through the history of recently executed commands, then edit and execute the chosen command
+
* Use ''up'' and ''down'' keys to walk through the history of recently executed commands, then edit and execute the chosen command.
* press ''ctrl-r'' to search in the history of executed commands
+
* Press ''ctrl-r'' to search in the history of executed commands.
* at the end of session, history stored in <tt>~/.bash_history</tt>
+
* At the end of s session, history stored in <tt>~/.bash_history</tt>.
* command <tt>history -a</tt> appends history to this file right now
+
* Command <tt>history -a</tt> appends history to this file right now.
** you can then look into the file and copy appropriate commands to your log
+
** You can then look into the file and copy appropriate commands to your log.
* various other history tricks, e.g. special variables [http://samrowe.com/wordpress/advancing-in-the-bash-shell/]
+
** For example, print the last commands using <tt>tail ~/.bash_history</tt>
* <tt>cd -</tt> goes to previously visited directory (also see <tt>pushd</tt> and <tt>popd</tt>)
+
* Various other history tricks exist, e.g. special variables [http://samrowe.com/wordpress/advancing-in-the-bash-shell/].
* <tt>ls -lt | head</tt> shows 10 most recent files, useful for seeing what you have done last in a directory
+
* <tt>cd -</tt> goes to previously visited directory (also see <tt>pushd</tt> and <tt>popd</tt>).
 +
* <tt>ls -lt | head</tt> shows 10 most recent files, useful for seeing what you have done last in a folder.
  
Instead of bash, you can use more advanced command-line environments, e.g. [http://ipython.org/notebook.html iPhyton notebook]
+
Instead of bash, you can use more advanced command-line environments, e.g. [http://ipython.org/notebook.html iPhyton notebook].
  
 
==Redirecting and pipes==
 
==Redirecting and pipes==
Line 43: Line 44:
 
command < file
 
command < file
  
# do not forget to quote > in other uses, e.g. when searching for string ">" in a file sequences.fasta
+
# do not forget to quote > in other uses,  
 +
# e.g. when searching for string ">" in a file sequences.fasta
 
grep '>' sequences.fasta
 
grep '>' sequences.fasta
 
# (without quotes rewrites sequences.fasta)
 
# (without quotes rewrites sequences.fasta)
Line 49: Line 51:
 
# should be quoted in '' as well
 
# should be quoted in '' as well
  
# send stdout of command1 to stdin of command2
+
# send stdout of command1 to stdin of command2  
 +
# this is called a pipeline
 
command1 | command2
 
command1 | command2
  
 
# backtick operator executes command,  
 
# backtick operator executes command,  
# removes trailing \n from stdout, substitutes to command line
+
# removes trailing \n from stdout,  
 +
# substitutes to command line
 
# the following commands do the same thing:
 
# the following commands do the same thing:
 
head -n 2 file
 
head -n 2 file
Line 73: Line 77:
 
set -o pipefail
 
set -o pipefail
 
</syntaxhighlight>
 
</syntaxhighlight>
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.
+
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, the pipeline then returns exit status of the rightmost command.
  
 
==Text file manipulation==
 
==Text file manipulation==
Line 111: Line 115:
 
ls -lh file
 
ls -lh file
  
# od -a prints file or stdout with named characters  
+
# od -a prints file or stdin with named characters  
 
#  allows checking whitespace and special characters
 
#  allows checking whitespace and special characters
 
echo "hello world!" | od -a
 
echo "hello world!" | od -a
Line 122: Line 126:
 
===Command grep (getting lines matching a regular expression)===
 
===Command grep (getting lines matching a regular expression)===
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
# get all lines containing string chromosome
+
# get all lines in file human.fa containing string wall
grep chromosome file
+
grep wall human.fa
 
# -i ignores case (upper case and lowercase letters are the same)
 
# -i ignores case (upper case and lowercase letters are the same)
grep -i chromosome file
+
grep -i wall human.fa
# -c counts the number of matching lines in each file
+
# -c counts the number of matching lines in each file in the current folder
grep -c '^[12][0-9]' file1 file2
+
grep -c '^[IJ]' *
  
 
# other options (there is more, see the manual):
 
# other options (there is more, see the manual):
# -v print/count not matching lines (inVert)
+
# -v print/count non-matching lines (inVert)
 
# -n show also line numbers
 
# -n show also line numbers
 
# -B 2 -A 1 print 2 lines before each match and 1 line after match
 
# -B 2 -A 1 print 2 lines before each match and 1 line after match
Line 143: Line 147:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# sort lines of a file alphabetically
 
# sort lines of a file alphabetically
sort file
+
sort names.txt
  
 
# some useful options of sort:
 
# some useful options of sort:
Line 149: Line 153:
 
# -k which column(s) to use as key
 
# -k which column(s) to use as key
 
# -r reverse (from largest values)
 
# -r reverse (from largest values)
# -s stable
 
 
# -t fields separator
 
# -t fields separator
  
# sorting first by column 2 numerically (-k2,2g),
+
# sorting output of the first command
 +
# first compare column 3 numerically (-k3,3g),
 
# in case of ties use column 1 (-k1,1)
 
# in case of ties use column 1 (-k1,1)
sort -k2,2g -k1,1 file
+
# the long result is piped to less to look at manually
 +
grep -v '#' matches.tsv | sort -k3,3g -k1,1 | less
  
 
# uniq outputs one line from each group of consecutive identical lines
 
# uniq outputs one line from each group of consecutive identical lines
Line 160: Line 165:
 
# the following finds all unique lines
 
# the following finds all unique lines
 
# and sorts them by frequency from the most frequent
 
# and sorts them by frequency from the most frequent
sort file | uniq -c | sort -gr
+
sort protocol.txt | uniq -c | sort -gr
 
</syntaxhighlight>
 
</syntaxhighlight>
 
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html sort], [http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html uniq]
 
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html sort], [http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html uniq]
Line 168: Line 173:
 
Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/diff-invocation.html diff]</tt> compares two files. It is good for manual checking of differences. Useful options:
 
Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/diff-invocation.html diff]</tt> compares two files. It is good for manual checking of differences. Useful options:
 
* <tt>-b</tt> (ignore whitespace differences)  
 
* <tt>-b</tt> (ignore whitespace differences)  
* <tt>-r</tt> for comparing whole directories
+
* <tt>-r</tt> for recursively comparing whole directories
 
* <tt>-q</tt> for fast checking for identity
 
* <tt>-q</tt> for fast checking for identity
 
* <tt>-y</tt> show differences side-by-side
 
* <tt>-y</tt> show differences side-by-side
Line 177: Line 182:
 
* lines occurring in both files
 
* lines occurring in both files
 
Some columns can be suppressed with options <tt>-1, -2, -3</tt>
 
Some columns can be suppressed with options <tt>-1, -2, -3</tt>
 
  
 
===Commands cut, paste, join (working with columns)===
 
===Commands cut, paste, join (working with columns)===
Line 192: Line 196:
  
 
==Programs sed and awk==
 
==Programs sed and awk==
Both <tt>sed</tt> and <tt>awk</tt> process text files line by line, allowing to do various transformations
+
Both <tt>sed</tt> and <tt>awk</tt> process text files line by line, allowing to do various transformations.
* <tt>awk</tt> newer, more advanced
+
* Program <tt>awk</tt> is newer, more advanced.
* several examples below
+
* We will not cover these, several examples are shown below
 
* More info on [https://en.wikipedia.org/wiki/AWK awk], [https://en.wikipedia.org/wiki/Sed sed] on Wikipedia
 
* More info on [https://en.wikipedia.org/wiki/AWK awk], [https://en.wikipedia.org/wiki/Sed sed] on Wikipedia
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 216: Line 220:
  
 
==Perl one-liners==
 
==Perl one-liners==
Instead of sed and awk, we will cover Perl one-liners
+
Instead of sed and awk, we will cover Perl one-liners.
* more examples on various websites ([http://www.math.harvard.edu/computing/perl/oneliners.txt example 1], [https://blogs.oracle.com/ksplice/entry/the_top_10_tricks_of example 2])
+
* More examples can be found on various websites, e.g. [https://learnbyexample.github.io/learn_perl_oneliners/one-liner-introduction.html]
* documentation for  [http://perldoc.perl.org/perlrun.html Perl switches]
+
* Documentation for  [http://perldoc.perl.org/perlrun.html Perl switches]
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# -e executes commands
 
# -e executes commands
perl -e'print 2+3,"\n"'
+
perl -e 'print 2+3,"\n"'
perl -e'$x = 2+3; print $x, "\n"';
+
perl -e '$x = 2+3; print $x, "\n"';
  
 
# -n wraps commands in a loop reading lines from stdin
 
# -n wraps commands in a loop reading lines from stdin
 
# or files listed as arguments
 
# or files listed as arguments
 
# the following is roughly the same as cat:
 
# the following is roughly the same as cat:
perl -ne'print'
+
perl -ne 'print'
 
# how to use:
 
# how to use:
perl -ne'print' < input > output
+
perl -ne 'print' < input > output
perl -ne'print' input1 input2 > output
+
perl -ne 'print' input1 input2 > output
 
# lines are stored in a special variable $_
 
# lines are stored in a special variable $_
 
# this variable is default argument of many functions,  
 
# this variable is default argument of many functions,  
Line 243: Line 247:
 
# -l removes end of line from each input line and adds "\n" after each print
 
# -l removes end of line from each input line and adds "\n" after each print
 
# the following adds * at the end of each line
 
# the following adds * at the end of each line
perl -lne'print $_, "*"'  
+
perl -lne 'print $_, "*"'  
  
 
# -a splits line into words separated by whitespace and stores them in array @F
 
# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in the numbers stored
+
# The next example prints difference in the numbers stored in columns 7 and 6.
# in the second and first column
+
# This is size of an interval which has start and end stored in these columns
# (e.g. interval size if each line coordinates of one interval)
+
grep -v '#' matches.tsv | perl -lane 'print $F[7]-$F[6]' | less
perl -lane'print $F[1]-$F[0]'
 
  
 
# -F allows to set separator used for splitting (regular expression)
 
# -F allows to set separator used for splitting (regular expression)
 
# the next example splits at tabs
 
# the next example splits at tabs
perl -F '"\t"' -lane'print $F[1]-$F[0]'
+
perl -F '"\t"' -lane 'print $F[7]-$F[6]'
  
 
# END { commands } is run at the very end, after we finish reading input
 
# END { commands } is run at the very end, after we finish reading input
# the following example computes the sum of interval lengths
+
# the following example computes the average of interval lengths
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
+
grep -v '#' matches.tsv | perl -lane '$count++; $sum += $F[7]-$F[6]; END { print $sum/$count; }'
 
# similarly BEGIN { command } before we start
 
# similarly BEGIN { command } before we start
 
</syntaxhighlight>
 
</syntaxhighlight>
  
Other interesting possibilites:
+
Other interesting possibilities:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# -i replaces each file with a new transformed version (DANGEROUS!)
 
# -i replaces each file with a new transformed version (DANGEROUS!)
Line 267: Line 270:
 
# in the current directory
 
# in the current directory
 
perl -lne 'print if length($_)>0' -i *.txt
 
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces sequence of whitespace by exactly one space  
+
# The following example replaces a sequence of whitespace by exactly one space  
 
# and removes leading and trailing spaces from lines in all .txt files
 
# and removes leading and trailing spaces from lines in all .txt files
 
perl -lane 'print join(" ", @F)' -i *.txt
 
perl -lane 'print join(" ", @F)' -i *.txt
  
# variable $. contains the line number. $ARGV the name of file or - for stdin
+
# variable $. contains the line number, $ARGV the name of file or - for stdin
 
# the following prints filename and line number in front of every line
 
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2
+
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' names.txt protocol.txt  | less
  
 
# moving files *.txt to have extension .tsv:
 
# moving files *.txt to have extension .tsv:
Line 279: Line 282:
 
#  then execute by hand or replace print with system
 
#  then execute by hand or replace print with system
 
#  mv -i asks if something is to be rewritten
 
#  mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
+
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'
+
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; system("mv -i $_ $s")'
</syntaxhighlight>
 
 
 
==HWbash==
 
<!-- NOTEX -->
 
[[#Lperl|Lecture on Perl]], [[#Lbash|Lecture on command-line tools]]
 
<!-- /NOTEX -->
 
 
 
* In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
 
* Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
 
* Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
 
<!-- NOTEX -->
 
* Include all relevant used commands in your protocol and add a short description of your approach.
 
* Submit the protocol and required output files.
 
* Outline of the protocol is in <tt>/tasks/bash/protocol.txt</tt>, submit to directory <tt>/submit/bash/yourname</tt>
 
<!-- /NOTEX -->
 
 
 
===Task A (passwords)===
 
* The file <tt>/tasks/bash/names.txt</tt> contains data about several people, one per line.
 
* Each line consists of given name(s), surname and email separated by spaces.
 
* Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form <tt>username@uniba.sk</tt>.
 
* The task is to generate file <tt>passwords.csv</tt> which contains a randomly generated password for each of these users
 
** The output file has columns separated by commas ','
 
** The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
 
<!-- NOTEX -->
 
* '''Submit''' file <tt>passwords.csv</tt> with the result of your commands.
 
<!-- /NOTEX -->
 
 
 
Example line from input:
 
<pre>
 
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
 
</pre>
 
 
 
Example line from output (password will differ):
 
<pre>
 
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
 
</pre>
 
 
 
Hints:
 
* Passwords can be generated using <tt>pwgen</tt> (e.g. <tt>pwgen -N 10 -1</tt> prints 10 passwords, one per line)
 
* We also recommend using <tt>perl</tt>, <tt>wc</tt>, <tt>paste</tt> (check option <tt>-d</tt> in <tt>paste</tt>)
 
* In Perl, function <tt>[http://perldoc.perl.org/functions/pop.html pop]</tt> may be useful for manipulating <tt>@F</tt> and function <tt>[http://perldoc.perl.org/functions/join.html join]</tt> for connecting strings with a separator.
 
 
 
===Task B (yeast genome)===
 
 
 
'''The input file:'''
 
* <tt>/tasks/bash/saccharomyces_cerevisiae.gff</tt> contains annotation of the yeast genome
 
** Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff].
 
** It was further processed to omit DNA sequences from the end of file.
 
** The size of the file is 5.6M.
 
* For easier work, link the file to your directory by <tt>ln -s /tasks/bash/saccharomyces_cerevisiae.gff yeast.gff</tt>
 
* The file is in [http://www.sequenceontology.org/gff3.shtml GFF3 format]
 
* The lines starting with <tt>#</tt> are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
 
* Meaning of the first 5 columns:
 
** column 0 chromosome name
 
** column 1 source (can be ignored)
 
** column 2 type of interval
 
** column 3 start of interval (1-based coordinates)
 
** column 4 end of interval (1-based coordinates)
 
* You can assume that these first 5 columns do not contain whitespace
 
 
 
'''Task:'''
 
* Print for each type of interval (column 2), how many times it occurs in the file.
 
* Sort from the most common to the least common interval types.
 
* Hint: commands <tt>sort</tt> and <tt>uniq</tt> will be useful. Do not forget to skip comments, for example using <tt>grep -v '^#'</tt>
 
* The result should be a file  <tt>types.txt</tt> formatted as follows:
 
<pre>
 
  7058 CDS
 
  6600 mRNA
 
...
 
...
 
      1 telomerase_RNA_gene
 
      1 mating_type_region
 
      1 intein_encoding_region
 
 
 
</pre>
 
<!-- NOTEX -->
 
'''Submit''' the file <tt>types.txt</tt>
 
<!-- /NOTEX -->
 
 
 
===Task C (chromosomes)===
 
* Continue processing file from task B.
 
* For each chromosome, the file contains a line which has in column 2 string <tt>chromosome</tt>, and the interval is the whole chromosome.
 
* To file <tt>chrosomes.txt</tt>, print a tab-separated list of chromosome names and sizes in the same order as in the input
 
* The last line of <tt>chromosomes.txt</tt> should list the total size of all chromosomes combined.
 
<!-- NOTEX -->
 
* '''Submit''' file <tt>chromosomes.txt</tt>
 
<!-- /NOTEX -->
 
* Hints:
 
** The total size can be computed by a perl one-liner.
 
** Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: <tt>perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'</tt>
 
** Grepping for word chromosome does not check if this word is indeed in the second column
 
** Tab character is written in Perl as <tt>"\t"</tt>.
 
* Your output should start and end as follows:
 
<pre>
 
chrI    230218
 
chrII  813184
 
...
 
...
 
chrXVI  948066
 
chrmt  85779
 
total  12157105
 
</pre>
 
 
 
===Task D (blast)===
 
'''Overall goal:'''
 
* Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09. The file contains sequence of the protein as well as a short description of its biological function.
 
* We have also downloaded proteins from the yeast ''Yarrowia lipolytica''. We will pretend that nothing is known about the function of these proteins (as if they were produced by gene finding program in a newly sequenced genome).
 
* For each ''Y.lipolytica'' protein, we have found similar proteins from other yeasts
 
* Now we want to extract for each protein in ''Y.lipolytica'' its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the ''Y.lipolytica'' protein.
 
 
 
'''Files:'''
 
* <tt>/tasks/bash/known.fa</tt> is a FASTA file containing sequences of known proteins from several species
 
* <tt>/tasks/bash/yarLip.fa</tt> is a FASTA file with proteins from ''Y.lipolytica''
 
* <tt>/tasks/bash/known.blast</tt> is the result of finding similar proteins in <tt>yarLip.fa</tt> versus <tt>known.fa</tt> by these commands (already done by us):
 
<syntaxhighlight lang="bash">
 
formatdb -i known.fa
 
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
 
 
</syntaxhighlight>
 
</syntaxhighlight>
* you can link these files to your directory as follows:
 
<syntaxhighlight lang="bash">
 
ln -s /tasks/bash/known.fa .
 
ln -s /tasks/bash/yarLip.fa .
 
ln -s /tasks/bash/known.blast .
 
</syntaxhighlight>
 
 
'''Step 1:'''
 
* Get the first (strongest) match for each query from <tt>known.blast</tt>.
 
* This can be done by printing the lines that are not comments but follow a comment line starting with #.
 
* In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide of you print the current line.
 
* Instead of using Perl, you can play with grep. Option <tt>-A 1</tt> prints the matching lines as well as one line ofter each match
 
* Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
 
* Store the result in file <tt>best.tsv</tt>. The file should start as follows:
 
<pre>
 
Q6CBS2  sp|B5BP46|YP52_SCHPO
 
Q6C8R4  sp|B5BP48|YP54_SCHPO
 
Q6CG80  sp|B5BP48|YP54_SCHPO
 
Q6CH56  sp|B5BP48|YP54_SCHPO
 
</pre>
 
<!-- NOTEX -->
 
* '''Submit''' file <tt>best.tsv</tt> with the result
 
<!-- /NOTEX -->
 
 
'''Step 2:'''
 
* Create file <tt>known.tsv</tt> which contains sequence names extracted from <tt>known.fa</tt> with leading <tt>></tt> removed
 
* This file should be sorted alphabetically.
 
* The file should start as follows (lines are trimmed below):
 
<pre>
 
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces...
 
sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces...
 
</pre>
 
* '''Submit''' file <tt>known.tsv</tt>
 
 
'''Step 3:'''
 
* Use command [http://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html join] to join the files <tt>best.tsv</tt> and <tt>known.tsv</tt> so that each line of <tt>best.tsv</tt> is extended with the text describing the corresponding target in <tt>known.tsv</tt>
 
* Use option <tt>-1 2</tt> to use the second column of <tt>best.tsv</tt> as a key for joining
 
* The output of <tt>join</tt> may look as follows:
 
<pre>
 
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces...
 
sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=...
 
</pre>
 
* Further reformat the output so that the query name goes first (e.g. <tt>Q6CBS2</tt>), followed by target name (e.g. <tt>sp|B5BP46|YP52_SCHPO</tt>), followed by the rest of the text, but remove all text after <tt>OS=</tt>
 
* Sort by query name, store as <tt>best.txt</tt>
 
* The output should start as follows:
 
<pre>
 
B5FVA8  tr|Q5A7D5|Q5A7D5_CANAL  Lysophospholipase
 
B5FVB0  sp|O74810|UBC1_SCHPO    Ubiquitin-conjugating enzyme E2 1
 
B5FVB1  sp|O13877|RPAB5_SCHPO  DNA-directed RNA polymerases I, II, and III subunit RPABC5
 
</pre>
 
<!-- NOTEX -->
 
* '''Submit''' file  <tt>best.txt</tt>
 
<!-- /NOTEX -->
 
 
'''Note:'''
 
* Not all ''Y.lipolytica'' proteins are necessarily included in your final output (some proteins do not have blast match).
 
** You can think how to find the list of such proteins, but this is not part of the task.
 
* Files <tt>best.txt</tt> and <tt>best.tsv</tt> should have the same number of lines.
 

Latest revision as of 13:10, 29 February 2024

HWbash

This lecture introduces command-line tools and Perl one-liners.

  • We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.

When working on the exercises, record all the commands used.

  • We strongly recommend making a log of commands for data processing also outside of this course.
  • If you have a log of executed commands, you can easily execute them again by copy and paste.
  • For this reason any comments are best preceded in the log by #.
  • If you use some sequence of commands often, you can turn it into a script.

Efficient use of the Bash command line

Some tips for bash shell:

  • Use tab key to complete command names, path names etc.
  • Use up and down keys to walk through the history of recently executed commands, then edit and execute the chosen command.
  • Press ctrl-r to search in the history of executed commands.
  • At the end of s session, history stored in ~/.bash_history.
  • Command history -a appends history to this file right now.
    • You can then look into the file and copy appropriate commands to your log.
    • For example, print the last commands using tail ~/.bash_history
  • Various other history tricks exist, e.g. special variables [1].
  • cd - goes to previously visited directory (also see pushd and popd).
  • ls -lt | head shows 10 most recent files, useful for seeing what you have done last in a folder.

Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook.

Redirecting and pipes

# redirect standard output to file
command > file

# append to file
command >> file

# redirect standard error
command 2>file

# redirect file to standard input
command < file

# do not forget to quote > in other uses, 
# e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc
# should be quoted in '' as well

# send stdout of command1 to stdin of command2 
# this is called a pipeline
command1 | command2

# backtick operator executes command, 
# removes trailing \n from stdout, 
# substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file

# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'

# in some commands, file argument can be taken from stdin
# if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2

Make piped commands fail properly:

set -o pipefail

If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, the pipeline then returns exit status of the rightmost command.

Text file manipulation

Commands echo and cat (creating and printing files)

# print text Hello and end of line to stdout
echo "Hello" 
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2

Commands head and tail (looking at start and end of files)

# print 10 first lines of file (or stdin)
head file
some_command | head 
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20

Documentation: head, tail

Commands wc, ls -lh, od (exploring file statistics and details)

# prints three numbers:
# the number of lines (-l), number of words (-w), number of bytes (-c)
wc file

# prints the size of file in human-readable units (K,M,G,T)
ls -lh file

# od -a prints file or stdin with named characters 
#   allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000   h   e   l   l   o  sp   w   o   r   l   d   !  nl
# 0000015

Documentation: wc, ls, od

Command grep (getting lines matching a regular expression)

# get all lines in file human.fa containing string wall
grep wall human.fa
# -i ignores case (upper case and lowercase letters are the same)
grep -i wall human.fa
# -c counts the number of matching lines in each file in the current folder
grep -c '^[IJ]' *

# other options (there is more, see the manual):
# -v print/count non-matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file 
#    (good for selecting e.g. only lines matching one of "good" ids)

Documentation: grep

Commands sort, uniq

# sort lines of a file alphabetically
sort names.txt

# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -t fields separator

# sorting output of the first command
# first compare column 3 numerically (-k3,3g),
# in case of ties use column 1 (-k1,1)
# the long result is piped to less to look at manually
grep -v '#' matches.tsv | sort -k3,3g -k1,1 | less

# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines
# and sorts them by frequency from the most frequent
sort protocol.txt | uniq -c | sort -gr

Documentation: sort, uniq

Commands diff, comm (comparing files)

Command diff compares two files. It is good for manual checking of differences. Useful options:

  • -b (ignore whitespace differences)
  • -r for recursively comparing whole directories
  • -q for fast checking for identity
  • -y show differences side-by-side

Command comm compares two sorted files. It is good for finding set intersections and differences. It writes three columns:

  • lines occurring only in the first file
  • lines occurring only in the second file
  • lines occurring in both files

Some columns can be suppressed with options -1, -2, -3

Commands cut, paste, join (working with columns)

  • Command cut selects only some columns from file (perl/awk more flexible)
  • Command paste puts two or more files side by side, separated by tabs or other characters
  • Command join is a powerful tool for making joins and left-joins as in databases on specified columns in two files

Commands split, csplit (splitting files to parts)

  • Command split splits into fixed-size pieces (size in lines, bytes etc.)
  • Command csplit splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
csplit sequences.fa '/^>/' '{*}'

Programs sed and awk

Both sed and awk process text files line by line, allowing to do various transformations.

  • Program awk is newer, more advanced.
  • We will not cover these, several examples are shown below
  • More info on awk, sed on Wikipedia
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints the first two lines, then quits (like head -n 2)
sed 2q  

# print the first and second column from a file
awk '{print $1, $2}' 

# print the line if the difference between the first and second column > 10
awk '{ if ($2-$1>10) print }'  

# print lines matching pattern
awk '/pattern/ { print }' 

# count the lines (like wc -l)
awk 'END { print NR }'

Perl one-liners

Instead of sed and awk, we will cover Perl one-liners.

  • More examples can be found on various websites, e.g. [2]
  • Documentation for Perl switches
# -e executes commands
perl -e 'print 2+3,"\n"'
perl -e '$x = 2+3; print $x, "\n"';

# -n wraps commands in a loop reading lines from stdin
# or files listed as arguments
# the following is roughly the same as cat:
perl -ne 'print'
# how to use:
perl -ne 'print' < input > output
perl -ne 'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions, 
# including print, so print is the same as print $_

# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_

# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne 'print $_, "*"' 

# -a splits line into words separated by whitespace and stores them in array @F
# The next example prints difference in the numbers stored in columns 7 and 6.
# This is size of an interval which has start and end stored in these columns
grep -v '#' matches.tsv | perl -lane 'print $F[7]-$F[6]' | less

# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane 'print $F[7]-$F[6]'

# END { commands } is run at the very end, after we finish reading input
# the following example computes the average of interval lengths
grep -v '#' matches.tsv | perl -lane '$count++; $sum += $F[7]-$F[6]; END { print $sum/$count; }'
# similarly BEGIN { command } before we start

Other interesting possibilities:

# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files
# in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# The following example replaces a sequence of whitespace by exactly one space 
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt

# variable $. contains the line number, $ARGV the name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' names.txt protocol.txt  | less

# moving files *.txt to have extension .tsv:
#   first print commands 
#   then execute by hand or replace print with system
#   mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; system("mv -i $_ $s")'