1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lbash"

From MAD
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
* We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.  
 
* We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.  
  
When working on the exercises, record all the commands used
+
When working on the exercises, record all the commands used.
* We strongly recommend making a log of commands for data processing also outside of this course
+
* We strongly recommend making a log of commands for data processing also outside of this course.
* If you have a log of executed commands, you can easily execute them again by copy and paste
+
* If you have a log of executed commands, you can easily execute them again by copy and paste.
* For this reason any comments are best preceded in the log by <tt>#</tt>
+
* For this reason any comments are best preceded in the log by <tt>#</tt>.
* If you use some sequence of commands often, you can turn it into a script
+
* If you use some sequence of commands often, you can turn it into a script.
  
 
==Efficient use of the Bash command line==
 
==Efficient use of the Bash command line==
  
 
Some tips for bash shell:
 
Some tips for bash shell:
* use ''tab'' key to complete command names, path names etc  
+
* Use ''tab'' key to complete command names, path names etc.
** tab completion [https://www.gnu.org/software/bash/manual/html_node/Programmable-Completion.html can be customized]
+
** Tab completion [https://www.gnu.org/software/bash/manual/html_node/Programmable-Completion.html can be customized].
* use ''up'' and ''down'' keys to walk through the history of recently executed commands, then edit and execute the chosen command
+
* Use ''up'' and ''down'' keys to walk through the history of recently executed commands, then edit and execute the chosen command.
* press ''ctrl-r'' to search in the history of executed commands
+
* Press ''ctrl-r'' to search in the history of executed commands.
* at the end of session, history stored in <tt>~/.bash_history</tt>
+
* At the end of s session, history stored in <tt>~/.bash_history</tt>.
* command <tt>history -a</tt> appends history to this file right now
+
* Command <tt>history -a</tt> appends history to this file right now.
** you can then look into the file and copy appropriate commands to your log
+
** You can then look into the file and copy appropriate commands to your log.
* various other history tricks, e.g. special variables [http://samrowe.com/wordpress/advancing-in-the-bash-shell/]
+
** For example, print the last commands using <tt>tail ~/.bash_history</tt>
* <tt>cd -</tt> goes to previously visited directory (also see <tt>pushd</tt> and <tt>popd</tt>)
+
* Various other history tricks exist, e.g. special variables [http://samrowe.com/wordpress/advancing-in-the-bash-shell/].
* <tt>ls -lt | head</tt> shows 10 most recent files, useful for seeing what you have done last in a folder
+
* <tt>cd -</tt> goes to previously visited directory (also see <tt>pushd</tt> and <tt>popd</tt>).
 +
* <tt>ls -lt | head</tt> shows 10 most recent files, useful for seeing what you have done last in a folder.
  
Instead of bash, you can use more advanced command-line environments, e.g. [http://ipython.org/notebook.html iPhyton notebook]
+
Instead of bash, you can use more advanced command-line environments, e.g. [http://ipython.org/notebook.html iPhyton notebook].
  
 
==Redirecting and pipes==
 
==Redirecting and pipes==
Line 50: Line 51:
 
# should be quoted in '' as well
 
# should be quoted in '' as well
  
# send stdout of command1 to stdin of command2
+
# send stdout of command1 to stdin of command2  
 +
# this is called a pipeline
 
command1 | command2
 
command1 | command2
  
 
# backtick operator executes command,  
 
# backtick operator executes command,  
# removes trailing \n from stdout, substitutes to command line
+
# removes trailing \n from stdout,  
 +
# substitutes to command line
 
# the following commands do the same thing:
 
# the following commands do the same thing:
 
head -n 2 file
 
head -n 2 file
Line 74: Line 77:
 
set -o pipefail
 
set -o pipefail
 
</syntaxhighlight>
 
</syntaxhighlight>
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.
+
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, the pipeline then returns exit status of the rightmost command.
  
 
==Text file manipulation==
 
==Text file manipulation==
Line 112: Line 115:
 
ls -lh file
 
ls -lh file
  
# od -a prints file or stdout with named characters  
+
# od -a prints file or stdin with named characters  
 
#  allows checking whitespace and special characters
 
#  allows checking whitespace and special characters
 
echo "hello world!" | od -a
 
echo "hello world!" | od -a
Line 123: Line 126:
 
===Command grep (getting lines matching a regular expression)===
 
===Command grep (getting lines matching a regular expression)===
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
# get all lines containing string chromosome
+
# get all lines in file human.fa containing string wall
grep chromosome file
+
grep wall human.fa
 
# -i ignores case (upper case and lowercase letters are the same)
 
# -i ignores case (upper case and lowercase letters are the same)
grep -i chromosome file
+
grep -i wall human.fa
# -c counts the number of matching lines in each file
+
# -c counts the number of matching lines in each file in the current folder
grep -c '^[12][0-9]' file1 file2
+
grep -c '^[IJ]' *
  
 
# other options (there is more, see the manual):
 
# other options (there is more, see the manual):
# -v print/count not matching lines (inVert)
+
# -v print/count non-matching lines (inVert)
 
# -n show also line numbers
 
# -n show also line numbers
 
# -B 2 -A 1 print 2 lines before each match and 1 line after match
 
# -B 2 -A 1 print 2 lines before each match and 1 line after match
Line 144: Line 147:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# sort lines of a file alphabetically
 
# sort lines of a file alphabetically
sort file
+
sort names.txt
  
 
# some useful options of sort:
 
# some useful options of sort:
Line 150: Line 153:
 
# -k which column(s) to use as key
 
# -k which column(s) to use as key
 
# -r reverse (from largest values)
 
# -r reverse (from largest values)
# -s stable
 
 
# -t fields separator
 
# -t fields separator
  
# sorting first by column 2 numerically (-k2,2g),
+
# sorting output of the first command
 +
# first compare column 3 numerically (-k3,3g),
 
# in case of ties use column 1 (-k1,1)
 
# in case of ties use column 1 (-k1,1)
sort -k2,2g -k1,1 file
+
# the long result is piped to less to look at manually
 +
grep -v '#' matches.tsv | sort -k3,3g -k1,1 | less
  
 
# uniq outputs one line from each group of consecutive identical lines
 
# uniq outputs one line from each group of consecutive identical lines
Line 161: Line 165:
 
# the following finds all unique lines
 
# the following finds all unique lines
 
# and sorts them by frequency from the most frequent
 
# and sorts them by frequency from the most frequent
sort file | uniq -c | sort -gr
+
sort protocol.txt | uniq -c | sort -gr
 
</syntaxhighlight>
 
</syntaxhighlight>
 
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html sort], [http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html uniq]
 
Documentation: [http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html sort], [http://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html uniq]
Line 169: Line 173:
 
Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/diff-invocation.html diff]</tt> compares two files. It is good for manual checking of differences. Useful options:
 
Command <tt>[http://www.gnu.org/software/coreutils/manual/html_node/diff-invocation.html diff]</tt> compares two files. It is good for manual checking of differences. Useful options:
 
* <tt>-b</tt> (ignore whitespace differences)  
 
* <tt>-b</tt> (ignore whitespace differences)  
* <tt>-r</tt> for comparing whole directories
+
* <tt>-r</tt> for recursively comparing whole directories
 
* <tt>-q</tt> for fast checking for identity
 
* <tt>-q</tt> for fast checking for identity
 
* <tt>-y</tt> show differences side-by-side
 
* <tt>-y</tt> show differences side-by-side
Line 178: Line 182:
 
* lines occurring in both files
 
* lines occurring in both files
 
Some columns can be suppressed with options <tt>-1, -2, -3</tt>
 
Some columns can be suppressed with options <tt>-1, -2, -3</tt>
 
  
 
===Commands cut, paste, join (working with columns)===
 
===Commands cut, paste, join (working with columns)===
Line 193: Line 196:
  
 
==Programs sed and awk==
 
==Programs sed and awk==
Both <tt>sed</tt> and <tt>awk</tt> process text files line by line, allowing to do various transformations
+
Both <tt>sed</tt> and <tt>awk</tt> process text files line by line, allowing to do various transformations.
* <tt>awk</tt> newer, more advanced
+
* Program <tt>awk</tt> is newer, more advanced.
* several examples below
+
* We will not cover these, several examples are shown below
 
* More info on [https://en.wikipedia.org/wiki/AWK awk], [https://en.wikipedia.org/wiki/Sed sed] on Wikipedia
 
* More info on [https://en.wikipedia.org/wiki/AWK awk], [https://en.wikipedia.org/wiki/Sed sed] on Wikipedia
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 217: Line 220:
  
 
==Perl one-liners==
 
==Perl one-liners==
Instead of sed and awk, we will cover Perl one-liners
+
Instead of sed and awk, we will cover Perl one-liners.
* more examples on various websites ([http://www.math.harvard.edu/computing/perl/oneliners.txt example 1], [https://blogs.oracle.com/ksplice/entry/the_top_10_tricks_of example 2])
+
* More examples can be found on various websites, e.g. [https://learnbyexample.github.io/learn_perl_oneliners/one-liner-introduction.html]
* documentation for  [http://perldoc.perl.org/perlrun.html Perl switches]
+
* Documentation for  [http://perldoc.perl.org/perlrun.html Perl switches]
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
# -e executes commands
 
# -e executes commands
perl -e'print 2+3,"\n"'
+
perl -e 'print 2+3,"\n"'
perl -e'$x = 2+3; print $x, "\n"';
+
perl -e '$x = 2+3; print $x, "\n"';
  
 
# -n wraps commands in a loop reading lines from stdin
 
# -n wraps commands in a loop reading lines from stdin
 
# or files listed as arguments
 
# or files listed as arguments
 
# the following is roughly the same as cat:
 
# the following is roughly the same as cat:
perl -ne'print'
+
perl -ne 'print'
 
# how to use:
 
# how to use:
perl -ne'print' < input > output
+
perl -ne 'print' < input > output
perl -ne'print' input1 input2 > output
+
perl -ne 'print' input1 input2 > output
 
# lines are stored in a special variable $_
 
# lines are stored in a special variable $_
 
# this variable is default argument of many functions,  
 
# this variable is default argument of many functions,  
Line 244: Line 247:
 
# -l removes end of line from each input line and adds "\n" after each print
 
# -l removes end of line from each input line and adds "\n" after each print
 
# the following adds * at the end of each line
 
# the following adds * at the end of each line
perl -lne'print $_, "*"'  
+
perl -lne 'print $_, "*"'  
  
 
# -a splits line into words separated by whitespace and stores them in array @F
 
# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in the numbers stored
+
# The next example prints difference in the numbers stored in columns 7 and 6.
# in the second and first column
+
# This is size of an interval which has start and end stored in these columns
# (e.g. interval size if each line coordinates of one interval)
+
grep -v '#' matches.tsv | perl -lane 'print $F[7]-$F[6]' | less
perl -lane'print $F[1]-$F[0]'
 
  
 
# -F allows to set separator used for splitting (regular expression)
 
# -F allows to set separator used for splitting (regular expression)
 
# the next example splits at tabs
 
# the next example splits at tabs
perl -F '"\t"' -lane'print $F[1]-$F[0]'
+
perl -F '"\t"' -lane 'print $F[7]-$F[6]'
  
 
# END { commands } is run at the very end, after we finish reading input
 
# END { commands } is run at the very end, after we finish reading input
# the following example computes the sum of interval lengths
+
# the following example computes the average of interval lengths
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
+
grep -v '#' matches.tsv | perl -lane '$count++; $sum += $F[7]-$F[6]; END { print $sum/$count; }'
 
# similarly BEGIN { command } before we start
 
# similarly BEGIN { command } before we start
 
</syntaxhighlight>
 
</syntaxhighlight>
Line 268: Line 270:
 
# in the current directory
 
# in the current directory
 
perl -lne 'print if length($_)>0' -i *.txt
 
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces sequence of whitespace by exactly one space  
+
# The following example replaces a sequence of whitespace by exactly one space  
 
# and removes leading and trailing spaces from lines in all .txt files
 
# and removes leading and trailing spaces from lines in all .txt files
 
perl -lane 'print join(" ", @F)' -i *.txt
 
perl -lane 'print join(" ", @F)' -i *.txt
  
# variable $. contains the line number. $ARGV the name of file or - for stdin
+
# variable $. contains the line number, $ARGV the name of file or - for stdin
 
# the following prints filename and line number in front of every line
 
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2
+
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' names.txt protocol.txt  | less
  
 
# moving files *.txt to have extension .tsv:
 
# moving files *.txt to have extension .tsv:
Line 280: Line 282:
 
#  then execute by hand or replace print with system
 
#  then execute by hand or replace print with system
 
#  mv -i asks if something is to be rewritten
 
#  mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
+
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'
+
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; system("mv -i $_ $s")'
 
</syntaxhighlight>
 
</syntaxhighlight>

Latest revision as of 12:10, 29 February 2024

HWbash

This lecture introduces command-line tools and Perl one-liners.

  • We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.

When working on the exercises, record all the commands used.

  • We strongly recommend making a log of commands for data processing also outside of this course.
  • If you have a log of executed commands, you can easily execute them again by copy and paste.
  • For this reason any comments are best preceded in the log by #.
  • If you use some sequence of commands often, you can turn it into a script.

Efficient use of the Bash command line

Some tips for bash shell:

  • Use tab key to complete command names, path names etc.
  • Use up and down keys to walk through the history of recently executed commands, then edit and execute the chosen command.
  • Press ctrl-r to search in the history of executed commands.
  • At the end of s session, history stored in ~/.bash_history.
  • Command history -a appends history to this file right now.
    • You can then look into the file and copy appropriate commands to your log.
    • For example, print the last commands using tail ~/.bash_history
  • Various other history tricks exist, e.g. special variables [1].
  • cd - goes to previously visited directory (also see pushd and popd).
  • ls -lt | head shows 10 most recent files, useful for seeing what you have done last in a folder.

Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook.

Redirecting and pipes

# redirect standard output to file
command > file

# append to file
command >> file

# redirect standard error
command 2>file

# redirect file to standard input
command < file

# do not forget to quote > in other uses, 
# e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc
# should be quoted in '' as well

# send stdout of command1 to stdin of command2 
# this is called a pipeline
command1 | command2

# backtick operator executes command, 
# removes trailing \n from stdout, 
# substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file

# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'

# in some commands, file argument can be taken from stdin
# if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2

Make piped commands fail properly:

set -o pipefail

If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, the pipeline then returns exit status of the rightmost command.

Text file manipulation

Commands echo and cat (creating and printing files)

# print text Hello and end of line to stdout
echo "Hello" 
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2

Commands head and tail (looking at start and end of files)

# print 10 first lines of file (or stdin)
head file
some_command | head 
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20

Documentation: head, tail

Commands wc, ls -lh, od (exploring file statistics and details)

# prints three numbers:
# the number of lines (-l), number of words (-w), number of bytes (-c)
wc file

# prints the size of file in human-readable units (K,M,G,T)
ls -lh file

# od -a prints file or stdin with named characters 
#   allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000   h   e   l   l   o  sp   w   o   r   l   d   !  nl
# 0000015

Documentation: wc, ls, od

Command grep (getting lines matching a regular expression)

# get all lines in file human.fa containing string wall
grep wall human.fa
# -i ignores case (upper case and lowercase letters are the same)
grep -i wall human.fa
# -c counts the number of matching lines in each file in the current folder
grep -c '^[IJ]' *

# other options (there is more, see the manual):
# -v print/count non-matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file 
#    (good for selecting e.g. only lines matching one of "good" ids)

Documentation: grep

Commands sort, uniq

# sort lines of a file alphabetically
sort names.txt

# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -t fields separator

# sorting output of the first command
# first compare column 3 numerically (-k3,3g),
# in case of ties use column 1 (-k1,1)
# the long result is piped to less to look at manually
grep -v '#' matches.tsv | sort -k3,3g -k1,1 | less

# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines
# and sorts them by frequency from the most frequent
sort protocol.txt | uniq -c | sort -gr

Documentation: sort, uniq

Commands diff, comm (comparing files)

Command diff compares two files. It is good for manual checking of differences. Useful options:

  • -b (ignore whitespace differences)
  • -r for recursively comparing whole directories
  • -q for fast checking for identity
  • -y show differences side-by-side

Command comm compares two sorted files. It is good for finding set intersections and differences. It writes three columns:

  • lines occurring only in the first file
  • lines occurring only in the second file
  • lines occurring in both files

Some columns can be suppressed with options -1, -2, -3

Commands cut, paste, join (working with columns)

  • Command cut selects only some columns from file (perl/awk more flexible)
  • Command paste puts two or more files side by side, separated by tabs or other characters
  • Command join is a powerful tool for making joins and left-joins as in databases on specified columns in two files

Commands split, csplit (splitting files to parts)

  • Command split splits into fixed-size pieces (size in lines, bytes etc.)
  • Command csplit splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
csplit sequences.fa '/^>/' '{*}'

Programs sed and awk

Both sed and awk process text files line by line, allowing to do various transformations.

  • Program awk is newer, more advanced.
  • We will not cover these, several examples are shown below
  • More info on awk, sed on Wikipedia
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints the first two lines, then quits (like head -n 2)
sed 2q  

# print the first and second column from a file
awk '{print $1, $2}' 

# print the line if the difference between the first and second column > 10
awk '{ if ($2-$1>10) print }'  

# print lines matching pattern
awk '/pattern/ { print }' 

# count the lines (like wc -l)
awk 'END { print NR }'

Perl one-liners

Instead of sed and awk, we will cover Perl one-liners.

  • More examples can be found on various websites, e.g. [2]
  • Documentation for Perl switches
# -e executes commands
perl -e 'print 2+3,"\n"'
perl -e '$x = 2+3; print $x, "\n"';

# -n wraps commands in a loop reading lines from stdin
# or files listed as arguments
# the following is roughly the same as cat:
perl -ne 'print'
# how to use:
perl -ne 'print' < input > output
perl -ne 'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions, 
# including print, so print is the same as print $_

# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_

# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne 'print $_, "*"' 

# -a splits line into words separated by whitespace and stores them in array @F
# The next example prints difference in the numbers stored in columns 7 and 6.
# This is size of an interval which has start and end stored in these columns
grep -v '#' matches.tsv | perl -lane 'print $F[7]-$F[6]' | less

# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane 'print $F[7]-$F[6]'

# END { commands } is run at the very end, after we finish reading input
# the following example computes the average of interval lengths
grep -v '#' matches.tsv | perl -lane '$count++; $sum += $F[7]-$F[6]; END { print $sum/$count; }'
# similarly BEGIN { command } before we start

Other interesting possibilities:

# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files
# in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# The following example replaces a sequence of whitespace by exactly one space 
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt

# variable $. contains the line number, $ARGV the name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' names.txt protocol.txt  | less

# moving files *.txt to have extension .tsv:
#   first print commands 
#   then execute by hand or replace print with system
#   mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt$/.tsv/; system("mv -i $_ $s")'