1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Lbash

From MAD
Jump to navigation Jump to search

HWbash

This lecture introduces command-line tools and Perl one-liners.

  • We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.

When working on the exercises, record all the commands used

  • We strongly recommend making a log of commands for data processing also outside of this course
  • If you have a log of executed commands, you can easily execute them again by copy and paste
  • For this reason any comments are best preceded in the log by #
  • If you use some sequence of commands often, you can turn it into a script

Efficient use of the Bash command line

Some tips for bash shell:

  • use tab key to complete command names, path names etc
  • use up and down keys to walk through the history of recently executed commands, then edit and execute the chosen command
  • press ctrl-r to search in the history of executed commands
  • at the end of session, history stored in ~/.bash_history
  • command history -a appends history to this file right now
    • you can then look into the file and copy appropriate commands to your log
  • various other history tricks, e.g. special variables [1]
  • cd - goes to previously visited directory (also see pushd and popd)
  • ls -lt | head shows 10 most recent files, useful for seeing what you have done last in a folder

Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook

Redirecting and pipes

# redirect standard output to file
command > file

# append to file
command >> file

# redirect standard error
command 2>file

# redirect file to standard input
command < file

# do not forget to quote > in other uses, 
# e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc
# should be quoted in '' as well

# send stdout of command1 to stdin of command2 
# this is called a pipeline
command1 | command2

# backtick operator executes command, 
# removes trailing \n from stdout, 
# substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file

# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'

# in some commands, file argument can be taken from stdin
# if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2

Make piped commands fail properly:

set -o pipefail

If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, the pipeline then returns exit status of the rightmost command.

Text file manipulation

Commands echo and cat (creating and printing files)

# print text Hello and end of line to stdout
echo "Hello" 
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2

Commands head and tail (looking at start and end of files)

# print 10 first lines of file (or stdin)
head file
some_command | head 
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20

Documentation: head, tail

Commands wc, ls -lh, od (exploring file statistics and details)

# prints three numbers:
# the number of lines (-l), number of words (-w), number of bytes (-c)
wc file

# prints the size of file in human-readable units (K,M,G,T)
ls -lh file

# od -a prints file or stdout with named characters 
#   allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000   h   e   l   l   o  sp   w   o   r   l   d   !  nl
# 0000015

Documentation: wc, ls, od

Command grep (getting lines matching a regular expression)

# get all lines containing string chromosome
grep chromosome file
# -i ignores case (upper case and lowercase letters are the same)
grep -i chromosome file
# -c counts the number of matching lines in each file
grep -c '^[12][0-9]' file1 file2

# other options (there is more, see the manual):
# -v print/count not matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file 
#    (good for selecting e.g. only lines matching one of "good" ids)

Documentation: grep

Commands sort, uniq

# sort lines of a file alphabetically
sort file

# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -s stable
# -t fields separator

# sorting first by column 2 numerically (-k2,2g),
# in case of ties use column 1 (-k1,1)
sort -k2,2g -k1,1 file 

# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines
# and sorts them by frequency from the most frequent
sort file | uniq -c | sort -gr

Documentation: sort, uniq

Commands diff, comm (comparing files)

Command diff compares two files. It is good for manual checking of differences. Useful options:

  • -b (ignore whitespace differences)
  • -r for comparing whole directories
  • -q for fast checking for identity
  • -y show differences side-by-side

Command comm compares two sorted files. It is good for finding set intersections and differences. It writes three columns:

  • lines occurring only in the first file
  • lines occurring only in the second file
  • lines occurring in both files

Some columns can be suppressed with options -1, -2, -3


Commands cut, paste, join (working with columns)

  • Command cut selects only some columns from file (perl/awk more flexible)
  • Command paste puts two or more files side by side, separated by tabs or other characters
  • Command join is a powerful tool for making joins and left-joins as in databases on specified columns in two files

Commands split, csplit (splitting files to parts)

  • Command split splits into fixed-size pieces (size in lines, bytes etc.)
  • Command csplit splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
csplit sequences.fa '/^>/' '{*}'

Programs sed and awk

Both sed and awk process text files line by line, allowing to do various transformations

  • awk newer, more advanced
  • several examples below
  • More info on awk, sed on Wikipedia
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints the first two lines, then quits (like head -n 2)
sed 2q  

# print the first and second column from a file
awk '{print $1, $2}' 

# print the line if the difference between the first and second column > 10
awk '{ if ($2-$1>10) print }'  

# print lines matching pattern
awk '/pattern/ { print }' 

# count the lines (like wc -l)
awk 'END { print NR }'

Perl one-liners

Instead of sed and awk, we will cover Perl one-liners

# -e executes commands
perl -e'print 2+3,"\n"'
perl -e'$x = 2+3; print $x, "\n"';

# -n wraps commands in a loop reading lines from stdin
# or files listed as arguments
# the following is roughly the same as cat:
perl -ne'print'
# how to use:
perl -ne'print' < input > output
perl -ne'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions, 
# including print, so print is the same as print $_

# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_

# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne'print $_, "*"' 

# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in the numbers stored
# in the second and first column
# (e.g. interval size if each line coordinates of one interval)
perl -lane'print $F[1]-$F[0]'

# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane'print $F[1]-$F[0]'

# END { commands } is run at the very end, after we finish reading input
# the following example computes the sum of interval lengths
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
# similarly BEGIN { command } before we start

Other interesting possibilities:

# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files
# in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces sequence of whitespace by exactly one space 
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt

# variable $. contains the line number. $ARGV the name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2

# moving files *.txt to have extension .tsv:
#   first print commands 
#   then execute by hand or replace print with system
#   mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'