2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z už opravených úloh nájdete na serveri v /grades/userid.txt
Dátumy odovzdania projektov:
1. termín: nedeľa 4.6. 22:00
2. termín: streda 20.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatváranie známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search

Today: using command-line tools and Perl one-liners.

  • We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.
  • You will record the commands used in your protocol
    • We strongly recommend making a log of commands for data processing also outside of this course
  • If you have a log of executed commands, you can easily execute them again by copy and paste
  • For this reason any comments are best preceded by #
  • If you use some sequence of commands often, you can turn it into a script

Most commands have man pages or are described within man bash

Efficient use of command line

Some tips for bash shell:

  • use tab key to complete command names, path names etc
    • tab completion can be customized [1]
  • use up and down keys to walk through history of recently executed commands, then edit and resubmit chosen command
  • press ctrl-r to search in the history of executed commands
  • at the end of session, history stored in ~/.bash_history
  • command history -a appends history to this file right now
    • you can then look into the file and copy appropriate commands to your protocol
  • various other history tricks, e.g. special variables [2]
  • cd - goes to previously visited directory, also see pushd and popd
  • ls -lt | head shows 10 most recent files, useful for seeing what you have done last

Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook

Redirecting and pipes

# redirect standard output to file
command > file

# append to file
command >> file

# redirect standard error
command 2>file

# redirect file to standard input
command < file

# do not forget to quote > in other uses, e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc should be quoted in '' as well

# send stdout of command1 to stdin of command2
command1 | command2

# backtick operator executes command, 
# removes trailing \n from stdout, substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file

# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'

# in some commands, file argument can be taken from stdin if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2

Make piped commands fail properly:

set -o pipefail

If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.

Text file manipulation

Commands echo and cat (creating and printing files)

# print text Hello and end of line to stdout
echo "Hello" 
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2

Commands head and tail (looking at start and end of files)

# print 10 first lines of file (or stdin)
head file
some_command | head 
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20 

Commands wc, ls -lh, od (exploring file stats and details)

# prints three numbers: number of lines (-l), number of words (-w), number of bytes (-c)
wc file

# prints size of file in human-readable units (K,M,G,T)
ls -lh file

# od -a prints file or stdout with named characters 
#   allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000   h   e   l   l   o  sp   w   o   r   l   d   !  nl
# 0000015

Command grep (getting lines matching a regular expression)

# -i ignores case (upper case and lowercase letters are the same)
grep -i chromosome file
# -c counts the number of matching lines in each file
grep -c '^[12][0-9]' file1 file2

# other options (there is more, see the manual):
# -v print/count not matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file 
#    (good for selecting e.g. only lines matching one of "good" ids)

Commands sort, uniq

# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -s stable
# -t fields separator

# sorting first by column 2 numerically (-k2,2g), in case of ties use column 1 (-k1,1)
sort -k2,2g -k1,1 file 

# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines and sorts them by frequency from the most frequent
sort file | uniq -c | sort -gr

Commands diff, comm (comparing files)

diff compares two files, useful for manual checking of differences

  • useful options
    • -b (ignore whitespace differences)
    • -r for comparing whole directories
    • -q for fast checking for identity
    • -y show differences side-by-side

comm compares two sorted files

  • writes 3 columns:
    • 1: lines occurring only in the first file
    • 2: lines occurring only in the second file
    • 3: lines occurring in both files
  • some columns can be suppressed with -1, -2, -3
  • good for finding set intersections and differences

Commands cut, paste, join (working with columns)

  • cut selects only some columns from file (perl/awk more flexible)
  • paste puts 2 or more files side by side, separated by tabs or other character
  • join is a powerful tool for making joins and left-joins as in databases on specified columns in two files

Commands split, csplit (splitting files to parts)

  • split splits into fixed-size pieces (size in lines, bytes etc.)
  • csplit splits at occurrence of a pattern (e.g. fasta file into individual sequences)
csplit sequences.fa '/^>/' '{*}'

Programs sed and awk

Both programs process text files line by line, allow to do various transformations

  • awk newer, more advanced
  • several examples below
  • More info on wikipedia: awk, sed
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints first two lines, then quits (like head -n 2)
sed 2q  

# print first and second column from a file
awk '{print $1, $2}' 

# print the line if difference in first and second column > 10
awk '{ if ($2-$1>10) print }'  

# print lines matching pattern
awk '/pattern/ { print }' 

# count lines
awk 'END { print NR }'

Perl one-liners

Instead of sed and awk, we will cover Perl one-liners

  • more examples [3], [4]
  • documentation for Perl switches [5]
# -e executes commands
perl -e'print 2+3,"\n"'
perl -e'$x = 2+3; print $x, "\n"';

# -n wraps commands in a loop reading lines from stdin or files listed as arguments
# the following is roughly the same as cat:
perl -ne'print'
# how to use:
perl -ne'print' < input > output
perl -ne'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions, 
# including print, so print is the same as print $_

# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_

# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne'print $_, "*"' 

# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in numbers stored in the second and first column
# (e.g. interval size if each line coordinates of one interval)
perl -lane'print $F[1]-$F[0]'

# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane'print $F[1]-$F[0]'

# END { commands } is run at the very end, after we finish reading input
# the following example computes the sum of interval lengths
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
# similarly BEGIN { command } before we start

Other interesting possibilites:

# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces sequence of whitespace by exactly one space 
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt

# variable $. contains line number. $ARGV name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2

# moving files *.txt to have extension .tsv:
#   first print commands 
#   then execute by hand or replace print with system
#   mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'