1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Lbash
This lecture introduces command-line tools and Perl one-liners.
- We will do simple transformations of text files using command-line tools without writing any scripts or longer programs.
When working on the exercises, record all the commands used
- We strongly recommend making a log of commands for data processing also outside of this course
- If you have a log of executed commands, you can easily execute them again by copy and paste
- For this reason any comments are best preceded in the log by #
- If you use some sequence of commands often, you can turn it into a script
Contents
- 1 Efficient use of the Bash command line
- 2 Redirecting and pipes
- 3 Text file manipulation
- 3.1 Commands echo and cat (creating and printing files)
- 3.2 Commands head and tail (looking at start and end of files)
- 3.3 Commands wc, ls -lh, od (exploring file statistics and details)
- 3.4 Command grep (getting lines matching a regular expression)
- 3.5 Commands sort, uniq
- 3.6 Commands diff, comm (comparing files)
- 3.7 Commands cut, paste, join (working with columns)
- 3.8 Commands split, csplit (splitting files to parts)
- 4 Programs sed and awk
- 5 Perl one-liners
- 6 HWbash
Efficient use of the Bash command line
Some tips for bash shell:
- use tab key to complete command names, path names etc
- tab completion can be customized
- use up and down keys to walk through the history of recently executed commands, then edit and execute the chosen command
- press ctrl-r to search in the history of executed commands
- at the end of session, history stored in ~/.bash_history
- command history -a appends history to this file right now
- you can then look into the file and copy appropriate commands to your log
- various other history tricks, e.g. special variables [1]
- cd - goes to previously visited directory (also see pushd and popd)
- ls -lt | head shows 10 most recent files, useful for seeing what you have done last in a directory
Instead of bash, you can use more advanced command-line environments, e.g. iPhyton notebook
Redirecting and pipes
# redirect standard output to file
command > file
# append to file
command >> file
# redirect standard error
command 2>file
# redirect file to standard input
command < file
# do not forget to quote > in other uses, e.g. when searching for string ">" in a file sequences.fasta
grep '>' sequences.fasta
# (without quotes rewrites sequences.fasta)
# other special characters, such as ;, &, |, # etc
# should be quoted in '' as well
# send stdout of command1 to stdin of command2
command1 | command2
# backtick operator executes command,
# removes trailing \n from stdout, substitutes to command line
# the following commands do the same thing:
head -n 2 file
head -n `echo 2` file
# redirect a string in ' ' to stdin of command head
head -n 2 <<< 'line 1
line 2
line 3'
# in some commands, file argument can be taken from stdin
# if denoted as - or stdin or /dev/stdin
# the following compares uncompressed version of file1 with file2
zcat file1.gz | diff - file2
Make piped commands fail properly:
set -o pipefail
If set, the return value of a pipeline is the value of the last (rightmost) command to exit with a non-zero status, or zero if all commands in the pipeline exit successfully. This option is disabled by default, pipe then returns exit status of the rightmost command.
Text file manipulation
Commands echo and cat (creating and printing files)
# print text Hello and end of line to stdout
echo "Hello"
# interpret backslash combinations \n, \t etc:
echo -e "first line\nsecond\tline"
# concatenate several files to stdout
cat file1 file2
Commands head and tail (looking at start and end of files)
# print 10 first lines of file (or stdin)
head file
some_command | head
# print the first 2 lines
head -n 2 file
# print the last 5 lines
tail -n 5 file
# print starting from line 100 (line numbering starts at 1)
tail -n +100 file
# print lines 81..100
head -n 100 file | tail -n 20
Commands wc, ls -lh, od (exploring file statistics and details)
# prints three numbers:
# the number of lines (-l), number of words (-w), number of bytes (-c)
wc file
# prints the size of file in human-readable units (K,M,G,T)
ls -lh file
# od -a prints file or stdout with named characters
# allows checking whitespace and special characters
echo "hello world!" | od -a
# prints:
# 0000000 h e l l o sp w o r l d ! nl
# 0000015
Command grep (getting lines matching a regular expression)
# get all lines containing string chromosome
grep chromosome file
# -i ignores case (upper case and lowercase letters are the same)
grep -i chromosome file
# -c counts the number of matching lines in each file
grep -c '^[12][0-9]' file1 file2
# other options (there is more, see the manual):
# -v print/count not matching lines (inVert)
# -n show also line numbers
# -B 2 -A 1 print 2 lines before each match and 1 line after match
# -E extended regular expressions (allows e.g. |)
# -F no regular expressions, set of fixed strings
# -f patterns in a file
# (good for selecting e.g. only lines matching one of "good" ids)
Documentation: grep
Commands sort, uniq
# sort lines of a file alphabetically
sort file
# some useful options of sort:
# -g numeric sort
# -k which column(s) to use as key
# -r reverse (from largest values)
# -s stable
# -t fields separator
# sorting first by column 2 numerically (-k2,2g),
# in case of ties use column 1 (-k1,1)
sort -k2,2g -k1,1 file
# uniq outputs one line from each group of consecutive identical lines
# uniq -c adds the size of each group as the first column
# the following finds all unique lines
# and sorts them by frequency from the most frequent
sort file | uniq -c | sort -gr
Commands diff, comm (comparing files)
Command diff compares two files. It is good for manual checking of differences. Useful options:
- -b (ignore whitespace differences)
- -r for comparing whole directories
- -q for fast checking for identity
- -y show differences side-by-side
Command comm compares two sorted files. It is good for finding set intersections and differences. It writes three columns:
- lines occurring only in the first file
- lines occurring only in the second file
- lines occurring in both files
Some columns can be suppressed with options -1, -2, -3
Commands cut, paste, join (working with columns)
- Command cut selects only some columns from file (perl/awk more flexible)
- Command paste puts two or more files side by side, separated by tabs or other characters
- Command join is a powerful tool for making joins and left-joins as in databases on specified columns in two files
Commands split, csplit (splitting files to parts)
- Command split splits into fixed-size pieces (size in lines, bytes etc.)
- Command csplit splits at occurrence of a pattern. For example, splitting a FASTA file into individual sequences:
csplit sequences.fa '/^>/' '{*}'
Programs sed and awk
Both sed and awk process text files line by line, allowing to do various transformations
# replace text "Chr1" by "Chromosome 1"
sed 's/Chr1/Chromosome 1/'
# prints the first two lines, then quits (like head -n 2)
sed 2q
# print the first and second column from a file
awk '{print $1, $2}'
# print the line if the difference between the first and second column > 10
awk '{ if ($2-$1>10) print }'
# print lines matching pattern
awk '/pattern/ { print }'
# count the lines (like wc -l)
awk 'END { print NR }'
Perl one-liners
Instead of sed and awk, we will cover Perl one-liners
- more examples on various websites (example 1, example 2)
- documentation for Perl switches
# -e executes commands
perl -e'print 2+3,"\n"'
perl -e'$x = 2+3; print $x, "\n"';
# -n wraps commands in a loop reading lines from stdin
# or files listed as arguments
# the following is roughly the same as cat:
perl -ne'print'
# how to use:
perl -ne'print' < input > output
perl -ne'print' input1 input2 > output
# lines are stored in a special variable $_
# this variable is default argument of many functions,
# including print, so print is the same as print $_
# simple grep-like commands:
perl -ne 'print if /pattern/'
# simple regular expression modifications
perl -ne 's/Chr(\d+)/Chromosome $1/; print'
# // and s/// are applied by default to $_
# -l removes end of line from each input line and adds "\n" after each print
# the following adds * at the end of each line
perl -lne'print $_, "*"'
# -a splits line into words separated by whitespace and stores them in array @F
# the next example prints difference in the numbers stored
# in the second and first column
# (e.g. interval size if each line coordinates of one interval)
perl -lane'print $F[1]-$F[0]'
# -F allows to set separator used for splitting (regular expression)
# the next example splits at tabs
perl -F '"\t"' -lane'print $F[1]-$F[0]'
# END { commands } is run at the very end, after we finish reading input
# the following example computes the sum of interval lengths
perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
# similarly BEGIN { command } before we start
Other interesting possibilites:
# -i replaces each file with a new transformed version (DANGEROUS!)
# the next example removes empty lines from all .txt files
# in the current directory
perl -lne 'print if length($_)>0' -i *.txt
# the following example replaces sequence of whitespace by exactly one space
# and removes leading and trailing spaces from lines in all .txt files
perl -lane 'print join(" ", @F)' -i *.txt
# variable $. contains the line number. $ARGV the name of file or - for stdin
# the following prints filename and line number in front of every line
perl -ne'printf "%s.%d: %s", $ARGV, $., $_' file1 file2
# moving files *.txt to have extension .tsv:
# first print commands
# then execute by hand or replace print with system
# mv -i asks if something is to be rewritten
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; print("mv -i $_ $s")'
ls *.txt | perl -lne '$s=$_; $s=~s/\.txt/.tsv/; system("mv -i $_ $s")'
HWbash
Lecture on Perl, Lecture on command-line tools
- In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
- Each task can be split into several stages and intermediate files written to disk, but you can also use pipelines to reduce the number of temporary files.
- Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
- Include all relevant used commands in your protocol and add a short description of your approach.
- Submit the protocol and required output files.
- Outline of the protocol is in /tasks/bash/protocol.txt, submit to directory /submit/bash/yourname
Task A (passwords)
- The file /tasks/bash/names.txt contains data about several people, one per line.
- Each line consists of given name(s), surname and email separated by spaces.
- Each person can have multiple given names (at least 1), but exactly one surname and one email. Email is always of the form username@uniba.sk.
- The task is to generate file passwords.csv which contains a randomly generated password for each of these users
- The output file has columns separated by commas ','
- The first column contains username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password
- Submit file passwords.csv with the result of your commands.
Example line from input:
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
Example line from output (password will differ):
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
Hints:
- Passwords can be generated using pwgen (e.g. pwgen -N 10 -1 prints 10 passwords, one per line)
- We also recommend using perl, wc, paste (check option -d in paste)
- In Perl, function pop may be useful for manipulating @F and function join for connecting strings with a separator.
Task B (yeast genome)
The input file:
- /tasks/bash/saccharomyces_cerevisiae.gff contains annotation of the yeast genome
- Downloaded from http://yeastgenome.org/ on 2016-03-09, in particular from [2].
- It was further processed to omit DNA sequences from the end of file.
- The size of the file is 5.6M.
- For easier work, link the file to your directory by ln -s /tasks/bash/saccharomyces_cerevisiae.gff yeast.gff
- The file is in GFF3 format
- The lines starting with # are comments, other lines contain tab-separated data about one interval of some chromosome in the yeast genome
- Meaning of the first 5 columns:
- column 0 chromosome name
- column 1 source (can be ignored)
- column 2 type of interval
- column 3 start of interval (1-based coordinates)
- column 4 end of interval (1-based coordinates)
- You can assume that these first 5 columns do not contain whitespace
Task:
- Print for each type of interval (column 2), how many times it occurs in the file.
- Sort from the most common to the least common interval types.
- Hint: commands sort and uniq will be useful. Do not forget to skip comments, for example using grep -v '^#'
- The result should be a file types.txt formatted as follows:
7058 CDS 6600 mRNA ... ... 1 telomerase_RNA_gene 1 mating_type_region 1 intein_encoding_region
Submit the file types.txt
Task C (chromosomes)
- Continue processing file from task B.
- For each chromosome, the file contains a line which has in column 2 string chromosome, and the interval is the whole chromosome.
- To file chrosomes.txt, print a tab-separated list of chromosome names and sizes in the same order as in the input
- The last line of chromosomes.txt should list the total size of all chromosomes combined.
- Submit file chromosomes.txt
- Hints:
- The total size can be computed by a perl one-liner.
- Example from the lecture: compute the sum of interval sizes if each line of the file contains start and end of one interval: perl -lane'$sum += $F[1]-$F[0]; END { print $sum; }'
- Grepping for word chromosome does not check if this word is indeed in the second column
- Tab character is written in Perl as "\t".
- Your output should start and end as follows:
chrI 230218 chrII 813184 ... ... chrXVI 948066 chrmt 85779 total 12157105
Task D (blast)
Overall goal:
- Proteins from several well-studied yeast species were downloaded from database http://www.uniprot.org/ on 2016-03-09. The file contains sequence of the protein as well as a short description of its biological function.
- We have also downloaded proteins from the yeast Yarrowia lipolytica. We will pretend that nothing is known about the function of these proteins (as if they were produced by gene finding program in a newly sequenced genome).
- For each Y.lipolytica protein, we have found similar proteins from other yeasts
- Now we want to extract for each protein in Y.lipolytica its closest match among all known proteins and see what is its function. This will give a clue about the potential function of the Y.lipolytica protein.
Files:
- /tasks/bash/known.fa is a FASTA file containing sequences of known proteins from several species
- /tasks/bash/yarLip.fa is a FASTA file with proteins from Y.lipolytica
- /tasks/bash/known.blast is the result of finding similar proteins in yarLip.fa versus known.fa by these commands (already done by us):
formatdb -i known.fa
blastall -p blastp -d known.fa -i yarLip.fa -m 9 -e 1e-5 > known.blast
- you can link these files to your directory as follows:
ln -s /tasks/bash/known.fa .
ln -s /tasks/bash/yarLip.fa .
ln -s /tasks/bash/known.blast .
Step 1:
- Get the first (strongest) match for each query from known.blast.
- This can be done by printing the lines that are not comments but follow a comment line starting with #.
- In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide of you print the current line.
- Instead of using Perl, you can play with grep. Option -A 1 prints the matching lines as well as one line ofter each match
- Print only the first two columns separated by tab (name of query, name of target), sort the file by the second column.
- Store the result in file best.tsv. The file should start as follows:
Q6CBS2 sp|B5BP46|YP52_SCHPO Q6C8R4 sp|B5BP48|YP54_SCHPO Q6CG80 sp|B5BP48|YP54_SCHPO Q6CH56 sp|B5BP48|YP54_SCHPO
- Submit file best.tsv with the result
Step 2:
- Create file known.tsv which contains sequence names extracted from known.fa with leading > removed
- This file should be sorted alphabetically.
- The file should start as follows (lines are trimmed below):
sp|A0A023PXA5|YA19A_YEAST Putative uncharacterized protein YAL019W-A OS=Saccharomyces... sp|A0A023PXB0|YA019_YEAST Putative uncharacterized protein YAR019W-A OS=Saccharomyces...
- Submit file known.tsv
Step 3:
- Use command join to join the files best.tsv and known.tsv so that each line of best.tsv is extended with the text describing the corresponding target in known.tsv
- Use option -1 2 to use the second column of best.tsv as a key for joining
- The output of join may look as follows:
sp|B5BP46|YP52_SCHPO Q6CBS2 Putative glutathione S-transferase C1183.02 OS=Schizosaccharomyces... sp|B5BP48|YP54_SCHPO Q6C8R4 Putative alpha-ketoglutarate-dependent sulfonate dioxygenase OS=...
- Further reformat the output so that the query name goes first (e.g. Q6CBS2), followed by target name (e.g. sp|B5BP46|YP52_SCHPO), followed by the rest of the text, but remove all text after OS=
- Sort by query name, store as best.txt
- The output should start as follows:
B5FVA8 tr|Q5A7D5|Q5A7D5_CANAL Lysophospholipase B5FVB0 sp|O74810|UBC1_SCHPO Ubiquitin-conjugating enzyme E2 1 B5FVB1 sp|O13877|RPAB5_SCHPO DNA-directed RNA polymerases I, II, and III subunit RPABC5
- Submit file best.txt
Note:
- Not all Y.lipolytica proteins are necessarily included in your final output (some proteins do not have blast match).
- You can think how to find the list of such proteins, but this is not part of the task.
- Files best.txt and best.tsv should have the same number of lines.