2-INF-185 Integrácia dátových zdrojov 2018/19

Materiály · Úvod · Pravidlá · Kontakt
· Od 14.3. sa presúvame do učebne F2-T3, a.k.a. F2-128.
· Body z už opravených úloh nájdete na serveri v /grades/userid.txt
· Do stredy 17.4. odovzdajte návrh projektu vo formáte .txt alebo .pdf do adresára /submit/navrh/username
  (príklady projektov pre bioinformatikov)


From IDZ
Jump to: navigation, search

Job Scheduling

  • Some computing jobs take a lot of time: hours, days, weeks,...
  • We do not want to keep a command-line window open the whole time; therefore we run such jobs in the background
  • Simple commands to do it in Linux:
    • To run the program immediately, then switch the whole console to the background: screen, tmux
    • To run the command when the computer becomes idle: batch
  • Now we will concentrate on Sun Grid Engine, a complex software for managing many jobs from many users on a cluster from multiple computers
  • Basic workflow:
    • Submit a job (command) to a queue
    • The job waits in the queue until resources (memory, CPUs, etc.) become available on some computer
    • The job runs on the computer
    • Output of the job is stored in files
    • User can monitor the status of the job (waiting, running)
  • Complex possibilities for assigning priorities and deadlines to jobs, managing multiple queues etc.
  • Ideally all computers in the cluster share the same environment and filesystem
  • We have a simple training cluster for this exercise:
    • You submit jobs to queue on vyuka
    • They will run on computer cpu02
    • This cluster is only temporarily available until next Thursday

Submitting a job (qsub)

  • qsub -b y -cwd 'command < input > output 2> error'
    • quoting around command allows us to include special characters, such as <, > etc. and not to apply it to qsub command itself
    • -b y treats command as binary, usually preferable for both binary programs and scripts
    • -cwd executes command in the current directory
    • -N name allows to set name of the job
    • -l resource=value requests some non-default resources
    • for example, we can use -l threads=2 to request 2 threads for parallel programs
    • Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
  • qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27

Monitoring and deleting jobs (qstat, qdel)

  • qstat displays jobs of the current user
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
     28 0.50000 s3         bbrejova     r     03/15/2016 22:12:18 main.q@cpu02.compbio.fmph.unib     1
     29 0.00000 s3         bbrejova     qw    03/15/2016 22:14:08                                    1
  • qstat -u '*' displays jobs of all users
    • finished jobs disappear from the list
  • qstat -F threads shows how many threads available
queuename                      qtype resv/used/tot. load_avg arch          states
main.q@cpu02.compbio.fmph.unib BIP   0/2/8          0.03     lx26-amd64
     28 0.75000 s3         bbrejova     r     03/15/2016 22:12:18     1
     29 0.25000 s3         bbrejova     r     03/15/2016 22:14:18     1
  • Command qdel allows you to delete a job (waiting or running)

Interactive work on the cluster (qrsh), screen

  • qrsh creates a job which is a normal interactive shell running on the cluster
  • in this shell you can manually run commands
  • when you close the shell, the job finishes
  • therefore it is a good idea to run qrsh within screen
    • run screen command, this creates a new shell
    • within this shell, run qrsh, then whatever commands
    • by pressing Ctrl-a d you "detach" the screen, so that both shells (local and qrsh) continue running but you can close your local window
    • later by running screen -r you get back to your shells

Running many small jobs

For example, consider tens of thousands of genes, run some computation for each gene

  • Have a script which iterates through all and runs them sequentially
    • Problems: Does not use parallelism, needs more programming to restart after some interruption
  • Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)
    • Jobs can run in parallel on many different computers
    • Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
  • Array jobs in qsub (option -t): runs jobs numbered 1,2,3...; number of the job is in an environment variable, used by the script to decide which gene to process
    • Queue contains only running sub-jobs plus one line for the remaining part of the array job.
    • After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
  • Next: using make in which you specify how to process each gene and submit a single make command to the queue
    • Make can execute multiple tasks in parallel using several threads on the same computer (qsub array jobs can run tasks on multiple computers)
    • It will automatically skip tasks which are already finished


  • Make is a system for automatically building programs (running compiler, linker etc)
  • Rules for compilation are written in a Makefile
  • Rather complex syntax with many features, we will only cover basics


  • The main part of a Makefile are rules specifying how to generate target files from some source files (prerequisites).
  • For example the following rule generates target.txt by concatenating source1.txt a source2.txt:
target.txt : source1.txt source2.txt
      cat source1.txt source2.txt > target.txt
  • The first line describes target and prerequisites, starts in the first column
  • The following lines list commands to execute to create the target
  • Each line with a command starts with a tab character
  • If we have a directory with this rule in Makefile and files source1.txt and source2.txt, running make target.txt will run the cat command
  • However, if target.txt already exists, the command will be run only if one of the prerequisites has more recent modification time than the target
  • This allows to restart interrupted computations or rerun necessary parts after modification of some input files
  • Makefile automatically chains the rules as necessary:
    • if we run make target.txt and some prerequisite does not exist, Makefile checks if it can be created by some other rule and runs that rule first
    • In general it first finds all necessary steps and runs them in topological order so that each rules has its prerequisites ready
    • Option make -n target will show what commands would be executed to build target (dry run) - good idea before running something potentially dangerous

Pattern rules

  • We can specify a general rule for files with a systematic naming scheme. For example, to create a .pdf file from a .tex file, we use pdflatex command:
%.pdf : %.tex
      pdflatex $^
  • In the first line, % denotes some variable part of the filename, which has to agree in the target and all prerequisites
  • In commands, we can use several variables:
    • $^ contains name for the prerequisite (source)
    • $@ contains the name of the target
    • $* contains the string matched by %

Other useful tricks in Makefiles


  • Store some reusable values in variables, then use them several times in the Makefile:
MYPATH := /projects/trees/bin

target : source
       $(MYPATH)/script < $^ > $@

Wildcards, creating a list of targets from files in the directory

The following Makefile automatically creates .png version of each .eps file simply by running make:

EPS := $(wildcard *.eps)
EPSPNG := $(patsubst %.eps,%.png,$(EPS))

all:  $(EPSPNG)

        rm $(EPSPNG)

%.png : %.eps
        convert -density 250 $^ $@
  • variable EPS contains names of all files matching *.eps
  • variable EPSPNG contains desirable names of png files
    • it is created by taking filenames in EPS and changing .eps to .png
  • all is a "phony target" which is not really created
    • its rule has no commands but all png files are prerequisites, so are done first
    • the first target in Makefile (in this case all) is default when no other target is specified on command-line
  • clean is also a phony target for deleting generated png files

Useful special built-in target names

Include these lines in your Makefile if desired

# prevents deletion of intermediate targets in chained rules

# delete targets if a rule fails

Parallel make

  • running make with option -j 4 will run up to 4 commands in parallel if their dependencies are already finished
  • easy parallelization on a single computer

Alternatives to Makefiles

  • Bioinformatics often uses "pipelines" - sequences of commands run one after another, e.g. by a script of Makefile
  • There are many tools developed for automating computational pipelines, see e.g. this review: Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016 bbw020.
  • For example Snakemake
    • Workflows can contain shell commands or Python code
    • Big advantage compared to Make: pattern rules may contain multiple variable portions (in make only one % per filename)
    • For example, you have several fasta files and several HMMs representing protein families and you wans to run each HMM on each fasta file:
rule HMMER:
     input: "{filename}.fasta", "{hmm}.hmm"
     output: "{filename}_{hmm}.hmmer"
     shell: "hmmsearch --domE 1e-5 --noali --domtblout {output} {input[1]} {input[0]}"