1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lmake"

From MAD
Jump to navigation Jump to search
(Created page with "==Job Scheduling== * Some computing jobs take a lot of time: hours, days, weeks,... * We do not want to keep a command-line window open the whole time; therefore we run such...")
 
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Job Scheduling==
+
<!-- NOTEX -->
 +
[[HWmake]]
 +
<!-- /NOTEX -->
 +
 
 +
 
 +
==Job scheduling==
  
 
* Some computing jobs take a lot of time: hours, days, weeks,...
 
* Some computing jobs take a lot of time: hours, days, weeks,...
 
* We do not want to keep a command-line window open the whole time; therefore we run such jobs in the background
 
* We do not want to keep a command-line window open the whole time; therefore we run such jobs in the background
 
* Simple commands to do it in Linux:  
 
* Simple commands to do it in Linux:  
** To run the program immediately, then switch the whole console to the background: [https://www.gnu.org/software/screen/manual/screen.html screen], [https://tmux.github.io/ tmux]
+
** To run the program immediately, then switch the whole console to the background: [https://www.gnu.org/software/screen/manual/screen.html screen], [https://tmux.github.io/ tmux] (see a [[Screen|separate section]])
 
** To run the command when the computer becomes idle: [http://pubs.opengroup.org/onlinepubs/9699919799/utilities/batch.html batch]
 
** To run the command when the computer becomes idle: [http://pubs.opengroup.org/onlinepubs/9699919799/utilities/batch.html batch]
 
* Now we will concentrate on '''[https://en.wikipedia.org/wiki/Oracle_Grid_Engine Sun Grid Engine]''', a complex software for managing many jobs from many users on a cluster consisting of multiple computers
 
* Now we will concentrate on '''[https://en.wikipedia.org/wiki/Oracle_Grid_Engine Sun Grid Engine]''', a complex software for managing many jobs from many users on a cluster consisting of multiple computers
Line 18: Line 23:
 
* We have a simple training cluster for this exercise:
 
* We have a simple training cluster for this exercise:
 
** You submit jobs to queue on vyuka
 
** You submit jobs to queue on vyuka
** They will run on computer cpu02
+
** They will run on computers runner01 and runner02
** This cluster is only temporarily available until next Thursday
+
** This cluster is only temporarily available until the next Monday
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
Line 25: Line 30:
 
===Submitting a job (qsub)===
 
===Submitting a job (qsub)===
  
Basic commad: <tt>qsub -b y -cwd 'command < input > output 2> error'</tt>
+
Basic command: <tt>qsub -b y -cwd command 'parameter < input > output 2> error'</tt>
* quoting around command allows us to include special characters, such as <tt><, ></tt> etc. and not to apply it to <tt>qsub</tt> command itself
+
* quoting around command parameters allows us to include special characters, such as <tt><, ></tt> etc. and not to apply it to <tt>qsub</tt> command itself
 
* <tt>-b y</tt> treats command as binary, usually preferable for both binary programs and scripts
 
* <tt>-b y</tt> treats command as binary, usually preferable for both binary programs and scripts
 
* <tt>-cwd</tt> executes command in the current directory
 
* <tt>-cwd</tt> executes command in the current directory
Line 33: Line 38:
 
* for example, we can use <tt>-l threads=2</tt> to request 2 threads for parallel programs
 
* for example, we can use <tt>-l threads=2</tt> to request 2 threads for parallel programs
 
* Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
 
* Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
* qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27
+
* <tt>qsub</tt> will create files for <tt>stdout</tt> and <tt>stderr</tt>, e.g. <tt>s2.o27</tt> and <tt>s2.e27</tt> for the job with name <tt>s2</tt> and jobid 27
  
 
===Monitoring and deleting jobs (qstat, qdel)===
 
===Monitoring and deleting jobs (qstat, qdel)===
 
Command <tt>qstat</tt> displays jobs of the current user
 
Command <tt>qstat</tt> displays jobs of the current user
* job 28 is running of  server cpu02 (status <t>r</tt>), job 29 is waiting in queue (status <tt>qw</tt>)
+
* job 28 is running of  server runner02 (status <t>r</tt>), job 29 is waiting in queue (status <tt>qw</tt>)
 
<pre>
 
<pre>
 
job-ID  prior  name      user        state submit/start at    queue       
 
job-ID  prior  name      user        state submit/start at    queue       
------------------------------------------------------------------------------
+
---------------------------------------------------------------------------------
     28 0.50000 s3        bbrejova    r    03/15/2016 22:12:18 main.q@cpu02
+
     28 0.50000 s3        bbrejova    r    03/15/2016 22:12:18 main.q@runner02
 
     29 0.00000 s3        bbrejova    qw    03/15/2016 22:14:08             
 
     29 0.00000 s3        bbrejova    qw    03/15/2016 22:14:08             
 
</pre>
 
</pre>
  
* Command <tt>qstat -u '*'</tt> displays jobs of all users
+
* Command <tt>qstat -u '*'</tt> displays jobs of all users.
* Finished jobs disappear from the list
+
* Finished jobs disappear from the list.
* Command <tt>qstat -F threads</tt> shows how many threads available
+
* Command <tt>qstat -F threads</tt> shows how many threads available:
 
<pre>
 
<pre>
 
queuename                      qtype resv/used/tot. load_avg arch          states
 
queuename                      qtype resv/used/tot. load_avg arch          states
 
---------------------------------------------------------------------------------
 
---------------------------------------------------------------------------------
main.q@cpu02.compbio.fmph.unib BIP  0/2/8         0.03     lx26-amd64
+
main.q@runner01                BIP  0/1/2          0.00     lx26-amd64  
        hc:threads=0
+
hc:threads=1
    28 0.75000 s3        bbrejova    r    03/15/2016 22:12:18     1
+
    238 0.25000 sleeper.pl bbrejova    r    03/05/2020 13:12:28     1      
    29 0.25000 s3        bbrejova    r    03/15/2016 22:14:18     1
+
---------------------------------------------------------------------------------
 +
main.q@runner02                BIP  0/1/2          0.00    lx26-amd64   
 +
    237 0.75000 sleeper.pl bbrejova    r    03/05/2020 13:12:13     1      
 
</pre>
 
</pre>
  
Line 63: Line 70:
 
* In this shell you can manually run commands
 
* In this shell you can manually run commands
 
* When you close the shell, the job finishes
 
* When you close the shell, the job finishes
* therefore it is a good idea to run <tt>qrsh</tt> within <tt>screen</tt>
+
* Therefore it is a good idea to run <tt>qrsh</tt> within <tt>screen</tt> command
** run <tt>screen</tt> command, this creates a new shell  
+
** Run <tt>screen</tt> command, this creates a new shell  
** within this shell, run <tt>qrsh</tt>, then whatever commands
+
** Within this shell, run <tt>qrsh</tt>, then whatever commands on the rmeote server
** by pressing <tt>Ctrl-a d</tt> you "detach" the screen, so that both shells (local and <tt>qrsh</tt>) continue running but you can close your local window
+
** By pressing <tt>Ctrl-a d</tt> you "detach" the screen, so that both shells (local and <tt>qrsh</tt>) continue running but you can close your local window
** later by running <tt>screen -r</tt> you get back to your shells
+
** Later by running <tt>screen -r</tt> you get back to your shells
 +
** See more details in the [[screen|separate section]]
  
 
===Running many small jobs===
 
===Running many small jobs===
For example, we many need run some computation for each human gene (there are roughly 20,000 such genes). Here are some possibilties:
+
For example, we many need to run some computation for each human gene (there are roughly 20,000 such genes). Here are some possibilities:
 
* Run a script which iterates through all jobs and runs them sequentially  
 
* Run a script which iterates through all jobs and runs them sequentially  
** Problems: Does not use parallelism, needs more programming to restart after some interruption
+
** Problems: This does not use parallelism, needs more programming to restart after some interruption
 
* Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)  
 
* Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)  
 
** Jobs can run in parallel on many different computers
 
** Jobs can run in parallel on many different computers
 
** Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
 
** Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
* Array jobs in qsub (option <tt>-t</tt>): runs jobs numbered 1,2,3...; number of the current job is in an environment variable, used by the script to decide which gene to process
+
* Array jobs in <tt>qsub</tt> (option <tt>-t</tt>): runs jobs numbered 1,2,3...; number of the current job is in an environment variable, used by the script to decide which gene to process
 
** Queue contains only running sub-jobs plus one line for the remaining part of the array job.  
 
** Queue contains only running sub-jobs plus one line for the remaining part of the array job.  
 
** After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
 
** After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
* Next: using make in which you specify how to process each gene and submit a single make command to the queue
+
* Next: using <tt>make</tt> in which you specify how to process each gene and submit a single <tt>make</tt> command to the queue
 
** Make can execute multiple tasks in parallel using several threads on the same computer (<tt>qsub</tt> array jobs can run tasks on multiple computers)
 
** Make can execute multiple tasks in parallel using several threads on the same computer (<tt>qsub</tt> array jobs can run tasks on multiple computers)
** It will automatically skip tasks which are already finished, so restart os easy
+
** It will automatically skip tasks which are already finished, so restart is easy
  
 
==Make==
 
==Make==
 
[https://en.wikipedia.org/wiki/Make_(software) Make] is a system for automatically building programs (running compiler, linker etc)
 
[https://en.wikipedia.org/wiki/Make_(software) Make] is a system for automatically building programs (running compiler, linker etc)
 
* In particular, we will use [https://www.gnu.org/software/make/manual/ GNU make]
 
* In particular, we will use [https://www.gnu.org/software/make/manual/ GNU make]
* Rules for compilation are written in a <tt>Makefile</tt>  
+
* Rules for compilation are written in a file typically called <tt>Makefile</tt>  
 
* Rather complex syntax with many features, we will only cover basics
 
* Rather complex syntax with many features, we will only cover basics
  
Line 104: Line 112:
 
* This allows to restart interrupted computations or rerun necessary parts after modification of some input files
 
* This allows to restart interrupted computations or rerun necessary parts after modification of some input files
 
* <tt>make</tt> automatically chains the rules as necessary:  
 
* <tt>make</tt> automatically chains the rules as necessary:  
** if we run <tt>make target.txt</tt> and some prerequisite does not exist, <tt>make</tt> checks if it can be created by some other rule and runs that rule first
+
** If we run <tt>make target.txt</tt> and some prerequisite does not exist, <tt>make</tt> checks if it can be created by some other rule and runs that rule first.
** In general it first finds all necessary steps and runs them in appropriate order so that each rules has its prerequisites ready
+
** In general it first finds all necessary steps and runs them in appropriate order so that each rules has its prerequisites ready.
** Option <tt>make -n target</tt> will show which commands would be executed to build target (dry run) - good idea before running something potentially dangerous
+
** Option <tt>make -n target</tt> will show which commands would be executed to build target (dry run) - good idea before running something potentially dangerous.
  
 
===Pattern rules===
 
===Pattern rules===
Line 115: Line 123:
 
       pdflatex $^
 
       pdflatex $^
 
</syntaxhighlight>
 
</syntaxhighlight>
* In the first line, <tt>%</tt> denotes some variable part of the filename, which has to agree in the target and all prerequisites
+
* In the first line, <tt>%</tt> denotes some variable part of the filename, which has to agree in the target and all prerequisites.
 
* In commands, we can use several variables:
 
* In commands, we can use several variables:
** Variable <tt>$^</tt> contains the names of the prerequisites (source)
+
** Variable <tt>$^</tt> contains the names of the prerequisites (source).
** Variable <tt>$@</tt> contains the name of the target
+
** Variable <tt>$@</tt> contains the name of the target.
** Variable <tt>$*</tt> contains the string matched by <tt>%</tt>
+
** Variable <tt>$*</tt> contains the string matched by <tt>%</tt>.
  
 
===Other useful tricks in Makefiles===
 
===Other useful tricks in Makefiles===
Line 147: Line 155:
 
         convert -density 250 $^ $@
 
         convert -density 250 $^ $@
 
</syntaxhighlight>
 
</syntaxhighlight>
* variable <tt>EPS</tt> contains names of all files matching <tt>*.eps</tt>
+
* Variable <tt>EPS</tt> contains names of all files matching <tt>*.eps</tt>.
* variable <tt>EPSPNG</tt> contains desirable names of <tt>.png</tt> files
+
* Variable <tt>EPSPNG</tt> contains names of desired <tt>.png</tt> files.
** it is created by taking filenames in <tt>EPS</tt> and changing <tt>.eps</tt> to <tt>.png</tt>
+
** It is created by taking filenames in <tt>EPS</tt> and changing <tt>.eps</tt> to <tt>.png</tt>
 
* <tt>all</tt> is a "phony target" which is not really created
 
* <tt>all</tt> is a "phony target" which is not really created
** its rule has no commands but all <tt>.png</tt> files are prerequisites, so are done first
+
** Its rule has no commands but all <tt>.png</tt> files are prerequisites, so are done first.
** the first target in a <tt>Makefile</tt> (in this case <tt>all</tt>) is default when no other target is specified on the command-line
+
** The first target in a <tt>Makefile</tt> (in this case <tt>all</tt>) is default when no other target is specified on the command-line.
* <tt>clean</tt> is also a phony target for deleting generated <tt>.png</tt> files
+
* <tt>clean</tt> is also a phony target for deleting generated <tt>.png</tt> files.
 +
* Thus run <tt>make all</tt> or just <tt>make</tt> to create png files, <tt>make clean</tt> to delete them.
  
 
====Useful special built-in target names====
 
====Useful special built-in target names====
Line 166: Line 175:
  
 
===Parallel make===
 
===Parallel make===
Running make with option <tt>-j 4</tt> will run up to 4 commands in parallel if their dependencies are already finished. Ths allows easy parallelization on a single computer.
+
Running make with option <tt>-j 4</tt> will run up to 4 commands in parallel if their dependencies are already finished. This allows easy parallelization on a single computer.
  
 
==Alternatives to Makefiles==
 
==Alternatives to Makefiles==
 
* Bioinformaticians often uses "pipelines" - sequences of commands run one after another, e.g. by a script or <tt>make</tt>
 
* Bioinformaticians often uses "pipelines" - sequences of commands run one after another, e.g. by a script or <tt>make</tt>
* There are many tools developed for automating computational pipelines, see e.g. this review: [https://academic.oup.com/bib/article/doi/10.1093/bib/bbw020/2562749/A-review-of-bioinformatic-pipeline-frameworks Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016.]
+
* There are many tools developed for automating computational pipelines in bioinformatics, see e.g. this review: [https://academic.oup.com/bib/article/doi/10.1093/bib/bbw020/2562749/A-review-of-bioinformatic-pipeline-frameworks Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016.]
* For example [https://bitbucket.org/snakemake/snakemake/wiki/Home Snakemake]
+
* For example [https://snakemake.readthedocs.io/en/stable/ Snakemake]
 
** Snake workflows can contain shell commands or Python code
 
** Snake workflows can contain shell commands or Python code
 
** Big advantage compared to <tt>make</tt>: pattern rules may contain multiple variable portions (in <tt>make</tt> only one <tt>%</tt> per filename)
 
** Big advantage compared to <tt>make</tt>: pattern rules may contain multiple variable portions (in <tt>make</tt> only one <tt>%</tt> per filename)
Line 182: Line 191:
 
</pre>
 
</pre>
  
==HWmake==
+
==Overall advantages of using command-line one-liners and make==
 
+
* Compared to popular Python frameworks, such as Pandas, many command-line utilities process files line-by-line and thus do not need to load big files into memory.
<!-- NOTEX -->
+
* For small easy tasks (e.g. counting lines, looking occurrences of a specific string, looking at first lines of the file etc.) it might be less effort to run a simple one-liner than to open the file in some graphical interface (Jupyter notebook, Excel) and finding the answer. This is true provided that you take time to learn command-line utilities.
See also the [[#Lmake|lecture]]
+
* Command-line commands, scripts and makefiles can be easily run in background on your computer or even on remote servers.
<!-- /NOTEX -->
+
* It is easy to document what exactly was run by keeping log of executed commands as well as any scripts used in the process.
 
 
===Motivation: Building Phylogenetic Trees===
 
The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences
 
* A '''phylogenetic tree''' is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors.
 
* The '''input''' contains sequences of all proteins from each species (we will use only a smaller subset)
 
* The process is typically split into severla stages shown below
 
 
 
====Identify ortholog groups====
 
''Orthologs'' are proteins from different species that "correspond" to each other. Ortholog are found based on sequence similarity and we can use a tool called [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download blast] to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species
 
 
 
====Align proteins on each group====
 
For each ortholog group, we need to align proteins in the group to identify corresponding parts of the proteins. This is done by a tool called <tt>muscle</tt>
 
 
Unaligned sequences (start of protein [https://www.uniprot.org/uniprot/O60568 O60568]):
 
<pre>
 
>human
 
MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA...
 
>baboon
 
MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA...
 
>dog
 
MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA...
 
>elephant
 
MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA...
 
>guineapig
 
MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA...
 
>opossum
 
SDKLLVITAA...
 
>pig
 
AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA...
 
>rabbit
 
MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA...
 
>rat
 
MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...
 
</pre>
 
 
 
Aligned sequences:
 
<pre>
 
rabbit      MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ...
 
guineapig    MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ...
 
opossum      ---------- ---------- ---------- ---------- SDKLLVITAA ...
 
rat          MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ...
 
elephant    MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ...
 
human        MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ...
 
baboon      MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ...
 
dog          MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ...
 
pig          AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...
 
</pre>
 
 
 
====Build phylogenetic tree for each grup====
 
 
 
For each alignment, we build a phylogenetic tree for this group. We will use a program called phyml.
 
 
 
Example of a phylogenetic tree in newick format:
 
<pre>
 
((opossum:0.09636245,rabbit:0.85794020):0.05219782,
 
(rat:0.07263127,elephant:0.03306863):0.01043531,
 
(dog:0.01700528,(pig:0.02891345,
 
(guineapig:0.14451043,
 
(human:0.01169266,baboon:0.00827402):0.02619598
 
):0.00816185):0.00631423):0.00800806);
 
</pre>
 
 
 
[[Image:Tree.png|center|thumb|200px|Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)]]
 
 
 
====Build a consensus tree====
 
 
 
The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree.
 
 
 
<!-- NOTEX -->
 
===Files and submitting===
 
<!-- /NOTEX -->
 
<!-- TEX
 
===Files===
 
/TEX -->
 
 
 
Our goal is to build a pipeline that automates the whole task using make and execute it remotely using <tt>qsub</tt>. Most of the work is already done, only small modifications are necessary.
 
 
 
<!-- NOTEX -->
 
* Submit by copying requested files to <tt>/submit/make/username/</tt>
 
* Do not forget to submit protocol, outline of the protocol is in <tt>/tasks/make/protocol.txt</tt>
 
<!-- /NOTEX -->
 
 
 
Start by copying directory <tt>/tasks/make</tt> to your user directory
 
<syntaxhighlight lang="bash">
 
cp -ipr /tasks/make .
 
cd make
 
</syntaxhighlight>
 
 
 
The directory contains three subdirectories:
 
* <tt>large</tt>: a larger sample of proteins for task A
 
* <tt>tiny</tt>: a very small set of proteins for task B
 
* <tt>small</tt>: a slightly larger set of proteins for task C
 
 
 
===Task A (long job)===
 
 
 
* In this task, you will run a long alignment job (more than two hours)
 
* Use directory <tt>large</tt> with files:
 
** <tt>ref.fa</tt>: selected human proteins
 
** <tt>other.fa</tt>: selected proteins from 8 other mammalian species
 
** <tt>Makefile</tt>: runs blast on <tt>ref.fa</tt> vs <tt>other.fa</tt> (also formats database <tt>other.fa</tt> before that)
 
* run <tt>make -n</tt> to see what commands will be done (you should see <tt>makeblastdb</tt>, <tt>blastp</tt>, and <tt>echo</tt> for timing)
 
<!-- NOTEX -->
 
** copy the output to the '''protocol'''
 
<!-- /NOTEX -->
 
* run <tt>qsub</tt> with appropriate options to run <tt>make</tt> (at least <tt>-cwd -b y</tt>)
 
<!-- NOTEX -->
 
* then run <tt>qstat > queue.txt </tt>
 
** '''Submit''' file <tt>queue.txt</tt> showing your job waiting or running
 
<!-- /NOTEX -->
 
<!-- TEX
 
* Use <tt>qsub</tt> to check the status of your job
 
/TEX -->
 
* When your job finishes, check the following files:
 
** the output file <tt>ref.blast</tt>
 
** standard output from the <tt>qsub</tt> job, which is stored in a file named e.g. <tt>make.oX</tt> where <tt>X</tt> is the number of your job. The output shows the time when your job started and finished (this information was written by commands <tt>echo</tt> in the <tt>Makefile</tt>)
 
<!-- NOTEX -->
 
* '''Submit''' the last 100 lines from <tt>ref.blast</tt> under the name <tt>ref-end.blast</tt> (use tool <tt>tail -n 100</tt>) and the file <tt>make.oX</tt> mentioned above
 
<!-- /NOTEX -->
 
 
 
===Task B (finishing Makefile)===
 
 
 
* In this task, you will finish a <tt>Makefile</tt> for splitting blast results into ortholog groups and building phylogenetic trees for each group
 
** This <tt>Makefile</tt> works with much smaller files and so you can run it quickly many times without <tt>qsub</tt>
 
* Work in directory <tt>tiny</tt>
 
** <tt>ref.fa</tt>: 2 human proteins
 
** <tt>other.fa</tt>: a selected subset of proteins from 8 other mammalian species
 
** <tt>Makefile</tt>: a longer makefile
 
** <tt>brm.pl</tt>: a Perl script for finding ortholog groups and sorting them to directories
 
 
 
The <tt>Makefile</tt> runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3
 
* If you run <tt>make</tt> without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
 
* Stage 1: run as <tt>make ref.brm</tt>
 
** It runs <tt>blast</tt> as in task A, then splits proteins into ortholog groups and creates one directory for each group with file <tt>prot.fa</tt> containing protein sequences
 
* Stage 2: run as <tt>make alignments</tt>
 
** In each directory with an ortholog group, it will create an alignment <tt>prot.phy</tt> and link it under names <tt>lg.phy</tt> and <tt>wag.phy</tt>
 
* Stage 3: run as <tt>make trees</tt> (needs to be written by you)
 
** In each directory with an ortholog group, it should create files <tt>lg.phy_phyml_tree</tt> and <tt>wag.phy_phyml_tree</tt> containing the results of the <tt>phyml</tt> program run with two different evolutionary models WAG and LG, where LG is the default
 
** Run <tt>phyml</tt> by commands of the form:<br><tt>phyml -i INPUT --datatype aa --bootstrap 0 --no_memory_check >LOG</tt><br><tt>phyml -i INPUT --model WAG --datatype aa --bootstrap 0 --no_memory_check >LOG</tt>
 
** Change <tt>INPUT</tt> and <tt>LOG</tt> in the commands to the appropriate filenames using <tt>make</tt> variables <tt>$@, $^, $*</tt> etc. The input should come from lg.phy or wag.phy in the directory of a gene and log should be the same as tree name with extension <tt>.log</tt> added (e.g. <tt>lg.phy_phyml_tree.log</tt>)
 
** Also add variables <tt>LG_TREES</tt> and <tt>WAG_TREES</tt> listing filenames of all desirable trees and uncomment phony target <tt>trees</tt> which uses these variables
 
* Stage 4:  run as <tt>make consensus</tt>
 
** Output trees from stage 3 are concatenated for each model separately to files <tt>lg/intree</tt>, <tt>wag/intree</tt> and then <tt>phylip</tt> is run to produce consensus trees <tt>lg.tree</tt> and <tt>wag.tree</tt>
 
** This stage also needs variables <tt>LG_TREES</tt> and <tt>WAG_TREES</tt> to be defined by you.
 
 
 
* Run your <tt>Makefile</tt> and check that the files <tt>lg.tree</tt> and <tt>wag.tree</tt> are produced
 
<!-- NOTEX -->
 
* '''Submit''' the whole directory <tt>tiny</tt>, including <tt>Makefile</tt> and all gene directories with tree files.
 
<!-- /NOTEX -->
 
 
 
 
 
===Task C (running make)===
 
* Copy your <tt>Makefile</tt> from part B to directory <tt>small</tt>, which contains 9 human proteins and run <tt>make</tt> on this slightly larger set
 
** Again, run it without <tt>qsub</tt>, but it will take some time, particularly if the server is busy
 
* Look at the two resulting trees (<tt>wag.tree</tt>, <tt>lg.tree</tt>) using the <tt>figtree</tt> program
 
<!-- NOTEX -->
 
** it is available on vyuka, but you can also [https://github.com/rambaut/figtree/releases install it] on your computer if needed
 
<!-- /NOTEX -->
 
* In <tt>figtree</tt>, change the position of the root in the tree to make the opossum the outgroup (species branching as the first away from the others). This is done by clicking on opossum and thus selecting it, then pressing the Reroot button.
 
* Also switch on displaying branch labels. These labels show for each branch of the tree, how many of the input trees support this branch. To do this, use the left panel with options.
 
* Export the trees in pdf format as <tt>wag.tree.pdf</tt> and <tt>lg.tree.pdf</tt>
 
* Compare the two trees
 
** Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
 
** Do the two trees differ? What is the highest and lowest support for a branch in each tree?
 
** Also compare your trees with the accepted "correct tree" found here http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png (note that this tree contains many more species, but all ours are included)
 
<!-- NOTEX -->
 
** Write '''your observations to the protocol'''
 
* '''Submit''' the entire <tt>small</tt> directory (including the two pdf files)
 
<!-- /NOTEX -->
 
 
 
===Further possibilities===
 
 
 
<!-- NOTEX -->
 
Here are some possibilities for further experiments, in case you are interested (do not submit these):
 
<!-- /NOTEX -->
 
<!-- TEX
 
Here are some possibilities for further experiments, in case you are interested:
 
/TEX -->
 
* You could copy your extended <tt>Makefile</tt> to directory <tt>large</tt> and create trees for all ortholog groups in the big set
 
<!-- NOTEX -->
 
** This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A
 
<!-- /NOTEX -->
 
** After <tt>ref.brm</tt> si done, programs for individual genes can be run in parallel, so you can try running <tt>make -j 2</tt> and request 2 threads from <tt>qsub</tt>
 
* Phyml also supports other models, for example JTT  (see [http://www.atgc-montpellier.fr/download/papers/phyml_manual_2012.pdf manual]); you could try to play with those.
 
* Command <tt>touch FILENAME</tt> will change the modification time of the given file to the current time
 
** What happens when you run <tt>touch</tt> on some of the intermediate files in the analysis in task B? Does <tt>Makefile</tt> always run properly?
 

Latest revision as of 20:07, 25 January 2024

HWmake


Job scheduling

  • Some computing jobs take a lot of time: hours, days, weeks,...
  • We do not want to keep a command-line window open the whole time; therefore we run such jobs in the background
  • Simple commands to do it in Linux:
    • To run the program immediately, then switch the whole console to the background: screen, tmux (see a separate section)
    • To run the command when the computer becomes idle: batch
  • Now we will concentrate on Sun Grid Engine, a complex software for managing many jobs from many users on a cluster consisting of multiple computers
  • Basic workflow:
    • Submit a job (command) to a queue
    • The job waits in the queue until resources (memory, CPUs, etc.) become available on some computer
    • The job runs on the computer
    • Output of the job is stored in files
    • User can monitor the status of the job (waiting, running)
  • Complex possibilities for assigning priorities and deadlines to jobs, managing multiple queues etc.
  • Ideally all computers in the cluster share the same environment and filesystem
  • We have a simple training cluster for this exercise:
    • You submit jobs to queue on vyuka
    • They will run on computers runner01 and runner02
    • This cluster is only temporarily available until the next Monday


Submitting a job (qsub)

Basic command: qsub -b y -cwd command 'parameter < input > output 2> error'

  • quoting around command parameters allows us to include special characters, such as <, > etc. and not to apply it to qsub command itself
  • -b y treats command as binary, usually preferable for both binary programs and scripts
  • -cwd executes command in the current directory
  • -N name allows to set name of the job
  • -l resource=value requests some non-default resources
  • for example, we can use -l threads=2 to request 2 threads for parallel programs
  • Grid engine will not check if you do not use more CPUs or memory than requested, be considerate (and perhaps occasionally watch your jobs by running top at the computer where they execute)
  • qsub will create files for stdout and stderr, e.g. s2.o27 and s2.e27 for the job with name s2 and jobid 27

Monitoring and deleting jobs (qstat, qdel)

Command qstat displays jobs of the current user

  • job 28 is running of server runner02 (status <t>r), job 29 is waiting in queue (status qw)
job-ID  prior   name       user         state submit/start at     queue       
---------------------------------------------------------------------------------
     28 0.50000 s3         bbrejova     r     03/15/2016 22:12:18 main.q@runner02
     29 0.00000 s3         bbrejova     qw    03/15/2016 22:14:08             
  • Command qstat -u '*' displays jobs of all users.
  • Finished jobs disappear from the list.
  • Command qstat -F threads shows how many threads available:
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
main.q@runner01                BIP   0/1/2          0.00     lx26-amd64    
	hc:threads=1
    238 0.25000 sleeper.pl bbrejova     r     03/05/2020 13:12:28     1        
---------------------------------------------------------------------------------
main.q@runner02                BIP   0/1/2          0.00     lx26-amd64    
    237 0.75000 sleeper.pl bbrejova     r     03/05/2020 13:12:13     1        
  • Command qdel deletes a job (waiting or running)

Interactive work on the cluster (qrsh), screen

Command qrsh creates a job which is a normal interactive shell running on the cluster

  • In this shell you can manually run commands
  • When you close the shell, the job finishes
  • Therefore it is a good idea to run qrsh within screen command
    • Run screen command, this creates a new shell
    • Within this shell, run qrsh, then whatever commands on the rmeote server
    • By pressing Ctrl-a d you "detach" the screen, so that both shells (local and qrsh) continue running but you can close your local window
    • Later by running screen -r you get back to your shells
    • See more details in the separate section

Running many small jobs

For example, we many need to run some computation for each human gene (there are roughly 20,000 such genes). Here are some possibilities:

  • Run a script which iterates through all jobs and runs them sequentially
    • Problems: This does not use parallelism, needs more programming to restart after some interruption
  • Submit processing of each gene as a separate job to cluster (submitting done by a script/one-liner)
    • Jobs can run in parallel on many different computers
    • Problem: Queue gets very long, hard to monitor progress, hard to resubmit only unfinished jobs after some failure.
  • Array jobs in qsub (option -t): runs jobs numbered 1,2,3...; number of the current job is in an environment variable, used by the script to decide which gene to process
    • Queue contains only running sub-jobs plus one line for the remaining part of the array job.
    • After failure, you can resubmit only unfinished portion of the interval (e.g. start from job 173).
  • Next: using make in which you specify how to process each gene and submit a single make command to the queue
    • Make can execute multiple tasks in parallel using several threads on the same computer (qsub array jobs can run tasks on multiple computers)
    • It will automatically skip tasks which are already finished, so restart is easy

Make

Make is a system for automatically building programs (running compiler, linker etc)

  • In particular, we will use GNU make
  • Rules for compilation are written in a file typically called Makefile
  • Rather complex syntax with many features, we will only cover basics

Rules

  • The main part of a Makefile are rules specifying how to generate target files from some source files (prerequisites).
  • For example the following rule generates file target.txt by concatenating files source1.txt and source2.txt:
target.txt : source1.txt source2.txt
      cat source1.txt source2.txt > target.txt
  • The first line describes target and prerequisites, starts in the first column
  • The following lines list commands to execute to create the target
  • Each line with a command starts with a tab character
  • If we have a directory with this rule in file called Makefile and files source1.txt and source2.txt, running make target.txt will run the cat command
  • However, if target.txt already exists, the command will be run only if one of the prerequisites has more recent modification time than the target
  • This allows to restart interrupted computations or rerun necessary parts after modification of some input files
  • make automatically chains the rules as necessary:
    • If we run make target.txt and some prerequisite does not exist, make checks if it can be created by some other rule and runs that rule first.
    • In general it first finds all necessary steps and runs them in appropriate order so that each rules has its prerequisites ready.
    • Option make -n target will show which commands would be executed to build target (dry run) - good idea before running something potentially dangerous.

Pattern rules

We can specify a general rule for files with a systematic naming scheme. For example, to create a .pdf file from a .tex file, we use the pdflatex command:

%.pdf : %.tex
      pdflatex $^
  • In the first line, % denotes some variable part of the filename, which has to agree in the target and all prerequisites.
  • In commands, we can use several variables:
    • Variable $^ contains the names of the prerequisites (source).
    • Variable $@ contains the name of the target.
    • Variable $* contains the string matched by %.

Other useful tricks in Makefiles

Variables

Store some reusable values in variables, then use them several times in the Makefile:

MYPATH := /projects/trees/bin

target : source
       $(MYPATH)/script < $^ > $@

Wildcards, creating a list of targets from files in the directory

The following Makefile automatically creates .png version of each .eps file simply by running make:

EPS := $(wildcard *.eps)
EPSPNG := $(patsubst %.eps,%.png,$(EPS))

all:  $(EPSPNG)

clean:
        rm $(EPSPNG)

%.png : %.eps
        convert -density 250 $^ $@
  • Variable EPS contains names of all files matching *.eps.
  • Variable EPSPNG contains names of desired .png files.
    • It is created by taking filenames in EPS and changing .eps to .png
  • all is a "phony target" which is not really created
    • Its rule has no commands but all .png files are prerequisites, so are done first.
    • The first target in a Makefile (in this case all) is default when no other target is specified on the command-line.
  • clean is also a phony target for deleting generated .png files.
  • Thus run make all or just make to create png files, make clean to delete them.

Useful special built-in target names

Include these lines in your Makefile if desired

.SECONDARY:
# prevents deletion of intermediate targets in chained rules

.DELETE_ON_ERROR:
# delete targets if a rule fails

Parallel make

Running make with option -j 4 will run up to 4 commands in parallel if their dependencies are already finished. This allows easy parallelization on a single computer.

Alternatives to Makefiles

  • Bioinformaticians often uses "pipelines" - sequences of commands run one after another, e.g. by a script or make
  • There are many tools developed for automating computational pipelines in bioinformatics, see e.g. this review: Jeremy Leipzig; A review of bioinformatic pipeline frameworks. Brief Bioinform 2016.
  • For example Snakemake
    • Snake workflows can contain shell commands or Python code
    • Big advantage compared to make: pattern rules may contain multiple variable portions (in make only one % per filename)
    • For example, assume we have several FASTA files and several profiles (HMMs) representing protein families and we want to run each profile on each FASTA file:
rule HMMER:
     input: "{filename}.fasta", "{profile}.hmm"
     output: "{filename}_{profile}.hmmer"
     shell: "hmmsearch --domE 1e-5 --noali --domtblout {output} {input[1]} {input[0]}"

Overall advantages of using command-line one-liners and make

  • Compared to popular Python frameworks, such as Pandas, many command-line utilities process files line-by-line and thus do not need to load big files into memory.
  • For small easy tasks (e.g. counting lines, looking occurrences of a specific string, looking at first lines of the file etc.) it might be less effort to run a simple one-liner than to open the file in some graphical interface (Jupyter notebook, Excel) and finding the answer. This is true provided that you take time to learn command-line utilities.
  • Command-line commands, scripts and makefiles can be easily run in background on your computer or even on remote servers.
  • It is easy to document what exactly was run by keeping log of executed commands as well as any scripts used in the process.