1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWmake"
(30 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<!-- NOTEX --> | <!-- NOTEX --> | ||
− | See also the [[ | + | See also the [[Lmake|lecture]] |
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
Line 7: | Line 7: | ||
* A '''phylogenetic tree''' is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors. | * A '''phylogenetic tree''' is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors. | ||
* The '''input''' contains sequences of all proteins from each species (we will use only a smaller subset) | * The '''input''' contains sequences of all proteins from each species (we will use only a smaller subset) | ||
− | * The process is typically split into | + | * The process is typically split into several stages shown below |
====Identify ortholog groups==== | ====Identify ortholog groups==== | ||
− | ''Orthologs'' are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download blast] to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species | + | ''Orthologs'' are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download blast] to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species. |
====Align proteins on each group==== | ====Align proteins on each group==== | ||
Line 50: | Line 50: | ||
</pre> | </pre> | ||
− | ====Build phylogenetic tree for each | + | ====Build phylogenetic tree for each group==== |
− | For each alignment, we build a phylogenetic tree for this group. We will use a program called | + | For each alignment, we build a phylogenetic tree for this group. We will use a program called [http://www.iqtree.org/ IQ-tree]. |
Example of a phylogenetic tree in newick format: | Example of a phylogenetic tree in newick format: | ||
Line 64: | Line 64: | ||
</pre> | </pre> | ||
− | [[Image:Tree.png|center|thumb|200px|Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)]] | + | [[Image:Tree.png|center|thumb|200px|Tree for gene O60568 (note: this particular tree does not agree well with real evolutionary history)]] |
====Build a consensus tree==== | ====Build a consensus tree==== | ||
− | The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree. | + | The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called [https://phylipweb.github.io/phylip/ Phylip]. The output is a single consensus tree. |
− | |||
− | |||
===Files and submitting=== | ===Files and submitting=== | ||
− | |||
− | |||
− | |||
− | |||
Our goal is to build a pipeline that automates the whole task using make and execute it remotely using <tt>qsub</tt>. Most of the work is already done, only small modifications are necessary. | Our goal is to build a pipeline that automates the whole task using make and execute it remotely using <tt>qsub</tt>. Most of the work is already done, only small modifications are necessary. | ||
− | |||
* Submit by copying requested files to <tt>/submit/make/username/</tt> | * Submit by copying requested files to <tt>/submit/make/username/</tt> | ||
* Do not forget to submit protocol, outline of the protocol is in <tt>/tasks/make/protocol.txt</tt> | * Do not forget to submit protocol, outline of the protocol is in <tt>/tasks/make/protocol.txt</tt> | ||
− | |||
− | Start by copying | + | |
+ | Start by copying folder <tt>/tasks/make</tt> to your home folder | ||
<syntaxhighlight lang="bash"> | <syntaxhighlight lang="bash"> | ||
cp -ipr /tasks/make . | cp -ipr /tasks/make . | ||
Line 91: | Line 84: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | The | + | The folder contains three subfolders: |
* <tt>large</tt>: a larger sample of proteins for task A | * <tt>large</tt>: a larger sample of proteins for task A | ||
* <tt>tiny</tt>: a very small set of proteins for task B | * <tt>tiny</tt>: a very small set of proteins for task B | ||
Line 98: | Line 91: | ||
===Task A (long job)=== | ===Task A (long job)=== | ||
− | * In this task, you will run a long alignment job (more than two hours) | + | * In this task, you will run a long alignment job (more than two hours). |
− | * | + | * Go to folder <tt>large</tt>, which contains the following files: |
** <tt>ref.fa</tt>: selected human proteins | ** <tt>ref.fa</tt>: selected human proteins | ||
** <tt>other.fa</tt>: selected proteins from 8 other mammalian species | ** <tt>other.fa</tt>: selected proteins from 8 other mammalian species | ||
** <tt>Makefile</tt>: runs blast on <tt>ref.fa</tt> vs <tt>other.fa</tt> (also formats database <tt>other.fa</tt> before that) | ** <tt>Makefile</tt>: runs blast on <tt>ref.fa</tt> vs <tt>other.fa</tt> (also formats database <tt>other.fa</tt> before that) | ||
− | * | + | * Run command <tt>make -n > make-n.txt; cat make-n.txt</tt> to see what commands will be done (you should see <tt>makeblastdb</tt>, <tt>blastp</tt>, and <tt>echo</tt> for timing). |
− | + | * Run <tt>qsub</tt> with appropriate options to run <tt>make</tt> (at least <tt>-cwd -b y</tt>). | |
− | * | + | * Then run <tt>qstat > queue.txt ; cat queue.txt </tt> |
− | + | ** The output should show your jobs waitning or running | |
− | |||
− | |||
− | * | ||
− | ** | ||
− | |||
− | |||
− | |||
− | |||
* When your job finishes, check the following files: | * When your job finishes, check the following files: | ||
** the output file <tt>ref.blast</tt> | ** the output file <tt>ref.blast</tt> | ||
** standard output from the <tt>qsub</tt> job, which is stored in a file named e.g. <tt>make.oX</tt> where <tt>X</tt> is the number of your job. The output shows the time when your job started and finished (this information was written by commands <tt>echo</tt> in the <tt>Makefile</tt>) | ** standard output from the <tt>qsub</tt> job, which is stored in a file named e.g. <tt>make.oX</tt> where <tt>X</tt> is the number of your job. The output shows the time when your job started and finished (this information was written by commands <tt>echo</tt> in the <tt>Makefile</tt>) | ||
− | + | * '''Submit''' the following files: | |
− | * '''Submit''' the last 100 lines from <tt>ref.blast</tt> under the name <tt>ref-end.blast</tt> (use tool <tt>tail -n 100</tt>) | + | ** File <tt>queue.txt</tt> showing your job waiting or running in the queue |
− | + | ** File <tt>make.oX</tt> with output of your job | |
+ | ** The last 100 lines from <tt>ref.blast</tt> under the name <tt>ref-end.blast</tt> (use tool <tt>tail -n 100</tt>) | ||
===Task B (finishing Makefile)=== | ===Task B (finishing Makefile)=== | ||
− | * In this task, you will finish a <tt>Makefile</tt> for splitting blast results into ortholog groups and building phylogenetic trees for each group | + | * In this task, you will finish a <tt>Makefile</tt> for splitting blast results into ortholog groups and building phylogenetic trees for each group. |
** This <tt>Makefile</tt> works with much smaller files and so you can run it quickly many times without <tt>qsub</tt> | ** This <tt>Makefile</tt> works with much smaller files and so you can run it quickly many times without <tt>qsub</tt> | ||
− | * Work in | + | * Work in folder <tt>tiny</tt> |
** <tt>ref.fa</tt>: 2 human proteins | ** <tt>ref.fa</tt>: 2 human proteins | ||
** <tt>other.fa</tt>: a selected subset of proteins from 8 other mammalian species | ** <tt>other.fa</tt>: a selected subset of proteins from 8 other mammalian species | ||
** <tt>Makefile</tt>: a longer makefile | ** <tt>Makefile</tt>: a longer makefile | ||
− | ** <tt>brm.pl</tt>: a Perl script for finding ortholog groups and sorting them to | + | ** <tt>brm.pl</tt>: a Perl script for finding ortholog groups and sorting them to folders |
− | The <tt>Makefile</tt> runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3 | + | The <tt>Makefile</tt> runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3. |
* If you run <tt>make</tt> without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing | * If you run <tt>make</tt> without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing | ||
* Stage 1: run as <tt>make ref.brm</tt> | * Stage 1: run as <tt>make ref.brm</tt> | ||
− | ** It runs <tt>blast</tt> as in task A, then splits proteins into ortholog groups and creates one | + | ** It runs <tt>blast</tt> as in task A, then splits proteins into ortholog groups and creates one folder for each group with file <tt>prot.fa</tt> containing protein sequences |
* Stage 2: run as <tt>make alignments</tt> | * Stage 2: run as <tt>make alignments</tt> | ||
− | ** In each | + | ** In each folder with an ortholog group, it will create an alignment <tt>prot.mfa</tt>. |
* Stage 3: run as <tt>make trees</tt> (needs to be written by you) | * Stage 3: run as <tt>make trees</tt> (needs to be written by you) | ||
− | ** In each | + | ** In each folder with an ortholog group, it should create files <tt>LG.treefile</tt> and <tt>JTT.treefile</tt> containing the results of the <tt>IQ-tree</tt> program run with two different evolutionary models LG and JTT+G4. |
− | ** | + | ** Create a rule that will create <tt>LG.treefile</tt> from <tt>prot.mfa</tt> in one gene folder. Use <tt>%</tt> for the folder name, similarly as rules to make alignments. Within the rule, you can use <tt>$*</tt> variable to get the name of this folder. |
− | ** Change <tt> | + | ** Run IQ-tree by commands of the form:<br><tt>iqtree2 -redo --seqtype AA -s GENE_FOLDER/prot.mfa -m LG -o opossum --seed 42 --prefix GENE_FOLDER/LG -n 50</tt> |
− | ** Also add variables <tt>LG_TREES</tt> and <tt> | + | ** Change <tt>GENE_FOLDER</tt> to the appropriate filename of the folder using <tt>make</tt> variables. |
+ | ** IQ-tree will create several output files starting with <tt>LG</tt> in the folder with the gene. | ||
+ | ** When this is done, create a similar rule except replace LG by JTT in the file names and change evolutionary model in the command by replacing <tt>-m LG</tt> option with <tt>-m JTT+G4</tt>. | ||
+ | ** Also add variables <tt>LG_TREES</tt> and <tt>JTT_TREES</tt> to the top of the Makefile listing filenames of all desirable trees and uncomment phony target <tt>trees</tt> which uses these variables. | ||
* Stage 4: run as <tt>make consensus</tt> | * Stage 4: run as <tt>make consensus</tt> | ||
− | ** Output trees from stage 3 are concatenated for each model separately to files <tt>lg/intree</tt>, <tt> | + | ** Output trees from stage 3 are concatenated for each model separately to files <tt>lg/intree</tt>, <tt>jtt/intree</tt> and then <tt>phylip</tt> is run to produce consensus trees <tt>lg.tree</tt> and <tt>jtt.tree</tt> |
− | ** This stage also needs variables <tt>LG_TREES</tt> and <tt> | + | ** This stage also needs variables <tt>LG_TREES</tt> and <tt>JTT_TREES</tt> to be defined by you. |
+ | |||
+ | * Run your <tt>Makefile</tt> and check that the files <tt>lg.tree</tt> and <tt>jtt.tree</tt> are produced, as well as their ASCII-art "images" <tt>lg.tree.txt</tt> and <tt>jtt.tree.txt</tt>. | ||
+ | * Instructions for submitting see after Task C (you can submit B even if you do not do C). | ||
+ | |||
+ | ===Task C (adding statistics to Makefile)=== | ||
+ | |||
+ | IQ-tree also produced files named <tt>LG.iqtree</tt> and <tt>JTT.iqtree</tt> in each gene folder. These contain various statistics about the run of the program. We want to compare these values across genes and models, so we want to extract them to two tables in csv format. Namely, we are interested in two lines from the <tt>*.iqtree</tt> files: | ||
+ | * Line starting "Log-likelihood of the tree" which gives a (negative) number saying how well the data fit the model. Higher number means a better fit. We can compare these numbers to see which model fits the data better. | ||
+ | * Line starting "Total CPU time used" which gives computation time in seconds. We can see if the computation time is different for different models. | ||
+ | |||
+ | * Uncomment the line adding a new goal to the <tt>Makefile</tt> named <tt>stats</tt>. | ||
+ | * This goal has two subgoals producing files <tt>stats-time.tsv</tt> and <tt>stats-likelihood.tsv</tt>. | ||
+ | * Write a rule for each of these subgoals producing the desired file. | ||
+ | |||
+ | To gather appropriate lines from all files and also know which file they come from, you can use grep command of the form<br> | ||
+ | <tt>grep '^Log-likelihood of the tree:' gene_Q*/*.iqtree</tt><br> | ||
+ | Its output will look like this: | ||
+ | <pre> | ||
+ | gene_Q6ZWJ1_human/JTT.iqtree:Log-likelihood of the tree: -4475.9510 (s.e. 121.3816) | ||
+ | gene_Q6ZWJ1_human/LG.iqtree:Log-likelihood of the tree: -4648.0902 (s.e. 130.9257) | ||
+ | gene_Q8WUG5_human/JTT.iqtree:Log-likelihood of the tree: -4097.1127 (s.e. 64.7212) | ||
+ | gene_Q8WUG5_human/LG.iqtree:Log-likelihood of the tree: -4144.8778 (s.e. 67.7092) | ||
+ | </pre> | ||
+ | You should reformat the output to <tt>stats-likelihood.tsv</tt> that looks like this (the columns are separated with tabs <tt>"\t"</tt>): | ||
+ | <pre> | ||
+ | Q6ZWJ1_human JTT -4475.9510 | ||
+ | Q6ZWJ1_human LG -4648.0902 | ||
+ | Q8WUG5_human JTT -4097.1127 | ||
+ | Q8WUG5_human LG -4144.8778 | ||
+ | </pre> | ||
+ | * To do reformatting, you can use perl one-liners or sed/awk directly in the Makefile or write your own script in Perl or Python and call it from Makefile. | ||
+ | * If you write one-lines in Makefile, you have to replace <tt>$</tt> with <tt>$$</tt> (e.g. <tt>$$F[0]</tt> to get the first column in a Perl one-liner). | ||
− | + | File <tt>stats-time.tsv</tt> should be similar, but it should contain time in seconds in the first column as below. Your time measurements may differ. | |
− | + | <pre> | |
− | + | Q6ZWJ1_human JTT 24.6512 | |
− | < | + | Q6ZWJ1_human LG 6.47912 |
+ | Q8WUG5_human JTT 28.7618 | ||
+ | Q8WUG5_human LG 7.47675 | ||
+ | </pre> | ||
+ | === Submitting tasks B and C === | ||
+ | * '''Submit''' the whole folder <tt>tiny</tt> (using <tt>cp -ri tiny /submit/make/username/</tt>), including <tt>Makefile</tt>, scripts you wrote of task C (if any), all gene folders with tree files and the resulting combined trees and statistics files. | ||
− | ===Task | + | ===Task D (running make, final obersvations)=== |
− | * Copy your <tt>Makefile</tt> | + | * Copy your <tt>Makefile</tt> and any scripts you wrote in tasks B and C to folder <tt>small</tt>, which contains 9 human proteins and run <tt>make</tt> on this slightly larger set. |
** Again, run it without <tt>qsub</tt>, but it will take some time, particularly if the server is busy | ** Again, run it without <tt>qsub</tt>, but it will take some time, particularly if the server is busy | ||
− | * Look at the two resulting trees (<tt> | + | * Look at the two resulting trees (<tt>jtt.tree.txt</tt>, <tt>lg.tree.txt</tt>) |
− | |||
− | |||
− | < | ||
− | |||
− | |||
− | |||
− | |||
** Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter. | ** Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter. | ||
− | ** Do the two trees differ? What is the highest and lowest support for a branch in each tree? | + | ** Do the two trees differ from each other? If yes, how? |
− | ** | + | ** What is the highest and lowest support for a branch in each tree? |
− | < | + | ** Do your trees differ from the accepted "correct tree" shown below? If yes, how? |
− | ** Write '''your observations to the protocol''' | + | * Run <tt>make stats</tt> and look at the resulting tsv files. |
− | + | ** Which model produced higher likelihoods? | |
− | < | + | ** Which model runs faster? |
+ | * Write '''your observations to the protocol''' and '''submit''' the entire <tt>small</tt> folder. | ||
+ | |||
+ | The correct tree extracted from a larger tree at the [http://genome-euro.ucsc.edu/images/phylo/hg38_100way.png UCSC genome browser]: | ||
+ | <pre> | ||
+ | +-------human | ||
+ | +---------------| | ||
+ | | +-------baboon | ||
+ | +-------| | ||
+ | | | +---------------rabbit | ||
+ | | +-------| | ||
+ | | | +-------rat | ||
+ | +-------| +-------| | ||
+ | | | +-------guineapig | ||
+ | | | | ||
+ | +-------| | +-------pig | ||
+ | | | +-----------------------| | ||
+ | | | +-------dog | ||
+ | | | | ||
+ | | +---------------------------------------elephant | ||
+ | | | ||
+ | +-----------------------------------------------opossum | ||
+ | </pre> | ||
===Further possibilities=== | ===Further possibilities=== | ||
− | |||
Here are some possibilities for further experiments, in case you are interested (do not submit these): | Here are some possibilities for further experiments, in case you are interested (do not submit these): | ||
− | + | * You could copy your extended <tt>Makefile</tt> to folder <tt>large</tt> and create trees for all ortholog groups in the big set. | |
− | + | ** This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A. | |
− | |||
− | |||
− | * You could copy your extended <tt>Makefile</tt> to | ||
− | |||
− | ** This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A | ||
− | |||
** After <tt>ref.brm</tt> si done, programs for individual genes can be run in parallel, so you can try running <tt>make -j 2</tt> and request 2 threads from <tt>qsub</tt> | ** After <tt>ref.brm</tt> si done, programs for individual genes can be run in parallel, so you can try running <tt>make -j 2</tt> and request 2 threads from <tt>qsub</tt> | ||
− | * | + | * IQ-tree also supports many other models (for more information, see [http://www.iqtree.org/doc/Substitution-Models documentation]); you could try to play with those. Or run IQ-tree without <tt>-m</tt> when it tries to find the best model for your input. |
− | * Command <tt>touch FILENAME</tt> will change the modification time of the given file to the current time | + | * We were looking at the trees in text mode but you can create nice figures of the trees using <tt>figtree</tt> program. It is available on vyuka, but you can also [https://github.com/rambaut/figtree/releases install it] on your computer if interested. |
+ | * Command <tt>touch FILENAME</tt> will change the modification time of the given file to the current time. | ||
** What happens when you run <tt>touch</tt> on some of the intermediate files in the analysis in task B? Does <tt>Makefile</tt> always run properly? | ** What happens when you run <tt>touch</tt> on some of the intermediate files in the analysis in task B? Does <tt>Makefile</tt> always run properly? |
Latest revision as of 17:25, 21 March 2024
See also the lecture
Contents
Motivation: Building phylogenetic trees
The task for today will be to build a phylogenetic tree of 9 mammalian species using protein sequences
- A phylogenetic tree is a tree showing evolutionary history of these species. Leaves are the present-day species, internal nodes are their common ancestors.
- The input contains sequences of all proteins from each species (we will use only a smaller subset)
- The process is typically split into several stages shown below
Identify ortholog groups
Orthologs are proteins from different species that "correspond" to each other. Orthologs are found based on sequence similarity and we can use a tool called blast to identify sequence similarities between pairs of proteins. The result of ortholog group identification will be a set of groups, each group having one sequence from each of the 9 species.
Align proteins on each group
For each ortholog group, we need to align proteins in the group to identify corresponding parts of the proteins. This is done by a tool called muscle
Unaligned sequences (start of protein O60568):
>human MTSSGPGPRFLLLLPLLLPPAASASDRPRGRDPVNPEKLLVITVA... >baboon MTSSRPGLRLLLLLLLLPPAASASDRPRGRDPVNPEKLLVMTVA... >dog MASSGPGLRLLLGLLLLLPPPPATSASDRPRGGDPVNPEKLLVITVA... >elephant MASWGPGARLLLLLLLLLLPPPPATSASDRSRGSDRVNPERLLVITVA... >guineapig MAFGAWLLLLPLLLLPPPPGACASDQPRGSNPVNPEKLLVITVA... >opossum SDKLLVITAA... >pig AMASGPGLRLLLLPLLVLSPPPAASASDRPRGSDPVNPDKLLVITVA... >rabbit MGCDSRKPLLLLPLLPLALVLQPWSARGRASAEEPSSISPDKLLVITVA... >rat MAASVPEPRLLLLLLLLLPPLPPVTSASDRPRGANPVNPDKLLVITVA...
Aligned sequences:
rabbit MGCDSRKPLL LLPLLPLALV LQPW-SARGR ASAEEPSSIS PDKLLVITVA ... guineapig MAFGA----W LLLLPLLLLP PPPGACASDQ PRGSNP--VN PEKLLVITVA ... opossum ---------- ---------- ---------- ---------- SDKLLVITAA ... rat MAASVPEPRL LLLLLLLLPP LPPVTSASDR PRGANP--VN PDKLLVITVA ... elephant MASWGPGARL LLLLLLLLLP PPPATSASDR SRGSDR--VN PERLLVITVA ... human MTSSGPGPRF LLLLPLLL-- -PPAASASDR PRGRDP--VN PEKLLVITVA ... baboon MTSSRPGLRL LLLLLLL--- -PPAASASDR PRGRDP--VN PEKLLVMTVA ... dog MASSGPGLRL LLGLLLLL-P PPPATSASDR PRGGDP--VN PEKLLVITVA ... pig AMASGPGLR- LLLLPLLVLS PPPAASASDR PRGSDP--VN PDKLLVITVA ...
Build phylogenetic tree for each group
For each alignment, we build a phylogenetic tree for this group. We will use a program called IQ-tree.
Example of a phylogenetic tree in newick format:
((opossum:0.09636245,rabbit:0.85794020):0.05219782, (rat:0.07263127,elephant:0.03306863):0.01043531, (dog:0.01700528,(pig:0.02891345, (guineapig:0.14451043, (human:0.01169266,baboon:0.00827402):0.02619598 ):0.00816185):0.00631423):0.00800806);
Build a consensus tree
The result of the previous step will be several trees, one for every group. Ideally, all trees would be identical, showing the real evolutionary history of the 9 species. But it is not easy to infer the real tree from sequence data, so the trees from different groups might differ. Therefore, in the last step, we will build a consensus tree. This can be done by using a tool called Phylip. The output is a single consensus tree.
Files and submitting
Our goal is to build a pipeline that automates the whole task using make and execute it remotely using qsub. Most of the work is already done, only small modifications are necessary.
- Submit by copying requested files to /submit/make/username/
- Do not forget to submit protocol, outline of the protocol is in /tasks/make/protocol.txt
Start by copying folder /tasks/make to your home folder
cp -ipr /tasks/make .
cd make
The folder contains three subfolders:
- large: a larger sample of proteins for task A
- tiny: a very small set of proteins for task B
- small: a slightly larger set of proteins for task C
Task A (long job)
- In this task, you will run a long alignment job (more than two hours).
- Go to folder large, which contains the following files:
- ref.fa: selected human proteins
- other.fa: selected proteins from 8 other mammalian species
- Makefile: runs blast on ref.fa vs other.fa (also formats database other.fa before that)
- Run command make -n > make-n.txt; cat make-n.txt to see what commands will be done (you should see makeblastdb, blastp, and echo for timing).
- Run qsub with appropriate options to run make (at least -cwd -b y).
- Then run qstat > queue.txt ; cat queue.txt
- The output should show your jobs waitning or running
- When your job finishes, check the following files:
- the output file ref.blast
- standard output from the qsub job, which is stored in a file named e.g. make.oX where X is the number of your job. The output shows the time when your job started and finished (this information was written by commands echo in the Makefile)
- Submit the following files:
- File queue.txt showing your job waiting or running in the queue
- File make.oX with output of your job
- The last 100 lines from ref.blast under the name ref-end.blast (use tool tail -n 100)
Task B (finishing Makefile)
- In this task, you will finish a Makefile for splitting blast results into ortholog groups and building phylogenetic trees for each group.
- This Makefile works with much smaller files and so you can run it quickly many times without qsub
- Work in folder tiny
- ref.fa: 2 human proteins
- other.fa: a selected subset of proteins from 8 other mammalian species
- Makefile: a longer makefile
- brm.pl: a Perl script for finding ortholog groups and sorting them to folders
The Makefile runs the analysis in four stages. Stages 1,2 and 4 are done, you have to finish stage 3.
- If you run make without argument, it will attempt to run all 4 stages, but stage 3 will not run, because it is missing
- Stage 1: run as make ref.brm
- It runs blast as in task A, then splits proteins into ortholog groups and creates one folder for each group with file prot.fa containing protein sequences
- Stage 2: run as make alignments
- In each folder with an ortholog group, it will create an alignment prot.mfa.
- Stage 3: run as make trees (needs to be written by you)
- In each folder with an ortholog group, it should create files LG.treefile and JTT.treefile containing the results of the IQ-tree program run with two different evolutionary models LG and JTT+G4.
- Create a rule that will create LG.treefile from prot.mfa in one gene folder. Use % for the folder name, similarly as rules to make alignments. Within the rule, you can use $* variable to get the name of this folder.
- Run IQ-tree by commands of the form:
iqtree2 -redo --seqtype AA -s GENE_FOLDER/prot.mfa -m LG -o opossum --seed 42 --prefix GENE_FOLDER/LG -n 50 - Change GENE_FOLDER to the appropriate filename of the folder using make variables.
- IQ-tree will create several output files starting with LG in the folder with the gene.
- When this is done, create a similar rule except replace LG by JTT in the file names and change evolutionary model in the command by replacing -m LG option with -m JTT+G4.
- Also add variables LG_TREES and JTT_TREES to the top of the Makefile listing filenames of all desirable trees and uncomment phony target trees which uses these variables.
- Stage 4: run as make consensus
- Output trees from stage 3 are concatenated for each model separately to files lg/intree, jtt/intree and then phylip is run to produce consensus trees lg.tree and jtt.tree
- This stage also needs variables LG_TREES and JTT_TREES to be defined by you.
- Run your Makefile and check that the files lg.tree and jtt.tree are produced, as well as their ASCII-art "images" lg.tree.txt and jtt.tree.txt.
- Instructions for submitting see after Task C (you can submit B even if you do not do C).
Task C (adding statistics to Makefile)
IQ-tree also produced files named LG.iqtree and JTT.iqtree in each gene folder. These contain various statistics about the run of the program. We want to compare these values across genes and models, so we want to extract them to two tables in csv format. Namely, we are interested in two lines from the *.iqtree files:
- Line starting "Log-likelihood of the tree" which gives a (negative) number saying how well the data fit the model. Higher number means a better fit. We can compare these numbers to see which model fits the data better.
- Line starting "Total CPU time used" which gives computation time in seconds. We can see if the computation time is different for different models.
- Uncomment the line adding a new goal to the Makefile named stats.
- This goal has two subgoals producing files stats-time.tsv and stats-likelihood.tsv.
- Write a rule for each of these subgoals producing the desired file.
To gather appropriate lines from all files and also know which file they come from, you can use grep command of the form
grep '^Log-likelihood of the tree:' gene_Q*/*.iqtree
Its output will look like this:
gene_Q6ZWJ1_human/JTT.iqtree:Log-likelihood of the tree: -4475.9510 (s.e. 121.3816) gene_Q6ZWJ1_human/LG.iqtree:Log-likelihood of the tree: -4648.0902 (s.e. 130.9257) gene_Q8WUG5_human/JTT.iqtree:Log-likelihood of the tree: -4097.1127 (s.e. 64.7212) gene_Q8WUG5_human/LG.iqtree:Log-likelihood of the tree: -4144.8778 (s.e. 67.7092)
You should reformat the output to stats-likelihood.tsv that looks like this (the columns are separated with tabs "\t"):
Q6ZWJ1_human JTT -4475.9510 Q6ZWJ1_human LG -4648.0902 Q8WUG5_human JTT -4097.1127 Q8WUG5_human LG -4144.8778
- To do reformatting, you can use perl one-liners or sed/awk directly in the Makefile or write your own script in Perl or Python and call it from Makefile.
- If you write one-lines in Makefile, you have to replace $ with $$ (e.g. $$F[0] to get the first column in a Perl one-liner).
File stats-time.tsv should be similar, but it should contain time in seconds in the first column as below. Your time measurements may differ.
Q6ZWJ1_human JTT 24.6512 Q6ZWJ1_human LG 6.47912 Q8WUG5_human JTT 28.7618 Q8WUG5_human LG 7.47675
Submitting tasks B and C
- Submit the whole folder tiny (using cp -ri tiny /submit/make/username/), including Makefile, scripts you wrote of task C (if any), all gene folders with tree files and the resulting combined trees and statistics files.
Task D (running make, final obersvations)
- Copy your Makefile and any scripts you wrote in tasks B and C to folder small, which contains 9 human proteins and run make on this slightly larger set.
- Again, run it without qsub, but it will take some time, particularly if the server is busy
- Look at the two resulting trees (jtt.tree.txt, lg.tree.txt)
- Note that the two children of each internal node are equivalent, so their placement higher or lower in the figure does not matter.
- Do the two trees differ from each other? If yes, how?
- What is the highest and lowest support for a branch in each tree?
- Do your trees differ from the accepted "correct tree" shown below? If yes, how?
- Run make stats and look at the resulting tsv files.
- Which model produced higher likelihoods?
- Which model runs faster?
- Write your observations to the protocol and submit the entire small folder.
The correct tree extracted from a larger tree at the UCSC genome browser:
+-------human +---------------| | +-------baboon +-------| | | +---------------rabbit | +-------| | | +-------rat +-------| +-------| | | +-------guineapig | | +-------| | +-------pig | | +-----------------------| | | +-------dog | | | +---------------------------------------elephant | +-----------------------------------------------opossum
Further possibilities
Here are some possibilities for further experiments, in case you are interested (do not submit these):
- You could copy your extended Makefile to folder large and create trees for all ortholog groups in the big set.
- This would take a long time, so submit it through qsub and only some time after the lecture is over to allow classmates to work on task A.
- After ref.brm si done, programs for individual genes can be run in parallel, so you can try running make -j 2 and request 2 threads from qsub
- IQ-tree also supports many other models (for more information, see documentation); you could try to play with those. Or run IQ-tree without -m when it tries to find the best model for your input.
- We were looking at the trees in text mode but you can create nice figures of the trees using figtree program. It is available on vyuka, but you can also install it on your computer if interested.
- Command touch FILENAME will change the modification time of the given file to the current time.
- What happens when you run touch on some of the intermediate files in the analysis in task B? Does Makefile always run properly?