TreeArrange and Treeps: User guide
----------------------------------
The current version of this software can be downloaded from
http://monod.uwaterloo.ca/downloads/treearrange/


1. Introduction
---------------

Treeps is a tool for displaying expression array data and optional
associated hierarchical clustering in a form of encapsulated postscript
files.

TreeArrange is a program that reorders leaves of hierarchical
clustering tree to place similar leaves together. Details of
reordering algorithms see [1].

Both programs are command-line programs that can be controlled by a
number of switches. This makes them especially suitable for processing 
multiple inputs. 

Encapsulated postscript files created by Treeps can be included to
other documents. Output of Treeps program can be also viewed in a
suitable postscript viewer (free postscript viewers include gv and
Ghostview). Such viewers often allow user to zoom in different parts
of the picture. This is a sufficient substitute for interactive
features provided by TreeView program by Michael Eisen. However, files
produced by TreeArrange can also be viewed by TreeView.

TreeArrange and Treeps input files can be produced for example 
by the following programs:
 - Cluster by Michael Eisen (for Windows)
     http://rana.lbl.gov/EisenSoftware.htm
 - XCluster by Gavin Sherlock (for Windows, Unix, Macintosh)
     http://genome-www.stanford.edu/~sherlock/cluster.html


2. Installation
---------------

a) Installation for Linux and UNIX platforms
--------------------------------------------
The TreeArrange and Treep is distributed as a .tgz file containing
source code. The source code is written in GNU C++, therefore it
should be easy to compile on major UNIX platforms (so far it was
tested on Linux only). In future we may prepare binary distributions 
as well.

To install the software on UNIX platform, follow these steps:

1. Unpack the source distribution. 

   gunzip treearrange.tgz
   tar xf treearrange.tar

   This will create a new directory treearrange.

2. Compile the source code. Be sure to use gnu make (you may
   need to substitute something else for make on some platforms).

   cd trearrange
   make

3. If everything goes well, two files should be produced by the
   compilation: Treeps and TreeArrange in trearrange directory. 
   Copy these two files to your favourite location for binary
   executable files (e.g. /usr/local/bin, ~/bin, etc.).

You can now try sample run on files provided in the source distribution
(sample files were created by program XCluster):

1. Change to the directory treearrange/sample

2. Run (this assume that TreeArrange and Treeps are in your executable 
   search path; otherwise add path):

   TreeArrange sample reordered
   Treeps reordered out.eps

3. Now you can view file out.eps by your favourite postscript viewer
   (e.g. ghostview or gv).


b) Installation for Windows platforms
-------------------------------------
Executable files for Windows are provided in a zipped format. 




3. Input and output files
-------------------------

Both programs accept input files in the format produced by Cluster
program written by Michael Eisen. There are (at most)
three files in each data set:

<name>.cdt contains expression data themselves
<name>.gtr contains hierarchical clustering on trees
<name>.atr contains hierarchical clustering on experiments

atr file is optional. It is ignored by Treeps and is simply copied by
TreeArrange. gtr file is optional for Treeps but it is required by
most of the methods in TreeArrange.

It is assumed that the user first uses Cluster program to preprocess
data and to produce cdt, gtr and atr files. Other programs using the
same output format can be used as well. Also, cdt file can be easily
created in a spreadsheet (see format description below).

TreeArrange produces the output in the same format, creating three
files with a different name. cdt file is original cdt files with
reordered lines. gtr and atr files stay the same. 

Output of TreeArrange is in the same format as input, only the lines
are rearranged in different order. Treeps outputs encapsulated
postscript (.eps file)

3.1 Format of cdt file
----------------------

Cdt file is in text format with entries separated by tabs.  Each line
contains data for one gene, each column corresponds to one
experiment. Unknown values are left empty. The first several rows and
columns have special meaning. You can create such file by loading data
to spreadsheet, adding required special rows and columns and then
saving the table as text separated by tabs. Note that programs are
case sensitive - use keywords in capital letters as shown below.

There is one, two, or three rows with special meaning. Other rows
contain data for genes. The first row always contains column names.
First several columns have special meanings and have unique name that
has to be used (see below). Columns with experimental data can be
named arbitrarily to describe meaning or identifier of the
experiment. The first row can be optionally followed by a row
containing weights of individual experiments. This row starts with
word EWEIGHT. Also optional is a row containing experiment
identifiers. This row starts with word AID. Experiment identifiers are
created by Cluster program automatically and are not used by our programs. 

There are at most 4 columns with special meaning.  The name of the
first column should be GID and this column should contain a gene
identifier. This is a special identifier added by Cluster program when
doing hierarchical clustering to identify leaves of the hierarchical
tree. If you do not use hierarchical clustering this column can be
omitted. The second column (or first if no GID) should be named UNIQID
and it should contain a unique identifier for each gene. You can use
any kind of identifier as far as no two rows have the same. The
following column should be named NAME and it should contain any
description of the gene you want to display. It does not need to be
unique. This column can also be safely omitted. The last special
column is called GWEIGHT and it contains weights of columns. It is not
used in the programs and can be safely omitted.

Empty lines and lines starting with word REMARK are ignored.

Some examples of tables:
GID       UNIQID      NAME        GWEIGHT EXPERIMENT1 EXPERIMENT2
EWEIGHT                                       1.0        2.5
GENE2X    SID112179   EST T91987     1.5      0.5        0.4
GENE1X    SID112153   EST N5632      2.0     -0.1        0.2
GENE0X    SID314213   EST FF3456     0.5      0.1        2.6


GID       UNIQID      EXPERIMENT1 EXPERIMENT2
GENE2X    SID112179       0.5        0.4
GENE1X    SID112153      -0.1        0.2
GENE0X    SID314213       0.1        2.6

These two are only examples of spreadsheet tables, actual cdt files need 
to be tab separated.


3.2 Format of gtr file
----------------------

Each line of gtr file describes one internal node of the hierarchical
clustering tree. It contains four fields separated with whitespace.
The first field is a unique identifier of this node, the other two
fields are unique identifiers of its two children. If a child is a
leaf (i.e. gene,identifier is a GID from GID column in cdt file). For
internal nodes new identifiers are introduced. The last field contains
length of an edge as an average correlation between the two clusters
(i.e. number between -1 and 1). Example:

NODE1X	     GENE2X  GENE1X 0.99
NODE2X	     GENE0X  NODE1X 0.98


4. TreeArrange command line options
-----------------------------------

TreeArrange accepts the following command line parameters:

    TreeArrange [options] <input> <output>

<input> and <output> are filenames of input and output data sets.
They do not contain .cdt suffixes, however they contain path, if they
are not in the current directory. It is assumed that 
<input>.cdt exists and optionally <input>.gtr and <input>.atr exist.
<output> is similarly name for output files. We recommend to use a
different names in order to preserve the original files. Program will 
generate <output>.cdt and optionally also <output>.gtr and
<output>.atr if input gtr and atr files were supplied. 

Options:
(each option is followed by space and value)

-m (possible values are O, I, W, R, T)
   Method for reordering to use (default: I)
     O: optimal reordering consistent with the tree
     I: as O, but with improvements saving running time
     W: order by average expression level (see [2])
        consistently with the tree
     R: random ordering consistent with the tree
     T: order heuristically ignoring the tree constraint
        ("Travelling Salesman" 2-OPT heuristic)
   All methods except T require hierarchical tree in gtr file

-d (possible values P, U, E)
   Distance measure to measure similarity of genes (default: P)
     P: Pearson correlation (centred)
     U: Pearson correlation (uncentred)
     E: Euclidean distance
 
-i (possible values are positive integers)          
   Number of iterations for T heuristic (default: 30)
   (more iterations mean longer running time but a slight change 
    of getting better result)

For a more detailed explanation of methods and distance measures see
[1].


5. Treeps command line options
------------------------------

Treeps accepts the following command line parameters:

    Treeps [options] <input> <output>

<input> has the same meaning as in TreeArrange. <output> is the name of the 
encapsulated postscript file (including .eps or .ps suffix).

Options: 
(each option is followed by a space and a value. In this list we show 
 option followed by format of value and short meaning. 
 More details see below.)

-t 0|1             Display tree on/off (default: 1)
-l 0|1             Display gene labels on/off (default: 1)
-c 0|1             Display gene groups on/off (default: 1)
-d 0|1             Display node labels on/off (default: 0)
-s <num>,<num>     Size of the one cell should be <num>x<num>px
-S <num>,<num>     Size of the map thumbnail should be <num>x<num>px
                   (1in=72px)
 
-b <num>           Start with <num>th gene
-e <num>           End with <num>th gene
-f <node name>     Display only subtree rooted in given node
 
-p <R>,<G>,<B>     Colour for positive values (default: 255,0,0)
-n <R>,<G>,<B>     Colour for negative values (default: 0,255,0)
-z <R>,<G>,<B>     Colour for zero values (default: 0,0,0)
-m <R>,<G>,<B>     Colour for missing values (default: 100,100,100)
-a <num>           Contrast (positive number; default: 3)
-P <filename>      Filename with palette for type colours


Options -t,-l,-c,-d switch on and off displaying certain features. 
Value 0 means switch off, 1 means switch on. These features are:
- hierarchical clustering tree for genes (tree for experiments
  are never displayed),
- gene labels (i.e. names of genes or other material included 
  in NAME column). If NAME column is missing, column UNIQID is used.
  Parts of the name column between dollar signs are ignored (see gene groups)
- gene groups - users can specify several groups of genes they want
  to highlight by a colour bar. Genes in one group may constitute a 
  cluster or they can be dispersed all around (e.g. genes with one function
  etc.) Group of a gene is given in NAME column between dollar signs. 
  Groups are numbered from 0. E.g. if name column contains string
  "My favourite $3$gene", this gene will belong to group number 3
  and string $3$ will be removed, i.e. displayed name will be 
  "My favourite gene". As we can see, group number can be inside the 
  gene name, or at the beginning as "$0$My gene" or at the end "My gene $6$".
  Genes that do not contain group number have white colour instead of
  colour bar.
- node labels are unique identifiers of internal nodes of hierarchical
  clustering tree. These identifiers are not nicely aligned and are not meant
  to appear in the final output. However you need to know the identifier
  of a certain node if you want to use -f option. Therefore we recommend to 
  use -d option, find out the label and then use -f option with this label.


Options -s and -S determine the size of the picture. Postscript can be
rendered in arbitrary density, therefore can be enlarged without loss
of quality (provided your tools handle postscript well). However it is
convenient to produce the postscript of the right size directly.  Both
options get a value consisting of two numbers (width and height)
separated by comma (no spaces). Both lengths are given in points,
where one inch is 72 points. -s option gives size of one cell of the
colour map of gene expressions, -S gives size of entire map. Use only
one of these options. Hierarchical tree has a fixed width. Size of
font for gene labels is set according to height of the cell.

Options -b, -e and -f determine which genes to display (if they are
not specified, all genes are displayed). -f displays subtree rooted in
node with given node identifier. To find out appropriate node
identifier use -d option to view identifiers. -b and -e allow you to
choose the first and last gene to be displayed. Nodes are numbered from
0 in the order in which they appear in cdt file. If the cdt file was
ordered by TreeArrange you need to specify order in the new 
cdt file. Genes in specified range do not need to be in one subtree.

Options -p, -n, -z, -m specify colours of colour map.  Each colour is
given as three numbers separated by comma (no spaces). Each number is
between 0 and 255. These three values describe red, green and blue
portion of the colour. For example 255,0,0 is red. 0,0,255 is blue,
255,255,255 is white, 0,0,0 is black etc. Option -p specifies colour
for positive values, -n for negative values, -z for zero values, -m
for missing values. For example if we specify -p as red and -z as
black, than zero will be black and positive numbers will range from
black to red, with more intensive red for higher values. 

Option -a specifies contrast. It is a positive real number. 
Higher is this number, the more will zero colour prevail. 
If the values in your data are close to zero and your diagram is too
black, use lower value of contrast. If your diagram is too red/green,
use higher value of contrast to bring more black places. 

Finally, option -P allows you to specify colours to use with group
labels. Put these colours to one file. This file should contain one
colour on each line. Colour in the first line is colour for group 0,
colour in the second line is colour for group 1, etc. Each colour is
given by three numbers (similarly as in -p, -n, -z etc. arguments).
The only difference is that these numbers should be separated by
space, not comma. The three numbers can be followed by a comment, given
on the same line. For example:
255 0 0  //group 0 will be red
0 255 0  //group 1 will be green
0 0 255  //group 2 will be blue
255 244 0 //group 3 will be yellow


References
----------

1. Therese Biedl, Brona Brejova, Erik D. Demaine, Angele M. Hamel, Tomas
   Vinar. Optimal Arrangement of Leaves in the Tree Representing
   Hierarchical Clustering of Gene Expression Data. Technical Report
   CS-2001-14, Dept. of Computer Science, University of Waterloo, April
   2001. 
   http://monod.uwaterloo.ca/papers/expanded.php3?paper=2001004

2. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster
   Analysis and Display of Genome-Wide Expression
   Patterns. Proceedings of the National Academy of Sciences of the
   U.S.A., 95(25):14863-14868, 1998.


