DESCAL - Descriptor Alignment
Prototype implementation for the paper
"OB-Fold Recognition Combining Sequence and Structural Motifs"
Martin Macko, Martin Kralik, Brona Brejova, Tomas Vinar
contact: vinar@fmph.uniba.sk

Remote protein homology detection is an important step towards
understanding protein function in living organisms. The problem is
notoriously difficult; distant homologs can often be detected only by
a combination of sequence and structural features.  We propose a new
framework, where important sequence and structural features are
described by the user in the form of a descriptor, and the descriptor
is then used to search a database of protein sequences and score
potential candidates. We develop algorithms necessary to support such
search using support vector machines and discrete optimization
methods. We demonstrate our approach on the example of the
telomere-binding OB-fold domain, showing that not only we can
distinguish between Telo_bind family members and negatives, but we
also identify proteins from related protein families carrying similar
OB-fold domains.

===
SYSTEM REQUIREMENTS:

operating system: 64-bit linux, at least 4 GB RAM Memory 
software required on the system:
* java >=1.6
* PSI-Pred: http://bioinf.cs.ucl.ac.uk/psipred/
* SVM-light: http://svmlight.joachims.org/

===
EXAMPLE OF HOW TO RUN THIS PROGRAM:

INPUT:
- descriptor file (Telo_bind.desc)
- protein sequence to which the descriptor should be aligned
  OR .ss2 file (results of PSI-Pred secondary structure prediction)
  (domain_POTE1_MACFA.fasta, domain_POTE1_MACFA.ss2)

1. Review and configure paths in DescAl.properties file
   (descriptor file is specified here as well)

2. (if you do not have .ss2 file; otherwise skip this step)
   runpsipred domain_POTE1_MACFA.fasta
   --or--
   runpsipred_single domain_POTE1_MACFA.fasta

3. Run DescAl to align the descriptor (Telo_bind.desc) to
   the sequence (file domain_POTE1_MACFA.ss2)
   java -ea -jar DescAl.jar DescAl.properties domain_POTE1_MACFA.ss2 1 1 1

OUTPUT: The test example produces two files characterizing the
alignment of the descriptor to the sequence:
- domain_POTE1_MACFA.ss2_results.txt
- domain_POTE1_MACFA.Telo_bind.desc.html
The format of the output files is described below.

===
PROGRAM ARGUMENTS
 
.properties file - file that specifies important parameters for
aligning algorithm - 

.ss2 sequence file - output of PSI-Pred prediction software
(http://bioinf.cs.ucl.ac.uk/psipred/) containing aminoacid sequence
and posterior probabilities of secondary structure elements.
           
secondary structure segment score weight - specifies relative weight
of the secondary structure score (recommended: 1)
	
sequence motif score weight - specifies relative weight of the sequence
motif score (recommended: 1)

hydrogen bond score weight - specifies relative weight of the hydrogen
bodn score (recommended: 1)

===
DESCAL.PROPERTIES FILE

descriptor_file: specifies local path to descriptor file

score_matrix_file: specifies local path to file with scoring matrix,
that is used for generating SVM samples 

svm_classify_path: specifies GLOBAL path to SVM classifier program

svm_model_path_Y: specifies GLOBAL path to trained SVM model that
estimates whether two aminoacids are in paralel hydrogen bond

svm_model_path_Z: specifies GLOBAL path to trained SVM model that
estimates whether two aminoacids are in antiparalel hydrogen bond

svm_generator_classname: specifies class that is used to generate svm
samples to evaluate with svm models

identifier_prefix: specifies prefix that results files will begin with 

result_folder: specifies flder to which result alignments will be
saved 

cutoff: specifies cutoff parameter for hydrogen bonds.  To align
hydrogen bond to some aminoacid, logarithm of its predicted secondary
structure must be bigger as this cutoff parameter.

===
DESCRIPTOR FILE FORMAT

First line - [segment_name|].[segment_name] contains sorted list of
descriptor segments, separated by "|". The first letter of segment
name (A, B or C) specifies secondary structure corresponding to that
segment (alpha helix, beta sheet or coil).
  
For each segment minimal and maximal length must be set. Name of
sequence motif is also specified in case motif is defined in that
segment.  After describing length constraints for all of the segments
specified in the first line, there may be defined the hydrogen bonds
between segments each on the new line. Each bond has specified
direction ('+' for parallel, '-' for antiparallel direction), names of
interacting segments and number of interacting aminoacid pairs.
     
After definition of all bonds in the descriptor, there is a separator line 
(***********LOGO_DEFINITIONS*************). 
  
Each motif defined in the segments has its length specified. Motif is
represented by matrix in which number of rows corresponds to length of
the motif.  This matrix has 20 rows and values in this matrix
represents logarithmic value of probability, that aminoacid specified
by column is on particular position in motif.  The position
corresponds to row number of the value. With these matrices specified,
the descriptor definition is complete.
  
===
OUTPUT FILE FORMATS

Text file with "_result.txt" suffix contains the score of the
resulting alignment, and also scores for each element of the scoring
scheme. It also contains the position of the alignment.  The file has
the following format:

[A] [B] [C] = [S] start descriptor: [i1] end position [i2] #[F] weights: [W_A] [W_B] [W_C]

                A       ...     secondary_structure_score
                B       ...     sequence_motifs_score
                C       ...     hydrogen_bond_score
                S       ...     sum_of_scores
                i1      ...     index_of_start_position
                i2      ...     index_of_end_position
                F       ...     input filename
                W_A     ...     weight for secondary_structure_score 
                W_B     ...     weight for sequence_motifs_score
                W_C     ...     weight for hydrogen_bond_score

HTML file with "_result.html" contains a semigraphical representation
of the alignment.
        
The alignment is represented as a table, which has fixed 6 rows and
the number of columns depends on the length of the input sequence.
These rows have the following format:

Row 1: indexes for aminoacids in the input sequnce
Row 2: aminoacid sequence
Row 3: secondary structure specified in the input file (the highest
posterior probability secondary structure element predicted by
PSI-Pred)
Row 4: segments of the descriptor aligned to the sequence
Row 5: motifs of tje descriptor aligned to the sequence 
Row 6: this row show aligned hydrogen bonds. Starting pair of linked
aminoacids for each bond is marked by bold letters.  Skipped bond that
is aligned after computing initial alignment is marked with letters X

===
CONTENTS OF THIS PACKAGE

DescAl.jar ... executable archive file with descriptor searcher

DescAl.properties ... properties file that specifies settings for
DescAl algorithm 

svm_classify ... executable of SVM classifier
(http://svmlight.joachims.org/) used to evaluate possible bonds
between amino acids.

antipar_svm_polyn_7.dat ... svm model for evaluating antiparallel
hydrogen bonds 

par_svm_polyn_7.dat ... svm model for evaluating parallel hydrogen
bonds 

triplets_score_matrix.txt ... file contains 20x20 score matrix that
represents expectation of hydrogen bond between two aminoacids

Telo_bind.desc ... file with descriptor for Telo_bind domain from
OB-fold family

domain_POTE1_MACFA.fasta 
domain_POTE1_MACFA.ss2 ...  example sequence file of protein POTE1_MACFA and
corresponding output of PSI-Pred software

README.TXT ... this readme file