DESCAL - Descriptor Alignment Prototype implementation for the paper "OB-Fold Recognition Combining Sequence and Structural Motifs" Martin Macko, Martin Kralik, Brona Brejova, Tomas Vinar contact: vinar@fmph.uniba.sk Remote protein homology detection is an important step towards understanding protein function in living organisms. The problem is notoriously difficult; distant homologs can often be detected only by a combination of sequence and structural features. We propose a new framework, where important sequence and structural features are described by the user in the form of a descriptor, and the descriptor is then used to search a database of protein sequences and score potential candidates. We develop algorithms necessary to support such search using support vector machines and discrete optimization methods. We demonstrate our approach on the example of the telomere-binding OB-fold domain, showing that not only we can distinguish between Telo_bind family members and negatives, but we also identify proteins from related protein families carrying similar OB-fold domains. === SYSTEM REQUIREMENTS: operating system: 64-bit linux, at least 4 GB RAM Memory software required on the system: * java >=1.6 * PSI-Pred: http://bioinf.cs.ucl.ac.uk/psipred/ * SVM-light: http://svmlight.joachims.org/ === EXAMPLE OF HOW TO RUN THIS PROGRAM: INPUT: - descriptor file (Telo_bind.desc) - protein sequence to which the descriptor should be aligned OR .ss2 file (results of PSI-Pred secondary structure prediction) (domain_POTE1_MACFA.fasta, domain_POTE1_MACFA.ss2) 1. Review and configure paths in DescAl.properties file (descriptor file is specified here as well) 2. (if you do not have .ss2 file; otherwise skip this step) runpsipred domain_POTE1_MACFA.fasta --or-- runpsipred_single domain_POTE1_MACFA.fasta 3. Run DescAl to align the descriptor (Telo_bind.desc) to the sequence (file domain_POTE1_MACFA.ss2) java -ea -jar DescAl.jar DescAl.properties domain_POTE1_MACFA.ss2 1 1 1 OUTPUT: The test example produces two files characterizing the alignment of the descriptor to the sequence: - domain_POTE1_MACFA.ss2_results.txt - domain_POTE1_MACFA.Telo_bind.desc.html The format of the output files is described below. === PROGRAM ARGUMENTS .properties file - file that specifies important parameters for aligning algorithm - .ss2 sequence file - output of PSI-Pred prediction software (http://bioinf.cs.ucl.ac.uk/psipred/) containing aminoacid sequence and posterior probabilities of secondary structure elements. secondary structure segment score weight - specifies relative weight of the secondary structure score (recommended: 1) sequence motif score weight - specifies relative weight of the sequence motif score (recommended: 1) hydrogen bond score weight - specifies relative weight of the hydrogen bodn score (recommended: 1) === DESCAL.PROPERTIES FILE descriptor_file: specifies local path to descriptor file score_matrix_file: specifies local path to file with scoring matrix, that is used for generating SVM samples svm_classify_path: specifies GLOBAL path to SVM classifier program svm_model_path_Y: specifies GLOBAL path to trained SVM model that estimates whether two aminoacids are in paralel hydrogen bond svm_model_path_Z: specifies GLOBAL path to trained SVM model that estimates whether two aminoacids are in antiparalel hydrogen bond svm_generator_classname: specifies class that is used to generate svm samples to evaluate with svm models identifier_prefix: specifies prefix that results files will begin with result_folder: specifies flder to which result alignments will be saved cutoff: specifies cutoff parameter for hydrogen bonds. To align hydrogen bond to some aminoacid, logarithm of its predicted secondary structure must be bigger as this cutoff parameter. === DESCRIPTOR FILE FORMAT First line - [segment_name|].[segment_name] contains sorted list of descriptor segments, separated by "|". The first letter of segment name (A, B or C) specifies secondary structure corresponding to that segment (alpha helix, beta sheet or coil). For each segment minimal and maximal length must be set. Name of sequence motif is also specified in case motif is defined in that segment. After describing length constraints for all of the segments specified in the first line, there may be defined the hydrogen bonds between segments each on the new line. Each bond has specified direction ('+' for parallel, '-' for antiparallel direction), names of interacting segments and number of interacting aminoacid pairs. After definition of all bonds in the descriptor, there is a separator line (***********LOGO_DEFINITIONS*************). Each motif defined in the segments has its length specified. Motif is represented by matrix in which number of rows corresponds to length of the motif. This matrix has 20 rows and values in this matrix represents logarithmic value of probability, that aminoacid specified by column is on particular position in motif. The position corresponds to row number of the value. With these matrices specified, the descriptor definition is complete. === OUTPUT FILE FORMATS Text file with "_result.txt" suffix contains the score of the resulting alignment, and also scores for each element of the scoring scheme. It also contains the position of the alignment. The file has the following format: [A] [B] [C] = [S] start descriptor: [i1] end position [i2] #[F] weights: [W_A] [W_B] [W_C] A ... secondary_structure_score B ... sequence_motifs_score C ... hydrogen_bond_score S ... sum_of_scores i1 ... index_of_start_position i2 ... index_of_end_position F ... input filename W_A ... weight for secondary_structure_score W_B ... weight for sequence_motifs_score W_C ... weight for hydrogen_bond_score HTML file with "_result.html" contains a semigraphical representation of the alignment. The alignment is represented as a table, which has fixed 6 rows and the number of columns depends on the length of the input sequence. These rows have the following format: Row 1: indexes for aminoacids in the input sequnce Row 2: aminoacid sequence Row 3: secondary structure specified in the input file (the highest posterior probability secondary structure element predicted by PSI-Pred) Row 4: segments of the descriptor aligned to the sequence Row 5: motifs of tje descriptor aligned to the sequence Row 6: this row show aligned hydrogen bonds. Starting pair of linked aminoacids for each bond is marked by bold letters. Skipped bond that is aligned after computing initial alignment is marked with letters X === CONTENTS OF THIS PACKAGE DescAl.jar ... executable archive file with descriptor searcher DescAl.properties ... properties file that specifies settings for DescAl algorithm svm_classify ... executable of SVM classifier (http://svmlight.joachims.org/) used to evaluate possible bonds between amino acids. antipar_svm_polyn_7.dat ... svm model for evaluating antiparallel hydrogen bonds par_svm_polyn_7.dat ... svm model for evaluating parallel hydrogen bonds triplets_score_matrix.txt ... file contains 20x20 score matrix that represents expectation of hydrogen bond between two aminoacids Telo_bind.desc ... file with descriptor for Telo_bind domain from OB-fold family domain_POTE1_MACFA.fasta domain_POTE1_MACFA.ss2 ... example sequence file of protein POTE1_MACFA and corresponding output of PSI-Pred software README.TXT ... this readme file