About:

Matt is a multiple protein structure alignment program. It uses local geometry to align segments of two sets of proteins, allowing limited bends in the backbones between the segments.  If you use Matt, please cite: M. Menke, B. Berger, L. Cowen, "Matt: Local Flexibility Aids Protein Multiple Structure Alignment", 2007, preprint.

Matt is licensed under the GNU public license version 2.0. If you would like to license Matt in an enviroment where the GNU public license is unacceptable (such as inclusion in a non-GPL software package) comercial Matt licensing is available through the MIT and Tufts offices of Technology Transfer. Contact betawrap@csail.mit.edu or cowen@cs.tufts.edu for more information. Contact mmenke@mit.edu for issues involving the code itself.


Compilation:

To compile under Linux, simply type "make". Note that the makefile will build Matt without OpenMP support.  To build it with OpenMP support, gcc 4.2 or hight is required.  Just add the "-fopenmp" switch to the command.  Matt has not yet been tested with OpenMP under Linux.

Microsoft Visual Studio 6.0 and 2005 project files are included.  To compile with either one, just open up the corresponding project file and compile.  By default, the Visual Studio 2005 project is set to compile with OpenMP enabled.  The Express Edition does not support OpenMP, so will return an error message.  The option to disable it is under Project Properties > C/C++ > Language.


Installation:

To install, simply copy the binary to the directory you want Matt to run from and type in the command to run it.  Matt needs no environment variables and does not need to be in the active directory to run properly.


Overview:

Matt takes a set of pdb files as input.  Individual chains can optionally be specified for individual source files.  Source files can be compressed in gzip or compress file formats.  The ".Z" and ".gz" extensions can optionally be left off the file name.

Up to eight files will be created.  Their names are <outprefix>.<extension> and <outprefix>_bent.<extension>, where extension is fasta, txt, pdb, and spt.  <outprefix> is specified as a command line option.  In all files, proteins are listed in the input order, except for the assembly order section of the txt files.  By default, the "bent" files will not be created.

The fasta format contains the alignment in fasta format, using periods to indicate unoccupied positions.  Only residues in the common core (i.e. are aligned across all input structures) are currently aligned.  This will be changed in a future version.

The txt is a visual alignment of the sequences of the three structures.  It also includes the assembly order, RMSD, number of core residues, raw score, p-value (For pairwise alignments), and the reference structure.  The reference structure is the one that is untransformed in the final pass of the algorithm.  It also is not transformed in the output pdb files.  Other than this and the fact that it's one of the two structures used to calculate rotation angles in the final alignment, the reference structure has no special significance.

The pdb files contain 3D atomic coordinates of the structural alignment.  The spt files are Rasmol/Jmol scripts that highlight aligned regions.  To run the scripts with Jmol, just open the PDB files and drag the script to the Jmol window.  With Rasmol, open the pdb file, and type "script <filename>.spt".  The core residues from each structure will be set to a different color.  The colors repeat after 10 structures.

Note that the pdb and txt files will have insertion codes in them if they were present in the pdb file.  To get rid of the codes, use the "-r" option.

The bent files  contain the results generated before the final pass, which align the unbent structures and fills in gaps.  The bent pdb files are the output structures generated by the first phase of the algorithm.  The source structures may be broken apart between different fragments.  The RMSD in the text file is the RMSD of the aligment of deformed structures, so should not be compared directly to the RMSD of other structural alignment algorithms.  The -d switch enables creating the bent files.


Multithreading:

Matt uses the OpenMP multithreading extensions when built with OpenMP enabled with a compliant compiler, such as gcc 4.2 and retail versions of Microsoft Visual Studio 2005.  Note that for compatibility reasons, the included makefile will not create a binary with OpenMP support.

Each thread works on aligning a different pair of structures, so there's only a benefit from this when running multiple alignments.  Unless otherwise specified, Matt will use OpenMP's default number of threads, which is generally one thread per CPU per core.  More information on that command line option (-t) is below.  When run on long proteins, particularly those with a lot of self-similarity, each thread can use a fairly large amount of memory.  If Matt takes up too much memory when run on a particular set of structures, running slowly or crashing as a result, try reducing the number of threads.

Matt will report the number of threads that are created, not the number of threads that are active.  Therefore, when running a pairwise alignment on a multi-core system, it may report multiple threads, even though only one is doing any work.

Running Matt with no parameters will display version and usage information.  If built with multithreading support, the version number will be followed by "OpenMP".


Syntax:
Matt -o outprefix [-c cutoff] [-t threads] [-[rlsVd][01]]*
     [file[:chain[,chain]*]]* [-L listfile]*

Command line notes:

For options that don't take a space before their parameter (r, l, s, V, and d), giving the option with no parameter is equivalent to specifying a parameter of 1.  Also, the order of parameters is irrelevant.  "-s", which affects how pdb files will be read, affects both pdb files before and after the -s option.  You can also combine multiple options with a single hyphen, so the following two lines are equivalent:

Matt 1plu.pdb 1tsp.pdb -r1 -s -d0 -c 4.0 -o alignment
Matt -rsd0c 4.0 -o alignment 1plu.pdb 1tsp.pdb


Mandatory command line parameters:

-o outprefix:  Specifies prefix of output file names.

[file[:chain[,chain]*]]*:  Specifies the files and chain names to load.  Each chain within a single file should only be listed once.  If no chains are specified, and the file has named chains, all named chains are loaded.  If the file does not have named chains, the single unnamed chain is specified.  Commas are required when more than one chain is specified. A colon followed by no chain names or two commas in a row indicate the chain with no name.  Chain names are case sensitive.

-L listfile:  Specifies a file containing a list of pdb files to load.  Each line of the file must specify a file.  Individual chains can optionally be specified using the same syntax as above.  Leading and trailing white space is ignored.  Blank lines are allowed.


Optional command line parameters:

-b[01]:  Disables or enables creation of bent files.  Disabled by default.

-c <cutoff>:  Sets the distance cutoff value, in Angstroms, for the final pass, which fills in some of the gaps.  Cutoff can be any non-negative floating point number.  This does not affect any of the bent files.  The default value is 5.0 angstroms, which is what was used in the paper.  A value of 0 prevents the last pass from running at all.  Note that there should be a space after the c.  P-values are calculated before the final extension pass, so the cutoff does no affect reported p-values.

-r[01]:  Disables or enables renumbering of all residues in all proteins.  Each protein will start from residue 1 and all residues will be numbered consecutively.  Insertion codes will be removed.  Note that all loaded residues will be given a number, so if SEQRES entries are loaded or some residues have no alpha carbons, the first residue used in the alignment may not be residue 1.  Disabled by default.

-l[01]:  Disables or enables renaming chains in the output pdb files.  When enabled, chains are first labeled by capital letters, then numbers, then symbols, then lowercase letters, and then the pattern repeats when there are over 90 chains.  The limit is due to the fact that chains in a pdb file can only have a single character label.  Chains are numbered according to the order they're specified in the command line.  Enabled by default.

-s[01]:  Disables or enables reading SEQRES lines in source pdb files.  When enabled, the program tries to align residues in the ATOM entries to residues in the SEQRES entries.  This allows detecting gaps between residues that would otherwise be assumed to be adjacent.  Fragments cannot cross over regions with no alpha carbon coordinates.  Note that residues with ATOM entries but no alpha carbons coordinate will always be loaded.  -s also affects residue renumbering if -r is set.  Enabled by default.

-V[01]:  Sets verbosity of feedback to stdout.  A value of 0 will only display errors and warnings, and 1 will display a list of chains as they are loaded.  Default value is 1.

-d[01]:  Disables or enables sending current progress to stderr.  Enabled by default.

-t <thread count>:  Sets the number of threads Matt uses.  If not specified, Matt will use OpenMP's default number of threads, which is implementation dependent, though it is generally the number of threads a system is capable of running synchronously.  When not compiled with OpenMP support, a warning will be displayed and the option will be ignored. Note that there should be a space between t and the number of threads.


Notes:

Matt will list residues with no alpha-carbon coordinates in its sequence alignments, but will not align them.  The recommended way to unambiguously figure out which atom entries in the alignment files corresponds to which entry in the created pdb files is to enable renumbering, keeping in mind Matt includes HETATOM entries with alpha carbons in the alignment.

Matt currently makes no effort to align residues not in the common core in the sequence alignments it produces.  This will be changed in a future version.