This is the command theseus that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

**PROGRAM:**

**NAME**

**theseus**- Maximum likelihood, multiple simultaneous superpositions with statistical

analysis

**SYNOPSIS**

**theseus**[options]

__pdbfile1__

__[pdbfile2__

__...]__

and

**theseus_align**[options] -f

__pdbfile1__

__[pdbfile2__

__...]__

**DESCRIPTION**

**Theseus**superposes a set of macromolecular structures simultaneously using the method of

maximum likelihood (ML), rather than the conventional least-squares criterion.

**Theseus**

assumes that the structures are distributed according to a matrix Gaussian distribution

and that the eigenvalues of the atomic covariance matrix are hierarchically distributed

according to an inverse gamma distribution. This ML superpositioning model produces much

more accurate results by essentially downweighting variable regions of the structures and

by correcting for correlations among atoms.

**Theseus**operates in two main modes: (1) a mode for superimposing structures with identical

sequences and (2) a mode for structures with different sequences but similar structures:

(1) A mode for superpositioning macromolecules with identical sequences and numbers

of residues, for instance, multiple models in an NMR family or multiple structures

from different crystal forms of the same protein.

In this mode,

**Theseus**will read every model in every file on the command line and

superpose them.

Example:

**theseus**

__1s40.pdb__

In the above example,

__1s40.pdb__is a pdb file of 10 NMR models.

(2) An ``alignment'' mode for superpositioning structures with different sequences,

for example, multiple structures of the cytochrome c protein from different species

or multiple mutated structures of hen egg white lysozyme.

This mode requires the user to supply a sequence alignment file of the structures

being superpositioned (see option

**-A**and ``FILE FORMATS'' below). Additionally, it

may be necessary to supply a mapfile that tells

**theseus**which PDB structure files

correspond to which sequences in the alignment (see option

**-M**and ``FILE FORMATS''

below). The mapfile is unnecessary if the sequence names and corresponding pdb

filenames are identical. In this mode, if there are multiple structural models in

a PDB file,

**theseus**only reads the first model in each file on the command line. In

other words,

**theseus**treats the files on the command line as if there were only one

structure per file.

Example 1:

**theseus**-A cytc.aln -M cytc.filemap d1cih__.pdb d1csu__.pdb d1kyow_.pdb

In the above example, d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb are pdb files of

cytochrome c domains from the SCOP database.

Example 2:

**theseus_align**-f d1cih__.pdb d1csu__.pdb d1kyow_.pdb

In this example, the

**theseus_align**script is called to do the hard work for you.

It will calculate a sequence alignment and then superpose based on that alignment.

The script

**theseus_align**takes the same options as the

**theseus**program. Note, the

first few lines of this script must be modified for your system, since it calls an

external multiple sequence alignment program to do the alignment. See the

**examples/**directory for more details, including example files.

**OPTIONS**

**Algorithmic**

**options,**

**defaults**

**in**

**{brackets}:**

**--amber**

Do special processing for AMBER8 formatted PDB files

Most people will never need to use this long option, unless you are processing MD

traces from AMBER. AMBER puts the atom names in the wrong column in the PDB file.

**-a**

**[**

__selection__

**]**

Atoms to include in the superposition. This option takes two types of arguments,

either (1) a number specifying a preselected set of atom types, or (2) an explict

PDB-style, colon-delimited list of the atoms to include.

For the preselected atom type subsets, the following integer options are available:

· 0, alpha carbons for proteins, C1´ atoms for nucleic acids

· 1, backbone

· 2, all

· 3, alpha and beta carbons

· 4, all heavy atoms (no hydrogens)

Note, only the

**-a0**option is available when superpositioning structures with

different sequences.

To custom select an explicit set of atom types, the atom types must be specified

exactly as given in the PDB file field, including spaces, and the atom-types must

encapsulated in quotation marks. Multiple atom types must be delimited by a colon.

For example,

**-a**

**`**

**N**

**:**

**CA**

**:**

**C**

**:**

**O**

**'**

would specify the atom types in the peptide backbone.

**-f**Only read the first model of a multi-model PDB file

**-h**Help/usage

**-i**

**[**

__nnn__

**]**

Maximum iterations, {200}

**-p**

**[**

__precision__

**]**

Requested relative precision for convergence, {1e-7}

**-r**

**[**

__root__

__name__

**]**

Root name to be used in naming the output files, {theseus}

**-s**

**[**

__n-n:...__

**]**

Residue selection (e.g. -s15-45:50-55), {all}

**-S**

**[**

__n-n:...__

**]**

Residues to exclude (e.g. -S15-45:50-55) {none}

The previous two options have the same format. Residue (or alignment column) ranges

are indicated by beginning and end separated by a dash. Multiple ranges, in any

arbitrary order, are separated by a colon. Chains may also be selected by giving

the chain ID immediately preceding the residue range. For example,

**-sA1-20:A40-71**

will only include residues 1 through 20 and 40 through 70 in chain A. Chains cannot

be specified when superposing structures with different sequences.

**-v**use ML variance weighting (no correlations) {default}

**Input/output**

**options:**

**-A**

**[**

__sequence__

__alignment__

__file__

**]**

Sequence alignment file to use as a guide (CLUSTAL or A2M format)

For use when superposing structures with different sequences. See ``FILE FORMATS''

below.

**-E**Print expert options

**-F**Print FASTA files of the sequences in PDB files and quit

A useful option when superposing structures with different sequences. The files

output with this option can be aligned with a multiple sequence alignment program

such as CLUSTAL or MUSCLE, and the resulting output alignment file used as

**theseus**

input with the

**-A**option.

**-h**Help/usage

**-I**Just calculate statistics for input file; don't superpose

**-M**

**[**

__mapfile__

**]**

File that maps PDB files to sequences in the alignment.

A simple two-column formatted file; see ``FILE FORMATS'' below. Used with mode 2.

**-n**Don't write transformed pdb file

**-o**

**[**

__reference__

__structure__

**]**

Reference file to superpose on, all rotations are relative to the first model in

this file

For example, 'theseus -o cytc1.pdb cytc1.pdb cytc2.pdb cytc3.pdb' will superpose

the structures and rotate the entire final superposition so that the structure from

cytc1.pdb is in the same orientation as the structure in the original cytc1.pdb PDB

file.

**-V**Version

**Principal**

**components**

**analysis:**

**-C**Use covariance matrix for PCA (correlation matrix is default)

**-P**

**[**

__nnn__

**]**

Number of principal components to calculate {0}

In both of the above, the corresponding principal component is written in the B-

factor field of the output PDB file. Usually only the first few PCs are of any

interest (maybe up to six).

EXAMPLES

**theseus**

__2sdf.pdb__

**theseus**-l -r new2sdf

__2sdf.pdb__

**theseus**-s15-45 -P3

__2sdf.pdb__

**theseus**-A

__cytc.aln__-M

__cytc.mapfile__-o

__cytc1.pdb__-s1-40

__cytc1.pdb__

__cytc2.pdb__

__cytc3.pdb__

__cytc4.pdb__

**ENVIRONMENT**

You can set the environment variable 'PDBDIR' to your PDB file directory and

**theseus**will

look there after the present working directory. For example, in the C shell (tcsh or

csh), you can put something akin to this in your .cshrc file:

setenv PDBDIR '/usr/share/pdbs/'

**FILE** **FORMATS**

**Theseus**will read standard PDB formatted files (see <http://www.rcsb.org/pdb/>). Every

effort has been made for the program to accept nonstandard CNS and X-PLOR file formats

also.

Two other files deserve mention, a sequence alignment file and a mapfile.

**Sequence**

**alignment**

**file**

When superposing structures with different residue identities (where the lengths of each

the macromolecules in terms of residues are not necessarily equal), a sequence alignment

file must be included for

**theseus**to use as a guide (specified by the

**-A**option).

**Theseus**

accepts both CLUSTAL and A2M (FASTA) formatted multiple sequence alignment files.

NOTE 1: The residue sequence in the alignment must match exactly the residue sequence

given in the coordinates of the PDB file. That is, there can be no missing or extra

residues that do not correspond to the sequence in the PDB file. An easy way to ensure

that your sequences exactly match the PDB files is to generate the sequences using

**theseus'**

**-F**option, which writes out a FASTA formatted sequence file of the chain(s) in

the PDB files. The files output with this option can then be aligned with a multiple

sequence alignment program such as CLUSTAL or MUSCLE, and the resulting output alignment

file used as

**theseus**input with the

**-A**option.

NOTE 2: Every PDB file must have a corresponding sequence in the alignment. However, not

every sequence in the alignment needs to have a corresponding PDB file. That is, there can

be extra sequences in the alignment that are not used for guiding the superposition.

**PDB**

**->**

**Sequence**

**mapfile**

If the names of the PDB files and the names of the corresponding sequences in the

alignemnt are identical, the mapfile may be omitted. Otherwise,

**Theseus**needs to know

which sequences in the alignment file correspond to which PDB structure files. This

information is included in a mapfile with a very simple format (specified with the

**-M**

option). There are only two columns separated by whitespace: the first column lists the

names of the PDB structure files, while the second column lists the corresponding sequence

names exactly as given in the multiple sequence alignment file.

An example of the mapfile:

cytc1.pdb seq1

cytc2.pdb seq2

cytc3.pdb seq3

**SCREEN** **OUTPUT**

Theseus provides output describing both the progress of the superposing and several

statistics for the final result:

**Classical**

**LS**

**pairwise**

**<RMSD>:**

The conventional RMSD for the superposition, the average RMSD for all pairwise

combinations of structures in the ensemble.

**Least-squares**

**<sigma>:**

The standard deviation for the superposition, based on the conventional assumption

of no correlation and equal variances. Basically equal to the RMSD from the average

structure.

**Maximum**

**Likelihood**

**<sigma>:**

The ML analog of the standard deviation for the superposition. When assuming that

the correlations are zero (a diagonal covariance matrix), this is equal to the

square root of the harmonic average of the variances for each atom. In contrast,

the ``Least-squares <sigma>'' given above reports the square root of the arithmetic

average of the variances. The harmonic average is always less than the arithmetic

average, and the harmonic average downweights large values proportional to their

magnitude. This makes sense statistically, because when combining values one should

weight them by the reciprocal of their variance (which is in fact what the ML

superposing method does).

**Marginal**

**Log**

**Likelihood:**

The final marginal log likelihood of the superposition, assuming the matrix

Gaussian distribution of the structures and the hierarchical inverse gamma

distribution of the eigenvalues of the covariance matrix. The marginal log

likelihood is the likelihood with the covariance matrix integrated out.

**AIC:**The Akaike Information Criterion for the final superposition. This is an important

statistic in likelihood analysis and model selection theory. It allows an objective

comparison of multiple theoretical models with different numbers of parameters. In

this case, the higher the number the better. There is a tradeoff between fit to the

data and the number of parameters being fit. Increasing the number of parameters

in a model will always give a better fit to the data, but it also increases the

uncertainty of the estimated values. The AIC criterion finds the best combination

by (1) maximizing the fit to the data while (2) minimizing the uncertainty due to

the number of parameters. In the superposition case, one can compare the least

squares superposition to the maximum likelihood superposition. The method (or

model) with the higher AIC is preferred. A difference in the AIC of 2 or more is

considered strong statistical evidence for the better model.

**BIC:**The Bayesian Information Criterion. Similar to the AIC, but with a Bayesian

emphasis.

**Omnibus**

**chi2:**

The overall reduced chi2 statistic for the entire fit, including the rotations,

translations, covariances, and the inverse gamma parameters. This is probably the

most important statistic for the superposition. In some cases, the inverse gamma

fit may be poor, yet the overall fit is still very good. Again, it should ideally

be close to 1.0, which would indicate a perfect fit. However, if you think it is

too large, make sure to compare it to the chi2 for the least-squares fit; it's

probably not that bad after all. A large chi2 often indicates a violation of the

assumptions of the model. The most common violation is when superposing two or

more independent domains that can rotate relative to each other. If this is the

case, then there will likely be not just one Gaussian distribution, but several

mixed Gaussians, one for each domain. Then, it would be better to superpose each

domain independently.

**Hierarchical**

**var**

**(alpha,**

**gamma)**

**chi2:**

The reduced chi2 for the inverse gamma fit of the covariance matrix eigenvalues. As

before, it should ideally be close to 1.0. The two values in the parentheses are

the ML estimates of the scale and shape parameters, respectively, for the inverse

gamma distribtuion.

**Rotational,**

**translational,**

**covar**

**chi2:**

The reduced chi2 statistic for the fit of the structures to the model. With a good

fit it should be close to 1.0, which indicates a perfect fit of the data to the

statistical model. In the case of least-squares, the assumed model is a matrix

Gaussian distribution of the structures with equal variances and no correlations.

For the ML fits, the assumed model is unequal variances and no correlations, as

calculated with the

**-v**option [default]. This statistic is for the superposition

only, and does not include the fit of the covariance matrix eigenvalues to an

inverse gamma distribution. See ``Omnibus chi2'' below.

**Hierarchical**

**minimum**

**var:**

The hierarchical fit of the inverse gamma distribution constrains the variances of

the atoms by making large ones smaller and small ones larger. This statistic

reports the minimum possible variance given the inferred inverse gamma parameters.

**skewness,**

**skewness**

**Z-value,**

**kurtosis**

**&**

**kurtosis**

**Z-value:**

The skewness and kurtosis of the residuals. Both should be 0.0 if the residuals fit

a Gaussian distribution perfectly. They are followed by the P-value for the

statistics. This is a very stringent test; residuals can be very non-Gaussian and

yet the estimated rotations, translations, and covariance matrix may still be

rather accurate.

**Data**

**pts,**

**Free**

**params,**

**D/P:**

The total number of data points given all observed structures, the number of

parameters being fit in the model, and the data-to-parameter ratio.

**Median**

**structure:**

The structure that is overall most similar to the average structure. This can be

considered to be the most ``typical'' structure in the ensemble.

**Total**

**rounds:**

The number of iterations that the algorithm took to converge.

**Fractional**

**precision:**

The actual precision that the algorithm converged to.

**OUTPUT** **FILES**

Theseus writes out the following files:

**theseus_sup.pdb**

The final superposition, rotated to the principle axes of the mean structure.

**theseus_ave.pdb**

The estimate of the mean structure.

**theseus_residuals.txt**

The normalized residuals of the superposition. These can be analyzed for deviations

from normality (whether they fit a standard Gaussian distribution). E.g., the chi2,

skewness, and kurtosis statistics are based on these values.

**theseus_transf.txt**

The final transformation rotation matrices and translation vectors.

**theseus_variances.txt**

The vector of estimated variances for each atom.

When Principal Components are calculated (with the

**-P**option), the following files are

also produced:

**theseus_pcvecs.txt**

The principal component vectors.

**theseus_pcstats.txt**

Simple statistics for each principle component (loadings, variance explained,

etc.).

**theseus_pcN_ave.pdb**

The average structure with the Nth principal component written in the temperature

factor field.

**theseus_pcN.pdb**

The final superposition with the Nth principal component written in the temperature

factor field. This file is omitted when superposing molecules with different

residue sequences (mode 2).

**theseus_cor.mat,**

**theseus_cov.mat**

The atomic correlation matrix and covariance matrices, based on the final

superposition. The format is suitable for input to GNU's

**octave.**These are the

matrices used in the Principal Components Analysis.

Use theseus online using onworks.net services