This is the command cmcalibrate that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

**PROGRAM:**

**NAME**

cmcalibrate - fit exponential tails for covariance model E-value determination

**SYNOPSIS**

**cmcalibrate**

__[options]__

__cmfile__

**DESCRIPTION**

**cmcalibrate**determines exponential tail parameters for E-value determination by generating

random sequences, searching them with the CM and collecting the scores of the resulting

hits. A histogram of the bit scores of the hits is fit to an exponential tail, and the

parameters of the fitted tail are saved to the CM file. The exponential tail parameters

are then used to estimate the statistical significance of hits found in

**cmsearch**and

**cmscan.**

A CM file must be calibrated with

**cmcalibrate**before it can be used in

**cmsearch**or

**cmscan,**

with a single exception: it is not necessary to calibrate CM files that include only

models with zero basepairs before running

**cmsearch.**

**cmcalibrate**is very slow. It takes a couple of hours to calibrate a single average sized

CM on a single CPU.

**cmcalibrate**will run in parallel on all available cores if Infernal

was built on a system that supports POSIX threading (see the Installation section of the

user guide for more information). Using

**<n>**cores will result in roughly

**<n>**-fold

acceleration versus a single CPU. MPI (Message Passing Interface) can be also be used for

parallelization with the

**--mpi**option if Infernal was built with MPI enabled, but using

more than 161 processors is not recommended because increasing past 161 won't accelerate

the calibration. See the Installation seciton of the user guide for more information.

The

**--forecast**option can be used to estimate how long the program will take to run for a

given

__cmfile__on the current machine. To predict the running time on

__<n>__processors with

MPI, additionally use the

**--nforecast**

__<n>__option.

The random sequences searched in

**cmcalibrate**are generated by an HMM that was trained on

real genomic sequences with various GC contents. The goal is to have the GC distributions

in the random sequences be similar to those in actual genomic sequences.

Four rounds of searches and subsequent exponential tail fits are performed, one each for

the four different CM algorithms that can be used in

**cmsearch**and

**cmscan:**glocal CYK,

glocal Inside, local CYK and local Inside.

The E-values parameters determined by

**cmcalibrate**are only used by the

**cmsearch**and

**cmscan**

programs. If you are not going to use these programs then do not waste time calibrating

your models.

**OPTIONS**

**-h**Help; print a brief reminder of command line usage and available options.

**-L**

__<x>__Set the total length of random sequences to search to

__<x>__megabases (Mb). By

default,

__<x>__

__is__1.6 Mb. Increasing

__<x>__will make the exponential tail fits more

precise and E-values more accurate, but will take longer (doubling

__<x>__will roughly

double the running time). Decreasing

__<x>__is not recommended as it will make the

fits less precise and the E-values less accurate.

**OPTIONS** **FOR** **PREDICTING** **REQUIRED** **TIME** **AND** **MEMORY**

**--forecast**

Predict the running time of the calibration of

__cmfile__(with provided options) on

the current machine and exit. The calibration is not performed. The predictions

should be considered rough estimates. If multithreading is enabled (see

Installation section of user guide), the timing will take into account the number

of available cores.

**--nforecast**

__<n>__

With

**--forecast,**specify that

__<n>__processors will be used for the calibration.

This might be useful for predicting the running time of an MPI run with

__<n>__

processors.

**--memreq**

Predict the amount of required memory for calibrating

__cmfile__(with provided

options) on the current machine and exit. The calibration is not performed.

**OPTIONS** **CONTROLLING** **EXPONENTIAL** **TAIL** **FITS**

**--gtailn**

__<x>__

fit the exponential tail for glocal Inside and glocal CYK to the

__<n>__highest scores

in the histogram tail, where

__<n>__is

__<x>__times the number of Mb searched. The

default value of

__<x>__is 250. The value 250 was chosen because it works well

empirically relative to other values.

**--ltailn**

__<x>__

fit the exponential tail for local Inside and local CYK to the

__<n>__highest scores

in the histogram tail, where

__<n>__is

__<x>__times the number of Mb searched. The

default value of

__<x>__is 750. The value 750 was chosen because it works well

empirically relative to other values.

**--tailp**

__<x>__

Ignore the

**--gtailn**and

**--ltailn**prefixed options and fit the

__<x>__fraction tail of

the histogram to an exponential tail, for all search modes.

**OPTIONAL** **OUTPUT** **FILES**

**--hfile**

__<f>__

Save the histograms fit to file

__<f>.__The format of this file is two space

delimited columns per line. The first column is the x-axis values of bit scores of

each bin. The second column is the y-axis values of number of hits per bin. Each

series is delimited by a line with a single character "&". The file will contain

one series for each of the four exponential tail fits in the following order:

glocal CYK, glocal Inside, local CYK, and local Inside.

**--sfile**

__<f>__

Save survival plot information to file

__<f>.__The format of this file is two space

delimited columns per line. The first column is the x-axis values of bit scores of

each bin. The second column is the y-axis values of fraction of hits that meet or

exceed the score for each bin. Each series is delimited by a line with a single

character "&". The file will contain three series of data for each of the four CM

search modes in the following order: glocal CYK, glocal Inside, local CYK, and

local Inside. The first series is the empirical survival plot from the histogram

of hits to the random sequence. The second series is the exponential tail fit to

the empirical distribution. The third series is the exponential tail fit if lambda

were fixed and set as the natural log of 2 (0.691314718).

**--qqfile**

__<f>__

Save quantile-quantile plot information to file

__<f>.__The format of this file is

two space delimited columns per line. The first column is the x-axis values, and

the second column is the y-axis values. The distance of the points from the

identity line (y=x) is a measure of how good the exponential tail fit is, the

closer the points are to the identity line, the better the fit is. Each series is

delimited by a line with a single character "&". The file will contain one series

of empirical data for each of the four exponential tail fits in the following

order: glocal CYK, glocal Inside, local CYK and local Inside.

**--ffile**

__<f>__

Save space delimited statistics of different exponential tail fits to file

__<f>.__

The file will contain the lambda and mu values for exponential tails fit to

histogram tails of different sizes. The fields in the file are labelled

informatively.

**--xfile**

__<f>__

Save a list of the scores in each fit histogram tail to file

__<f>.__Each line of

this file will have a different score indicating one hit existed in the tail with

that score. Each series is delimited by a line with a single character "&". The

file will contain one series for each of the four exponential tail fits in the

following order: glocal CYK, glocal Inside, local CYK, and local Inside.

**OTHER** **OPTIONS**

**--seed**

__<n>__

Seed the random number generator with

__<n>,__an integer >= 0. If

__<n>__is nonzero,

stochastic simulations will be reproducible; the same command will give the same

results. If

__<n>__is 0, the random number generator is seeded arbitrarily, and

stochastic simulations will vary from run to run of the same command. The default

seed is 181.

**--beta**

__<x>__

By default query-dependent banding (QDB) is used to accelerate the CM search

algorithms with a beta tail loss probability of 1E-15. This beta value can be

changed to

__<x>__with

**--beta**

__<x>__

**.**The beta parameter is the amount of probability

mass excluded during band calculation, higher values of beta give greater speedups

but sacrifice more accuracy than lower values. The default value used is 1E-15.

(For more information on QDB see Nawrocki and Eddy, PLoS Computational Biology

3(3): e56.)

**--nonbanded**

Turn off QDB during E-value calibration. This will slow down calibration.

**--nonull3**

Turn off the null3 post hoc additional null model. This is not recommended unless

you plan on using the same option to

**cmsearch**and/or

**cmscan.**

**--random**

Use the background null model of the CM to generate the random sequences, instead

of the more realistic HMM. Unless the CM was built using the

**--null**option to

**cmbuild,**the background null model will be 25% each A, C, G and U.

**--gc**

__<f>__

Generate the random sequences using the nucleotide distribution from the sequence

file

__<f>.__

**--cpu**

__<n>__

Specify that

__<n>__parallel CPU workers be used. If

__<n>__is set as "0", then the

program will be run in serial mode, without using threads. You can also control

this number by setting an environment variable,

__INFERNAL_NCPU.__This option will

only be available if the machine on which Infernal was built is capable of using

POSIX threading (see the Installation section of the user guide for more

information).

**--mpi**Run as an MPI parallel program. This option will only be available if Infernal has

been configured and built with the "--enable-mpi" flag (see the Installation

section of the user guide for more information).

Use cmcalibrate online using onworks.net services