.TH "cmcalibrate" 1 "October 2009" "Infernal 1.0.2" "Infernal Manual"

.SH NAME
.TP 
cmcalibrate - fit exponential tails for E-values and determine HMM
	      filter thresholds for a CM

.SH SYNOPSIS
.B cmcalibrate
.I [options]
.I cmfile

.SH DESCRIPTION

.B cmcalibrate
calibrates E-value statistics and HMM filter thresholds for the 
covariance models (CMs) in
.I cmfile. 
The E-values and HMM filter threshold statistics are added to the 
.I cmfile
and are used by
.B cmsearch
for increased sensitivity and acceleration in RNA homology search.

.PP
CMs are profiles of RNA consensus sequence and secondary structure. A
CM file is produced by the 
.B cmbuild 
program, from a given RNA sequence alignment of known 
consensus structure.

.B cmcalibrate
is very slow. It takes several hours
to calibrate a single average sized CM. 
.B cmcalibrate
can be run in parallel with MPI.
To do this, use the
.B --mpi 
option and run cmsearch inside a MPI wrapper program such as 
.B mpirun. 
For example: 
.B mpirun C
.B cmcalibrate
.I --mpi 
.I [other options]
.I cmfile.
Note that 
.B cmcalibrate
must have been compiled in MPI mode to use the
.I --mpi
option. See the Installation seciton of the user's guide for
instructions on how to compile in MPI mode.

The 
.BI --forecast " <n>" 
option can be used to estimate how long the program will take to run
on 
.I <n>
processors. Unless you plan on running
.B cmcalibrate
in MPI mode, 
.I <n>
should be set as 1.

.PP
.B cmcalibrate
performs two main tasks. The first is to calibrate E-value
statistics.  This is done by generating random
sequences and searching them with the CM and collecting hits. The histogram of the bit
scores of the hits is fit to an exponential tail, and the parameters
of the fitted tail are saved to the CM file. The exponential tail is
used to predict the expected number of hits (E-values) at a given bit score in 
.B cmsearch. 
The random sequences are generated by an HMM that was
trained on real genomic sequences with various GC contents. The goal
is to have the GC distributions in the random sequences to be similar
to actual genomic sequences. 

The second task is to determine appropriate HMM filter
thresholds for the CM over the possible range of final CM bit score
thresholds. This is done by sampling 10,000 sequences from the CM
itself and searching them with the CM and HMM. The appropriate HMM bit
score threshold for a given CM threshold is set as the HMM threshold that
will recognize 99.5% of the hits that score above the CM threshold. This
HMM threshold is calculated over the range of reasonable CM
thresholds. 

Both tasks must be performed for each configuration and
algorithm that 
.B cmsearch 
might use. These include HMM Viterbi, HMM Forward, CM CYK and CM
Inside algorithms for E-value calibration, and CM CYK and CM Inside
algorithms for HMM filter thresholds. Additionally, for each
algorithm, each task must be performed twice, once for a locally
configured model and once for a globally configured model.

The E-values and HMM filter thresholds determined by 
.B cmcalibrate
are only used by the
.B cmsearch 
program.
If you are not going to use 
.B cmsearch,
do not waste time calibrating your models.

The majority of the options to 
.B cmcalibrate
fall into one of two categories, depending on which of the two main
tasks they're associated with. Options that affect the
exponential tail E-value fitting are prefixed with 
.B --exp.
Options that affect the HMM filter threshold determination are
prefixed with 
.B --fil.

The calibration of E-value statistics takes the majority of the
running time of
.B cmcalibrate.
This is because CM search algorithms are slow, and the random
sequences that must be searched have to be long enough to include enough
random hits that can be binned into a histogram to which an
exponential tail can be reliably fit. By default the random sequence
length for CM searches is 1.5 megabases (Mb), for all search modes,
but 1.5 can be changed to
.I <x>
with 
.BI --exp-cmL-glc " <x>"
or
.BI --exp-cmL-loc " <x>"
options 
for glocal and local CM search calibrations respectively.
Because 
.B cmsearch 
uses HMM search algorithms to filter, 
.B cmcalibrate 
must also fit exponential tails for HMM search algorithms. HMMs are
much faster than CMs so it is possible to search much longer random
sequence than 1.5 MB and not significantly increase the running time
of
.B cmcalibrate. 
The length of sequence searched with the HMM is controlled by the
.BI --exp-fract " <x>",
.BI --exp-hmmLn-glc " <x>",
.BI --exp-hmmLn-loc " <x>",
and the
.BI --exp-hmmLx " <x>"
options. 
By default, the sequence length for HMM calibration is set as the
length that will require 0.10 times the number of dynamic programming
calculations as a CM E-value calibration step. (The value 0.10 can be
changed to 
.I <x>
with the 
.BI --exp-fract " <x>" 
option).
If this sequence length is less than a minimum value, which by default is
15.0 MB, then the minimum value is used. The minimum value can be 
changed to 
.I <x>
with 
.BI --exp-hmmLn-glc " <x>" 
and
.BI --exp-hmmLn-loc " <x>" 
for glocal and local HMM search calibrations separately. Similarily if this value is more than a maximum value, which by default is
1000.0 MB, then the maximum value is used. The maximum value can be 
changed to 
.I <x>
with the 
.BI --exp-hmmLx " <x>"
option.


.SH OPTIONS

.TP
.B -h
Print brief help; includes version number and summary of
all options, including expert options.

.TP
.BI -s " <n>"
Set the random number generator seed to 
.I <n>, 
where 
.I <n> 
is a positive integer. 
The default is to use time() to
generate a different seed for each run, which means that two different
runs of 
.B cmcalibrate
on the same CM will give slightly different
E-value and HMM filter threshold parameters. You can use this option
to generate reproducible results.

.TP
.BI --forecast " <n>"
Predict the running time of the calibration for 
.I cmfile 
and provided options
and exit, DO NOT perform the calibration. 
The predictions should be used as rough
estimates. The value 
.I <n>
is the number of processors the calibration will be run on, so 
.I <n>
equal to 1 is appropriate unless you will run 
.B cmcalibrate
in parallel with MPI.

.TP
.B --devhelp
Print help, as with  
.B "-h",
but also include undocumented developer options. These options are not
listed below, are under development or experimental, and are not
guaranteed to even work correctly. Use developer options at your own
risk. The only resources for understanding what they actually do are
the brief one-line description printed when
.B "--devhelp"
is enabled, and the source code.

.TP
.BI --mpi
Run as an MPI parallel program. This option will only be available if
Infernal 
has been configured and built with the "--enable-mpi" flag (see User's
Guide for details).

.SH EXPERT OPTIONS

.TP
.BI --exp-cmL-glc " <x>"
Set the length of random sequence to search for the CM 
.B glocal 
exponential tail fits to 
.I <x> 
megabases (Mb). By default, 
.I <x> is
1.5 Mb. Searching more sequences will make the exponential tail fits
more precise, but will take longer: using 
.I <x> 
of 3.0 instead of the default of 1.5 will cause the running time of
.B cmcalibrate 
to increase by roughly 50%.

.TP
.BI --exp-cmL-loc " <x>"
Set the length of random sequence to search for the CM 
.B local 
exponential tail fits to 
.I <x> 
megabases (Mb). By default, 
.I <x> is
1.5 Mb. Searching more sequences will make the exponential tail fits
more precise, but will take longer: using 
.I <x> 
of 3.0 instead of the default of 1.5 will cause the running time of
.B cmcalibrate 
to increase by roughly 50%.

.TP
.BI --exp-hmmLn-glc " <x>"
Set the minimum random sequence length to search for the HMM 
.B glocal
exponential tail fits to 
.I <x>
megabases (Mb). By default, 
.I <x>
is 15.0.
For more information, see the explanation regarding sequence lengths
for E-value calibration above before the Options section.

.TP
.BI --exp-hmmLn-loc " <x>"
Set the minimum random sequence length to search for the HMM 
.B local
exponential tail fits to 
.I <x>
megabases (Mb). By default, 
.I <x>
is 15.0.
For more information, see the explanation regarding sequence lengths
for E-value calibration above before the Options section.

.TP
.BI --exp-hmmLx " <x>"
Set the maximum random sequence length to search when determining HMM
E-values to  
.I <x>
megabases (Mb). By default, 
.I <x>
is 1000.0.
For more information, see the explanation regarding sequence lengths
for E-value calibration above before the Options section.

.TP
.BI --exp-fract " <x>"
Set the HMM/CM fraction of dynamic programming calculations to 
.I <x>.
By default, 
.I <x>
is 0.10. 
For more information, see the explanation regarding sequence lengths
for E-value calibration above before the Options section.

.TP
.BI --exp-tailn-cglc " <x>"
During E-value calibration of 
.B glocal CM 
search modes fit the exponential tail to the high scores in the
histogram tail that includes 
.I <x>
hits per Mb searched.
By default this 
.I <x>
is 25. The value 25 was chosen because it works well empirically
for glocal CM modes relative to other values.

.TP
.BI --exp-tailn-cloc " <x>"
During E-value calibration of 
.B local CM 
search modes fit the exponential tail to the high scores in the
histogram tail that includes 
.I <x>
hits per Mb searched.
By default this 
.I <x>
is 75. The value 75 was chosen because it works well empirically
for local CM modes relative to other values.

.TP
.BI --exp-tailn-hglc " <x>"
During E-value calibration of 
.B glocal HMM 
search modes fit the exponential tail to the high scores in the
histogram tail that includes 
.I <x>
hits per Mb searched.
By default this 
.I <x>
is 250. The value 250 was chosen because it works well empirically
for glocal HMM modes relative to other values.

.TP
.BI --exp-tailn-hloc " <x>"
During E-value calibration of 
.B local HMM 
search modes fit the exponential tail to the high scores in the
histogram tail that includes 
.I <x>
hits per Mb searched.
By default this 
.I <x>
is 750. The value 750 was chosen because it works well empirically
for glocal HMM modes relative to other values.

.TP
.BI --exp-tailp " <x>"
Ignore the
.B --exp-tailn
prefixed options and fit the 
.I <x>
fraction right tail of the histogram to exponential tails, for all
search modes.

.TP
.BI --exp-tailxn " <n>"
With 
.B --exp-tailp 
enforce that the maximum number of hits in the tail that is fit is
.I <n>.

.TP
.BI --exp-beta " <x>"
During E-value calibration, by default query-dependent banding (QDB)
is used to accelerate the CM search algorithms with a beta tail loss
probability of 1E-15.
This beta value can be changed to 
.I <x>
using the 
.BI --exp-beta " <x>"
option. The beta parameter is the amount of probability mass excluded
during band calculation, higher values of beta give greater speedups
but sacrifice more accuracy than lower values. A recommended value is
1E-7 (0.00001\%).  QDB is explained in more detail in the manual page
for 
.B cmsearch 
and in (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56). 

.TP
.B --exp-no-qdb
Turn of QDB during E-value calibration. This will slow down
calibration, and is not recommended unless you plan on using 
.B --no-qdb
in 
.B cmsearch.

.TP 
.BI --exp-hfile " <f>"
Save the histograms fit for the E-value calibration to file
.I <f>.
The format of this file is two tab delimited columns. The first column
is the x-axis values of bit scores of each bin. The second column is the y-axis
values of number of hits per bin. Each series is delimited by a line
with a single character "&". The file will contain one series for each
exponential tail fit, i.e. one series of empirical data for each line of output from
.B cmcalibrate
that begins with "exp tail".

.TP 
.BI --exp-sfile " <f>"
Save a survival plot for the E-value calibration to file
.I <f>.
The format of this file is two tab delimited columns. The first column
is the x-axis values of bit scores of each bin. The second column is the y-axis
values of fraction of hits that meet or exceed the score for each
bin. Each series is delimited by a line with a single character "&". 
The file will contain three series' of data for each
exponential tail fit, i.e. three series for each line of output from
.B cmcalibrate
that begins with "exp tail".
The first series is the empirical survival plot from the histogram of hits
to the random sequence. The second series is the exponential tail fit
to the empirical distribution. The third series is the exponential
tail fit if lambda were fixed and set as the natural log of 2 (0.691314718).

.TP 
.BI --exp-qqfile " <f>"
Save a quantile-quantile plot for the E-value calibration to file
.I <f>.
The format of this file is two tab delimited columns. The first column
is the x-axis values, and the second column is the y-axis
values. The distance of the points from the identity line (y=x) is a
measure of how good the exponential tail fit is, the closer the points
are to the identity line, the better the fit is.
Each series is delimited by a line with a single character "&". 
The file will contain one series of empirical data for each
exponential tail fit, i.e. one series for each line of output from
.B cmcalibrate
that begins with "exp tail".

.TP 
.BI --exp-ffile " <f>"
Save statistics on the exponential tail statistics to file
.I <f>.
The file will contain the lambda and mu values for exponential tails 
fit to tails of different sizes. For example, by default
.B cmcalibrate 
fits exponential tails to the rightmost 0.01 (1\%) of the score histogram and
stores the parameters of that exponential tail to the CM file.
(The value of 0.01 can be changed to 
.I <x>
with the 
.BI --exp-tailp " <x>"
option).
When 
.BI --exp-ffile " <f>" 
is used the file
.I " <f>" 
will include the exponential tail parameters for fits to various
fractions of the histogram tail, instead of just to 0.01.

.TP 
.BI --fil-N " <n>"
Set the number of sequences sampled and searched for the HMM filter
threshold calibration to 
.I <n>.
By default, 
.I <n>
is 10,000.

.TP 
.BI --fil-F " <x>"
Set the fraction of sample sequences the HMM filter must be able to
recognize, and allow to survive, to 
.I <x>,
where 
.I <x>
is a positive real number less than or equal to 1.0.
By default,
.I <x>
is 0.993.

.TP 
.BI --fil-tau " <x>"
Set the tail loss probability during HMM band calculation for HMM
filter threshold calibration to 
.I <x>. 
This is the amount of probability mass within the HMM posterior
probabilities that is considered negligible. The default value is 1E-7.
In general, higher values will result in greater acceleration, but
increase the chance of missing the optimal alignment due to the HMM
bands. 

.TP 
.B --fil-gemit
During HMM filter calibration, always sample sequences from a globally
configured CM, even when calibrating local modes. By default,
sequences are sampled from a globally configured CM when calibrating
the global search modes, and sampled from a locally configured CM
when calibrating the local search modes.
	
.TP 
.BI --fil-dfile " <f>"
Save statistics on filter threshold calibration, including HMM and CM scores for all sampled sequences,
to file 
.I <f>.

.TP
.BI --mxsize " <x>"
Set the maximum allowable DP matrix size to 
.I <x>
megabytes. By default this size is 2,048 Mb. 
This should be large enough for the vast majority of calibrations,
however if it is not 
.B cmcalibrate
will exit prematurely and report an error message that 
the matrix exceeded it's maximum allowable size. In this case, the
.B --mxsize 
can be used to raise the limit.


.SH SEE ALSO

For complete documentation, see the User's Guide (Userguide.pdf) that
came with the distribution; or see the Infernal web page,
http://infernal.janelia.org/.

.SH COPYRIGHT

.nf
Copyright (C) 2009 HHMI Janelia Farm Research Campus.
Freely distributed under the GNU General Public License (GPLv3).
.fi
See the file COPYING that came with the source
for details on redistribution conditions.

.SH AUTHOR

.nf
Eric Nawrocki, Diana Kolbe, and Sean Eddy
HHMI Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147
http://selab.janelia.org/

