exonerate - Online in the Cloud

Run exonerate in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command exonerate that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

exonerate - a generic tool for sequence comparison

SYNOPSIS

exonerate [ options ] <query path> <target path>

DESCRIPTION

exonerate is a general tool for sequence comparison.

It uses the C4 dynamic programming library. It is designed to be both general and fast.
It can produce either gapped or ungapped alignments, according to a variety of different
alignment models. The C4 library allows sequence alignment using a reduced space full
dynamic programming implementation, but also allows automated generation of heuristics
from the alignment models, using bounded sparse dynamic programming, so that these
alignments may also be rapidly generated. Alignments generated using these heuristics
will represent a valid path through the alignment model, yet (unlike the exhaustive
alignments), the results are not guaranteed to be optimal.

CONVENTIONS

A number of conventions (and idiosyncracies) are used within exonerate. An understanding
of them facilitates interpretation of the output.

Coordinates
An in-between coordinate system is used, where the positions are counted between
the symbols, rather than on the symbols. This numbering scheme starts from zero.
This numbering is shown below for the sequence "ACGT":

A C G T
0 1 2 3 4

Hence the subsequence "CG" would have start=1, end=3, and length=2. This
coordinate system is used internally in exonerate, and for all the output formats
produced with the exception of the "human readable" alignment display and the GFF
output where convention and standards dictate otherwise.

Reverse Complements
When an alignment is reported on the reverse complement of a sequence, the
coordinates are simply given on the reverse complement copy of the sequence. Hence
positions on the sequences are never negative. Generally, the forward strand is
indicated by '+', the reverse strand by '-', and an unknown or not-applicable
strand (as in the case of a protein sequence) is indicated by '.'

Alignment Scores
Currently, only the raw alignment scores are displayed. This score just is the sum
of transistion scores used in the dynamic programming. For example, in the case of
a Smith-Waterman alignment, this will be the sum of the substitution matrix scores
and the gap penalties.

GENERAL OPTIONS

Most arguments have short and long forms. The long forms
are more likely to be stable over time, and hence should be used in scripts which
call exonerate.

-h | --shorthelp <boolean>
Show help. This will display a concise summary of the available options, defaults
and values currently set.

--help <boolean>
This shows all the help options including the defaults, the value currently set,
and the environment variable which may be used to set each parameter. There will
be an indication of which options are mandatory. Mandatory options have no
default, and must have a value supplied for exonerate to run. If mandatory options
are used in order, their flags may be skipped from the command line (see examples
below). Unlike this man page, the information from this option will always be up
to date with the latest version of the program.

-v | --version <boolean>
Display the version number. Also displays other information such as the build date
and glib version used.

SEQUENCE INPUT OPTIONS

Pairwise comparisons will be performed between all query sequences and all target
sequences. Generally, for the best performance, shorter sequences (eg. ESTs, shotgun
reads, proteins) should be used as the query sequences, and longer sequences (eg. genomic
sequences) should be used as the target sequences.

-q | --query <paths>
Specify the query sequences required. These must be in a FASTA format file.
Single or muiltiple query sequences may be supplied. Additionally multiple copies
of the fasta file may be supplied following a --query flag, or by using with
multiple --query flags.

-t | --target <paths>
Specify the target sequences required. Also, must be in a FASTA format file. As
with the query sequences, single or multiple target sequences and files may be
supplied. NEW: the target filename may by replace by a server name and port number
in the form of hostname:port when using exonerate-server. See the man page for
exonerate-server for more information on running exonerate in client:server mode.

-Q | --querytype <dna | protein>
Specify the query type to use. If this is not supplied, the query type is assumed
to be DNA when the first sequence in the file contains more than 85% [ACGTN] bases.
Otherwise, it is assumed to be peptide. This option forces the query type as some
nucleotide and peptide sequences can fall either side of this threshold.

-T | --targettype <dna | protein>
Specify the target type to use. The same as --querytype (above), except that it
applies to the target. Specifying the sequence type will avoid the overhead of
having to read the first sequence in the database twice (which may be significant
with chromosome-sized sequences)

--querychunkid <id>

--querychunktotal <total>

--targetchunkid <id>

--targetchunktotal <total>
These options to facilitate running exonerate on compute farms, and avoid having to
split up sequence databases into small chunks to run on different nodes. If, for
example, you wished to split the target database into three parts, you would run
three exonerate jobs on different nodes including the options:

--targetchunkid 1 --targetchunktotal 3
--targetchunkid 2 --targetchunktotal 3
--targetchunkid 3 --targetchunktotal 3
NB. The granularity offered by this option only goes down to a single sequence, so
when there are more chunks than sequences in the database, some processes will do
nothing.

-V | --verbose <int>
Be verbose - show information about what is going on during the analysis. The
default is 1 (little information), the higher the number given, the more
information is printed. To silence all the default output from exonerate, use
--verbose 0 --showalignment no --showvulgar no

ANALYSIS OPTIONS

-E | --exhaustive <boolean>
Specify whether or not exhaustive alignment should be used. By default, this is
FALSE, and alignment heuristics will be used. If it is set to TRUE, an exhaustive
alignment will be calculated. This requires quadratic time, and will be much, much
slower, but will provide the optimal result for the given model.
-B | --bigseq <int>
Perform alignment of large (multi-megabase) sequences. This is very memory
efficient and fast when both sequences are chromosome-sized, but currently does not
currently permit the use of a word neighbourhood (ie. exactly matching seeds only).
--forcescan <none | query | target>
Force the FSM to scan the query sequence rather than the target. This option is
useful, for example, if you have a single piece of genomic sequence and you with to
compare it to the whole of dbEST. By scanning the database, rather than the query,
the analysis will be completed much more quickly, as the overheads of multiple
query FSM construction, multiple target reading and splice site predictions will be
removed. By default, exonerate will guess the optimal strategy based on database
sequence sizes.
--saturatethreshold <number>
When set to zero, this option does nothing. Otherwise, once more than this number
of words (in addition to the expected number of words by chance) have matched a
position on the query, the position on the query will be 'numbed' (ignore further
matches) for the current pairwise comparison.
--customserver <command>
NEW: When using exonerate in client:server mode with a non-standard server, this
command allows you to send a custom command to the server. This command is sent by
the client (exonerate) before any other commands, and is provided as a way of
passing parameters or other commands specific to the custom server. See the
exonerate-server man page for more information on running exonerate in
client:server mode.

FASTA DATABASE OPTIONS

--fastasuffix <extension>
If any of the inputs given with --query or --target are directories, then exonerate
will recursively descent these directories, reading all files ending with this
suffix as fasta format input.

GAPPED ALIGNMENT OPTIONS

-m | --model <alignment model>
Specify the alignment model to use. The models currently supported are:
ungapped
The simplest type of model, used by default. An appropriate model with be
selected automatically for the type of input sequences provided.
ungapped:trans
This ungapped model includes translation of all frames of both the query and
target sequences. This is similar to an ungapped tblastx type search.
affine:global
This performs gapped global alignment, similar to the Needleman-Wunsch
algorithm, except with affine gaps. Global alignment requires that both the
sequences in their entirety are included in the alignment.
affine:bestfit
This performs a best fit or best location alignment of the query onto the
target sequence. The entire query sequence will be included in the
alignment, but only the best location for its alignment on the target
sequence.
affine:local
This is local alignment with affine gaps, similar to the Smith-Waterman-
Gotoh algorithm. A general-purpose alignment algorithm. As this is local
alignment, any subsequence of the query and target sequence may appear in
the alignment.
affine:overlap
This type of alignment finds the best overlap between the query and target.
The overlap alignment must include the start of the query or target and the
end of the query or the target sequence, to align sequences which overlap at
the ends, or in the mid-section of a longer sequence.. This is the type of
alignment frequently used in assembly algorithms.
est2genome
This model is similar to the affine:local model, but it also includes intron
modelling on the target sequence to allow alignment of spliced to unspliced
coding sequences for both forward and reversed genes. This is similar to
the alignment models used in programs such as EST_GENOME and sim4.
ner NERs are non-equivalenced regions - large regions in both the query and
target which are not aligned. This model can be used for protein alignments
where strongly conserved helix regions will be aligned, but weakly conserved
loop regions are not. Similarly, this model could be used to look for co-
linearly conserved regions in comparison of genomic sequences.
protein2dna
This model compares a protein sequence to a DNA sequence, incorporating all
the appropriate gaps and frameshifts.
protein2dna:bestfit
NEW: This is a bestfit version of the protein2dna model, with which the
entire protein is included in the alignment. It is currently only available
when using exhaustive alignment.
protein2genome
This model allows alignment of a protein sequence to genomic DNA. This is
similar to the protein2dna model, with the addition of modelling of introns
and intron phases. This model is similar to those used by genewise.
protein2genome:bestfit
NEW: This is a bestfit version of the protein2genome model, with which the
entire protein is included in the alignment. It is currently only available
when using exhaustive alignment.
coding2coding
This model is similar to the ungapped:trans model, except that gaps and
frameshifts are allowed. It is similar to a gapped tblastx search.
coding2genome
This is similar to the est2genome model, except that the query sequence is
translated during comparison, allowing a more sensitive comparison.
cdna2genome
This combines properties of the est2genome and coding2genome models, to
allow modeling of an whole cDNA where a central coding region can be flanked
by non-coding UTRs. When the CDS start and end is known it may be specified
using the --annotation option (see below) to permit only the correct coding
region to appear in the alignemnt.
genome2genome
This model is similar to the coding2coding model, except introns are
modelled on both sequences. (not working well yet)

The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
p2d, p2d:b p2g, p2g:b, c2c, c2g cd2g and g2g can also be used for specifying
models.

-s | --score <threshold>
This is the overall score threshold. Alignments will not be reported below this
threshold. For heuristic alignments, the higher this threshold, the less time the
analysis will take.

--percent <percentage>
Report only alignments scoring at least this percentage of the maximal score for
each query. eg. use --percent 90 to report alignments with 90% of the maximal
score optainable for that query. This option is useful not only because it reduces
the spurious matches in the output, but because it generates query-specific
thresholds (unlike --score ) for a set of queries of differing lengths, and will
also speed up the search considerably. NB. with this option, it is possible to
have a cDNA match its corresponding gene exactly, yet still score less than 100%,
due to the addition of the intron penalty scores, hence this option must be used
with caution.

--showalignment <boolean>
Show the alignments in an human readable form.

--showsugar <boolean>
Display "sugar" output for ungapped alignments. Sugar is Simple UnGapped Alignment
Report, which displays ungapped alignments one-per-line. The sugar line starts
with the string "sugar:" for easy extraction from the output, and is followed by
the the following 9 fields in the order below:

query_id Query identifier
query_start Query position at alignment start
query_end Query position alignment end
query_strand Strand of query matched
target_id |
target_start | the same 4 fields
target_end | for the target sequence
target_strand |
score The raw alignment score

--showcigar <boolean>
Show the alignments in "cigar" format. Cigar is a Compact Idiosyncratic Gapped
Alignment Report, which displays gapped alignments one-per-line. The format starts
with the same 9 fields as sugar output (see above), and is followed by a series of
<operation, length> pairs where operation is one of match, insert or delete, and
the length describes the number of times this operation is repeated.

--showvulgar <boolean>
Shows the alignments in "vulgar" format. Vulgar is Verbose Useful Labelled Gapped
Alignment Report, This format also starts with the same 9 fields as sugar output
(see above), and is followed by a series of <label, query_length, target_length>
triplets. The label may be one of the following:

M Match
C Codon
G Gap
N Non-equivalenced region
5 5' splice site
3 3' splice site
I Intron
S Split codon
F Frameshift

--showquerygff <boolean>
Report GFF output for features on the query sequence. See
http://www.sanger.ac.uk/Software/formats/GFF for more information.

--showtargetgff <boolean>
Report GFF output for features on the target sequence.

--ryo <format>
Roll-your-own output format. This allows specification of a printf-esque format
line which is used to specify which information to include in the output, and how
it is to be shown. The format field may contain the following fields:

%[qt][idlsSt]
For either {query,target}, report the
{id,definition,length,sequence,Strand,type} Sequences are reported in a
fasta-format like block (no headers).
%[qt]a[bels]
For either {query,target} region which occurs in the alignment, report the
{begin,end,length,sequence}
%[qt]c[bels]
For either {query,target} region which occurs in the coding sequence in the
alignment, report the {begin,end,length,sequence}
%s The raw score
%r The rank (in results from a bestn search)
%m Model name
%e[tism]
Equivalenced {total,id,similarity,mismatches} (ie. %em == (%et - %ei))
%p[is] Percent {id,similarity} over the equivalenced portions of the alignment.
(ie. %pi == 100*(%ei / %et))
%g Gene orientation ('+' = forward, '-' = reverse, '.' = unknown)
%S Sugar block (the 9 fields used in sugar output (see above)
%C Cigar block (the fields of a cigar line after the sugar portion)
%V Vulgar block (the fields of a vulgar line after the sugar portion)
%% Expands to a percentage sign (%)
\n Newline
\t Tab
\\ Expands to a backslash (\)
\{ Open curly brace
\} Close curly brace
{ Begin per-transition output section
} End per-transition output section
%P[qt][sabe]
Per-transition output for {query,target} {sequence,advance,begin,end}
%P[nsl]
Per-transition output for {name,score,label}

This option is very useful and flexible. For example, to report all the sections of query
sequences which feature in alignments in fasta format, use:

--ryo ">%qi %qd\n%qas\n"

To output all the symbols and scores in an alignment, try something like:

--ryo "%V{%Pqs %Pts %Ps\n}"

-n | --bestn <number>
Report the best N results for each query. (Only results scoring better than the
score threshold
will be reported). The option reduces the amount of output generated, and also
allows exonerate to speed up the search.

-S | --subopt <boolean>
This option allows for the reporting of (Waterman-Eggert style) suboptimal
alignments. (It is on by default.) All suboptimal (ie. non-intersecting)
alignments will be reported for each pair of sequences scoring at least the
threshold provided by --score.

When this option is used with exhaustive alignments, several full quadratic time
passes will be required, so the running time will be considerably increased.

-g | --gappedextension <boolean>
Causes a gapped extension stage to be performed ie. dynamic programming is applied
in arbitrarily shaped and dynamically sized regions surrounding HSP seeds. The
extension threshold is controlled by the --extensionthreshold option.

Although sometimes slower than BSDP, gapped extension improves sensitivity with
weak, gap-rich alignments such as during cross-species comparison.

NB. This option is now the default. Set it to false to reverse to the old BSDP type
alignments. This option may be slower than BSDP for some large scale analyses with
simple alignment models.

--refine <strategy>
Force exonerate to refine alignments generated by heuristics using dynamic
programming over larger regions. This takes more time, but improves the quality of
the final alignments.

The strategies available for refinement are:

none The default - no refinement is used.
full An exhaustive alignment is calculated from the pair of sequences in their
entirety.
region DP is applied just to the region of the sequences covered by the heuristic
alignment.

--refineboundary <size>
Specify an extra boundary to be included in the region subject to alignment during
refinement by region.

VITERBI ALGORITHM OPTIONS

-D | --dpmemory <Mb>
The exhaustive alignment traceback routines use a Hughey-style reduced memory
technique. This option specifies how much memory will be used for this.
Generally, the more memory is permitted here, the faster the alignments will be
produced.

CODE GENERATION OPTIONS

-C | --compiled <boolean>
This option allows disabling of generated code for dynamic programming. It is
mainly used during development of exonerate. When set to FALSE, an "interpreted"
version of the dynamic programming implementation is used, which is much slower.

HEURISTIC OPTIONS

--terminalrangeint
--terminalrangeext
--joinrangeint
--joinrangeext
--spanrangeint
--spanrangeext
These options are used to specify the size of the sub-alignment regions to which DP
is applied around the ends of the HSPs. This can be at the HSP ends (terminal
range), between HSPs (join range), or between HSPs which may be connected by a
large region such as an intron or non-equivalenced region (span range). These
ranges can be specified for a number of matches back onto the HSP (internal range)
or out from the HSP (external range).

Use exonerate online using onworks.net services