This is the command sim4 that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
sim4 - align an expressed DNA sequence with a genomic sequence
SYNOPSIS
sim4 seqfile1 seqfile2 {[WXKCRDAPNB]=value}
DESCRIPTION
sim4 is a similarity-based tool for aligning an expressed DNA sequence (EST, cDNA, mRNA)
with a genomic sequence for the gene. It also detects end matches when the two input
sequences overlap at one end (i.e., the start of one sequence overlaps the end of the
other). If seqfile2 is a database of sequences, the sequence in seqfile1 will be aligned
with each of the sequences in seqfile2.
sim4 employs a blast-based technique to first determine the basic matching blocks
representing the "exon cores". In this first stage, it detects all possible exact matches
of W-mers (i.e., DNA words of size W) between the two sequences and extends them to
maximal scoring gap-free segments. In the second stage, the exon cores are extended into
the adjacent as-yet-unmatched fragments using greedy alignment algorithms, and heuristics
are used to favor configurations that conform to the splice-site recognition signals (GT-
AG, CT-AC). If necessary, the process is repeated with less stringent parameters on the
unmatched fragments.
By default, sim4 searches both strands and reports the best match, measured by the number
of matching nucleotides found in the alignment. The R command line option can be used to
restrict the search to one orientation (strand) only.
Currently, five major alignment display options are supported, controlled by the A option.
By default (A=0), only the endpoints, overall similarity, and orientation of the introns
are reported. An arrow sign (`->' or `<-') indicates the orientation of the intron (`+' or
`-' strand), when the signals flanking the intron have three or more position matches with
either the GT-AG or the CT-AC splice recognition signals. When the same number of matches
is found for both orientations, the intron is reported as ambiguous, and represented by
`--'. The sign `==' marks the absence from the alignment of a cDNA fragment starting at
that position. Alternative formats (lav-block format, text, PipMaker-type `exons file', or
certain combinations of these options) can be requested by specifying a different value
for A.
If the P option is specified with a non-zero value, sim4 will remove any 3'-end poly-A
tails that it detects in the alignment.
Occasionally, sim4 may miss an internal exon when surrounded by very large introns,
typically longer than 100 Kb. When this is suspected, the H option can be used to reset
the exons' weight to compensate for the intron gap penalty.
Ambiguity codes are by default allowed in sequence data, but sim4 treats them non-
differentially. If desired, the B command option can restrict the set of acceptable
characters to A,C,G,T,N and X only.
sim4 compares the lengths of the input sequences to distinguish between the cDNA (`short')
and the genomic (`long') components in the comparison. When seqfile2 contains a collection
of sequences, the first entry in the file will be used to determine the type of this and
all subsequent comparisons.
In the description below, the term MSP denotes a Maximal Segment Pair, that is, a pair of
highly similar fragments in the two sequences, obtained during the blast-like procedure by
extending a W-mer hit by matches and perhaps a few mismatches.
OPTIONS
The algorithm parameters (included in the first two sections below) have already been
tuned and do not normally require adjustment by the user.
Parameters internal to the blast-like procedure:
W Sets the word size for blast hits in the first stage of the algorithm. The default
value is 12, but it can be increased for a more stringent search or decreased to
find weaker matches.
X Controls the limits for terminating word extensions in the blast-like stage of the
algorithm. The default value is 12.
K Sets the threshold for the MSP scores when determining the basic `exon cores',
during the first stage of the algorithm. (If this option is not specified, the
threshold is computed from the lengths of the sequences, using statistical
criteria.) For example, a good value for genomic sequences in the range of a few
hundred Kb is 16. To avoid spurious matches, however, a larger value may be needed
for longer sequences.
C Sets the threshold for the MSP scores when aligning the as-yet-unmatched fragments,
during the second stage of the algorithm. By default, the smaller of the constant
12 and a statistics-based threshold is chosen.
Additional algorithm parameters:
D Sets the bound for the "diagonal" distance within consecutive MSPs in an exon. The
default value is 10.
Context parameters:
R Specifies the direction of the search. If R=0, only the "+" (direct) strand is
searched. If R=1, only the "-" (reverse complement) matches are sought. By default
(R=2), sim4 searches both strands and reports the best match, measured by the
number of matching pairs in the alignment.
A Specifies the format of the output: exon endpoints only (A=0), exon endpoints and
boundaries of the coding region (CDS) in the genomic sequence, when specified for
the input mRNA (A=5), alignment text (A=1), alignment in lav-block format (A=2), or
both exon endpoints and alignment text (A=3 or A=4). If a reverse complement match
is found, A=0,1,2,3,5 will give its position in the "+" strand of the longer
sequence and the "-" strand of the shorter sequence. A=4 will give its position in
the "+" strand of the first sequence (seqfile1) and the "-" strand of the second
sequence (seqfile2), regardless of which sequence is longer. The A=5 option can be
used with the S command line option to specify the endpoints of the CDS in the
mRNA, and produces output in the `exons file' format required by PipMaker.
P Specifies whether or not the program should report the fragment of the alignment
containing the poly-A tail (if found). By default (P=0) the alignment is displayed
as computed, but specifying a non-zero value will request sim4 to remove the poly-A
tails. When this feature is enabled, all display options produce additional lav
alignment headers.
H Resets the MSPs' weight to compensate for very large introns. The default value is
H=500, but some introns larger than 100 Kb may require higher values, typically
between 1000 and 2500. This option should be used cautiously, generally in cases
where an unmatched internal portion of the cDNA may disguise a missed exon within a
very large intron. It is not recommended for ESTs, where they may produce spurious
exons.
N Requests an additional search for small marginal exons (N=1) guided by the splice-
site recognition signals. This option can be used when a high accuracy match is
expected. The default value is N=0, specifying no additional search.
B Controls the set of characters allowed in the input sequences. By default (B=1),
ambiguity characters (ABCDGHKMNRSTVWXY) are allowed. By specifying B=0, the set of
acceptable characters is restricted to A,C,G,T,N and X only.
S Allows the user to specify the endpoints of the CDS in the input mRNA, with the
syntax: S=n1..n2. This option is only available with the A=5 flag, which produces
output in the format required by PipMaker. Alternatively, the CDS coordinates could
appear in a construct CDS=n1..n2 in the FastA header of the mRNA sequence. When the
second file is an mRNA database, the command line specification for the CDS will
apply to the first sequence in the file only.
EXAMPLES
sim4 est genomic
sim4 genomic estdb
sim4 est genomic A=1 P=1
sim4 est1 est2 R=1
sim4 mRNA genomic A=5 S=123..1020
sim4 mouse_cDNA human_genomic K=15 C=11 A=3 W=10
AUTHORS
sim4 was written by Liliana Florea <[email protected]> and Scott Schwartz.
This manual page was written by Nelson A. de Oliveira <[email protected]>, based on the
online documentation at http://globin.cse.psu.edu/html/docs/sim4.html, for the Debian
project (but may be used by others).
Wed, 03 Aug 2005 18:40:58 -0300 SIM4(1)
Use sim4 online using onworks.net services