This is the command gsnap that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
gsnap - Genomic Short-read Nucleotide Alignment Program
gsnap [OPTIONS...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]
Input options (must include -d)
Genome directory. Default (as specified by --with-gmapdb to the configure program)
Whether to use a suffix array, which will give increased speed. Allowed values: 0
(no), 1 (yes, plus GSNAP/GMAP algorithm, default), or 2 (yes, and use only suffix
array algorithm). Note that suffix arrays will bias against SNP alleles in
kmer size to use in genome database (allowed values: 16 or less) If not specified,
the program will find the highest available kmer size in the genome database
Sampling to use in genome database. If not specified, the program will find the
smallest available sampling value in the genome database within selected k-mer size
Process only the i-th out of every n sequences e.g., 0/100 or 99/100 (useful for
distributing jobs to a computer farm).
Size of input buffer (program reads this many sequences at a time for efficiency)
Amount of barcode to remove from start of read (default 0)
Orientation of paired-end reads Allowed values: FR (fwd-rev, or typical Illumina;
default), RF (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand)
Starting position of identifier in FASTQ header, space-delimited (>= 1)
Ending position of identifier in FASTQ header, space-delimited (>= 1)
start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
start=1, end=1 => identifier is SRR001666.1 start=2, end=2 => identifier is
071112_SLXA-EAS1_s_7:5:1:817:345 start=1, end=2 => identifier is SRR001666.1
When multiple FASTQ files are provided on the command line, GSNAP assumes they are
matching paired-end files. This flag treats each file as single-end.
Skips reads marked by the Illumina chastity program. Expecting a string after the
accession having a 'Y' after the first colon, like this:
where the 'Y' signifies filtering by chastity. Values: off (default), either,
both. For 'either', a 'Y' on either end of a paired-end read will be filtered.
For 'both', a 'Y' is required on both ends of a paired-end read (or on the only end
of a single-end read).
Allows accession names of reads to mismatch in paired-end files
Uncompress gzipped input files
Uncompress bzip2-compressed input files
Note: GSNAP has an ultrafast algorithm for calculating mismatches up to and
((readlength+2)/kmer - 2) ("ultrafast mismatches"). The program will run fastest if
max-mismatches (plus suboptimal-levels) is within that value. Also, indels, especially
end indels, take longer to compute, although the algorithm is still designed to be fast.
Batch mode (default = 2)
Mode Offsets Positions Genome Suffix array
0 see note mmap mmap mmap
1 see note mmap & preload mmap mmap
2 see note mmap & preload mmap & preload mmap & preload
3 see note allocate mmap & preload mmap & preload
(default) 4 see note allocate allocate mmap & preload
5 see note allocate allocate allocate
Note: For a single sequence, all data structures use mmap
If mmap not available and allocate not chosen, then will use fileio (very slow)
Note about offsets: Expansion of offsets can be controlled
independently by the --expand-offsets flag. However, offsets are accessed
relatively fast in this version of GSNAP.
If 1 (default), then allocated memory is shared among all processes on this node.
If 0, then each process has private allocated memory
Whether to expand the genomic offsets index Values: 0 (no, default), or 1 (yes).
Expansion gives faster alignment, but requires more memory
Maximum number of mismatches allowed (if not specified, then defaults to the
ultrafast level of ((readlength+index_interval-1)/kmer - 2)) (By default, the
genome index interval is 3, but this can be changed by providing a different value
for -q to gmap_build when processing the genome.)
If specified between 0.0 and 1.0, then treated as a fraction
of each read length. Otherwise, treated as an integral number of mismatches
(including indel and splicing penalties) For RNA-Seq, you may need to increase this
value slightly to align reads extending past the ends of an exon.
Minimum coverage required for an alignment. If specified between 0.0 and 1.0, then
treated as a fraction of each read length. Otherwise, treated as an integral
number of base pairs. Default value is 0.0.
Whether to count unknown (N) characters in the query as a mismatch (0=no (default),
Whether to count unknown (N) characters in the genome as a mismatch (0=no, 1=yes
Maximum number of alignments to find (default 1000). Must be larger than --npaths,
which is the number to report. Keeping this number large will allow for random
selection among multiple alignments. Reducing this number can speed up the
Penalty for an indel (default 2). Counts against mismatches allowed. To find
indels, make indel-penalty less than or equal to max-mismatches. A value < 2 can
lead to false positives at read ends
Minimum length at end required for indel alignments (default 4)
Maximum number of middle insertions allowed (default 9)
-z, --max-middle-deletions=INT Maximum number of middle deletions allowed (default 30)
Maximum number of end insertions allowed (default 3)
Maximum number of end deletions allowed (default 6)
Report suboptimal hits beyond best hit (default 0) All hits with best score plus
suboptimal-levels are reported
Method for removing adapters from reads. Currently allowed values: off, paired.
Default is "off". To turn on, specify "paired", which removes adapters from
paired-end reads if they appear to be present.
Score to use for mismatches when trimming at ends (default is -3; to turn off
trimming, specify 0). Warning: turning trimming off will give false positive
mismatches at the ends of reads
Score to use for indels when trimming at ends (default is -2; to turn off trimming,
specify 0). Warning: turning trimming off will give false positive indels at the
ends of reads
Directory for SNPs index files (created using snpindex) (default is location of
genome index files specified using -D and -d)
Use database containing known SNPs (in <STRING>.iit, built previously using
snpindex) for tolerance to SNPs
Directory for methylcytosine index files (created using cmetindex) (default is
location of genome index files specified using -D, -V, and -d)
Directory for A-to-I RNA editing index files (created using atoiindex) (default is
location of genome index files specified using -D, -V, and -d)
Alignment mode: standard (default), cmet-stranded, cmet-nonstranded, atoi-stranded,
atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded. Non-standard modes requires
you to have previously run the cmetindex or atoiindex programs (which also cover
the ttoc modes) on the genome
Number of worker threads
Options for GMAP alignment within GSNAP
Cases to use GMAP for complex alignments containing multiple splices or indels
Allowed values: none, all, pairsearch, indel_knownsplice, terminal, improve
(or multiple values, separated by commas).
Default: all, i.e., pairsearch,indel_knownsplice,terminal,improve
Try GMAP pairsearch on nearby genomic regions if best score (the total of both ends
if paired-end) exceeds this value (default 5)
Keep GMAP hit only if it has this many consecutive matches (default 20)
Extra mismatch/indel score allowed for GMAP alignments (default 3)
Perform GMAP pairsearch on nearby genomic regions up to this many many candidate
ends (default 50). Requires pairsearch in --gmap-mode
Perform GMAP terminal on nearby genomic regions up to this many candidate ends
(default 50). Requires terminal in --gmap-mode
Perform GMAP improvement on nearby genomic regions up to this many candidate ends
(default 5). Requires improve in --gmap-mode
Allow microexons only if one of the splice site probabilities is greater than this
value (default 0.95)
Splicing options for DNA-Seq
Look for distant splicing in DNA-Seq data (0=no (default), 1=yes) Automatically
inactivated for RNA-Seq data if -N or -s are specified)
Splicing options for RNA-Seq
Look for novel splicing (0=no (default), 1=yes)
Directory for splicing involving known sites or known introns, as specified by the
-s or --use-splicing flag (default is directory computed from -D and -d flags).
Note: can just give full pathname to the -s flag instead.
Look for splicing involving known sites or known introns (in <STRING>.iit), at
short or long distances See README instructions for the distinction between known
sites and known introns
For ambiguous known splicing at ends of the read, do not clip at the splice site,
but extend instead into the intron. This flag makes sense only if you provide the
--use-splicing flag, and you are trying to eliminate all soft clipping with
Definition of local novel splicing event (default 200000)
Distance to look for novel splices at the ends of reads (default 50000)
Penalty for a local splice (default 0). Counts against mismatches allowed
Penalty for a distant splice (default 1). A distant splice is one where the intron
length exceeds the value of -w, or --localsplicedist, or is an inversion, scramble,
or translocation between two different chromosomes Counts against mismatches
Minimum length at end required for distant spliced alignments (default 20, min
allowed is the value of -k, or kmer size)
Minimum length at end required for short-end spliced alignments (default 2, but
unless known splice sites are provided with the -s flag, GSNAP may still need the
end length to be the value of -k, or kmer size to find a given splice
Minimum identity at end required for distant spliced alignments (default 0.95)
(Not currently implemented, since it leads to poor results) Penalty for
antistranded splicing when using stranded RNA-Seq protocols. A positive value,
such as 1, expects antisense on the first read and sense on the second read.
Default is 0, which treats sense and antisense equally well
Report distant splices on the same chromosome as a single splice, if possible.
Will produce a single SAM line instead of two SAM lines, which is also done for
translocations, inversions, and scramble events
Options for paired-end reads
Max total genomic length for DNA-Seq paired reads, or other reads without splicing
(default 1000). Used if -N or -s is not specified.
Max total genomic length for RNA-Seq paired reads, or other reads that could have a
splice (default 200000). Used if -N or -s is specified. Should probably match the
value for -w, --localsplicedist.
Expected paired-end length, used for calling splices in medial part of paired-end
reads (default 200). Was turned off in previous versions, but reinstated.
Allowable deviation from expected paired-end length, used for calling splices in
medial part of paired-end reads (default 100). Was turned off in previous
versions, but reinstated.
Options for quality scores
Protocol for input quality scores. Allowed values: illumina (ASCII 64-126)
(equivalent to -J 64 -j -31) sanger (ASCII 33-126) (equivalent to -J 33 -j 0)
Default is sanger (no quality print shift)
SAM output files should have quality scores in sanger protocol
Or you can customize this behavior with these flags:
FASTQ quality scores are zero at this ASCII value (default is 33 for sanger
protocol; for Illumina, select 64)
Shift FASTQ quality scores by this amount in output (default is 0 for sanger
protocol; to change Illumina input to Sanger output, select -31)
Maximum number of paths to print (default 100).
If more than maximum number of paths are found, then nothing is printed.
Print output in same order as input (relevant only if there is more than one worker
For GSNAP output in SNP-tolerant alignment, shows all differences relative to the
reference genome as lower case (otherwise, it shows all differences relative to
both the reference and alternate genome)
For paired-end reads whose alignments overlap, clip the overlapping region.
For paired-end reads whose alignments overlap, merge the two ends into a single end
Print detailed information about SNPs in reads (works only if -v also selected)
(not fully implemented yet)
Print only failed alignments, those with no results
Exclude printing of failed alignments
Another format type, other than default. Currently implemented: sam, m8 (BLAST
Basename for multiple-file output, separately for nomapping, halfmapping_uniq,
halfmapping_mult, unpaired_uniq, unpaired_mult, paired_uniq, paired_mult,
concordant_uniq, and concordant_mult results
File name for a single stream of output results.
Print completely failed alignments as input FASTA or FASTQ format, to the given
file, appending .1 or .2, for paired-end data. If the --split-output flag is also
given, this file is generated in addition to the output in the .nomapping file.
When --split-output or --failed-input is given, this flag will append output to the
existing files. Otherwise, the default is to create new files.
Among alignments tied with the best score, order those alignments in this order.
Allowed values: genomic, random (default)
Buffer size, in queries, for output thread (default 1000). When the number of
results to be printed exceeds this size, the worker threads are halted until the
backlog is cleared
Options for SAM output
Do not print headers beginning with '@'
Add nomapper lines as needed to make all paired-end results alternate between first
end and second end
Whether the paired bit in the SAM flags means concordant only (1) or paired plus
concordant (0, default)
Print headers only for this batch, as specified by -q
Insert 0M in CIGAR between adjacent insertions and deletions Required by Picard,
but can cause errors in other tools
Allows multiple alignments to be marked as primary if they have equally good
For RNA-Seq alignments, disallows XS:A:? when the sense direction is unclear, and
replaces this value arbitrarily with XS:A:+. May be useful for some programs, such
as Cufflinks, that cannot handle XS:A:?. However, if you use this flag, the
reported value of XS:A:+ in these cases will not be meaningful.
In MD string, when known SNPs are given by the -v flag, prints difference
nucleotides as lower-case when they, differ from reference but match a known
Extends alignments through soft clipped regions
Action to take if there is a disagreement between CIGAR length and sequence length
Allowed values: ignore, warning, noprint (default), abort
Value to put into read-group id (RG-ID) field
Value to put into read-group name (RG-SM) field
Value to put into read-group library (RG-LB) field
Value to put into read-group library (RG-PL) field
Check compiler assumptions
--help Show this help message
Other tools of GMAP suite are located in /usr/lib/gmap
Use gsnap online using onworks.net services