gsnap - Online in the Cloud

Run gsnap in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command gsnap that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

gsnap - Genomic Short-read Nucleotide Alignment Program

SYNOPSIS

gsnap [OPTIONS...] <FASTA file>, or cat <FASTA file> | gmap [OPTIONS...]

OPTIONS

Input options (must include -d)
-D, --dir=directory
Genome directory. Default (as specified by --with-gmapdb to the configure program)
is /var/cache/gmap

-d, --db=STRING
Genome database

--use-sarray=INT
Whether to use a suffix array, which will give increased speed. Allowed values: 0
(no), 1 (yes, plus GSNAP/GMAP algorithm, default), or 2 (yes, and use only suffix
array algorithm). Note that suffix arrays will bias against SNP alleles in
SNP-tolerant alignment.

-k, --kmer=INT
kmer size to use in genome database (allowed values: 16 or less) If not specified,
the program will find the highest available kmer size in the genome database

--sampling=INT
Sampling to use in genome database. If not specified, the program will find the
smallest available sampling value in the genome database within selected k-mer size

-q, --part=INT/INT
Process only the i-th out of every n sequences e.g., 0/100 or 99/100 (useful for
distributing jobs to a computer farm).

--input-buffer-size=INT
Size of input buffer (program reads this many sequences at a time for efficiency)
(default 1000)

--barcode-length=INT
Amount of barcode to remove from start of read (default 0)

--orientation=STRING
Orientation of paired-end reads Allowed values: FR (fwd-rev, or typical Illumina;
default), RF (rev-fwd, for circularized inserts), or FF (fwd-fwd, same strand)

--fastq-id-start=INT
Starting position of identifier in FASTQ header, space-delimited (>= 1)

--fastq-id-end=INT
Ending position of identifier in FASTQ header, space-delimited (>= 1)

Examples:

@HWUSI-EAS100R:6:73:941:1973#0/1
start=1, end=1 (default) => identifier is HWUSI-EAS100R:6:73:941:1973#0

@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
start=1, end=1 => identifier is SRR001666.1 start=2, end=2 => identifier is
071112_SLXA-EAS1_s_7:5:1:817:345 start=1, end=2 => identifier is SRR001666.1
071112_SLXA-EAS1_s_7:5:1:817:345

--force-single-end
When multiple FASTQ files are provided on the command line, GSNAP assumes they are
matching paired-end files. This flag treats each file as single-end.

--filter-chastity=STRING
Skips reads marked by the Illumina chastity program. Expecting a string after the
accession having a 'Y' after the first colon, like this:

@accession 1:Y:0:CTTGTA
where the 'Y' signifies filtering by chastity. Values: off (default), either,
both. For 'either', a 'Y' on either end of a paired-end read will be filtered.
For 'both', a 'Y' is required on both ends of a paired-end read (or on the only end
of a single-end read).

--allow-pe-name-mismatch
Allows accession names of reads to mismatch in paired-end files

--gunzip
Uncompress gzipped input files

--bunzip2
Uncompress bzip2-compressed input files

Computation options

Note: GSNAP has an ultrafast algorithm for calculating mismatches up to and
including

((readlength+2)/kmer - 2) ("ultrafast mismatches"). The program will run fastest if
max-mismatches (plus suboptimal-levels) is within that value. Also, indels, especially
end indels, take longer to compute, although the algorithm is still designed to be fast.

-B, --batch=INT
Batch mode (default = 2)

Mode Offsets Positions Genome Suffix array
0 see note mmap mmap mmap
1 see note mmap & preload mmap mmap
2 see note mmap & preload mmap & preload mmap & preload
3 see note allocate mmap & preload mmap & preload
(default) 4 see note allocate allocate mmap & preload
5 see note allocate allocate allocate

Note: For a single sequence, all data structures use mmap
If mmap not available and allocate not chosen, then will use fileio (very slow)

Note about offsets: Expansion of offsets can be controlled
independently by the --expand-offsets flag. However, offsets are accessed
relatively fast in this version of GSNAP.

--use-shared-memory=INT
If 1 (default), then allocated memory is shared among all processes on this node.
If 0, then each process has private allocated memory

--expand-offsets=INT
Whether to expand the genomic offsets index Values: 0 (no, default), or 1 (yes).
Expansion gives faster alignment, but requires more memory

-m, --max-mismatches=FLOAT
Maximum number of mismatches allowed (if not specified, then defaults to the
ultrafast level of ((readlength+index_interval-1)/kmer - 2)) (By default, the
genome index interval is 3, but this can be changed by providing a different value
for -q to gmap_build when processing the genome.)

If specified between 0.0 and 1.0, then treated as a fraction
of each read length. Otherwise, treated as an integral number of mismatches
(including indel and splicing penalties) For RNA-Seq, you may need to increase this
value slightly to align reads extending past the ends of an exon.

--min-coverage=FLOAT
Minimum coverage required for an alignment. If specified between 0.0 and 1.0, then
treated as a fraction of each read length. Otherwise, treated as an integral
number of base pairs. Default value is 0.0.

--query-unk-mismatch=INT
Whether to count unknown (N) characters in the query as a mismatch (0=no (default),
1=yes)

--genome-unk-mismatch=INT
Whether to count unknown (N) characters in the genome as a mismatch (0=no, 1=yes
(default))

--maxsearch=INT
Maximum number of alignments to find (default 1000). Must be larger than --npaths,
which is the number to report. Keeping this number large will allow for random
selection among multiple alignments. Reducing this number can speed up the
program.

-i, --indel-penalty=INT
Penalty for an indel (default 2). Counts against mismatches allowed. To find
indels, make indel-penalty less than or equal to max-mismatches. A value < 2 can
lead to false positives at read ends

--indel-endlength=INT
Minimum length at end required for indel alignments (default 4)

-y, --max-middle-insertions=INT
Maximum number of middle insertions allowed (default 9)

-z, --max-middle-deletions=INT Maximum number of middle deletions allowed (default 30)

-Y, --max-end-insertions=INT
Maximum number of end insertions allowed (default 3)

-Z, --max-end-deletions=INT
Maximum number of end deletions allowed (default 6)

-M, --suboptimal-levels=INT
Report suboptimal hits beyond best hit (default 0) All hits with best score plus
suboptimal-levels are reported

-a, --adapter-strip=STRING
Method for removing adapters from reads. Currently allowed values: off, paired.
Default is "off". To turn on, specify "paired", which removes adapters from
paired-end reads if they appear to be present.

--trim-mismatch-score=INT
Score to use for mismatches when trimming at ends (default is -3; to turn off
trimming, specify 0). Warning: turning trimming off will give false positive
mismatches at the ends of reads

--trim-indel-score=INT
Score to use for indels when trimming at ends (default is -2; to turn off trimming,
specify 0). Warning: turning trimming off will give false positive indels at the
ends of reads

-V, --snpsdir=STRING
Directory for SNPs index files (created using snpindex) (default is location of
genome index files specified using -D and -d)

-v, --use-snps=STRING
Use database containing known SNPs (in <STRING>.iit, built previously using
snpindex) for tolerance to SNPs

--cmetdir=STRING
Directory for methylcytosine index files (created using cmetindex) (default is
location of genome index files specified using -D, -V, and -d)

--atoidir=STRING
Directory for A-to-I RNA editing index files (created using atoiindex) (default is
location of genome index files specified using -D, -V, and -d)

--mode=STRING
Alignment mode: standard (default), cmet-stranded, cmet-nonstranded, atoi-stranded,
atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded. Non-standard modes requires
you to have previously run the cmetindex or atoiindex programs (which also cover
the ttoc modes) on the genome

-t, --nthreads=INT
Number of worker threads

Options for GMAP alignment within GSNAP

--gmap-mode=STRING
Cases to use GMAP for complex alignments containing multiple splices or indels
Allowed values: none, all, pairsearch, indel_knownsplice, terminal, improve

(or multiple values, separated by commas).
Default: all, i.e., pairsearch,indel_knownsplice,terminal,improve

--trigger-score-for-gmap=INT
Try GMAP pairsearch on nearby genomic regions if best score (the total of both ends
if paired-end) exceeds this value (default 5)

--gmap-min-match-length=INT
Keep GMAP hit only if it has this many consecutive matches (default 20)

--gmap-allowance=INT
Extra mismatch/indel score allowed for GMAP alignments (default 3)

--max-gmap-pairsearch=INT
Perform GMAP pairsearch on nearby genomic regions up to this many many candidate
ends (default 50). Requires pairsearch in --gmap-mode

--max-gmap-terminal=INT
Perform GMAP terminal on nearby genomic regions up to this many candidate ends
(default 50). Requires terminal in --gmap-mode

--max-gmap-improvement=INT
Perform GMAP improvement on nearby genomic regions up to this many candidate ends
(default 5). Requires improve in --gmap-mode

--microexon-spliceprob=FLOAT
Allow microexons only if one of the splice site probabilities is greater than this
value (default 0.95)

Splicing options for DNA-Seq

--find-dna-chimeras=INT
Look for distant splicing in DNA-Seq data (0=no (default), 1=yes) Automatically
inactivated for RNA-Seq data if -N or -s are specified)

Splicing options for RNA-Seq

-N, --novelsplicing=INT
Look for novel splicing (0=no (default), 1=yes)

--splicingdir=STRING
Directory for splicing involving known sites or known introns, as specified by the
-s or --use-splicing flag (default is directory computed from -D and -d flags).
Note: can just give full pathname to the -s flag instead.

-s, --use-splicing=STRING
Look for splicing involving known sites or known introns (in <STRING>.iit), at
short or long distances See README instructions for the distinction between known
sites and known introns

--ambig-splice-noclip
For ambiguous known splicing at ends of the read, do not clip at the splice site,
but extend instead into the intron. This flag makes sense only if you provide the
--use-splicing flag, and you are trying to eliminate all soft clipping with
--trim-mismatch-score=0

-w, --localsplicedist=INT
Definition of local novel splicing event (default 200000)

--novelend-splicedist=INT
Distance to look for novel splices at the ends of reads (default 50000)

-e, --local-splice-penalty=INT
Penalty for a local splice (default 0). Counts against mismatches allowed

-E, --distant-splice-penalty=INT
Penalty for a distant splice (default 1). A distant splice is one where the intron
length exceeds the value of -w, or --localsplicedist, or is an inversion, scramble,
or translocation between two different chromosomes Counts against mismatches
allowed

-K, --distant-splice-endlength=INT
Minimum length at end required for distant spliced alignments (default 20, min
allowed is the value of -k, or kmer size)

-l, --shortend-splice-endlength=INT
Minimum length at end required for short-end spliced alignments (default 2, but
unless known splice sites are provided with the -s flag, GSNAP may still need the
end length to be the value of -k, or kmer size to find a given splice

--distant-splice-identity=FLOAT
Minimum identity at end required for distant spliced alignments (default 0.95)

--antistranded-penalty=INT
(Not currently implemented, since it leads to poor results) Penalty for
antistranded splicing when using stranded RNA-Seq protocols. A positive value,
such as 1, expects antisense on the first read and sense on the second read.
Default is 0, which treats sense and antisense equally well

--merge-distant-samechr
Report distant splices on the same chromosome as a single splice, if possible.
Will produce a single SAM line instead of two SAM lines, which is also done for
translocations, inversions, and scramble events

Options for paired-end reads

--pairmax-dna=INT
Max total genomic length for DNA-Seq paired reads, or other reads without splicing
(default 1000). Used if -N or -s is not specified.

--pairmax-rna=INT
Max total genomic length for RNA-Seq paired reads, or other reads that could have a
splice (default 200000). Used if -N or -s is specified. Should probably match the
value for -w, --localsplicedist.

--pairexpect=INT
Expected paired-end length, used for calling splices in medial part of paired-end
reads (default 200). Was turned off in previous versions, but reinstated.

--pairdev=INT
Allowable deviation from expected paired-end length, used for calling splices in
medial part of paired-end reads (default 100). Was turned off in previous
versions, but reinstated.

Options for quality scores

--quality-protocol=STRING
Protocol for input quality scores. Allowed values: illumina (ASCII 64-126)
(equivalent to -J 64 -j -31) sanger (ASCII 33-126) (equivalent to -J 33 -j 0)

Default is sanger (no quality print shift)
SAM output files should have quality scores in sanger protocol

Or you can customize this behavior with these flags:

-J, --quality-zero-score=INT
FASTQ quality scores are zero at this ASCII value (default is 33 for sanger
protocol; for Illumina, select 64)

-j, --quality-print-shift=INT
Shift FASTQ quality scores by this amount in output (default is 0 for sanger
protocol; to change Illumina input to Sanger output, select -31)

Output options

-n, --npaths=INT
Maximum number of paths to print (default 100).

-Q, --quiet-if-excessive
If more than maximum number of paths are found, then nothing is printed.

-O, --ordered
Print output in same order as input (relevant only if there is more than one worker
thread)

--show-refdiff
For GSNAP output in SNP-tolerant alignment, shows all differences relative to the
reference genome as lower case (otherwise, it shows all differences relative to
both the reference and alternate genome)

--clip-overlap
For paired-end reads whose alignments overlap, clip the overlapping region.

--merge-overlap
For paired-end reads whose alignments overlap, merge the two ends into a single end
(beta implementation)

--print-snps
Print detailed information about SNPs in reads (works only if -v also selected)
(not fully implemented yet)

--failsonly
Print only failed alignments, those with no results

--nofails
Exclude printing of failed alignments

-A, --format=STRING
Another format type, other than default. Currently implemented: sam, m8 (BLAST
tabular format)

--split-output=STRING
Basename for multiple-file output, separately for nomapping, halfmapping_uniq,
halfmapping_mult, unpaired_uniq, unpaired_mult, paired_uniq, paired_mult,
concordant_uniq, and concordant_mult results

-o, --output-file=STRING
File name for a single stream of output results.

--failed-input=STRING
Print completely failed alignments as input FASTA or FASTQ format, to the given
file, appending .1 or .2, for paired-end data. If the --split-output flag is also
given, this file is generated in addition to the output in the .nomapping file.

--append-output
When --split-output or --failed-input is given, this flag will append output to the
existing files. Otherwise, the default is to create new files.

--order-among-best=STRING
Among alignments tied with the best score, order those alignments in this order.
Allowed values: genomic, random (default)

--output-buffer-size=INT
Buffer size, in queries, for output thread (default 1000). When the number of
results to be printed exceeds this size, the worker threads are halted until the
backlog is cleared

Options for SAM output

--no-sam-headers
Do not print headers beginning with '@'

--add-paired-nomappers
Add nomapper lines as needed to make all paired-end results alternate between first
end and second end

--paired-flag-means-concordant=INT
Whether the paired bit in the SAM flags means concordant only (1) or paired plus
concordant (0, default)

--sam-headers-batch=INT
Print headers only for this batch, as specified by -q

--sam-use-0M
Insert 0M in CIGAR between adjacent insertions and deletions Required by Picard,
but can cause errors in other tools

--sam-multiple-primaries
Allows multiple alignments to be marked as primary if they have equally good
mapping scores

--force-xs-dir
For RNA-Seq alignments, disallows XS:A:? when the sense direction is unclear, and
replaces this value arbitrarily with XS:A:+. May be useful for some programs, such
as Cufflinks, that cannot handle XS:A:?. However, if you use this flag, the
reported value of XS:A:+ in these cases will not be meaningful.

--md-lowercase-snp
In MD string, when known SNPs are given by the -v flag, prints difference
nucleotides as lower-case when they, differ from reference but match a known
alternate allele

--extend-soft-clips
Extends alignments through soft clipped regions

--action-if-cigar-error
Action to take if there is a disagreement between CIGAR length and sequence length
Allowed values: ignore, warning, noprint (default), abort

--read-group-id=STRING
Value to put into read-group id (RG-ID) field

--read-group-name=STRING
Value to put into read-group name (RG-SM) field

--read-group-library=STRING
Value to put into read-group library (RG-LB) field

--read-group-platform=STRING
Value to put into read-group library (RG-PL) field

Help options

--check
Check compiler assumptions

--version
Show version

--help Show this help message

Other tools of GMAP suite are located in /usr/lib/gmap

Use gsnap online using onworks.net services