This is the command PerM that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
perm - Efficient mapping of short reads with periodic spaced seeds
If you have any usage questions, please email "yanghoch at usc dot edu".
SYNOPSIS
To use the command line, type perm with the args in the order.
EXAMPLES
For single-end reads:
perm Ref Reads [options]
Examples:
perm Ref.fasta Reads.fasta -v 5 -o out.mapping -u ummappedReads.fa
perm RefFilesList.txt ReadsSetFilesList.txt -v 5 -u unmappedReads.fa -E
perm Ref.fasta Reads.csfasta -v 5 -m -s my.index --delimiter ´,´ --seed F3
perm my.index SingleEndReads.csfasta -v 5 -o out.sam -k 10 -a ambiguous10.csfasta
For paired-end reads:
perm Ref -1 F3_Reads -2 R3_Reads [options]
Examples:
perm ref.fa -1 F3.fa -2 R3.fa -U 3000 -L 100 -v 5 -A -m -s -o out.sam
perm ref.txt -1 F3.fq -2 R3.fq -v 5 -m -s my.index -o out.mapping --seed F3
perm my.index -1 F3.fq -2 R3.fq -U 3000 -L 100 -v 5 -A -o out.sam
To build an index only:
perm Ref Read_Length --readFormat <.csfasta|.fasta> -m -s index path --seed F3
Example:
perm hg18.txt 50 --readFormat .csfasta -m -s hg18_50_SOLiD.index
OPTIONS
Required Arguments
· The reference file should be in FASTA format with the either the .fasta, .fna, or .fa
file extension. For a transcriptome with multiple genes or isoforms as reference,
concatenate all FASTA sequences in a single FASTA file. Alternatively, if there are
many files, for example one per chromosome, ex: chr1.fa to chrY.fa, list the FASTA
filenames one per line in a file which has the .txt extension. The .txt is important
because PerM examines the file extension to know if the input file is a list of
filenames. The filenames need to include the file path (relative or absolute) unless
the FASTA files are all in the same directory that PerM is run from.
· The read file(s) should be in the .fasta, .fastq, .csfasta or .csfastq format. PerM
parses a file according to its extension, or the format explicitly specified by the
--readFormat <format> flag. If there are multiple read files, list each filename, one
per line, in a .txt file. PerM takes it as the input and can map multiple read sets in
parallel by [http://en.wikipedia.org/wiki/OpenMP OpenMP].
Short Options (grouped by related functionality)
-A Output all alignments within the mismatch threshold (see -v option), end-to-end.
-B Output best alignments in terms of mismatches in the threshold (see -v option). For
example, if a read has no perfect match alignments, two single base mismatch
alignments, and additional alignments with more mismatches, only the two single
base mismatch alignments will be output. -B is the default mode if neither -A or -B
is specified.
-E Output only uniquely mapped reads remaining after the best down selection has been
applied if applicable. When combined with the -A option, only reads with a single
alignment within the mismatch threshold (see -v option) will be output.
-v Maximum number of mismatches allowed (or allowed in each end for pair-end reads).
The default value is the number of mismatches that the seed used is fully sensitive
to.
-k Specifies maximum number of alignments to output. The default value is 200 if the
-k flag is not given. Alignments for reads mapping to more than the maximum number
positions will not be output. Use the -a option to collect reads which exceeded the
maximum.
-t Number of bases at the 5´ end of each read to ignore. For example, if the first 5
bases are used as a barcode or to index multiple samples together, use -t 5. If not
specified, no initial bases will be ignored.
-T Number of bases in each read to use, starting after any bases ignored by -t option.
Later bases at the 3´ of the read are ignored. For example, -T 30 means use only
first 30 bases (signals) after the any bases ignored due to the -t option.
-m Create the reference index without reusing the saved index even if available.
-s path
Save the reference index to accelerate the mapping in the future. If path is not
specified, the index will be created in the current working directory (i.e. where
PerM is run from) using the default index name. If path is a directory, the index
will be created in the specified directory using the default index name (directory
must exist; it will not automatically be created). If path is a file path, the
index will be created with the specified name.
-o filepath
Name of mapping output file when mapping a single read set. The output file format
will be either the .mapping tab delimited text format or the SAM format as
determined by the extension of the output filename. For example {{{-o out.sam}}}
will output in SAM format; {{{-o /path/to/out.mapping}}} will output in .mapping
format. Use --outputFormat to override this behavior. The -o option does not apply
when multiple reads sets are being mapped at once to take advantage of multiple
CPUs (cores); see the -d option for that case.
-d dirpath
Output directory for mapping output files when mapping multiple read sets (output
files will be named automatically). If the directory specified does not exist, the
output directory will be created provided the parent directory exists. If the -d
switch is not specified, files will be written to the directory PerM is run from.
Note: if -d filepath is specified when mapping a single read set, dirpath will be
prepended to filepath; however, this usage is not recommended.
-a filepath
Create a FASTA (FASTQ) file for reads mapped to more positions than the threshold
specified by -k or the default of 200.
-b filepath
Create a FASTA (FASTQ) file for reads that is shorter than expected length or with
strange characters.
-u filepath
Create a FASTA (FASTAQ) file of unmapped reads. When a single read set is mapped,
filename specifies the name of the output file. When multiple read sets are mapped,
filename is irrelevant and should be omitted; the files of unmapped sequences will
automatically be named and created in the directory PerM is run from.
Long Options
--ambiguosReadOnly
Output only ambiguous mapping to find repeats (similar regions within substitution
threshold). When this option is specified, reads that mapped to over mapping number
threshold that specified by -k will still be printed.
--ambiguosReadInOneLine
utput reads mapped to more than k places in one line. When this option is
specified, reads that mapped to over mapping number threshold specified by -k will
still be printed but printed in one line.
--noSamHeader
Do not include a SAM header. This makes it easier to concatenate multiple SAM
output files.
--includeReadsWN
Map reads with equal or fewer N or ´.´ bases than the specified threshold by
encoding N or ´.´ as A or 3. Reads with more ´N´ will be discarded. The default
setting discards read with any ´N´.
--statsOnly
Output the mapping statistics to stdout only, without saving alignments to files.
--ignoreQS
Ignore the quality scores in FASTQ or QUAL files.
--printNM
When quality scores are available, use this flag to print number of mismatches,
instead of mismatch scores in mapping format.
--seed {F,,0,, | F,,1,, | F,,2,, | F,,3,, | F,,4,, | S,,11,, | S,,20,, | S,,12,,}
Specify the seed pattern. The F,,0,,, F,,1,,, F,,2,,, F,,3,,, and F,,4,, seeds are
fully sensitive to 0-4 mismatches respectively. The S,,11,, S,,20,, S,,12,, seeds
are designed for the SOLiD sequencer. An S,,kj,, seed is full sensitive to k
adjacent mismatch pairs (SNP signature is color space) and j isolated mismatches.
See [http://code.google.com/p/perm/wiki/Algorithms the algorithm page] for more
information about the seed patterns.
--refFormat {fasta | list | index }
Assume references sequence(s) are in the specified format, instead of guessing
according to file´s extension.
--readFormat |{fasta | fastq | csfasta | csfastq}
Assume reads are in the specified format, instead of guessing according to the
file(s)´ extension.
--outputFormat { sam | mapping }
Override the default output mapping format option or specify it explicitly when the
output file extension is not .sam or .mapping.
--delimiter char
char is a character used as the delimiter to separate the the read id, and the
additional info in the line with > when reading a FASTA or CSFASTA file.
--log filepath
filepath specifies the name of the log file which contains the mapping statistics
that will also be printed on the screen.
--forwardOnly
Map reads to the forward strand only: (This is for SOLiD Strand specific
sequencing).
--reverseOnly
Map reads to the reverse strand only: (This is for SOLiD Strand specific
sequencing)
Options for Paired-end Reads
PerM deals with mate-paired reads by mapping each end separately. All combinations of
mated pairs mapping to the same reference sequence will be output if their separation are
in the allowed ranged as specified by the -L and -U flags.
-e Exclude ambiguous paired.
-L / --lowerBound Int
lower bound for mate-paired separation distance
-U / --upperBound Int
upper bound for mate-paired separation distance
The upper bound and lower bound can be negative, which may catch the re-arrangement
variations. Use the -A argument to avoid missing the correct pairs. However, this may
greatly increase the running time if both ends are in repetitive regions.
--fr Map paired-end reads to different strand only
--ff Map paired-end reads to the same strand only
--printRefSeq
Print the mapped reference paired sequence as the two last columns in .mapping
format. | The default option output mapping in both the same or different strand.
DEFAULT SETTINGS
The following are the default settings when the corresponding command line option is not
specified. Please specify the option to change the default settings.
· Allow only two mismatches only in each end and use seed F,,2,, S,,11,, or F,,3,,
,selected according to the read lengths and types.
· Print the best alignments for each read in terms of the number of mismatches.
· Output files in *.mapping format.
· Searches for a saved index with the default file name before building the new index.
· Won´t save the index in file, unless {{{-s}}} is specified.
· For paired end reads, the default allowed separation distance is 0-3000 bp. Change
with the -L and -U options.
Parallel Mapping
PerM simultaneously maps multiple reads sets in a list by querying the same index. It will
detect how many CPUs (cores) are available and assign each of them a read set. If a read
set is done, the next read set in the list will be processed automatically. Each read set
will have its own mapping output file. To better utilize all CPUs on a node, large reads
set should be split into many small read sets and put in a list. When multiple nodes are
used in the same file system, the index should be pre-built first by one node; the other
nodes will read the pre-built index without building index again. Without pre-built index,
each machine will try to build its own index, wasting CPU time and storage space.
Exit codes
PerM sets the exit code to 0 upon successful completion, the normal Unix behavior. If the
program is terminated via Ctrl-C (SIGINT), the exit code will be 2, the number for SIGINT
(see man kill). If you invoke PerM from another language, you can check the return code
and do something intelligent. Here is a Perl pseudo-code example:
while (... some sort of loop ...) {
my $cmd = "PerM ... arguments and switches";
my $ec = system($cmd);
if ($ec == 2) {
print STDERR "PerM terminated via Ctrl-C. Stopping run.\n\n";
# Maybe do some cleanup such as deleting the small files the read file was
# split into for parallel processing.
exit($ec);
}
}
Use PerM on Galaxy
Thanks to Prof Anton Nekrutenko and Kelly Vincent at PSU, you can now use PerM on
[http://test.g2.bx.psu.edu/ Galaxy´s test server]. Follow the hyperlink to Galaxy´s page,
and click NGS:Mapping in the tool menu. Please choose Map with PerM for SOLiD and
Illumina. You can upload your own reference or use the pre-built hg19 index in the system.
Please email me if you encounter any difficulties. Once the system proves its stability,
it will be moved to Galaxy´s main server with more pre-built reference index.
Unit Test
When PerM was developed, a unit cppUnit test module was also prepared. If you are
interested in the test code for PerM, please email me.
Use PerM online using onworks.net services