This is the command maq that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
Maq - Mapping and Assembly with Qualities
maq command [options] arguments
maq.pl command [options] arguments
Maq is a software that builds mapping assemblies from short reads generated by the next-
generation sequencing machines. It is particularly designed for Illumina-Solexa 1G Genetic
Analyzer, and has a preliminary functionality to handle AB SOLiD data.
With Maq you can:
· Fast align Illumina/SOLiD reads to the reference genome. With the default options, one
million pairs of reads can be mapped to the human genome in about 10 CPU hours with less
than 1G memory.
· Accurately measure the error probability of the alignment of each individual read.
· Call the consensus genotypes, including homozygous and heterozygous polymorphisms, with
a Phred probabilistic quality assigned to each base.
· Find short indels with paired end reads.
· Accurately find large scale genomic deletions and translocations with paired end reads.
· Discover potential CNVs by checking read depth.
· Evaluate the accuracy of raw base qualities from sequencers and help to check the
However, Maq can NOT:
· Do de novo assembly. (Maq can only call the consensus by mapping reads to a known
· Map shorts reads against themselves. (Maq can only find complete overlap between reads.)
· Align capillary reads or 454 reads to the reference. (Maq cannot align reads longer than
fasta2bfa maq fasta2bfa in.ref.fasta out.ref.bfa
Convert sequences in FASTA format to Maq's BFA (binary FASTA) format.
fastq2bfq maq fastq2bfq [-n nreads] in.read.fastq out.read.bfq⎪out.prefix
Convert reads in FASTQ format to Maq's BFQ (binary FASTQ) format.
-n INT number of reads per file [not specified]
map maq map [-n nmis] [-a maxins] [-c] [-1 len1] [-2 len2] [-d adap3] [-m mutrate]
[-u unmapped] [-e maxerr] [-M c⎪g] [-N] [-H allhits] [-C maxhits] out.aln.map
in.ref.bfa in.read1.bfq [in.read2.bfq] 2> out.map.log
Map reads to the reference sequences.
-n INT Number of maximum mismatches that can always be found 
-a INT Maximum outer distance for a correct read pair 
-A INT Maximum outer distance of two RF paied read (0 for disable) 
-c Map reads in the colour space (for SOLiD only)
-1 INT Read length for the first read, 0 for auto 
-2 INT Read length for the second read, 0 for auto 
-m FLOAT Mutation rate between the reference sequences and the reads [0.001]
-d FILE Specify a file containing a single line of the 3'-adapter sequence
-u FILE Dump unmapped reads and reads containing more than nmis mismatches to
a separate file [null]
-e INT Threshold on the sum of mismatching base qualities 
-H FILE Dump multiple/all 01-mismatch hits to FILE [null]
-C INT Maximum number of hits to output. Unlimited if larger than 512. 
-M c⎪g methylation alignment mode. All C (or G) on the forward strand will be
changed to T (or A). This option is for testing only.
-N store the mismatch position in the output file out.aln.map. When this
option is in use, the maximum allowed read length is 55bp.
* Paired end reads should be prepared in two files, one for each end, with
reads are sorted in the same order. This means the k-th read in the first
file is mated with the k-th read in the second file. The corresponding read
names must be identical up to the tailing `/1' or `/2'. For example, such a
pair of read names are allowed: `EAS1_1_5_100_200/1' and
`EAS1_1_5_100_200/2'. The tailing `/' is usually generated by the
GAPipeline to distinguish the two ends in a pair.
* The output is a compressed binary file. It is affected by the endianness.
* The best way to run this command is to provide about 1 to 3 million reads as
input. More reads consume more memory.
* Option -n controls the sensitivity of the alignment. By default, a hit with
up to 2 mismatches can be always found. Higher -n finds more hits and also
improves the accuracy of mapping qualities. However, this is done at the cost
* Alignments with many high-quality mismatches should be discarded as false
alignments or possible contaminations. This behaviour is controlled by option
-e. The -e threshold is only calculated approximately because base qualities
are divided by 10 at a certain stage of the alignment. The -Q option in the
assemble command precisely set the threshold.
* A pair of reads are said to be correctly paired if and only if the
orientation is FR and the outer distance of the pair is no larger than
maxins. There is no limit on the minimum insert size. This setting is
determined by the paired end alignment algorithm used in Maq. Requiring a
minimum insert size will lead to some wrong alignments with highly
overestimated mapping qualities.
* Currently, read pairs from Illumina/Solexa long-insert library have RF read
orientation. The maximum insert size is set by option -A. However, long-
insert library is also mixed with a small fraction of short-insert read
pairs. -a should also be set correctly.
* Sometimes 5'-end or even the entire 3'-adapter sequence may be sequenced.
Providing -d renders Maq to eliminate the adapter contaminations.
* Given 2 million reads as input, maq usually takes 800MB memory.
mapmerge maq mapmerge out.aln.map in.aln1.map in.aln2.map [...]
Merge a batch of read alignments together.
* In theory, this command can merge unlimited number of alignments. However, as
mapmerge will be reading all the inputs at the same time, it may hit the
limit of the maximum number of opening files set by the OS. At present, this
has to be manually solved by endusers.
* Command mapmerge can be used to merge alignment files with different read
lengths. All the subsequent analyses do not assume fixed length any more.
rmdup maq rmdup out.rmdup.map in.ori.map
Remove pairs with identical outer coordinates. In principle, pairs with
identical outer coordinates should happen rarely. However, due to the
amplification in sample preparation, this occurs much more frequently than by
chance. Practical analyses show that removing duplicates helps to improve the
overall accuracy of SNP calling.
assemble maq assemble [-sp] [-m maxmis] [-Q maxerr] [-r hetrate] [-t coef] [-q minQ] [-N
nHap] out.cns in.ref.bfa in.aln.map 2> out.cns.log
Call the consensus sequences from read mapping.
-t FLOAT Error dependency coefficient [0.93]
-r FLOAT Fraction of heterozygotes among all sites [0.001]
-s Take single end mapping quality as the final mapping quality;
otherwise paired end mapping quality will be used
-p Discard paired end reads that are not mapped in correct pairs
-m INT Maximum number of mismatches allowed for a read to be used in
consensus calling 
-Q INT Maximum allowed sum of quality values of mismatched bases 
-q INT Minimum mapping quality allowed for a read to be used in consensus
-N INT Number of haplotypes in the pool (>=2) 
* Option -Q sets a limit on the maximum sum of mismatching base qualities.
Reads containing many high-quality mismatches should be discarded.
* Option -N sets the number of haplotypes in a pool. It is designed for
resequencing of samples by pooling multiple strains/individuals together. For
diploid genome resequencing, this option equals 2.
glfgen maq glfgen [-sp] [-m maxmis] [-Q maxerr] [-r hetrate] [-t coef] [-q minQ] [-N
nHap] out.cns in.ref.bfa in.aln.map 2> out.cns.log
Calculate log-likelihood for all genotypes and store the results in GLF format
(Genotyping Likelihood Format). Please check MAQ website for detailed
descriptions of the file format and the related utilities.
indelpe maq indelpe in.ref.bfa in.aln.map > out.indelpe
Call consistent indels from paired end reads. The output is TAB delimited with
each line consisting of chromosome, start position, type of the indel, number
of reads across the indel, size of the indel and inserted/deleted nucleotides
(separated by colon), number of indels on the reverse strand, number of indels
on the forward strand, 5' sequence ahead of the indel, 3' sequence following
the indel, number of reads aligned without indels and three additional columns
At the 3rd column, type of the indel, a star indicates the indel is confirmed
by reads from both strands, a plus means the indel is hit by at least two reads
but from the same strand, a minus shows the indel is only found on one read,
and a dot means the indel is too close to another indel and is filtered out.
Users are recommended to run through `maq.pl indelpe' to correct the number of
reads mapped without indels. For more details, see the `maq.pl indelpe'
indelsoa maq indelsoa in.ref.bfa in.aln.map > out.indelsoa
Call potential homozygous indels and break points by detecting the abnormal
alignment pattern around indels and break points. The output is also TAB
delimited with each line consisting of chromosome, approximate coordinate,
length of the abnormal region, number of reads mapped across the position,
number of reads on the left-hand side of the position and number of reads on
the right-hand side. The last column can be ignored.
The output contains many false positives. A recommended filter could be:
awk '$5+$6-$4 >= 3 && $4 <= 1' in.indelsoa
Note that this command does not aim to be an accurate indel detector, but
mainly helps to avoid some false positives in substitution calling. In
addition, it only works well given deep depth (~40X for example); otherwise the
false negative rate would be very high.
sol2sanger maq sol2sanger in.sol.fastq out.sanger.fastq
Convert Solexa FASTQ to standard/Sanger FASTQ format.
bfq2fastq maq bfq2fastq in.read.bfq out.read.fastq
Convert Maq's BFQ format to standard FASTQ format.
mapass2maq maq mapass2maq in.mapass2.map out.maq.map
Convert obsolete mapass2's map format to Maq's map format. The old format does
not contain read names.
mapview maq mapview [-bN] in.aln.map > out.aln.txt
Display the read alignment in plain text. For reads aligned before the Smith-
Waterman alignment, each line consists of read name, chromosome, position,
strand, insert size from the outer coorniates of a pair, paired flag, mapping
quality, single-end mapping quality, alternative mapping quality, number of
mismatches of the best hit, sum of qualities of mismatched bases of the best
hit, number of 0-mismatch hits of the first 24bp, number of 1-mismatch hits of
the first 24bp on the reference, length of the read, read sequence and its
quality. Alternative mapping quality always equals to mapping quality if the
reads are not paired. If reads are paired, it equals to the smaller mapping
quality of the two ends. This alternative mapping quality is actually the
mapping quality of an abnormal pair.
The fifth column, paired flag, is a bitwise flag. Its lower 4 bits give the
orientation: 1 stands for FF, 2 for FR, 4 for RF, and 8 for RR, where FR means
that the read with smaller coordinate is on the forward strand, and its mate is
on the reverse strand. Only FR is allowed for a correct pair. The higher bits
of this flag give further information. If the pair meets the paired end
requirement, 16 will be set. If the two reads are mapped to different
chromosomes, 32 will be set. If one of the two reads cannot be mapped at all,
64 will be set. The flag for a correct pair always equals to 18.
For reads aligned by the Smith-Waterman alignment afterwards, the flag is
always 130. A line consists of read name, chromosome, position, strand, insert
size, flag (always 130), position of the indel on the read (0 if no indel),
length of the indels (positive for insertions and negative for deletions),
mapping quality of its mate, number of mismatches of the best hit, sum of
qualities of mismatched bases of the best hit, two zeros, length of the read,
read sequence and its quality. The mate of a 130-flagged read always gets a
Flag 192 indicates that the read is not mapped but its mate is mapped. For such
a read pair, one read has flag 64 and the other has 192.
-b do not display the read sequence and the quality
-N display the positions where mismatches occur. This flag only works
with a .map file generated by `maq map -N'.
mapcheck maq mapcheck [-s] [-m maxmis] [-q minQ] in.ref.bfa in.aln.map > out.mapcheck
Read quality check. The mapcheck first reports the composition and the depth of
the reference. After that there is a form. The first column indicates the
position on a read. Following four columns which show the nucleotide
composition, substitution rates between the reference and reads will be given.
These rates and the numbers in the following columns are scaled to 999 and
rounded to nearest integer. The next group of columns show the distribution of
base qualities along the reads at a quality interval of 10. A decay in quality
can usually be observed, which means bases at the end of read are less
accurate. The last group of columns present the fraction of substitutions for
read bases at a quality interval. This measures the accuracy of base quality
estimation. Idealy, we expect to see 1 in the 3? column, 10 in the 2? column
and 100 in the 1? column.
-s Take single end mapping quality as the final mapping quality
-m INT Maximum number of mismatahces allowed for a read to be counted 
-q INT Minimum mapping quality allowed for a read to be counted 
pileup maq pileup [-spvP] [-m maxmis] [-Q maxerr] [-q minQ] [-l sitefile] in.ref.bfa
in.aln.map > out.pileup
Display the alignment in a `pileup' text format. Each line consists of
chromosome, position, reference base, depth and the bases on reads that cover
this position. If -v is added on the command line, base qualities and mapping
qualities will be presented in the sixth and seventh columns in order.
The fifth column always starts with `@'. In this column, read bases identical
to the reference are showed in comma `,' or dot `.', and read bases different
from the reference in letters. A comma or a upper case indicates that the base
comes from a read aligned on the forward strand, while a dot or a lower case on
the reverse strand.
This command is for users who want to develop their own SNP callers.
-s Take single end mapping quality as the final mapping quality
-p Discard paired end reads that are not mapped as correct pairs
-v Output verbose information including base qualities and mapping
-m INT Maximum number of mismatches allowed for a read to be used 
-Q INT Maximum allowed number of quality values of mismatches 
-q INT Minimum mapping quality allowed for a read to be used 
-l FILE File containing the sites at which pileup will be printed out. In this
file the first column gives the names of the reference and the second
the coordinates. Additional columns will be ignored. [null]
-P also output the base position on the read
cns2fq maq cns2fq [-Q minMapQ] [-n minNeiQ] [-d minDepth] [-D maxDepth] in.cns >
Extract the consensus sequences in FASTQ format. In the sequence lines, bases
in lower case are essentially repeats or do not have sufficient coverage; bases
in upper case indicate regions where SNPs can be reliably called. In the
quality lines, ASCII of a character minus 33 gives the PHRED quality.
-Q INT Minimum mapping quality 
-d INT Minimum read depth 
-n INT Minimum neighbouring quality 
-D INT Maximum read dpeth. >=255 for unlimited. 
cns2snp maq cns2snp in.cns > out.snp
Extract SNP sites. Each line consists of chromosome, position, reference base,
consensus base, Phred-like consensus quality, read depth, the average number of
hits of reads covering this position, the highest mapping quality of the reads
covering the position, the minimum consensus quality in the 3bp flanking
regions at each side of the site (6bp in total), the second best call, log
likelihood ratio of the second best and the third best call, and the third best
The 5th column is the key criterion when you judge the reliability of a SNP.
However, as this quality is only calculated assuming site independency, you
should also consider other columns to get more accurate SNP calls. Script
command `maq.pl SNPfilter' is designed for this (see below).
The 7th column implies whether the site falls in a repetitive region. If no
read covering the site can be mapped with high mapping quality, the flanking
region is possibly repetitive or in the lack of good reads. A SNP at such site
is usually not reliable.
The 8th column roughly gives the copy number of the flanking region in the
reference genome. In most cases, this number approaches 1.00, which means the
region is about unique. Sometimes you may see non-zero read depth but 0.00 at
the 7th column. This indicates that all the reads covering the position have at
least two mismatches. Maq only counts the number of 0- and 1-mismatch hits to
the reference. This is due to a complex technical issue.
The 9th column gives the neighbouring quality. Filtering on this column is also
required to get reliable SNPs. This idea is inspired by NQS, although NQS is
initially designed for a single read instead of a consensus.
cns2view maq cns2view in.cns > out.view
Show detailed information at all sites. The output format is identical to
cns2ref maq cns2ref in.cns > out.ref.fasta
Extract the reference sequence.
cns2win maq cns2win [-w winsize] [-c chr] [-b begin] [-e end] [-q minQ] in.cns >
Extract information averaged in a tilling window. The output is TAB delimited,
which consists of reference name, coordinate divided by 1,000,000, SNP rate,
het rate, raw read depth, read depth in approximately unique regions, the
average number of hits of reads in the window and percent GC.
-w INT Size of a window 
-c STR Destinated reference sequence; otherwise all references will be used
-b INT Start position, 0 for no constraint 
-e INT End position, 0 for no constraint 
-q INT Minimum consensus quality of the sites to be used 
fakemut maq fakemut [-r mutrate] [-R indelfrac] in.ref.fasta > out.fakeref.fasta 2>
Randomly introduce substitutions and indels to the reference. Substitutions and
sinlge base-pair indels can be added.
-r FLOAT Mutation rate [0.001]
-R FLOAT Fraction of mutations to be indels [0.1]
simutrain maq simutrain out.simupars.dat in.read.fastq
Estimate/train parameters for read simulation.
simulate maq simulate [-d insize] [-s stdev] [-N nReads] [-1 readLen1] [-2 readLen2] [-r
mutRate] [-R indelFrac] [-h] out.read1.fastq out.read2.fastq in.ref.fasta
Simulate paired end reads. File in.simupars.dat determines the read lengths and
quality distribution. It is generated from simutrain, or can be downloaded from
Maq website. In the output read files, a read name consists of the reference
sequence name and the outer coordinates of the pair of simulated reads. By
default, simulate assumes reads come from a diploid sequence which is generated
by adding two different sets of mutations, including one base-pair indels, to
-d INT mean of the outer distance of insert sizes 
-s INT standard deviation of insert sizes 
-N INT number of pairs of reads to be generated 
-1 INT length of the first read [set by in.simupars.dat]
-2 INT length of the second read [set by in.simupars.dat]
-r FLOAT mutation rate [0.001]
-R FLOAT fraction of 1bp indels [0.1]
-h add all mutations to in.ref.fasta and generate reads from the single
mutated sequence (haploid mode)
* Reads generated from this command are independent, which deviates from the
truth. Whereas alignment evaluation is less affected by this, evaluation on
SNP calling should be performed with caution. Error dependency may be one of
the major causes of wrong SNP calls.
simustat maq simustat in.simu-aln.map > out.simustat
Evaluate mapping qualities from simulated reads.
fasta2csfa maq fasta2csfa in.nucl-ref.fasta > out.colour-ref.fasta
Convert nucleotide FASTA to colour-coded FASTA. Flag -c should be then applied
to map command. In the output, letter `A' stands for color 0, `C' for 1, `G'
for 2 and `T' for 3. Each sequence in the output is 1bp shorter than the input.
csmap2nt maq csmap2nt out.nt.map in.ref.nt.bfa in.cs.map
Convert color alignment to nucleotide alignment. The input in.ref.nt.bfa is the
nucleotide binary FASTA reference file. It must correspond to the original file
from which the color reference is converted. Nucleotide consensus can be called
from the resultant alignment.
submap maq submap [-q minMapQ] [-Q maxSumErr] [-m maxMM] [-p] out.map in.map
Filter bad alignments in in.map. Command-line options are described in the
eland2maq maq eland2maq [-q defqual] out.map in.list in.eland
Convert eland alignment to maq's .map format. File in.list consists of the
sequence names that appear at the seventh column of the eland alignment file
in.eland and the name you expect to see in maq alignment. The following is an
If you are aligning reads in several batches using eland, it is important to
use the same in.list for the conversion. In addition, maq will load all the
alignments and sort them in the memory. If you have concatenate several eland
outputs into one huge file, you should separate it into smaller files to
prevent maq from eating all your machine memory.
This command actually aims to show Eland alignment in Maqview. As no quality
information is available, the resultant maq alignment file should not be used
to call consensus genotypes.
export2maq maq export2maq [-1 read1len] [-2 read2len] [-a maxdist] [-n] out.map in.list
Convert Illumina's Export format to Maq's .map format. Export format is a new
alignment format since SolexaPipeline-0.3.0 which also calculates mapping
qualities like maq. The resultant file can be used to call consensus genotypes
as most of necessary information is available for maq to do this accurately.
-1 INT Length of the first read 
-2 INT Length of the second read 
-a INT Maximum outer distance for a correct read pair 
-n Retain filtered reads
demo maq.pl demo [-h] [-s] [-N nPairs] [-d outDir] in.fasta in.simudat
Demonstrate the use of maq and its companion scripts. This command will
simulate reads from a FASTA file in.fasta. The sequence length and qualities
are determined by in.simudat which is generated from maq simutrain or can be
downloaded from Maq website. The simulated reads will then be mapped with
maq.pl easyrun. The alignment accuracy is evaluated by maq simustat, the
consensus accuracy by maq simucns, and the SNP accuracy by maq_eval.pl.
By default, paired end reads will be simulated and a diploid sequence will be
generated from the input by adding mutations to either haploid type. The insert
size and mutation rate are controlled by maq simulate.
-h simulate a haploid sequence instead of a diploid sequence
-s use single-end mode to align reads instead of paired-end mode
-N INT number of pairs of reads to be simulated 
-d DIR output directory [maqdemo]
* The output files from maq_eval.pl have not been documented, but you may make
a good guess at some of these files.
* This command just demonstrates the use of the maq suite. The accuracy on real
data is almost always lower than what you see from pure simulation.
easyrun maq.pl easyrun [-1 read1Len] [-d out.dir] [-n nReads] [-A 3adapter] [-e minDep]
[-q minCnsQ] [-p] [-2 read2Len] [-a maxIns] [-S] [-N] in.ref.fasta in1.fastq
Analyses pipeline for small genomes. Easyrun command will run most of analyses
implemented in maq. By default, easyrun assumes all the input read sequences
files are single-end and independent; when -p is specified, two read sequence
files are required, one for each end.
Several files will be generated in out.dir, among which the following files are
the key output:
cns.final.snp final SNP calls with low quality ones filtered out
cns.fq consensus sequences and qualities in the FASTQ format
-d DIR output directory [easyrun]
-n INT number of reads/pairs in one batch of alignment 
-S apply split-read analysis of short indels (maybe very slow)
-N INT number of haplotypes/strains in the pool (>=2) 
-A FILE file for 3'-adapter. The file should contain a single line of sequence
-1 INT length of the first read, 0 for auto 
-e INT minimum read depth required to call a SNP (for SNPfilter) 
-q INT minimum consensus quality for SNPs in cns.final.snp 
-p switch to paired end alignment mode
-2 INT length of the second read when -p is applied 
-a INT maximum insert size when -p is applied 
* For SNP calling on pooled samples, users should set correct `-N' as well as
* The input file can be maq's binary format. maq.pl will automatically detect
the file format.
SNPfilter maq.pl SNPfilter [-d minDep] [-D maxDep] [-Q maxMapQ] [-q minCnsQ] [-w
indelWinSize] [-n minNeiQ] [-F in.indelpe] [-f in.indelsoa] [-s minScore] [-m
maxAcross] [-a] [-N maxWinSNP] [-W densWinSize] in.cns2snp.snp >
Rule out SNPs that are covered by few reads (specified by -d), by too many
reads (specified by -D), near (specified by -w) to a potential indel, falling
in a possible repetitve region (characterized by -Q), or having low-quality
neighbouring bases (specified by -n). If maxWinSNP or more SNPs appear in any
densWinSize window, they will also be filtered out together.
-d INT Minimum read depth required to call a SNP 
-D INT Maximum read depth required to call a SNP (<255, otherwise ignored)
-Q INT Required maximum mapping quality of reads covering the SNP 
-q INT Minimum consensus quality 
-n INT Minimum adjacent consensus quality 
-w INT Size of the window around the potential indels. SNPs that are close
to indels will be suppressed 
-F FILE The indelpe output [null]
-f FILE The indelsoa output [null]
-s INT Minimum score for a soa-indel to be considered 
-m INT Maximum number of reads that can be mapped across a soa-indel 
-a Alternative filter for single end alignment
indelpe maq.pl indelpe in.indelpe > out.indelpe
Correct the number of reads mapped without indels for homopolymer tracts. This
command modify the 4th, 10th and the last three columns of in.indelpe and
output the result in out.indelpe. After the correction, the following awk
command gives putative homozygous indels:
awk '($3=="*"⎪⎪$3=="+") && $6+$7>=3 && ($6+$7)/$4>=0.75'
and the following gives heterozygotes:
awk '($3=="*"⎪⎪$3=="+") && $6+$7>=3 && ($6+$7)/$4<0.75'
Please note that this indelpe command just implements several heuristic rules.
It does not correct for impure homopolymer runs or di-nucleotide/triplet
repeats. Consequently, the two awk commands only give approximate hom/het
· Easyrun script:
maq.pl easyrun -d easyrun ref.fasta part1.fastq part2.fastq
· Key commands behind easyrun:
maq fasta2bfa ref.fasta ref.bfa;
maq fastq2bfq part1.fastq part1.bfq;
maq fastq2bfq part2.fastq part2.bfq;
maq map part1.map ref.bfa part1.bfq;
maq map part2.map ref.bfa part2.bfq;
maq mapmerge aln.map part1.map part2.map;
maq assemble cns.cns ref.bfa aln.map;
Use maq online using onworks.net services