ipdSummary - Online in the Cloud

Run ipdSummary in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command ipdSummary that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

ipdSummary - Detect DNA base-modifications from kinetic signatures.

DESCRIPTION

kineticsTool loads IPDs observed at each position in the genome, and compares those IPDs
to value expected for unmodified DNA, and outputs the result of this statistical test.
The expected IPD value for unmodified DNA can come from either an in-silico control or an
amplified control. The in silico control is trained by PacBio and shipped with the
package. It predicts predicts the IPD using the local sequence context around the current
position. An amplified control dataset is generated by sequencing unmodified DNA with the
same sequence as the test sample. An amplified control sample is usually generated by
whole-genome amplification of the original sample.

Modification Detection
The basic mode of kineticsTools does an independent comparison of IPDs at each position on
the genome, for each strand, and emits various statistics to CSV and GFF (after applying a
significance filter).

Modifications Identification
kineticsTools also has a Modification Identification mode that can decode multi-site IPD
'fingerprints' into a reduced set of calls of specific modifications. This feature has the
following benefits:

· Different modifications occuring on the same base can be distinguished (for
example m5C and m4C)

· The signal from one modification is combined into one statistic, improving
sensitivity, removing extra peaks, and correctly centering the call

OPTIONS

Please call this program with --help to see the available options.

ALGORITHM

Synthetic Control
Studies of the relationship between IPD and sequence context reveal that most of the
variation in mean IPD across a genome can be predicted from a 12-base sequence context
surrounding the active site of the DNA polymerase. The bounds of the relevant context
window correspond to the window of DNA in contact with the polymerase, as seen in
DNA/polymerase crystal structures. To simplify the process of finding DNA modifications
with PacBio data, the tool includes a pre-trained lookup table mapping 12-mer DNA
sequences to mean IPDs observed in C2 chemistry.

Filtering and Trimming
kineticsTools uses the Mapping QV generated by BLASR and stored in the cmp.h5 file to
ignore reads that aren't confidently mapped. The default minimum Mapping QV required is
10, implying that BLASR has 90\% confidence that the read is correctly mapped. Because of
the range of readlengths inherent in PacBio dataThis can be changed in using the
--mapQvThreshold command line argument, or via the SMRTPortal configuration dialog for
Modification Detection.

There are a few features of PacBio data that require special attention in order to acheive
good modification detection performance. kineticsTools inspects the alignment between the
observed bases and the reference sequence -- in order for an IPD measurement to be
included in the analysis, the PacBio read sequence must match the reference sequence for k
around the cognate base. In the current module k=1 The IPD distribution at some locus be
thought of as a mixture between the 'normal' incorporation process IPD, which is sensitive
to the local sequence context and DNA modifications and a contaminating 'pause' process
IPD which have a much longer duration (mean >10x longer than normal), but happen rarely
(~1% of IPDs). Note: Our current understanding is that pauses do not carry useful
information about the methylation state of the DNA, however a more careful analysis may be
warranted. Also note that modifications that drastically increase the Roughly 1% of
observed IPDs are generated by pause events. Capping observed IPDs at the global 99th
percentile is motivated by theory from robust hypothesis testing. Some sequence contexts
may have naturally longer IPDs, to avoid capping too much data at those contexts, the cap
threshold is adjusted per context as follows: capThreshold = max(global99,
5*modelPrediction, percentile(ipdObservations, 75))

Statistical Testing
We test the hypothesis that IPDs observed at a particular locus in the sample have a
longer means than IPDs observed at the same locus in unmodified DNA. If we have generated
a Whole Genome Amplified dataset, which removes DNA modifications, we use a case-control,
two-sample t-test. This tool also provides a pre-calibrated 'synthetic control' model
which predicts the unmodified IPD, given a 12 base sequence context. In the synthetic
control case we use a one-sample t-test, with an adjustment to account for error in the
synthetic control model.

INPUTS

aligned_reads.cmp.h5
A standard cmp.h5 file contain alignments and IPD information supplies the kinetic data
used to perform modification detection. The standard cmp.h5 file of a SMRTportal jobs is
data/aligned_read.cmp.h5.

Reference Sequence
The tool requires the reference sequence used to perform alignments. Currently this must
be supplied via the path to a SMRTportal reference repository entry.

OUTPUTS

The modification detection tool provides results in a variety of formats suitable for
in-depth statistical analysis, quick reference, and comsumption by visualization tools
such as PacBio SMRTView. Results are generally indexed by reference position and
reference strand. In all cases the strand value refers to the strand carrying the
modification in DNA sample. Remember that the kinetic effect of the modification is
observed in read sequences aligning to the opposite strand. So reads aligning to the
positive strand carry information about modification on the negative strand and vice
versa, but in this toolkit we alway report the strand containing the putative
modification.

modifications.csv
The modifications.csv file contains one row for each (reference position, strand) pair
that appeared in the dataset with coverage at least x. x defaults to 3, but is
configurable with '--minCoverage' flag to ipdSummary.py. The reference position index is
1-based for compatibility with the gff file the R environment.

Output columns
in-silico control mode

┌────────────────┬──────────────────────────────────┐
│Column │ Description │
├────────────────┼──────────────────────────────────┤
│refId │ reference sequence ID of this │
│ │ observation │
├────────────────┼──────────────────────────────────┤
│tpl │ 1-based template position │
├────────────────┼──────────────────────────────────┤
│strand │ native sample strand where │
│ │ kinetics were generated. '0' is │
│ │ the strand of the original │
│ │ FASTA, '1' is opposite strand │
│ │ from FASTA │
├────────────────┼──────────────────────────────────┤
│base │ the cognate base at this │
│ │ position in the reference │
├────────────────┼──────────────────────────────────┤
│score │ Phred-transformed pvalue that a │
│ │ kinetic deviation exists at this │
│ │ position │
└────────────────┴──────────────────────────────────┘

│tMean │ capped mean of normalized IPDs │
│ │ observed at this position │
├────────────────┼──────────────────────────────────┤
│tErr │ capped standard error of │
│ │ normalized IPDs observed at this │
│ │ position (standard deviation / │
│ │ sqrt(coverage) │
├────────────────┼──────────────────────────────────┤
│modelPrediction │ normalized mean IPD predicted by │
│ │ the synthetic control model for │
│ │ this sequence context │
├────────────────┼──────────────────────────────────┤
│ipdRatio │ tMean / modelPrediction │
├────────────────┼──────────────────────────────────┤
│coverage │ count of valid IPDs at this │
│ │ position (see Filtering section │
│ │ for details) │
├────────────────┼──────────────────────────────────┤
│frac │ estimate of the fraction of │
│ │ molecules that carry the │
│ │ modification │
├────────────────┼──────────────────────────────────┤
│fracLow │ 2.5% confidence bound of frac │
│ │ estimate │
├────────────────┼──────────────────────────────────┤
│fracUpp │ 97.5% confidence bound of frac │
│ │ estimate │
└────────────────┴──────────────────────────────────┘

case-control mode

┌────────────────┬──────────────────────────────────┐
│Column │ Description │
├────────────────┼──────────────────────────────────┤
│refId │ reference sequence ID of this │
│ │ observation │
├────────────────┼──────────────────────────────────┤
│tpl │ 1-based template position │
├────────────────┼──────────────────────────────────┤
│strand │ native sample strand where │
│ │ kinetics were generated. '0' is │
│ │ the strand of the original │
│ │ FASTA, '1' is opposite strand │
│ │ from FASTA │
├────────────────┼──────────────────────────────────┤
│base │ the cognate base at this │
│ │ position in the reference │
├────────────────┼──────────────────────────────────┤
│score │ Phred-transformed pvalue that a │
│ │ kinetic deviation exists at this │
│ │ position │
├────────────────┼──────────────────────────────────┤
│caseMean │ mean of normalized case IPDs │
│ │ observed at this position │
├────────────────┼──────────────────────────────────┤
│controlMean │ mean of normalized control IPDs │
│ │ observed at this position │
├────────────────┼──────────────────────────────────┤
│caseStd │ standard deviation of case IPDs │
│ │ observed at this position │
├────────────────┼──────────────────────────────────┤
│controlStd │ standard deviation of control │
│ │ IPDs observed at this position │
└────────────────┴──────────────────────────────────┘

│ipdRatio │ tMean / modelPrediction │
├────────────────┼──────────────────────────────────┤
│testStatistic │ t-test statistic │
├────────────────┼──────────────────────────────────┤
│coverage │ mean of case and control │
│ │ coverage │
├────────────────┼──────────────────────────────────┤
│controlCoverage │ count of valid control IPDs at │
│ │ this position (see Filtering │
│ │ section for details) │
├────────────────┼──────────────────────────────────┤
│caseCoverage │ count of valid case IPDs at this │
│ │ position (see Filtering section │
│ │ for details) │
└────────────────┴──────────────────────────────────┘

modifications.gff
The modifications.gff is compliant with the GFF Version 3 specification (‐
http://www.sequenceontology.org/gff3.shtml). Each template position / strand pair whose
p-value exceeds the pvalue threshold appears as a row. The template position is 1-based,
per the GFF spec. The strand column refers to the strand carrying the detected
modification, which is the opposite strand from those used to detect the modification. The
GFF confidence column is a Phred-transformed pvalue of detection.

Note on genome browser compatibility

The modifications.gff file will not work directly with most genome browsers. You will
likely need to make a copy of the GFF file and convert the _seqid_ columns from the
generic 'ref0000x' names generated by PacBio, to the FASTA headers present in the original
reference FASTA file. The mapping table is written in the header of the modifications.gff
file in #sequence-header tags. This issue will be resolved in the 1.4 release of
kineticsTools.

The auxiliary data column of the GFF file contains other statistics which may be useful
downstream analysis or filtering. In particular the coverage level of the reads used to
make the call, and +/- 20bp sequence context surrounding the site.

┌───────────┬──────────────────────────────────┐
│Column │ Description │
├───────────┼──────────────────────────────────┤
│seqid │ Fasta contig name │
├───────────┼──────────────────────────────────┤
│source │ Name of tool -- 'kinModCall' │
├───────────┼──────────────────────────────────┤
│type │ Modification type -- in │
│ │ identification mode this will be │
│ │ m6A, m4C, or m5C for identified │
│ │ bases, or the generic tag │
│ │ 'modified_base' if a kinetic │
│ │ event was detected that does not │
│ │ match a known modification │
│ │ signature │
├───────────┼──────────────────────────────────┤
│start │ Modification position on contig │
├───────────┼──────────────────────────────────┤
│end │ Modification position on contig │
├───────────┼──────────────────────────────────┤
│score │ Phred transformed p-value of │
│ │ detection - this is the │
│ │ single-site detection p-value │
├───────────┼──────────────────────────────────┤
│strand │ Sample strand containing │
│ │ modification │
└───────────┴──────────────────────────────────┘

│phase │ Not applicable │
├───────────┼──────────────────────────────────┤
│attributes │ Extra fields relevant to base │
│ │ mods. IPDRatio is traditional │
│ │ IPDRatio, context is the │
│ │ reference sequence -20bp to │
│ │ +20bp around the modification, │
│ │ and coverage level is the number │
│ │ of IPD observations used after │
│ │ Mapping QV filtering and │
│ │ accuracy filtering. If the row │
│ │ results from an identified │
│ │ modification we also include an │
│ │ identificationQv tag with the │
│ │ from the modification │
│ │ identification procedure. │
│ │ identificationQv is the │
│ │ phred-transformed probability of │
│ │ an incorrect identification, for │
│ │ bases that were identified as │
│ │ having a particular │
│ │ modification. frac, fracLow, │
│ │ fracUp are the estimated │
│ │ fraction of molecules carrying │
│ │ the modification, and the 5% │
│ │ confidence intervals of the │
│ │ estimate. The methylated │
│ │ fraction estimation is a │
│ │ beta-level feature, and should │
│ │ only be used for exploratory │
│ │ purposes. │
└───────────┴──────────────────────────────────┘

motifs.gff
If the Motif Finder tool is run, it will generate motifs.gff, which a reprocessed version
of modifications.gff with the following changes. If a detected modification occurs on a
motif detected by the motif finder, the modification is annotated with motif data. An
attribute 'motif' is added containing the motif string, and an attribute 'id' is added
containing the motif id, which is the motif string for unpaired motifs or
'motifString1/motifString2' for paired motifs. If a motif instance exists in the genome,
but was not detected in modifications.gff, an entry is added to motifs.gff, indicating the
presence of that motif and the kinetics that were observed at that site.

motif_summary.csv
If the Motif Finder tool is run, motif_summary.csv is generated, summarizing the modified
motifs discovered by the tool. The CSV contains one row per detected motif, with the
following columns

┌───────────────────┬──────────────────────────────────┐
│Column │ Description │
├───────────────────┼──────────────────────────────────┤
│motifString │ Detected motif sequence │
├───────────────────┼──────────────────────────────────┤
│centerPos │ Position in motif of │
│ │ modification (0-based) │
├───────────────────┼──────────────────────────────────┤
│fraction │ Fraction of instances of this │
│ │ motif with modification QV above │
│ │ the QV threshold │
├───────────────────┼──────────────────────────────────┤
│nDetected │ Number of instances of this │
│ │ motif with above threshold │
└───────────────────┴──────────────────────────────────┘

│nGenome │ Number of instances of this │
│ │ motif in reference sequence │
├───────────────────┼──────────────────────────────────┤
│groupTag │ A string idetifying the motif │
│ │ grouping. For paired motifs this │
│ │ is │
│ │ "<motifString1>/<motifString2>", │
│ │ For unpaired motifs this equals │
│ │ motifString │
├───────────────────┼──────────────────────────────────┤
│partnerMotifString │ motifString of paired motif │
│ │ (motif with │
│ │ reverse-complementary │
│ │ motifString) │
├───────────────────┼──────────────────────────────────┤
│meanScore │ Mean Modification Qv of detected │
│ │ instances │
├───────────────────┼──────────────────────────────────┤
│meanIpdRatio │ Mean IPD ratio of detected │
│ │ instances │
├───────────────────┼──────────────────────────────────┤
│meanCoverage │ Mean coverage of detected │
│ │ instances │
├───────────────────┼──────────────────────────────────┤
│objectiveScore │ Objective score of this motif in │
│ │ the motif finder algorithm │
└───────────────────┴──────────────────────────────────┘

Use ipdSummary online using onworks.net services