This is the command slmbuild that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
slmbuild - generate language model from idngram file
SYNOPSIS
slmbuild [option]... idngram_file...
DESCRIPTION
slmbuild generates a back-off smoothing language model from a given idngram file.
Generally, the idngram_file is created by ids2ngram.
OPTIONS All the following options are mandatory.
-n,--NMax N
1 for unigram, 2 for bigram, 3 for trigram. Any number not in the range of 1..3 is not
valid.
-o, --out output-file
Specify the output xfilei name.
-l, --log
using -log(pr), use pr directly by default.
-w, --wordcount N
Lexican size, number of different words.
-b, --brk id...
Set the ids which should be treated as breaker.
-e, --e id...
Set the ids which should not be put into LM.
-c, --cut c...
k-grams whose freq <= c[k] are dropped.
-d, --discount method, param...
The k-th -d parm specifies the discount method
For k-gram, possibble values for method/param are:
B<GT>,I<R>,I<dis> : B<GT> discount for r E<lt>= I<R>, r is the freq of a ngram.
Linear discount for those r E<gt> I<R>, i.e. r'=r*dis
0 E<lt>E<lt> dis E<lt> 1.0, for example 0.999
B<ABS>,[I<dis>] : Absolute discount r'=r-I<dis>. And I<dis> is optional
0 E<lt>E<lt> I<dis> E<lt> cut[k]+1.0, normally I<dis> E<lt> 1.0.
LIN,[I<dis>] : Linear discount r'=r*dis. And dis is optional
0 E<lt> dis E<lt> 1.0
NOTE
-n must be given before -c -b. And -c must give right number of cut-off, also -ds must
appear exactly N times specifying the discounts for 1-gram, 2-gram..., respectively.
BREAKER-IDs could be SentenceTokens or ParagraphTokens. Conceptually, these ids have no
meaning when they appeared in the middle of n-gram.
EXCLUDE-IDs could be ambiguious-ids. Conceptually, n-grams which contain those ids are
meaningless.
We can not erase ngrams according to BREAKER-IDS and EXCLUDE-IDs directly from IDNGRAM
file, because some low-level information is still useful in it.
EXAMPLE
Following example read 'all.id3gram' and write trigram model 'all.slm'.
At 1-gram level, use Good-Turing discount with cut-off 0, i<R>=8, dis=0.9995. At 2-gram
level, use Absolute discount with cut-off 3, dis auto-calc. At 3-gram level, use Absolute
discount with cut-off 2, dis auto-calc. Word id 10,11,12 are breakers (sentence/para/paper
breaker, etc). Exclude-ID is 9. Lexicon contains 200000 words. The result languagme model
uses -log(pr).
slmbuild -l -n 3 -o all.slm -w 200000 -c 0,3,2 -d GT,8,0.9995 -d ABS -d ABS -b 10,11,12 -e
9 all.id3gram
Use slmbuild online using onworks.net services