fst-compiler-utf8 - Online in the Cloud

This is the command fst-compiler-utf8 that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

fst-compiler, fst-compiler-utf8 - Two compilers for SFST programs

SYNOPSIS

fst-compiler grammar-file [ output-file ]
fst-compiler-utf8 grammar-file [ output-file ]

OPTIONS

-c Store the transducer in compact format which is used by fst-infl2.

-l Store the transducer in lowmem format.

-s Switch surface and analysis layer of the transducer. You have to use this switch in
order to use fst-infl (fst-infl2, fst-infl3) for generation rather than analysis.

DESCRIPTION

fst-compiler is a compiler for finite-state transducer programs. It generates a minimized
finite state transducer which can be used with fst-mor, fst-infl, fst-print, fst-compare,
fst-parse, and fst-lattice. The compact transducer representation which is generated with
the -c flag, is supported by fst-infl2, fst-train, and fst-match. The memory-efficient
transducer representation which is generated with the -l flag, is only supported by fst-
infl3.

The first program argument is the name of a file which contains the transducer program.
The programming language is described below. The second argument is the name of the file
to which the resulting transducer will be written in binary form. If a second argument is
missing, the output will be written to stdout.

fst-compiler-utf8 differs from fst-compiler only in the character encoding. fst-compiler-
utf8 supports UTF8 encoding of the source files whereas fst-compiler is to be used for
8-Bit character codes like latin1 which are an extension of the ASCII code. Information
about the encoding is stored in the transducer files and used by the other SFST programs.

FILE FORMATS

A transducer program consists of an (optional) sequence of alphabet and variable
definitions followed by a single transducer expression which defines the result
transducer.

Alphabet

An alphabet definition consists of the keyword ALPHABET followed by = and some transducer
expression e.g.

ALPHABET = [a-z]:[A-Z]

This command redefines the alphabet as the set of symbol pairs occurring on the
transitions of the transducer. Occurrences of two-level operators, negation operators and
unquoted periods always have to be preceded by an alphabet definition.

Variables

There are two different types of variables. Symbol set variables are enclosed by hash
signs (#) and take symbol sequences (see below) as values:

#UC# = A-Z
#LC# = a-z

Transducer variables are enclosed by dollar signs and take transducer expressions as
values:

$MAP$ = [a-z]:[A-Z]+
$MAP$ = [#LC#]:[#UC#]+

Variables whose name starts with the symbol `=' are special agreement variables. If an
agreement variable occurs more than once in a transducer expression, it will always have
the same value. Consider the following transducer program:

$=1$ = [abc]
$=1$ X $=1$

The result transducer recognizes the strings aXa, bXb, and cXc. Only acyclic transducers
(i.e. transducers with a finite set of string mappings) can be assigned to agreement
variables.

Symbols

A symbol is either

- a single character like A s 5,

- a quoted character like \* or \_,

- a multi-character symbol like <X> or <ab.c5> (which is always
enclosed in angle brackets) or

- a backslash followed by a number which is the numeric code of the
designated character

- the null symbol <>.

Symbol sequence

A symbol sequence is a sequence of characters, multi-character symbols and character
ranges, e.g. a-z \. <x>.

symbol range

A symbol range is either

- a single symbol

- a symbol sequence enclosed in square brackets like [A-Za-z] or

- a symbol sequence starting with ^ and enclosed in square brackets like [^A-Za-z]
(designating the complement of [a-zA-Z]) or

- the period (which represents any symbol from the alphabet)

Transducer expressions

A transducer expression (TE) is recursively defined as follows:

- A pair of two symbol ranges separated by a colon is a TE.

[a-z]:[a-Z]

- A single symbol range like [a-z] is a TE.
It is a short form for [a-z]:[a-z].

- Two symbol sequences enclosed in braces and separated by a colon are
a TE. {a[bc]}:{def} is equivalent to a:d b:e <>:f | a:d c:e <>:f.

- X Y is a TE if X and Y are TEs.
(Blanks are ignored unless they are quoted.)

- (X) is a TE if X is a TE.

- X op is a TE is X is a TE and op is either * (Kleene's star operator), +
(Kleene's plus operator), or ? (optionality operator)

- op X is a TE is X is a TE and op is either ! (negation operator), ^
(target language extraction operator), _ (source language extraction operator), or ^_
(source and target switch operator).

- X op Y is a TE is X and Y are TEs and op is either & (conjunction
operator), | (disjunction operator), || (composition operator), or - (subtraction
operator)

- L x op y R is a TE if L and R are TEs, x and y are symbol ranges and
op is either => (two-level restriction), <= (two-level coercion), or <=> (two-level
restriction and coercion).

- X op L__R is a TE if X, L and R are TEs and op is either ^-> (upward
replacement), _-> (downward replacement), /-> (leftward replacement) or \-> (rightward
replacement). Furthermore, L and R must define automata (i.e. which map their strings
onto themselves). These operators correspond to Karttunen's replace operators. If the
arrow is followed by a question mark (?), the replacement becomes optional.

- X << l is a TE if X is a TE, and l is either of the form
a or the form a:b where a and b are single characters or symbols. The result is a
transducer where l was freely inserted into X. The transducer ab << c for instance is
equivalent to c*ac*bc*.

- X op Y L1__R2, ... , LN__RN is a TE if X,Y, L1 through LN and R1
through RN are TEs, and op is either => (general restriction), <= (general coercion), ^=>
(general surface restriction), ^<= (general surface coercion), ^<=> (general surface
restriction and coercion), _=> (general deep restriction), _<= (general deep coercion),
_<=> (general deep restriction and coercion). (These operators were implemented following
a suggestion by Anssi Yli-Jyra.)

- "fname" is a TE. The compiler reads the file named fname and turns
it into a transducer of the form line1|line2|line3|... where linex is the x-th line of
the file. All characters other than : and \ are interpreted literally (i.e. not as
operators). This TE is typically used e.g. to read morpheme list from a file.

- "<fname>" is a TE. The compiler reads a pre-compiled transducer from
the file named fname. This

Further Features

Comments start with the symbol % and extend up to the end of the line. Blanks are ignored
unless they are quoted. Expressions terminate at the end of a line unless the end of line
is preceded by a backslash. The command

#include "fname"

can be used to insert source code from a file named fname. The command

RE >> "fname"

stores the regular expression RE in the file fname. The command

#use hopcroft

tells the compiler to use the Hopcroft minimisation algorithm from now on, and

#use default

switches back to the default minimisation algorithm (Brzozowski). The command

EXAMPLE

Here is an example of a simple transducer program. Assuming that the file "adj-stems"
contains the two lines

easy
late
big

this transducer will correctly analyze the adjective forms easy, easier, easiest and late,
later, and latest.

ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>

$R$ = y<=>i (<ADJ>:<> e)

$R2$ = e<=><> (<ADJ>:<> e)

$R$ = $R$ & $R2$

$Stems$ = "adj-stems"

$S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})

$S$ || $R$

EXIT STATUS

fst-compiler returns 0 unless some error occurs.

Use fst-compiler-utf8 online using onworks.net services

Latest Linux & Windows online programs

Ky targets modern browsers and Deno.
For older browsers, you will need to
transpile and use a fetch polyfill and
globalThis polyfill. For Node.js, check
out Go...

Enter

NATS Go Client

With flexible deployment models using
clusters, superclusters, and leaf nodes,
optimize communications for your unique
deployment. The NATS Adaptive Edge Archi...

Enter

Alist

Alist is a file?listing and WebDAV
server built with Gin (Go) and SolidJS.
It supports multiple storage
backendslocal disk, OneDrive, Google
Drive, Aliyun Dri...

Enter

Gravity language

Gravity is a powerful, dynamically
typed, lightweight, embeddable
programming language written in C
without any external dependencies
(except for stdlib). It i...

Enter

SentenceTransformers

SentenceTransformers is a Python
framework for state-of-the-art sentence,
text and image embeddings. The initial
work is described in our paper
Sentence-BERT: ...

Enter

OpenMAIC

OpenMAIC is an open-source multi-agent
learning platform built to turn a topic
or uploaded material into a fully
interactive classroom experience with
minimal ...

Enter

fswatch

A cross-platform file change monitor
with multiple backends: Apple OS X File
System Events API, BSD kqueue,
Solaris/Illumos File Events
Notification, Linux ino...

Enter

XcodeGen

XcodeGen is a command line tool written
in Swift that generates your Xcode
project using your folder structure and
a project spec. The project spec is a
YAML o...

Enter

V2Fly

V2Fly Project V is a set of network
tools that helps you to build your own
computer network. It secures your
network connections and thus protects
your privacy...

Enter

fst-compiler-utf8 - Online in the Cloud

PROGRAM:

NAME

SYNOPSIS

OPTIONS

DESCRIPTION

FILE FORMATS

EXAMPLE

EXIT STATUS

Latest Linux & Windows online programs

Categories to download Software & Programs for Windows & Linux