fst_runtime package
Module contents
fst-runtime
This package acts as a lightweight runtime for querying finite-state transducers (FSTs) stored in the AT&T .att
format.
The development of this package was motivated by the need for a lightweight way to run FSTs compiled with foma or other toolkits in Python, separate from the compilation technology. Previously, installing all of foma into a Docker container and compiling by source was necessary to query the FST during deployment. This package provides a solution to that.
Note
Epsilon in this AT&T format is represented by the string ‘@0@’.
This package is used via the Fst
object. This object requires a path to the AT&T-compiled FST, e.g., fst = Fst('/path/to/file.att')
.
You can get the multi-character symbols used in the FST via fst.multichar_symbols
.
The Fst
object has a public method named down_generation
, which follows the FST convention of the “down” direction being the
direction of “generation” (i.e., creating forms from some tagged word form). For example, in English, for the lemma “do” with
possible affixes of prefixes = [['de', 'un']]
and suffixes = [['+VERB'], ['+GER', '+INF', '+PAST']]
,
the call would be fst.down_generation([['un']], 'do', [['+VERB'], ['+GER', '+INF', '+PAST']])
.
This would return ['do', 'doing', 'did', 'undo', 'undoing', 'undid']
.
Submodules
fst_runtime.att_format_error module
att_format_error
This module defines custom exceptions used for handling errors specific to the AT&T file format (.att file) processing.
- fst_runtime.att_format_error.AttFormatError[source]
Exception raised for errors in the input AT&T file format (.att file).
- Type:
class
- exception fst_runtime.att_format_error.AttFormatError(message: str)[source]
Bases:
Exception
Exception raised for errors in the input AT&T file format (.att file) that stores an FST.
- Parameters:
message (str) – The error message to be displayed.
- message
The error message to be displayed.
- Type:
str
fst_runtime.fst module
This module provides the main class Fst
which defines a finite-state transducer (FST) in-memory as a directed graph.
- fst_runtime.fst.EPSILON
The epsilon character as encoded in the AT&T
.att
FST format; this representation is the string:@0@
.- Type:
str
- fst_runtime.fst.EPSILON: str = '@0@'
This is the epsilon character as encoded in the AT&T
.att
FST format.
- class fst_runtime.fst.Fst(att_file_path: str, *, semiring: Semiring | None = None, recursion_limit: int | None = None)[source]
Bases:
object
Represents a finite-state transducer as a directed graph.
- recursion_limit
Sets the recursion limit for the generation/analysis functionality, to prevent epsilon cycles from running amok.
- Type:
int
- multichar_symbols
A copy of the set of multi-character symbols defined in the FST.
- Type:
set[str]
- down_generation[source]
Generates wordforms from a lemma and sets of prefix and suffix tags.
- Type:
method
- down_generations[source]
Generates wordforms from many lemmas and common sets of prefix and suffix tags.
- Type:
method
- up_analysis[source]
Analyzes a wordform and returns any associated tagged lemmas of the wordform.
- Type:
method
- up_analyses[source]
Analyzes many wordforms and returns their associated tagged lemmas of each wordform in a dictionary keyed to the wordform.
- Type:
method
- down_generation(lemma: str, *, prefixes: list[list[str]] | None = None, suffixes: list[list[str]] | None = None) Generator[FstOutput] [source]
Queries the FST in the down/generation direction.
- Parameters:
lemma (str) – The lemma to process.
prefixes (list[list[str]], optional) – A list of lists containing prefix sequences. Default is None.
suffixes (list[list[str]], optional) – A list of lists containing suffix sequences. Default is None.
- Returns:
A generator of generated forms that are accepted by the FST along with their weights.
- Return type:
Generator[FstOutput]
Note
When provided lists of prefixes and suffixes as well as the lemma, it fully permutes the tags based on the slots of the affixes. For example, the lemma “wal” in English (for the lemma “walk”), with prefix tags
[["+VERB"], ["+INF", "+PAST", "+GER", "+PRES"]]
. Then, these would be fully permuted to “wal+VERB+INF”, “wal+VERB+PAST”, “wal+VERB+GER”, and “wal+VERB+PRES”; likewise with any prefixes. All of these constructions are then walked over the FST to see if we end at an accepting state. If so, the generated forms (i.e., walk, walked, walking, walks) will be added to a list and returned.
- down_generations(lemmas: list[str], *, prefixes: list[list[str]] | None = None, suffixes: list[list[str]] | None = None) dict[str, Generator[FstOutput]] [source]
Calls
down_generation
for each lemma and returns a dictionary keyed on each lemma.The values in the dictionary are generators of wordforms returned by the FST.
- Parameters:
lemmas (list[str]) – The list of lemmas to process.
prefixes (list[list[str]], optional) – A list of lists containing prefix sequences. Default is None.
suffixes (list[list[str]], optional) – A list of lists containing suffix sequences. Default is None.
- Returns:
A dictionary where each key is a lemma and the value is a generator of wordforms generated by the FST.
- Return type:
dict[str, Generator[str]]
See also
down_generation
For more information on how each lemma is processed.
- property multichar_symbols: set[str]
Public getter for the multichar_symbols variable.
- Returns:
A copy of the set of multi-character symbols.
- Return type:
set[str]
- property recursion_limit: int | None
Public getter for the recursion_limit variable.
- Returns:
The recursion limit that has been set. A value less than 1 represents that no recursion limit has been set, and so the current system recursion limit will be used (default for Python applications).
- Return type:
int
- up_analyses(wordforms: list[str]) dict[str, Generator[FstOutput]] [source]
Calls
up_analysis
for each wordform and returns a dictionary keyed on each wordform.The values in the dictionary are generators of tagged forms returned by the FST.
- Parameters:
wordforms (list[str]) – The list of wordforms to process.
- Returns:
A dictionary where each key is a wordform and the value is a generator of tagged forms generated by the FST, along with their weights.
- Return type:
dict[str, Generator[FstOutput]]
See also
up_analysis
For more information on how each wordform is processed.
- up_analysis(wordform: str) Generator[FstOutput] [source]
Queries the FST up, or in the direction of analysis.
- Parameters:
wordform (str) – The wordform to process.
- Returns:
A generator of tagged forms that could lead to the provided wordform, along with their weights.
- Return type:
Generator[FstOutput]
Note
This function queries the FST in the direction of analysis by starting at the accepting states. Instead of looking at the input symbols for a node and the out transitions, it looks at the output symbols of the node and the in transitions. In this way, the FST becomes reversed. For example,
walking -> wal+GER
. There can be several tagged forms that lead to a single word. For instance, the wordwalk
can have forms likewal+VERB+1Sg+Pres
,wal+VERB+2Sg+Pres
, etc., that lead to its generation. All these tagged forms are aggregated and returned.This method starts at the accepting states and looks at the output symbols of the node and the in transitions, effectively reversing the FST. While down/generation generates forms from a lemma plus some tags, the up/analysis direction takes a word form and generates the tagged forms that could lead to that particular word form.
- class fst_runtime.fst.FstOutput(output_string: str, path_weight: Any)[source]
Bases:
object
A dataclass for holding the output from a given node to another in an FST.
- ouput_string
This string represents the current state of the FST output; e.g. this could be “r”, then “ru”, then “run” as you walk through the FST.
- Type:
str
- path_weight
This is the current weight of the path being walked. This value is computed via the semiring provided to the FST.
- Type:
Any
- output_string: str
This string represents the current state of the FST output; e.g. this could be “r”, then “ru”, then “run” as you walk through the FST.
- path_weight: Any
This is the current weight of the path being walked. This value is computed via the semiring provided to the FST.
fst_runtime.tokenize_input module
This module holds a tokenization function that splits an input string into its constituent parts, while considering the set of provided multi-character symbols.
- fst_runtime.tokenize_input.tokenize_input_string[source]
Tokenizes the input string while respecting the multichar_symbols.
- Type:
function
- fst_runtime.tokenize_input.tokenize_input_string(input_string: str, multichar_symbols: set[str]) list[str] [source]
Returns a list containing the individual tokens that make up the
input_string
.- Parameters:
input_string (str) – The input string to be tokenized.
multichar_symbols (set[str]) – A set of multi-character symbols that need to be recognized as single tokens.
- Returns:
A list of individual tokens that make up the input string.
- Return type:
list[str]
Note
This function tokenizes the input string into individual tokens, taking into account the multi-character symbols specified in the
multichar_symbols
set. It ensures that the multi-character symbols are recognized as single tokens rather than being split into multiple tokens.
fst_runtime.semiring module
This module defines a semiring as well as several semirings commonly used with weighted FSTs.
- fst_runtime.semiring.BooleanSemiring[bool][source]
A semiring whose underlying set and operations are defined over the boolean values
True
andFalse
.- Type:
class
- fst_runtime.semiring.LogSemiring[float][source]
A semiring whose underlying set of values are the reals with +/- infinity, with addition as logadd and multiplication as standard addition.
- Type:
class
- fst_runtime.semiring.ProbabilitySemiring[float][source]
This is the probability semiring that is defined on the non-negative reals and standard additiona and multiplication.
- Type:
class
- fst_runtime.semiring.TropicalSemiring[float][source]
The tropical semiring is defined on the reals with +/- infinity, where addition is the minimum and multiplication is standard addition.
- Type:
class
- class fst_runtime.semiring.BooleanSemiring[source]
Bases:
Semiring
[bool
]A semiring whose underlying set and operations are defined over the boolean values
True
andFalse
.- convert_string_into_domain[source]
Converts the string representation of a value into the
bool
type.- Type:
method
Note
The boolean semiring defines
add
as theor
operator andmultiply
as theand
operator. The additive identity of the semiring isFalse
, and the multiplicative idenity isTrue
.This is also apparently the smallest semiring that is not a ring.
See also
Semiring
The base class of the
BooleanSemiring
withT = bool
.
References
- Wikipedia article on two-element boolean algebra:
- check_membership(*values: Any) bool [source]
Checks that all provided values are boolean.
- Parameters:
*values (Any) – The values to check for boolean membership.
- Returns:
Whether or not every provided value is in the underlying set or not.
- Return type:
bool
- convert_string_into_domain(string_representation_of_value: str) bool [source]
Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).
- Parameters:
string_representation_of_value (str) – This is the string representation of the value to be converted.
- Returns:
The value converted into the underlying domain of the semiring.
- Return type:
T
- class fst_runtime.semiring.LogSemiring[source]
Bases:
Semiring
[float
]A semiring whose underlying set of values are the reals with +/- infinity, with addition as logadd and multiplication as standard addition.
- check_membership[source]
Checks that all provided values are real numbers or +/- infinity.
- Type:
method
- convert_string_into_domain[source]
Converts the string representation of a value into the
float
type.- Type:
method
Note
This is also known as the minimum logarithmic semiring, given the negation of the log and the exponents of e.
This semiring defines
add
as-math.log(math.exp(-a) + math.exp(-b))
andmultiply
asa + b
. It defines the additive identity asfloat('inf')
, and the multiplicative identity as0.0
.This
add
function is a smooth approximation of the minimum of the valuesa
andb
. This sort of operation is known as the log-sum-exp trick, which allows for higher precision when doing floating-point arithmetic on large or small values by shifting the values into a domain that’s better suited for floating-point precision. This sort of equation is often used in probability theory, as logarithms can have a bunch of benefits for calculations.See also
Semiring
The base class of the
LogSemiring
withT = float
.
References
- Wikipedia article on the LogSumExp function:
- Wikipedia article on the log semiring:
- Numpy has the maximum version of this function, see this and the Wikipedia article on the log semiring:
https://numpy.org/doc/stable/reference/generated/numpy.logaddexp.html
- check_membership(*values: Any) bool [source]
Checks that all provided values are real numbers or +/- infinity.
- Parameters:
*values (Any) – The values to check for real number membership.
- Returns:
Whether or not every provided value is in the underlying set or not.
- Return type:
bool
- convert_string_into_domain(string_representation_of_value: str) float [source]
Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).
- Parameters:
string_representation_of_value (str) – This is the string representation of the value to be converted.
- Returns:
The value converted into the underlying domain of the semiring.
- Return type:
T
- class fst_runtime.semiring.ProbabilitySemiring[source]
Bases:
Semiring
[float
]This is the probability semiring that is defined on the non-negative reals and standard additiona and multiplication.
- check_membership[source]
Checks that all provided values are non-negative real numbers.
- Type:
method
- convert_string_into_domain[source]
Converts the string representation of a value into the
float
type.- Type:
method
Note
This semiring uses standard addition and multiplication, and is meant for managing weights that are probabilities.
See also
Semiring
The base class of the
ProbabilitySemiring
withT = float
.
- check_membership(*values: Any) bool [source]
Checks that all provided values are non-negative real numbers.
- Parameters:
*values (Any) – The values to check for membership in the non-negative reals.
- Returns:
Whether or not every provided value is in the underlying set or not.
- Return type:
bool
- convert_string_into_domain(string_representation_of_value: str) float [source]
Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).
- Parameters:
string_representation_of_value (str) – This is the string representation of the value to be converted.
- Returns:
The value converted into the underlying domain of the semiring.
- Return type:
T
- class fst_runtime.semiring.Semiring(add: Callable[[T, T], T], multiply: Callable[[T, T], T], additive_identity: T, multiplicative_identity: T)[source]
Bases:
ABC
,Generic
An abstract class that defines a semiring.
- additive_identity
The additive identity of the semiring.
- Type:
T
- multiplicative_identity
The multiplicative identity of the semiring.
- Type:
T
- get_path_weight[source]
Computes the overall weight of a single path by multiplying the weights of all edges in the path.
- Type:
method
- get_path_set_weight[source]
Computes the overall weight of a set of paths by adding the weights of individual paths.
- Type:
method
- check_membership[source]
This method ensures that the values provided to it are members of the underlying set of the semiring. Raises a
ValueError
if not.- Type:
abstract method
- convert_string_into_domain[source]
This takes the string representation of a value and converts it into the underlying domain of the semiring.
- Type:
abstract method
Examples
An example of initializing this object for the tropical semiring would be:
class TropicalSemiring(Semiring[float]): def __init__(self) -> None: super().__init__( add=min, multiply=lambda a, b: a + b, additive_identity=float('inf'), multiplicative_identity=0.0 )
References
- See this OpenFST paper for a relatively high-level discussion of weighted FSTs and semirings.
https://www.openfst.org/twiki/pub/FST/FstBackground/ciaa.pdf
- Wikipedia discussion on semirings:
- See this paper for a more in-depth and technical weighted FST design discussion:
https://www.cs.mun.ca/~harold/Courses/Old/Ling6800.W06/Diary/tcs.pdf
See this textbook for the definitions of the different semirings used here, as well as the general mathematical underpinning of them, and their uses in/for FSTs:
Lothaire, Applied Combinatorics on Words (Cambridge: Cambridge University Press, 2004), 200.
- add(a: T, b: T) T [source]
Performs the addition operation of the semiring.
- Parameters:
a (T) – The first operand.
b (T) – The second operand.
- Returns:
The result of the addition.
- Return type:
T
Note
Please note that this addition is not the standard “+” operation, but could be any associative, commutative binary operation that has an identity element 0.
- property additive_identity: T
The additive identity of the semiring.
- Returns:
The additive identity.
- Return type:
T
- abstract check_membership(*values: Any) bool [source]
Checks that the given values are members of the underlying set of the semiring.
- Parameters:
*values (Any) – The values that will be checked to guarantee they are of the type of the underlying set of the semiring.
- Returns:
Whether or not every provided value is in the underlying set or not.
- Return type:
bool
- abstract convert_string_into_domain(string_representation_of_value: str) T [source]
Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).
- Parameters:
string_representation_of_value (str) – This is the string representation of the value to be converted.
- Returns:
The value converted into the underlying domain of the semiring.
- Return type:
T
- get_path_set_weight(*set_of_path_weights: T) T [source]
Computes the overall weight of a set of paths by adding the weights of individual paths.
- Parameters:
*set_of_path_weights (T) – A list of weights corresponding to individual paths.
- Returns:
The overall weight of the set of paths, computed as the sum of the individual path weights.
- Return type:
T
References
Lothaire, Applied Combinatorics on Words (Cambridge: Cambridge University Press, 2004), 201.
- get_path_weight(*path_weights: T) T [source]
Computes the overall weight of a single path by multiplying the weights of all edges in the path.
- Parameters:
*path_weights (T) – Weights corresponding to the edges in a path.
- Returns:
The overall weight of the path, computed as the product of the individual edge weights.
- Return type:
T
References
Lothaire, Applied Combinatorics on Words (Cambridge: Cambridge University Press, 2004), 201.
- property multiplicative_identity: T
The multiplicative identity of the semiring.
- Returns:
The multiplicative identity.
- Return type:
T
- multiply(a: T, b: T) T [source]
Performs the multiplication operation of the semiring.
- Parameters:
a (T) – The first operand.
b (T) – The second operand.
- Returns:
The result of the multiplication.
- Return type:
T
Note
Please note that this multiplication is not the standard “*” operation, but could be any associative binary operation that distributes over the defined addition with identity element 1 and that has 0 as an annhilator. Multiplication retains higher precedence over the defined addition.
- class fst_runtime.semiring.TropicalSemiring[source]
Bases:
Semiring
[float
]The tropical semiring is defined on the reals with +/- infinity, where addition is the minimum and multiplication is standard addition.
- check_membership[source]
Checks that all provided values are real numbers or +/- infinity.
- Type:
method
- convert_string_into_domain[source]
Converts the string representation of a value into the
float
type.- Type:
method
Note
This is also known as the minimum tropical semiring for its use of
min
, instead ofmax
, as the addition function.As mentioned,
add
is defined asmin{a, b}
. Multiplication is defined as standard addition. The additive identity isfloat('inf')
.The way this works is that for a given output form, you may end up with a bunch of different paths that got you there. Each of those paths will have its own weight, and, because addition is
min
, that means when you sum the paths together, the result you get is the lowest weight among paths that led to the output. This can be useful because some paths may be penalized for having maybe non-standard forms, etc., but which lead to a perfectly valid output. We therefore only care about the minimum weight which is therefore the determiner of the validity/order of the output form. The rest of the weights can be thought of as superfluous.See also
Semiring
The base class of the
TropicalSemiring
withT = float
.
References
- The Wikipedia article on tropical semirings:
- check_membership(*values: Any) bool [source]
Checks that all provided values are real numbers or +/- infinity.
- Parameters:
*values (Any) – The values to check for real number membership.
- Returns:
Whether or not every provided value is in the underlying set or not.
- Return type:
bool
- convert_string_into_domain(string_representation_of_value: str) float [source]
Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).
- Parameters:
string_representation_of_value (str) – This is the string representation of the value to be converted.
- Returns:
The value converted into the underlying domain of the semiring.
- Return type:
T