fst_runtime package

Module contents

fst-runtime

This package acts as a lightweight runtime for querying finite-state transducers (FSTs) stored in the AT&T .att format.

The development of this package was motivated by the need for a lightweight way to run FSTs compiled with foma or other toolkits in Python, separate from the compilation technology. Previously, installing all of foma into a Docker container and compiling by source was necessary to query the FST during deployment. This package provides a solution to that.

Note

Epsilon in this AT&T format is represented by the string ‘@0@’.

This package is used via the Fst object. This object requires a path to the AT&T-compiled FST, e.g., fst = Fst('/path/to/file.att'). You can get the multi-character symbols used in the FST via fst.multichar_symbols.

The Fst object has a public method named down_generation, which follows the FST convention of the “down” direction being the direction of “generation” (i.e., creating forms from some tagged word form). For example, in English, for the lemma “do” with possible affixes of prefixes = [['de', 'un']] and suffixes = [['+VERB'], ['+GER', '+INF', '+PAST']], the call would be fst.down_generation([['un']], 'do', [['+VERB'], ['+GER', '+INF', '+PAST']]). This would return ['do', 'doing', 'did', 'undo', 'undoing', 'undid'].

Submodules

fst_runtime.att_format_error module

att_format_error

This module defines custom exceptions used for handling errors specific to the AT&T file format (.att file) processing.

fst_runtime.att_format_error.AttFormatError[source]

Exception raised for errors in the input AT&T file format (.att file).

Type:

class

exception fst_runtime.att_format_error.AttFormatError(message: str)[source]

Bases: Exception

Exception raised for errors in the input AT&T file format (.att file) that stores an FST.

Parameters:

message (str) – The error message to be displayed.

message

The error message to be displayed.

Type:

str

fst_runtime.fst module

This module provides the main class Fst which defines a finite-state transducer (FST) in-memory as a directed graph.

fst_runtime.fst.Fst[source]

Defines an FST in-memory as a directed graph.

Type:

class

fst_runtime.fst.EPSILON

The epsilon character as encoded in the AT&T .att FST format; this representation is the string: @0@.

Type:

str

fst_runtime.fst.EPSILON: str = '@0@'

This is the epsilon character as encoded in the AT&T .att FST format.

class fst_runtime.fst.Fst(att_file_path: str, *, semiring: Semiring | None = None, recursion_limit: int | None = None)[source]

Bases: object

Represents a finite-state transducer as a directed graph.

recursion_limit

Sets the recursion limit for the generation/analysis functionality, to prevent epsilon cycles from running amok.

Type:

int

multichar_symbols

A copy of the set of multi-character symbols defined in the FST.

Type:

set[str]

down_generation[source]

Generates wordforms from a lemma and sets of prefix and suffix tags.

Type:

method

down_generations[source]

Generates wordforms from many lemmas and common sets of prefix and suffix tags.

Type:

method

up_analysis[source]

Analyzes a wordform and returns any associated tagged lemmas of the wordform.

Type:

method

up_analyses[source]

Analyzes many wordforms and returns their associated tagged lemmas of each wordform in a dictionary keyed to the wordform.

Type:

method

down_generation(lemma: str, *, prefixes: list[list[str]] | None = None, suffixes: list[list[str]] | None = None) Generator[FstOutput][source]

Queries the FST in the down/generation direction.

Parameters:
  • lemma (str) – The lemma to process.

  • prefixes (list[list[str]], optional) – A list of lists containing prefix sequences. Default is None.

  • suffixes (list[list[str]], optional) – A list of lists containing suffix sequences. Default is None.

Returns:

A generator of generated forms that are accepted by the FST along with their weights.

Return type:

Generator[FstOutput]

Note

When provided lists of prefixes and suffixes as well as the lemma, it fully permutes the tags based on the slots of the affixes. For example, the lemma “wal” in English (for the lemma “walk”), with prefix tags [["+VERB"], ["+INF", "+PAST", "+GER", "+PRES"]]. Then, these would be fully permuted to “wal+VERB+INF”, “wal+VERB+PAST”, “wal+VERB+GER”, and “wal+VERB+PRES”; likewise with any prefixes. All of these constructions are then walked over the FST to see if we end at an accepting state. If so, the generated forms (i.e., walk, walked, walking, walks) will be added to a list and returned.

down_generations(lemmas: list[str], *, prefixes: list[list[str]] | None = None, suffixes: list[list[str]] | None = None) dict[str, Generator[FstOutput]][source]

Calls down_generation for each lemma and returns a dictionary keyed on each lemma.

The values in the dictionary are generators of wordforms returned by the FST.

Parameters:
  • lemmas (list[str]) – The list of lemmas to process.

  • prefixes (list[list[str]], optional) – A list of lists containing prefix sequences. Default is None.

  • suffixes (list[list[str]], optional) – A list of lists containing suffix sequences. Default is None.

Returns:

A dictionary where each key is a lemma and the value is a generator of wordforms generated by the FST.

Return type:

dict[str, Generator[str]]

See also

down_generation

For more information on how each lemma is processed.

property multichar_symbols: set[str]

Public getter for the multichar_symbols variable.

Returns:

A copy of the set of multi-character symbols.

Return type:

set[str]

property recursion_limit: int | None

Public getter for the recursion_limit variable.

Returns:

The recursion limit that has been set. A value less than 1 represents that no recursion limit has been set, and so the current system recursion limit will be used (default for Python applications).

Return type:

int

up_analyses(wordforms: list[str]) dict[str, Generator[FstOutput]][source]

Calls up_analysis for each wordform and returns a dictionary keyed on each wordform.

The values in the dictionary are generators of tagged forms returned by the FST.

Parameters:

wordforms (list[str]) – The list of wordforms to process.

Returns:

A dictionary where each key is a wordform and the value is a generator of tagged forms generated by the FST, along with their weights.

Return type:

dict[str, Generator[FstOutput]]

See also

up_analysis

For more information on how each wordform is processed.

up_analysis(wordform: str) Generator[FstOutput][source]

Queries the FST up, or in the direction of analysis.

Parameters:

wordform (str) – The wordform to process.

Returns:

A generator of tagged forms that could lead to the provided wordform, along with their weights.

Return type:

Generator[FstOutput]

Note

This function queries the FST in the direction of analysis by starting at the accepting states. Instead of looking at the input symbols for a node and the out transitions, it looks at the output symbols of the node and the in transitions. In this way, the FST becomes reversed. For example, walking -> wal+GER. There can be several tagged forms that lead to a single word. For instance, the word walk can have forms like wal+VERB+1Sg+Pres, wal+VERB+2Sg+Pres, etc., that lead to its generation. All these tagged forms are aggregated and returned.

This method starts at the accepting states and looks at the output symbols of the node and the in transitions, effectively reversing the FST. While down/generation generates forms from a lemma plus some tags, the up/analysis direction takes a word form and generates the tagged forms that could lead to that particular word form.

class fst_runtime.fst.FstOutput(output_string: str, path_weight: Any)[source]

Bases: object

A dataclass for holding the output from a given node to another in an FST.

ouput_string

This string represents the current state of the FST output; e.g. this could be “r”, then “ru”, then “run” as you walk through the FST.

Type:

str

path_weight

This is the current weight of the path being walked. This value is computed via the semiring provided to the FST.

Type:

Any

output_string: str

This string represents the current state of the FST output; e.g. this could be “r”, then “ru”, then “run” as you walk through the FST.

path_weight: Any

This is the current weight of the path being walked. This value is computed via the semiring provided to the FST.

fst_runtime.tokenize_input module

This module holds a tokenization function that splits an input string into its constituent parts, while considering the set of provided multi-character symbols.

fst_runtime.tokenize_input.tokenize_input_string[source]

Tokenizes the input string while respecting the multichar_symbols.

Type:

function

fst_runtime.tokenize_input.tokenize_input_string(input_string: str, multichar_symbols: set[str]) list[str][source]

Returns a list containing the individual tokens that make up the input_string.

Parameters:
  • input_string (str) – The input string to be tokenized.

  • multichar_symbols (set[str]) – A set of multi-character symbols that need to be recognized as single tokens.

Returns:

A list of individual tokens that make up the input string.

Return type:

list[str]

Note

This function tokenizes the input string into individual tokens, taking into account the multi-character symbols specified in the multichar_symbols set. It ensures that the multi-character symbols are recognized as single tokens rather than being split into multiple tokens.

fst_runtime.semiring module

This module defines a semiring as well as several semirings commonly used with weighted FSTs.

fst_runtime.semiring.Semiring[T][source]

An abstract class that defines a semiring.

Type:

class

fst_runtime.semiring.BooleanSemiring[bool][source]

A semiring whose underlying set and operations are defined over the boolean values True and False.

Type:

class

fst_runtime.semiring.LogSemiring[float][source]

A semiring whose underlying set of values are the reals with +/- infinity, with addition as logadd and multiplication as standard addition.

Type:

class

fst_runtime.semiring.ProbabilitySemiring[float][source]

This is the probability semiring that is defined on the non-negative reals and standard additiona and multiplication.

Type:

class

fst_runtime.semiring.TropicalSemiring[float][source]

The tropical semiring is defined on the reals with +/- infinity, where addition is the minimum and multiplication is standard addition.

Type:

class

class fst_runtime.semiring.BooleanSemiring[source]

Bases: Semiring[bool]

A semiring whose underlying set and operations are defined over the boolean values True and False.

check_membership[source]

Checks that all provided values are boolean.

Type:

method

convert_string_into_domain[source]

Converts the string representation of a value into the bool type.

Type:

method

Note

The boolean semiring defines add as the or operator and multiply as the and operator. The additive identity of the semiring is False, and the multiplicative idenity is True.

This is also apparently the smallest semiring that is not a ring.

See also

Semiring

The base class of the BooleanSemiring with T = bool.

References

Wikipedia article on two-element boolean algebra:

https://en.wikipedia.org/wiki/Two-element_Boolean_algebra

check_membership(*values: Any) bool[source]

Checks that all provided values are boolean.

Parameters:

*values (Any) – The values to check for boolean membership.

Returns:

Whether or not every provided value is in the underlying set or not.

Return type:

bool

convert_string_into_domain(string_representation_of_value: str) bool[source]

Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).

Parameters:

string_representation_of_value (str) – This is the string representation of the value to be converted.

Returns:

The value converted into the underlying domain of the semiring.

Return type:

T

class fst_runtime.semiring.LogSemiring[source]

Bases: Semiring[float]

A semiring whose underlying set of values are the reals with +/- infinity, with addition as logadd and multiplication as standard addition.

check_membership[source]

Checks that all provided values are real numbers or +/- infinity.

Type:

method

convert_string_into_domain[source]

Converts the string representation of a value into the float type.

Type:

method

Note

This is also known as the minimum logarithmic semiring, given the negation of the log and the exponents of e.

This semiring defines add as -math.log(math.exp(-a) + math.exp(-b)) and multiply as a + b. It defines the additive identity as float('inf'), and the multiplicative identity as 0.0.

This add function is a smooth approximation of the minimum of the values a and b. This sort of operation is known as the log-sum-exp trick, which allows for higher precision when doing floating-point arithmetic on large or small values by shifting the values into a domain that’s better suited for floating-point precision. This sort of equation is often used in probability theory, as logarithms can have a bunch of benefits for calculations.

See also

Semiring

The base class of the LogSemiring with T = float.

References

Wikipedia article on the LogSumExp function:

https://en.wikipedia.org/wiki/LogSumExp

Wikipedia article on the log semiring:

https://en.wikipedia.org/wiki/Log_semiring

Numpy has the maximum version of this function, see this and the Wikipedia article on the log semiring:

https://numpy.org/doc/stable/reference/generated/numpy.logaddexp.html

check_membership(*values: Any) bool[source]

Checks that all provided values are real numbers or +/- infinity.

Parameters:

*values (Any) – The values to check for real number membership.

Returns:

Whether or not every provided value is in the underlying set or not.

Return type:

bool

convert_string_into_domain(string_representation_of_value: str) float[source]

Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).

Parameters:

string_representation_of_value (str) – This is the string representation of the value to be converted.

Returns:

The value converted into the underlying domain of the semiring.

Return type:

T

class fst_runtime.semiring.ProbabilitySemiring[source]

Bases: Semiring[float]

This is the probability semiring that is defined on the non-negative reals and standard additiona and multiplication.

check_membership[source]

Checks that all provided values are non-negative real numbers.

Type:

method

convert_string_into_domain[source]

Converts the string representation of a value into the float type.

Type:

method

Note

This semiring uses standard addition and multiplication, and is meant for managing weights that are probabilities.

See also

Semiring

The base class of the ProbabilitySemiring with T = float.

check_membership(*values: Any) bool[source]

Checks that all provided values are non-negative real numbers.

Parameters:

*values (Any) – The values to check for membership in the non-negative reals.

Returns:

Whether or not every provided value is in the underlying set or not.

Return type:

bool

convert_string_into_domain(string_representation_of_value: str) float[source]

Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).

Parameters:

string_representation_of_value (str) – This is the string representation of the value to be converted.

Returns:

The value converted into the underlying domain of the semiring.

Return type:

T

class fst_runtime.semiring.Semiring(add: Callable[[T, T], T], multiply: Callable[[T, T], T], additive_identity: T, multiplicative_identity: T)[source]

Bases: ABC, Generic

An abstract class that defines a semiring.

additive_identity

The additive identity of the semiring.

Type:

T

multiplicative_identity

The multiplicative identity of the semiring.

Type:

T

add[source]

The addition operation for the semiring.

Type:

method

multiply[source]

The multiplication operation for the semiring.

Type:

method

get_path_weight[source]

Computes the overall weight of a single path by multiplying the weights of all edges in the path.

Type:

method

get_path_set_weight[source]

Computes the overall weight of a set of paths by adding the weights of individual paths.

Type:

method

check_membership[source]

This method ensures that the values provided to it are members of the underlying set of the semiring. Raises a ValueError if not.

Type:

abstract method

convert_string_into_domain[source]

This takes the string representation of a value and converts it into the underlying domain of the semiring.

Type:

abstract method

Examples

An example of initializing this object for the tropical semiring would be:

class TropicalSemiring(Semiring[float]):
    def __init__(self) -> None:
        super().__init__(
            add=min,
            multiply=lambda a, b: a + b,
            additive_identity=float('inf'),
            multiplicative_identity=0.0
        )

References

See this OpenFST paper for a relatively high-level discussion of weighted FSTs and semirings.

https://www.openfst.org/twiki/pub/FST/FstBackground/ciaa.pdf

Wikipedia discussion on semirings:

https://en.wikipedia.org/wiki/Semiring

See this paper for a more in-depth and technical weighted FST design discussion:

https://www.cs.mun.ca/~harold/Courses/Old/Ling6800.W06/Diary/tcs.pdf

See this textbook for the definitions of the different semirings used here, as well as the general mathematical underpinning of them, and their uses in/for FSTs:

Lothaire, Applied Combinatorics on Words (Cambridge: Cambridge University Press, 2004), 200.

add(a: T, b: T) T[source]

Performs the addition operation of the semiring.

Parameters:
  • a (T) – The first operand.

  • b (T) – The second operand.

Returns:

The result of the addition.

Return type:

T

Note

Please note that this addition is not the standard “+” operation, but could be any associative, commutative binary operation that has an identity element 0.

property additive_identity: T

The additive identity of the semiring.

Returns:

The additive identity.

Return type:

T

abstract check_membership(*values: Any) bool[source]

Checks that the given values are members of the underlying set of the semiring.

Parameters:

*values (Any) – The values that will be checked to guarantee they are of the type of the underlying set of the semiring.

Returns:

Whether or not every provided value is in the underlying set or not.

Return type:

bool

abstract convert_string_into_domain(string_representation_of_value: str) T[source]

Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).

Parameters:

string_representation_of_value (str) – This is the string representation of the value to be converted.

Returns:

The value converted into the underlying domain of the semiring.

Return type:

T

get_path_set_weight(*set_of_path_weights: T) T[source]

Computes the overall weight of a set of paths by adding the weights of individual paths.

Parameters:

*set_of_path_weights (T) – A list of weights corresponding to individual paths.

Returns:

The overall weight of the set of paths, computed as the sum of the individual path weights.

Return type:

T

References

Lothaire, Applied Combinatorics on Words (Cambridge: Cambridge University Press, 2004), 201.

get_path_weight(*path_weights: T) T[source]

Computes the overall weight of a single path by multiplying the weights of all edges in the path.

Parameters:

*path_weights (T) – Weights corresponding to the edges in a path.

Returns:

The overall weight of the path, computed as the product of the individual edge weights.

Return type:

T

References

Lothaire, Applied Combinatorics on Words (Cambridge: Cambridge University Press, 2004), 201.

property multiplicative_identity: T

The multiplicative identity of the semiring.

Returns:

The multiplicative identity.

Return type:

T

multiply(a: T, b: T) T[source]

Performs the multiplication operation of the semiring.

Parameters:
  • a (T) – The first operand.

  • b (T) – The second operand.

Returns:

The result of the multiplication.

Return type:

T

Note

Please note that this multiplication is not the standard “*” operation, but could be any associative binary operation that distributes over the defined addition with identity element 1 and that has 0 as an annhilator. Multiplication retains higher precedence over the defined addition.

class fst_runtime.semiring.TropicalSemiring[source]

Bases: Semiring[float]

The tropical semiring is defined on the reals with +/- infinity, where addition is the minimum and multiplication is standard addition.

check_membership[source]

Checks that all provided values are real numbers or +/- infinity.

Type:

method

convert_string_into_domain[source]

Converts the string representation of a value into the float type.

Type:

method

Note

This is also known as the minimum tropical semiring for its use of min, instead of max, as the addition function.

As mentioned, add is defined as min{a, b}. Multiplication is defined as standard addition. The additive identity is float('inf').

The way this works is that for a given output form, you may end up with a bunch of different paths that got you there. Each of those paths will have its own weight, and, because addition is min, that means when you sum the paths together, the result you get is the lowest weight among paths that led to the output. This can be useful because some paths may be penalized for having maybe non-standard forms, etc., but which lead to a perfectly valid output. We therefore only care about the minimum weight which is therefore the determiner of the validity/order of the output form. The rest of the weights can be thought of as superfluous.

See also

Semiring

The base class of the TropicalSemiring with T = float.

References

The Wikipedia article on tropical semirings:

https://en.wikipedia.org/wiki/Tropical_semiring

check_membership(*values: Any) bool[source]

Checks that all provided values are real numbers or +/- infinity.

Parameters:

*values (Any) – The values to check for real number membership.

Returns:

Whether or not every provided value is in the underlying set or not.

Return type:

bool

convert_string_into_domain(string_representation_of_value: str) float[source]

Returns the underlying value for a given string representation of a that value (i.e. as read in from a file).

Parameters:

string_representation_of_value (str) – This is the string representation of the value to be converted.

Returns:

The value converted into the underlying domain of the semiring.

Return type:

T