Finite-State Software at JHU CS: Getting Started (HOWTO)

This tutorial page should help you get started with the 3 different finite-state packages we have installed:

If you have trouble getting the programs to run, please contact Radu Florian (rflorian@cs.jhu.edu). If you have corrections or suggestions for this page, email Jason Eisner (jason@cs.jhu.edu).

Note on other finite-state packages

Note that you can find other finite-state packages on the web (e.g., this Mathematica library and Java applet), but most of them only deal with finite-state acceptors, not with weights or transducers.

As you probably know, some useful unambiguous transductions are straightforward to carry out in Perl or Python or with regular expression libraries (the GNU regex library for C, C++, Java; other Java packages; and the extended Regex++ library for C/C++). The idea is to simply match a generalized regular expression against different pieces of a text, recording the locations of subexpression matches so that they can then be manipulated by arbitrary code. This hybrid approach is useful if one really needs more power than a finite-state approach - thanks to the possibility of backreferences (which require backtracking) and arbitrary replacement code (most often used to swap substrings of unbounded length). But it sacrifices efficiency, economy, elegance, intermediate nondeterminism, and specification via best weighted path.

Changing your PATH and MANPATH variables

This section assumes you use csh or a csh derivative as your shell. If not, you may have to modify these instructions slightly.

Comparison Chart

 FSA UtilitiesXerox FSTAT&T FSM + lextools
interactive startupfsa -tkxfstN/A
manualHolland, local (local is accessible from CS research network only)book draft (only accessible from CS research and undergrad networks; don't distribute)man fsmintro
man lextools
complete regexp guideHolland, localsects 2.3-2.4 (p. 42ff), XRCEman -s5 lextools
automaton format guideHolland, local man -s5 fsm
man -s3 fsmclass
man -s5 lextools
interface guidechaps 11, 12, 13
Prolog manual
Prolog-Emacs interface
overview + ref
chap 3 (p. 73ff)
help, apropos
N/A
parentheses(E)[E](E)
comments% comment# comment# comment
atomic expressionsa, a:b, a::3, a:b:3a, a:ba, a:b, <3>
(<3> = epsilon with weight 3)
literal symbol'*' (or escape(*))"*" (or %*)\*
symbol escape codesas in Prologas in C 
long symbol namesfoo barfoobar (greedy)[foo][bar]
complex symbolpredicates [noun num=pl gender=fem]
symbol classa..z
(or predicates)
 superclass defn
in .sym file
(lexmakelab compiles
to .scl file
for -S option)
any symbol? (surround with spaces)?[<sigma>] (if defined as superclass)
symbol complement (? - E)`E\E 
edge of string .#.[<bos>], [<eos>]
concat[E1,E2,E3]E1 E2E1 E2 (fsmconcat)
character concat {abcd}abcd
union{E1,E2,E3}E1 | E2E1 | E2 (fsmunion)
empty string (epsilon)[]0
[]
0
[<epsilon>] (if defined)
empty language{}\? 
optionalityE^(E)E?
Kleene closureE*E*E* (fsmclosure)
Kleene plusE+E+E+
repetition E^n, E^<n, E^>n, E^{n,m}E^n
contains$E$E 
reversereverse(E)E.rfsmreverse
intersectE1 & E2E1 & E2& (fsmintersect)
differenceE1 - E2E1 - E2E1 - E2 (fsmdifference)
complement (?* - E)~E~E!E (fsmcomplement)
cross-productE1 x E2
E1:E2 (high-precedence)
E1 .x. E2, also {abcd}:{aceg}E1:E2
same-length cross-productE1 xx E2  
projectdomain(E), range(E)E.u, E.lfsmproject
epsilon-removeefree(E), et al. fsmrmepsilon
determinizeE!
{t,w,wt}_determinize(E)
 fsmdeterminize
minimizeE#
{t,w,wt}_minimize(E)
 fsmminimize
composeE1 o E2E1 .o. E2E1 @ E2 (fsmcompose)
invertinvert(E)E.ifsminvert
ignore (insert freely) A / B
A ./. B (blocked at edges)
 
restriction [A => L1 _ R1, L2 _ R2] 
replacementRelevant macros in
  ~rflorian/software/lib/fsa
A -> B (exhaustive nondeterm.)
A (->) B (optional)
A @-> B (LR longest)
A @> B (LR shortest)
A ->@ B (RL longest)
A >@ B (RL shortest)
Reverse arrow swaps lower, upper
A may have form [. E .]
  (blocks repeat matches of
  epsilon in same place)
B may have form L ... R
  (inserts L, R around A)
Contextual restrictions:
   || L _ R (upper upper)
   // L _ R (lower upper)
   \\ L _ R (upper lower)
   \/ L _ R (lower lower)
Parallel replacements joined by ,
  or by ,, if restricted
 

Walkthrough: A Basic Example

This tutorial section tells you what to type in order to get the same example working in all 3 programs. It is written so that it will be easiest to follow if you learn the 3 programs in the order given. The example involves roughly the following steps:

FSA Utilities

  1. To start fsa with the graphical interface, type fsa -tk. This opens an interactive Prolog window (where any errors and outputs will appear) and an interactive Tk/Tcl window.

  2. In the "Regex:" field, type [{art,quant}^, adj*, noun+]. Look at the above to see what these symbols mean. (Note: Emacs editing keys work in the "Regex:" field.)

  3. Notice that the automaton that appears has been automatically determinized and minimized (this often happens but not always). Convince yourself that it's correct. The initial state is green; the final state is red.

  4. To get a better view of the automaton, drag the states around and point to the arcs. Clicking on an arc will print its description in the Prolog window.

  5. Request a slower but excellent view made by the dot program for automatic directed graph layout. Select dot_ghostview or dot_xv from the Visualization menu. (This calls dot to lay out a postscript or gif file, and then invokes the appropriate viewer on that file.)

  6. In your regexp, art, noun, etc. are elements of the formal alphabet. Try changing noun to Noun and hit return; you will get an error message in the Prolog window. The problem is that capitalized symbols have a special meaning to the Prolog back-end. So try changing Noun to the quoted 'Noun': this should work. Then change it back.

    (You also have to quote symbols like '*' to block their special meaning. If you want numbers in your formal alphabet, be warned that the digit character '5' (which you can enter into the "String:" field) is distinct in Prolog from the number 5, but both can be in the alphabet, as can 23 and '23'.)

  7. Type art adj adj noun into the "String:" field. The Prolog window will tell you "no" (i.e., it doesn't match). The problem is that the string was interpreted as a sequence of characters 'a','r','t',' ', ... To fix this, on the Settings menu, change symbol_separator to 32, which is the ASCII code for space. (You can also arrange this at startup by including symbol_separator=32 on the command line.) Now when you submit the string, the Prolog window should say "yes".

  8. Another way to test a string is to intersect it with the automaton. If you get the empty language (easy to see after minimization), it didn't match. Thus, type ([art,adj,adj,noun] & [{art,quant}^, adj*, noun+])# into the Regexp box: what happens? What if you delete the first noun from the string? (Notice that we are not really using the string itself, but rather the regular language that consists of only that string.)

  9. Now we'll compose our automaton with a transducer that replaces each noun that immediately precedes another noun with nmod. Such a transducer can be constructed by one of the directed replacement operators of Karttunen (1997). We need to load a macro file that defines this operator. On the File menu, choose LoadAux... and load the file ~rflorian/software/lib/fsa/Karttunen97/Karttunen97_2.pl. (You can also load a file at startup time by adding -aux file to the command line; you may use the -aux flag multiple times on the same line to load multiple files.)

  10. Having loaded the file, we can now write replace(Old,New,Left,Right), to get a directed replacement transducer that scans its input from left to right. At every position i, it replaces the longest substring (if any) that starts at i, matches the regexp Old, and falls directly between substrings matching Left and Right. The match for Old is replaced nondeterministically by every string matching New. The transducer then resumes scanning immediately after the end of the match text.

    (This is actually an abbreviation for Old => New lu Left -- Right. =>chooses left-to-right scanning, and lu says that the left context is matched against the "lower" (output) version of the string -- i.e., it can see the previous changes -- whereas the right context is matched against the "upper" (input) version that has not been changed yet. <=, ul, uu, and ll are also available.)

  11. Specifically, type replace(noun,nmod,[],noun) into the "Regex:" field. (The resulting transducer will mention a few harmless odd characters that were introduced during its construction.) Try typing foo bar noun baz noun noun bing noun noun noun noun into the "String:" field. You should get the expected output, but you get 16 identical copies of it.

  12. The multiple copies arise from multiple equivalent paths in the transducer. Change the regexp to t_determinize(replace(noun,nmod,[],noun)). (This does transducer (t_) determinization, which is not automatically requested by the graphical interface because it may not terminate.) Now try again; you should get only one copy.

  13. Let's compose our original regexp with the transducer, using the composition operator o. If R and S are relations (i.e., sets of ordered pairs), then R o S = {(x,z): for some y, (x,y) in R and (y,z) in S}. This is a generalization of function composition.

    The original regexp matches a particular language L. It will automatically be coerced to an "identity" transduction (namely [{art:art,quant:quant}^, adj:adj*, noun:noun+]) that maps every string in L to itself, and rejects all other strings. In other words, it describes the relation {(x,x): x in L}.

    So try the composition: [{art,quant}^, adj*, noun+] o replace(noun,nmod,[],noun). Determinize it using t_determinize (again introducing some odd characters that you can ignore). How does this machine work? How does it manage to be deterministic, given that it must look ahead to the next symbol in order to decide whether to change noun to nmod?

  14. Modify the transducer so that it also introduces angle brackets around the noun phrase in the output: [ []:'<', [{art,quant}^, adj*, noun+] o replace(noun,nmod,[],noun), []:'>' ]. Try it on the inputs art adj noun noun noun and art adj. Again, determinize it and try again.

  15. The symbol ? matches any character, so ? * matches any string. (Warning: the space after ? is crucial!) If ? * is used as a transduction, the usual rules mean it will be coerced to the transduction that maps any string to itself - i.e., leaves the input unchanged in the output.

    Therefore, given a transducer T, we can write [? *,[T, ? *]*] for a transducer that applies T to some arbitrary set of disjoint substrings of the input that it matches, and leaves the rest of the input unchanged.

    Take T to be the deterministic transducer we have just constructed, and apply this transformation to it. Try it on the strings verb art adj noun noun noun and art adj. How would you characterize the output?

  16. Use the LoadAux command as before to load ~rflorian/software/lib/fsa/GerdemannVannoord99/replace.pl. This defines 1-argument and 3-argument versions of the replace macro; in these versions the first argument is a transducer. If T is our noun phrase transducer as above, then replace(T) is a transducer that scans the input from left to right, applying T to the maximal substring If T is deterministic, then so is replace(T).

    Try applying replace(T) to strings like verb art adj noun noun noun prep art adj noun. Fun, eh? (Note: replace(T,Left,Right) will only replace matches that fall directly between substrings matching Left and right.)

  17. Try doing the above application by using composition. That is, build the automaton [verb, art, adj, ... ] o replace(T). This yields the automaton that maps a particular string to its output, and nothing else. Try applying the function range to this automaton, to get only the output.

  18. Finally, try applying the transducer to a class of strings in parallel: in the "range" expression above, replace the specific string by the regexp [quant, noun+, verb]. The resulting automaton describes the set of outputs for these inputs. Add # to the end of the expression to determinize and minimize it. Have a look.

  19. Quit FSA Utilities, either by typing halt. in the Prolog window, or by existing through the File menu.

Some other nice features of FSA Utilities that we didn't explore:

XFST: Xerox Finite-State Tool

Note: The tutorial material in chapters 2-3 of the xfst book will be very helpful if you want to do real work with xfst.

Note: A somewhat nicer and more standalone version of the following instructions is found in problem 1 of this assignment (postscript version).

  1. To start xfst, type xfst. This gives you a command line. Useful commands are help, help command, and apropos topic.

    Because xfst doesn't have good command-line editing and recall facilities, you may want to start a shell in emacs and run xfst in that shell. (M-x shell starts the shell and C-h m tells you how it works.)

  2. Define a regular expression over an alphabet of part-of-speech tags: define nounphrase (Art|Quant) Adj* Noun+ ;

    Note: When we type in strings to match against this regexp, xfst will manage to interpret ArtAdjAdjNoun as a length-4 string. It tokenizes by greedy left-to-right longest match, but it is safest not to rely on that. Instead, the convention of starting all multicharacter symbols with a distinguished character or character class prevents any accidental ambiguities. Here we use capital letters for this. In practice, the authors of xfst recommend using + (typed in a regexp as %+ or "+" since + has a special meaning in regexps).

  3. Get information about the nounphrase machine: print words nounphrase and print net nounphrase.

    The former command appears to list all words that can be accepted along acyclic paths (a finite number).

    The latter command lists the transitions from each state. States are named sn or fsn depending on whether they are final; s0 or fs0 is the start state. Notice that as in FSA Utilities, the machine has been automatically determinized and minimized for us, since it is an acceptor rather than a transducer.

  4. Let's see whether Art Adj Adj Noun is a noun phrase. We'll test that by intersection:

  5. There is an easier way to do the test. Just as in FSA Utilities, an automaton is also regarded as as an identity transducer on the language it accepts. xfst makes it easy to apply transducers to raw strings, as follows:

    push nounphrase pushes the nounphrase machine onto the stack, ready to apply; and then apply down enters a mode where you can type in test strings and discover what nounphrase transduces them to, in the usual ("down") direction of transduction. If the input string is copied to the output, then it's part of the language; if there are no output strings, then the input was not part of the language. Try typing in ArtAdjAdjNoun and ArtAdjAdj.

    Press Ctrl-D to exit this mode, and do clear stack.

  6. Type read regex Noun -> Nmod || _ Noun ; , which pushes an unnamed transducer that replaces Noun with Nmod before any Noun. Test it out: apply down FooBarNounBazNounNounBingNounNounNounNoun.

    Note: The "down" direction of application (and in the stack) is identified with "right" in regular expressions (and in strings). Applying down transforms along the direction of the rightward arrow in Noun -> Nmod, and from a to b in a:b or a .x. b.

  7. We now want to compose nounphrase with this transducer. Type push nounphrase and then compose net.

    Did this compose the transducers in the right order? We want the lower language (the usual output) of nounphrase to feed into the upper language (input) of the new transducer. Since nounphrase is sitting on top of the stack, in the usual metaphor, it is above the new transducer, so its lower language was "touching" the new transducer's upper language, as desired.

    Now that we have composed them, we can apply down, which is as if we send a string downward through nounphrase first and then through the noun-to-nmod transducer. Try some inputs like ArtAdjNounNounNoun and VerbAdjNounNounNoun.

  8. We want to modify the composed machine so it inserts angle brackets around the noun phrase. First let's pop it into a named variable: define noun2nmod. (If the stack isn't clear now, type clear.) Now let's push the three things we have to concatenate:

    However, the stack is now in reverse order: since down in the stack corresponds to right in the string, we want the topmost thing on the stack to be leftmost in the string. So let's reverse the order of stack elements (without exchanging up and down for any element): turn stack. (Type print stack to see the stack.) Now we can finally concatenate together all three transducers on the stack: concatenate net. Try using apply down to test it on the same strings as before.

  9. Give a name to our new transducer: define npbracket. We can use this name directly in a regexp: read regex [?* [npbracket ?*]*];

    As we did with the FSA utilities, try this on the strings VerbArtAdjNounNounNoun and ArtAdj.

  10. For our last major trick, we again want to "correct" the previous transducer so that it behaves like replace(T) in the FSA Utilities. This means applying npbracket to maximal noun phrases from left to right. This will be a little tricky, since xfst has no operator that uses a transducer to do directed replacement. However, the solution is completely general.

    We will do it as a composition of two transducers:
  11. Check that the machine we just constructed does the right thing: apply down VerbArtAdjNounNounNounPrepArtAdjNoun. And as before, try applying it to a class of strings:

  12. Questions for you to try yourself:

  13. Type print defined and print storage to see what you've done. Then quit xfst by typing exit or quit.

Some other nice features of xfst that we didn't explore: Warning: In the case of transducers, xfst appears to use only the automaton algorithms for determinization and minimization. That is, it temporarily treats an arc label like a:b as a single symbol. So the result is not necessarily deterministic with respect to the input (there may be a:b and a:c arcs from the same state); nor is it necessarily the minimal transducer implementing this relation, since xfst will not change the alignment between input and output (will not split up the two halves of a:b and put them on different arcs).

FSM and lextools: AT&T finite-state library

  1. This software comes in the form of C libraries (type man -s3 fsm) and command-line tools (type man -s1 fsm and man lextools). There is no interactive interface other than the shell, makefiles, etc. So you will need to create some files representing the machines. Make a directory to hold the files for this exercise.

  2. The FSM package denotes states and symbols internally by numbers. Let's start by giving them symbolic labels for our convenience. Create a file called partofspeech.lab with these contents:

    eps     0
    Noun	1
    Verb	2
    Quant	3
    Art	4
    Adj	5
    Nmod	6
    Prep	7
    Adj	8
             
    0 is reserved for the empty string. (It should be assigned a name like epsilon or #.) Otherwise the numbers here are arbitrary; they need not be consecutive or even small.

  3. Now let's create a textual description of a FSA (finite-state acceptor) for the language (Art | Quant)? Adj* Noun+. Put this in a file nounphrase.txt:

    0	1	Art
    0	1	Quant
    1	1	Adj
    1	2	Noun
    2	2	Noun
    2
             
    The first line means, for example, that there is an arc from state 0 to state 1 labeled with the symbol Art. The last line means that state 2 is a final state.

    The first state mentioned in the file (state 0 in this case) is the start state. The state numbering is arbitrary (in fact, the FSM programs may renumber the states, so the -s option for assigning names to states usually won't do what you want when printing or drawing a machine, although it can be helpful when compiling one from input you create).

    The order of lines in the file is irrelevant, and any amount of whitespace may be used to separate fields on a line. Any line can take an additional field at the end giving the cost of the arc or the final state.

  4. Now compile this textual representation into a binary representation:
    fsmcompile -ipartofspeech.lab nounphrase.txt > nounphrase.fsa
    The -i flag gives our file that turns input symbol labels into numbers. (Use -o for output symbols and -s for state labels.) Compilation or other operations may renumber the states.

  5. Let's look at our compiled FSA:

    You may be interested to know that fsmdraw produces a directed graph specification in the dot language (documentation: man dot, user's guide). While there are several options to fsmdraw, it is sometimes convenient to edit this specification it produces, or filter it with a Perl script - for example, to get color. You can also use dotty nounphrase.dot to make minor changes to the layout, e.g., drag states around with the mouse or set colors and styles.

    dot -Tps then lays out the graph and produces postscript output, which can be saved in a file or piped directly to ghostview - as above.

  6. There are some easier ways to produce these machines than by hand-coding them! The lextools package is designed for that. In particular, lexcompre compiles a regular expression specified with -s:

    lexcompre -l partofspeech.lab -s '([Art]|[Quant])?[Adj]*[Noun]+' > nounphrase2.fsa
    View the output and make sure it's right!

    Other options:
  7. Just for fun, try reversing the automaton:

    fsmreverse nounphrase2.fsa | fsmdraw -ipartofspeech.lab | dot -Tps | ghostview -landscape -
    The result contains an epsilon arc and other instances of nondeterminism. Thus, try
    fsmreverse nounphrase2.fsa | fsmrmepsilon | fsmdeterminize | fsmminimize | fsmdraw -ipartofspeech.lab | dot -Tps | ghostview -landscape -
    Notice that we needed a sequence of 3 commands: we must remove epsilons before determinization (which would otherwise treat them as ordinary symbols), and determinization is required for minimization to run.

    Note also that some of the FSM commands are being used as pipes here. As for many Unix commands, the one-argument commands will conveniently read from standard input if their filename argument is omitted, and all the commands take the special filename - to refer to standard input.

  8. Now let's create our transducer that maps Noun to Nmod before another noun. First by hand: create a file noun2nmod.txt with the following contents (but delete the comments).

     
    ### state 0: neither inside or immediately after a noun sequence
    0   # final      
    0	1	Noun	Nmod   # nondeterministically start multi-word noun sequence:
    0	2	Noun	Noun   # nondeterministically start single-word noun sequence:
    0	0	Verb	Verb
    0	0	Quant	Quant
    0	0	Art	Art
    0	0	Adj	Adj
    0	0	Nmod	Nmod
    0	0	Prep	Prep
    0	0	Adv	Adv
    ### state 1: strictly inside a maximal input noun sequence
    ###          must continue it
    1	1	Noun	Nmod
    1	2	Noun	Noun
    ### state 2: right at the end of a maximal input noun sequence
    ###          no way to leave here on a Noun input
    2    # final
    2	0	Verb	Verb
    2	0	Quant	Quant
    2	0	Art	Art
    2	0	Adj	Adj
    2	0	Nmod	Nmod
    2	0	Prep	Prep
    2	0	Adv	Adv
    
    Compile this via fsmcompile -t -ipartofspeech.lab -opartofspeech.lab noun2nmod.txt > noun2nmod.fst. (The -t option indicates that the input describes a transducer.) View the result with fsmdraw.

  9. (coming soon! But you do know enough now to use these tools in assignment 1. You could check out this tutorial if you want more info on FSM.)



CS 405 home - Jason Eisner - jason@cs.jhu.edu - $Date: 2001/11/19 16:08:31 $