Next: 4. Using The Toolkit
Up: Fast Transformation-Based Learning Toolkit
Previous: 2. An Introduction to
  Contents
Subsections
3. System Description
The fnTBL system has 2 executables:
- fnTBL-train - that learns the rules that are to be applied
- fnTBL - that applies the rules to the new test data
Both of them have a set of command line parameters, which will be
discussed in the following subsections, and read also from a parameter
file, where some parameters that can be changed from problem to problem,
are specified.
3.1 The Parameter File
The parameter file is used to store all the parameters that do not
change often during the development of a particular task. We found
it to be useful, because it reduces significantly the number of command-line
parameters that need to be specified, and modifying one parameter
does not require the program to be recompiled (as it would if the
parameters would be hard-wired in the program).
Its format is as follows:
where:
- parameter_name represents the name of the parameter
- parameter_value represents the value of the parameter
The parameter value can also contain previously defined parameters,
if they have the dollar sign
in them. For instance,
is a legal value if the variable MAIN was previously defined. I found
this to be useful if one wants to define a main directory, where all
the other files reside; when one changes the location of the main
directory, only one line needs to be modified.
The following are legal examples of lines in the parameter file:
MAIN = /remote/bigram/usr/home/rflorian/workdir/research/data/mtbl-toolkit/text_chunking;
FILE_TEMPLATE = ${MAIN}/file.templ;
RULE_TEMPLATES = ${MAIN}/lexical_predicates.richer.templ;
REASONABLE_SPLIT = 10;
REASONABLE_DT_SPLIT = 5;
EMPTY_LINES_ARE_SEPARATORS = 0;
ELIMINATION_THRESHOLD = 0;
The example is actually extracted from a parameter file associated
with the fnTBL POS-tagger. A complete set of valid parameters are
provided in the Appendix. The most important ones are:
- FILE_TEMPLATE - contains the name of the file with the sample feature
names (see 3.2.2);
- RULE_TEMPLATES - contains the name of the file with the rule description
(see 3.2.3);
- CONSTRAINTS_FILE - contains the name of the constraints file (see
3.2.5);
- LOG_FILE - contains the name of a file where all the commands will
be written - useful to keep track what commands were used, if the
parameter is defined;
- EMPTY_LINES_ARE_SEPARATORS - describes if the empty lines are
to be considered separators or not. If the samples are interdependent
(e.g. POS tagging), then the value should be 1 (they are separators).
If the samples are independent, then it should be 0. The reason for
its existence is that sometimes even if the samples are independent,
one might want to separate the samples into sentences (for instance,
in lexical POS tagging, Section 5.2)
This section will briefly describe the file formats used with the
fnTBL toolkit. In all the following files, the lines starting with
a '#' sign are considered as comments and, therefore, are ignored.
Both the training file and the test file need to be in a particular
format for the fnTBL tools to work properly. The format requires that
each sample be on a separate line, with the features separated by
white space (spaces or tabular characters). In addition, if the samples
are interdependent (like in the POS tagging case), and organized into
``blocks'' (i.e. sentences), the blocks need to be separated by
a blank line.
Here's an example of such a file:
Revenue NN NN
rose VBD VBD
5 CD CD
% NN NN
to TO TO
$ $ $
282 NN CD
million CD CD
from IN IN
$ $ $
268.3 NN CD
million CD CD
. . .
The DT DT
drop NN NN
in IN IN
earnings NNS NNS
had VBD VBD
been VBN VBN
anticipated VBN VBN
by IN IN
most JJS JJS
Wall NNP NNP
Street NNP NNP
analysts NNS NNS
, , ,
but CC CC
the DT DT
results NNS NNS
were VBD VBD
reported VBD VBN
after IN IN
the DT DT
market NN NN
closed VBD VBD
. . .
There are 3 fields for each word (in this case)3.1:
- The word itself (e.g. closed);
- The most likely part-of-speech associated with the word (e.g. VBD);
- The true POS of the word3.2 (e.g. VBN).
As presented in the Section 2.1, the system will start
to learn transformations that correct the second field to match, as
much as possible, the third field.
Both the training and test data should have this format. Some tools
that compute the most likely tag given the word (for instance) are
provided with the fnTBL distribution and are described in the Appendix
section.
3.2.2 FILE_TEMPLATES
The names of the fields associated with the samples are defined in
a file whose name is specified in the parameter file. The name of
the parameter is FILE_TEMPLATES and the file contained in
this parameter should have the following format:
where
-
is the name of the
feature associated with the sample;
-
is the name of the
``guess'' associated with the sample;
-
is the name of the
``truth'' associated with the sample; there are the same number
of
and
features.
To make this clearer, here's an example of a template file, taken
from POS tagging:
In this example, the feature is word, pos is the name
of the ``guess'' of the system at some point in time, and tpos
is the name of the truth. If one wants to perform a simultaneous classification
of POS tagging and text chunking (see [5]), then the
corresponding file should be:
where the features are again word, the guesses are named pos
and chunk and the truths (there are 2 classifications in this
case) are named tpos and tchunk.
3.2.3 RULE_TEMPLATES
Because the fnTBL is a general-purpose tool, the user needs to specify
the rule templates - in other words, to describe the kind of rules
the system will try to learn. The description is done by the templates:
a template defines which features are checked when a transformation
is proposed. For instance, if one wants to implement a bigram look-up,
then the template consists of the previous word and the current word.
The predicate of a rule is created as a conjunction of smaller, atomic,
predicates that can be as simple as feature identities; other types
verify if the word ends/begins with a specific suffix/prefix and yet
others verify the identity of one of the previous words. This particular
choice of conjunction is selected for:
- Simplicity - it is easier to represent the rules and templates this
way, both internally and externally (as text);
- Speed-up - the algorithm exploits heavily this choice; if disjunction
had been allowed between atomic predicates, the algorithm would have
been a lot more complicated and no significant speed-up would have
been obtained.
A list of atomic predicates' notation is presented after the rule
description.
The format of a rule template is
where:
-
are the atomic predicates,
with
as arguments
-
are the classifications that
are to be changed by the rule, and are chosen from
For example
is defining a rule that based on the previous and current word will
change the POS feature, while
will change the POS of a word based on a POS trigram ending on the
current position.
Motivated by some of the problems TBL has been applied to, some types
of atomic predicates have been implemented. Here is the list of how
they can be specified:
- Feature identity:
will check the identity of a particular feature.
- A feature of a neighboring sample (in the case of interdependent samples):
e.g. word_1, pos_-1, chunk_0.
can be negative as well; there is a restriction that the number has
to be an integer belonging to the range
;
- A feature that checks the presence of a particular feature in sequence
of samples:
e.g. word:[1,3] checks the presence of the particular word
in the samples on positions +1,+2 or +3, that is:
- A feature checking that one of the features present in an enumeration
has a particular value (useful for independent samples, see the PP
attachment case study):
e.g.
will return true on a given sample
if and only if one of
the features
or
of sample
is
3.3.
One of the big advantages of the fnTBL toolkit is that the predicate
structure is modular, and adding a new atomic predicate type to the
toolkit should not present a big programming challenge; see the section
about the code design for more information.
To make life easier, an additional construction is allowed
in the rule file format - definitions of a rule placeholder:
once a variable is defined, it can be recalled as
in the rule template file. This construction helps mostly with ``set''
atomic predicates (the predicate will return true if any one of a
set of features has a given value). The following is an rule template
example taken from the PP-attachment problem:
In here, a predicate-set placeholder is defined -
can replace the predicate
in subsequent rules. The rule template
will try to predict the attachment of a preposition using wordnet
parents of the verb and the first noun of the sample.
Both the main programs from the fnTBL toolkit (fnTBL
and fnTBL-learn) use rule files files which list the rules
learned by the system. These rules have meaningful linguistic information
in them, and can be easily read and understood. The format they are
presented in is similar to the one of the template files, rules being
actually instantiated templates. The format is:
where
is an atomic predicate,
are valid values for the respective predicates and
are the classes to be changed. For instance, again for the POS tagging
problem, the rule
will change all the occurrences of infinitive verbs to nouns if the
previous part-of-speech is a determiner; similarly, the rule
will change the attachment from noun to verb if the preposition is
at, in the case of prepositional phrase attachment.
3.2.5 The Constraints File
One less usual parameter for the fnTBL algorithm is the constraints
file. The TBL algorithm suffers from one relatively major problem
- since it's output is not probabilistic, it has no mechanism of
estimating the posterior probability of a class given a sample
and therefore it might assign classes to samples that might not make
any sense at all. For instance, it might assign to the word the
the POS tag VB, which is, of course, very unlikely. Most probabilistic
classifiers do not suffer from this problem, as they have appropriate
ways of computing the posterior probability
.
To avoid this problem, the TBL algorithm proposed originally by Eric
Brill enforced that no new classes should be assigned to samples:
if a sample has been seen in training, then a rule that would change
the classification to a classification that has never been seen with
that particular sample is not allowed to apply.
We are relaxing here this constraint by allowing the user to specify
the set of allowable associations feature-class for a set of examples,
by using constraint files. If a particular feature is not present
in these constraint files, then a rule that examines that feature
can change it as it sees fit. However, if the feature is present in
the file, only rules that change it to some specified class are allowed
to apply to it. This mechanism allows the user to specify, for instance,
that only words which were seen enough times in the data are not allowed
to be associated with new classes.
The format of the constraint file is:
where
are features
on which the conditioning needs to be done,
is the classification that is constrained and
is the file containing the constraints - a file containing a list
of the pairs that need be constrained. For instance, in the POS tagging
example, a constraint in the constraint file could be
and the file constraint.dat might contain
saying NN VBG
the DT
their PRP$
them PRP
Any combination of features that does not appear in the constraint
file are not constrained at all, and it's possible for those samples
to be assigned any classification. Also, it is useful to note that
the constraints are stackable; in other words, first a sample is checked
against all the constraints - if it passes, then the class change
is performed.
Next: 4. Using The Toolkit
Up: Fast Transformation-Based Learning Toolkit
Previous: 2. An Introduction to
  Contents
Radu Florian
2002-02-07