Next: Bibliography
Up: Fast Transformation-Based Learning Toolkit
Previous: Acknowledgements
  Contents
Subsections
A. Appendix
A.1 Parameter File Variables
- CONSTRAINTS_FILE - defines a set of constraints on the data - rules
will not be applied to a particular sample if some constraint is broken.
For instance, if a constraint of the form
is imposed on a POS task, then no rule will change the tag of the
word the to anything else but a determiner. See Section 3.2.5
for more information on constraints and why they are useful for TBL.
- COOCCURRENCE_CONFIGURATION_FILE - defines the bigram - file used
by lexical tagging - see Section 5.2.1. A sample of such
a file is presented in Table A.1.
Table A.1:
Bigram cooccurrence file
 |
- ELIMINATION_THRESHOLD - if defined, it sets a threshold on the number
of rules that are generated initially. Every time the number of generated
rules is a multiple of a specified number, all rules with count less
than ELIMINATION_THRESHOLD are eliminated - this feature has the
advantage of keeping the initial number of rules down.
- ERASE_USELESS_RULES - if defined, the rules that have 0 good counts
and a number of bad counts greater than the value of this parameter
are eliminated - this helps keep the space represented by the rules
smaller, but it will make the program run a little slower. The big
advantage is that probably the space needed to represent the rules
is not going to grow considerably beyond what is needed to store the
initial rule set (as many rules that have low good scores are eliminated).
- FILE-TEMPLATE - defines the file that holds the names assigned to
the features. The file should contain one line describing the sample,
and it should have the following format:
where
is the name associated
with feature number
,
is the name associated with the
class and
is the name associated with truth number
. Let us notice that
there are exactly as many classifications as true values, and that
there can be more than one classification, since the fnTBL
toolkit can handle multiple simultaneous classifications. The names
provided here will be used to generate the rules (the features will
be named using these names).
- LARGE_WORD_VOCABULARY - defines the large vocabulary file used
by the lexical tagging program - see Section 5.2.1. The
file should contain a list of words, one per line, that can appear
in the language - such a file is easily extracted from a large unanotated
corpus, using a command such as
- MINIMUM_ENTROPY_GAIN - this parameter is used to control the further
expansion of the probabilistic tree (see [4])
- the expansion is stopped when the gain of the current split is
less than the value of this parameter;
- NULL_FEATURES - When considering a set-like predicate, some of the
features might not be ``real'' features, but rather the absence
of a feature. These features are ignored when rules are generated
- it speeds up the program considerably, while eliminating rules
that make no sense at all. The features to be ignored are to be separated
by a single comma (,).
- ORDER_BASED_ON_SIZE -- determines how the tiesA.1 are broken. Possible values:
- 0 - the rule whose predicate has more atomic predicates is chosen
first;
- 1 - the rule whose predicate has less atomic predicates is chosen
first;
- 2 - the rule whose template is declared first in the rule template
file is chosen first.
- REASONABLE_SPLIT - this parameter is required by the probabilistic
tree generation (see [4]) - and controls the
threshold under which rules do not split the data: if the rule applied
less than the value of this parameter, then the rule is not considered
in the split;
- REASONABLE_DT_SPLIT - this parameter is required by the further
expansion of the probabilistic tree (see [4]);
the expansion is stopped when the any of the generated nodes has less
than the value of this parameter samples;
- RULE_TEMPLATES - defines the file that holds the rule templates.
The file should list a series of rules, one on a line, respecting
the following pattern:
where
is a basic predicate
template and
is a valid
name for a classification (as defined by the file contained in the
FILE_TEMPLATE). The basic (also called atomic) predicates are checking
one particular feature (i.e. they have one argument), and the rule
is formed as a conjunction of these atomic predicates. The rule template
defines the search space of the algorithm - it will find particular
instantiations of these rules that correct the most errors at each
step. A valid research question that can be raised in this context
is how to obtain these patterns automatically, and not have the user
specify them.
- TRUTH_SEPARATOR - in some cases, it might be useful to have multiple
values for the truth. For instance, in the task of word sense disambiguation,
words might have multiple senses (e.g. ``Roger Rabbit''
will have both the proper name sense - a cartoon character - and
the real rabbit sense - animal). In this case, the multiple true
values are to be separated by the character defined in this parameter.
For instance, if TRUTH_SEPARATOR=|, then the value P|bar%1:14:00::
is a valid value as the truth for a sample. The fnTBL training program
will interpret this as a union of senses, rather than a single value,
and will generate the appropriate rules.
A.2 Additional Tools Provided
There are a number of additional tools provided with the toolkit,
mainly perl scripts that are built to solve one problem or the other.
Here is a (partial) list of scripts, together with their description:
- mcreate_lexicon.prl - generates the feature lexicon
(all the possible classes for a given feature); it is used to create
the restriction files. It should be run as:
where :
-
is a feature text file, with
the features separated by whitespace;
can also be '-', in which case the input is read from stdin (useful
when used in pipes);
- -d
will generate a list for feature number
and classification
(it assumes
that column
contains features
while on column
classifications
are listed)
- -n
sets a listing threshold
- any feature with a frequency less than
is not displayed;
- -c - if this flag is present, classification frequencies are also
output, together with the classification.
The output is done at stdout, so you might want to redirect it to
a file. For each feature with a frequency over the threshold, it outputs
the feature and a list of classifications separated by spaces. The
list is output in the reverse order of occurrence (i.e. the most likely
one first).
- most_likely_tag.prl - applies a lexicon file to
a feature file. For instance, it can be used to generate the most
likely POS tag associated with a sequence of words, given a lexicon
file created with the previous program (mcreate_lexicon.prl).
Its command line is
where:
-
is a feature file, with features
separated by whitespace;
can
also be ``-``, in which case the input is read from stdin (useful
for pipelining);
-
- uses
the specified
; if
this parameter is unspecified, it will use the input file to first
extract the most likely tag, given the provided pattern;
-
- uses this
pattern to assign the most likely classification; a pattern has the
form
(similar to the -d option of the mcreate_lexicon.prl
script). If this pattern is not specified,
is considered
and
is considered
.
-
is used to specify the most likely tag in case the feature classification
is undefined. There are 2 values specified, one for common nouns,
and one for proper nouns. If this doesn't make sense (it does only
for POS), then specify just one value, the value of the most likely
classification overall.
The output is done, again, at the standard out. The output corresponding
to a feature sequence is done as follows: the feature on position
is replaced with the most
likely classification associated with the feature present on position
; if no feature list is associated
with the feature value on position
(i.e. the feature is ``unknown''), then the most likely tag is
output on position
(with a
difference for POS tagging, where if the word begins with a capital
letter, the second value provided is output, otherwise the first one
is output).
- brill-to-tbl.prl converts the rules generated by the
Eric Brill's POS tagger into rules that can be used by the fnTBL
toolkit. Provided just for fun - and comparison. It receives the
list of rules in Brill format and outputs at stdout the list of rules
in fnTBL format.
- rm-to-tbl.prl converts the rules generated by the
Ramshaw & Marcus baseNP chunking TBL program (see [8])
into the corresponding rules in fnTBL format. It accepts
a file as argument, the rules in Ramshaw&Marcus format and outputs
the corresponding rules in the toolkit format.
- mcompute_error.prl evaluates the performance of the
fnTBL output for a given task. Obviously, one needs
to have the correct classification for this feature to work properly.
The necessary output is the one obtained from the fnTBL
command. The running command is:
where
-
is the output of the fnTBL
program (assuming the last
columns correspond to the true
classifications); the value can also be ``-``, in which case
the input is read from stdin;
-
specifies which fields are the system's output and which ones are
the true classification:
is
the index where the system's output begins,
is the index where the true classification begins and
are the number of elements to be classified; the default values are
,
,
.
- number-rules.prl a really silly program that numbers
the rule file such that the rule numbers correspond to the numbers
output with the option
of the fnTBL
command.
- pos-train.prl and pos-apply.prl are
described in detail in Section 5.2.3, and
we will not bore the reader with another description.
- A number of other simple, less-than-useful scripts. Please feel free
to poke around them to see what they do
... .
Next: Bibliography
Up: Fast Transformation-Based Learning Toolkit
Previous: Acknowledgements
  Contents
Radu Florian
2002-02-07