Next: 3. System Description
Up: Fast Transformation-Based Learning Toolkit
Previous: 1. Preface
  Contents
Subsections
2. An Introduction to Transformation-Based Learning
In 1992, Eric Brill introduced the formalism of the transformation-based
learning algorithm in its current formulation, but probably a more
well-known reference to the technique is Eric Brill's article from
1995, [1].
The central idea behind transformation based learning (henceforth
TBL) is to start with some simple solution to the problem, and apply
transformations - at each step the transformation which results in
the largest benefit is selected and applied to the problem. The algorithm
stops when the selected transformation does not modify the data in
enough places, or there are no more transformations to be selected.
While TBL can be used in a more general setup, the toolkit we are
describing here is merely concerned with classification tasks, where
the goal is to assign classifications to a set of samples. Although
this assumption might be restrictive (not all tasks of interest may
be transformed into a classification task - e.g. parsing), most natural
language related tasks can be cast as classification tasks. It is
not inconceivable, though, that an extension of the current toolkit
for such tasks might be build, but some extensive modifications to
the code are needed.
The fast TBL system (henceforth fnTBL2.1) implements ideas from 3 articles: [4], [7]
and [5]. However, for the time being, the probability
generation is buggy and it should not be used. As soon as we will
find time, we will re-integrate the module back into the code.
2.1 The Algorithm Description
To make the presentation clearer, we will introduce some notations
that will be used throughout the documentation (the notation is compatible
with the above-mentioned articles):
- X will denote the space of samples. How a sample is represented
is dependent on the task. For example, in prepositional phrase attachment
(the problem of deciding the point of attachment of a prepositional
phrase), a sample can be formed from a sentence by extracting the
head verb of the main verb phrase, the head noun of the noun, the
preposition and the head noun of the noun phrase following the preposition:
In part-of-speech tagging (the task of assigning a label to each
word in a sentence, corresponding to its part-of-speech function),
a sample might be represented as the word itself
This example is contrasting with the previous one in the following
way: in the first one the samples in the training set form a set
(the samples are independent), while in the second one they form a
sequence (they are interdependent). TBL (and fnTBL in particular)
can handle easily both cases; in fact, in the case when they are interdependent,
the TBL shows it's full learning power. In the case where the samples
are independent, it might be wise to try also some other method (decision
trees, for instance), as the greedy approach of TBL can sometimes
lead to sub-optimal results. Still, TBL possesses an advantage over
standard machine learning techniques such as decision trees and decision
lists, as it's error driven, therefore optimizing directly the error-rate
(rather than other presumably correlated metrics, such as entropy
or Gini index).
-
will denote the set of possible classifications
of the samples. For example, in the first task presented at point
1. the classification set
consists of
the valid POS tags (NN, NNS, VBP, etc), while in the second case the
set
consists of the labels s and n,
corresponding to the verb attachment and noun attachment, respectively.
-
will denote the state space
(the cross-product between the sample space and the classification
space - each point in this space will be a pair of a sample and a
classification)
will usually denote a predicate defined on the space
- basically a predicate on a sequence of states. For instance, in
the part-of-speech case,
might be
or
- A rule r is defined as a pair (
,
) of a predicate
and a target state, and it will be written as
c for simplicity. A rule r=(
,
) is said to apply
on a sample
if the predicate
returns true on
the sample
. Examples of rules include
and
Obviously, if the samples are interdependent, then the application
of a rule will be context-dependent.
- Given a state
and a rule
,
the state resulting from applying rule
to state
,
, is defined as
- We assume that we have some labeled training data
which can
be either a set of samples, or a sequence of samples, depending on
whether the samples are interdependent or not; We also assume the
existence of some test data
on which to compute
performance evaluations.
- The score associated with a rule
is
usually the difference in performance (on the training data) that
results from applying the rule:
where
The TBL algorithm can be now described as follows:
- Initialize each sample in the training data with a classification
(e.g. for each sample
determine its most likely classification
); let
be the starting training data.
- Considering all the transformations (rules)
to the training
data
, select the one with the highest score
and apply it to the training data to obtain
If there are no more possible transformations, or
,
stop2.2.
-
- Repeat from step 2.
An easy observation that needs to be made is that the pseudo-code
presented above is an algorithm, i.e. it will finish on any data that
is given. The argument is simple: let us denote the number of errors
in
by
. Then we have the following recurrence:
 |
(2.1) |
Since
at each stage where we apply a rule, (2.1) implies that
for any
. Since
, it results that
the algorithm will stop at some point in time, therefore avoiding,
for instance, garden paths.
The most expensive step in the TBL computation is the computation
of the rules' score. There have been many approaches that try to solve
the problem in different ways:
- In Brill's tagger [1] (probably the most used
TBL software package), the rules are regenerated at each step: only
the rules that correct the errors are generated and the good counts
are computed this way; then they are examined in the decreasing order
of their good counts and the bad counts are computed by applying the
rules to each sample -- this approach has the advantage that it can
stop early (i.e. not evaluate rules whose score cannot be better than
the best rule found so far). The disadvantage is that it slows down
considerably as the best rule score decreases.
- Ramshaw & Marcus [8] proposed a different approach:
for each rule store all the positions in the corpus where it applies
and for each sample store the list of rules that apply to it. This
trick makes the update very fast, but unfortunately requires very
large amounts of memory.
- Samuel [9] presented a Monte-Carlo approach
to TBL, where only a part of the rule space is explored to determine
the ``best'' transformation to apply to the data.
- Lager, in his
TBL system [6], implements
an efficient Prolog version of TBL, which achieves speed-ups by ``lazy''
learning.
fnTBL (as described in [7], provided with the distribution)
achieves its speedup by generating the rules both for the good and
bad counts computation. The main idea is to store the rule counts
and to recompute them as necessary, when a newly selected rule gets
applied to the corpus. The advantage is that only samples in the vicinity
of the application of the best rule need to be examined and the rules
that apply on them need to have their counts updated -- and to identify
those rules, we are ``generating'' them -- using the set of templates
to identify the rules that apply on those positions and update their
counts (good and bad) if necessary. This approach can obtain up to
2 orders of magnitude speed-up relative to the approach present in
Brill's tagger.
Next: 3. System Description
Up: Fast Transformation-Based Learning Toolkit
Previous: 1. Preface
  Contents
Radu Florian
2002-02-07