Introduction to Natural Language Processing (600.465)Maximum Entropy
CS Dept., Johns Hopkins Univ.
Maximum Entropy??
- Recall: so far, we always “liked”
- minimum entropy...
= minimum uncertainty
= maximum predictive power
.... distributions
- always: relative to some “real world” data
- always: clear relation between the data, model and parameters: e.g., n-gram language model
- This is still the case! But...
The Maximum Entropy Principle
- Given some set of constraints (“relations”, “facts”), which must hold (i.e., we believe they correspond to the real world we model):
What is the best distribution among those available?
- Answer: the one with maximum entropy
(of such distributions satisfying the constraints)
- Why? ...philosophical answer:
- Occam’s razor; Jaynes, ...:
- make things as simple as possible, but not simpler;
- do not pretend you know something you don’t
Example
- Throwing the “unknown” die
- do not know anything ® we should assume a fair die
(uniform distribution ~ max. entropy distribution)
- Throwing unfair die
- we know: p(4) = 0.4, p(6) = 0.2, nothing else
- best distribution?
- do not assume anything
about the rest:
Using Non-Maximum Entropy Distribution
- Result depends on the real world:
- real world ~ our constraints (p(4) = 0.4, p(6) = 0.2), everything else no specific constraints:
- our average error: D(q||p) [recall: Kullback-Leibler distance]
- real world ~ orig. constraints + p(1) = 0.25:
- q is best (but hey, then we should have started with all 3 constraints!)
Things in Perspective: n-gram LM
- Is an n-gram model a ME model?
- yes if we believe that trigrams are the all and only constraints
- trigram model constraints: p(z|x,y) = c(x,y,z)/c(x,y)
- no room for any “adjustments”
- like if we say p(2) = 0.7, p(6) = 0.3 for a throwing die
- Accounting for the apparent inadequacy:
- smoothing
- ME solution: (sort of) smoothing “built in”
- constraints from training, maximize entropy on training + heldout
Features and Constraints
- Introducing...
- binary valued selector functions (“features”):
- fi(y,x) Î {0,1}, where
- y Î Y (sample space of the event being predicted, e.g. words, tags, ...),
- x Î X (space of contexts, e.g. word/tag bigrams, unigrams, weather conditions, of - in general - unspecified nature/length/size)
- constraints:
- Ep(fi(y,x)) = E’(fi(y,x)) (= empirical expectation)
- recall: expectation relative to distribution p: Ep(fi) = Sy,xp(x,y)fi(y,x)
- empirical expectation: E’(fi) = Sy,xp’(x,y)fi(y,x) = 1/|T| St=1..Tfi(yt,xt)
- notation: E’(fi(y,x)) = di: constraints of the form Ep(fi(y,x)) = di
Additional Constraint (Ensuring Probability Distribution)
- The model’s p(y|x) should be probability distribution:
- add an “omnipresent” feature f0(y,x) = 1 for all y,x
- constraint: Ep(f0(y,x)) = 1
- Now, assume:
- We know the set S = {fi(y,x), i=0..N} (|S| = N+1)
- We know all the constraints
- i.e. a vector di, one for each feature, i=0..N
- Where are the parameters?
- ...we do not even know the form of the model yet
The Model
- Given the constraints, what is the form of the model which maximizes the entropy of p?
- Use Lagrangian Multipliers:
- minimizing some function f(z) in the presence of N constraints gi(z) = di means to minimize
f(x) - Si=1..Nli(gi(x) - di) (w.r.t. all li and x)
- our case, minimize
A(p) = -H(p) - Si=1..Nli(Ep(fi(y,x)) - di) (w.r.t. all li and p!)
- i.e. f(z) = -H(p), gi(z)= Ep(fi(y,x)) (variable z ~ distribution p)
Loglinear (Exponential) Model
- Maximize: for p, derive (partial derivation) and solve A’(p) = 0:
d[-H(p) - Si=0..Nli(Ep(fi(y,x)) - di)]/dp = 0
d[ S p log(p) - Si=0..Nli((S p fi) - di)]/dp = 0
1 + log(p) - Si=0..Nli fi = 0
1 + log(p) = Si=1..Nli fi + l0
p = eSi=1..Nli fi + l0 - 1
- p(y,x) = (1/Z) eSi=1..Nlifi(y,x) (Z = e 1-l0, the normalization factor)
Getting the Lambdas: Setup
- Model: p(y,x) = (1/Z) eSi=1..Nlifi(y,x)
- Generalized Iterative Scaling (G.I.S.)
- obeys form of model & constraints:
- G.I.S. needs, in order to work, "y,x Si=1..N fi(y,x) = C
- to fulfill, define additional constraint:
- fN+1(y,x) = Cmax - Si=1..N fi(y,x), where Cmax = maxx,y Si=1..N fi(y,x)
- also, approximate (because SxÎAll contexts is not (never) feasible)
- Ep(fi) = Sy,xp(x,y)fi(y,x) @ 1/|T| St=1..TSyÎYp(y|xt)fi(y,xt)
(use p(y,x)=p(y|x)p’(x), where p’(x) is empirical i.e. from data T)
Generalized Iterative Scaling
- 1. Initialize li(1) (any values, e.g. 0), compute di, i=1..N+1
- 2. Set iteration number n to 1.
- 3. Compute current model distribution expected values
of all the constraint expectations
Ep(n)(fi) (based on p(n)(y|xt))
- [pass through data, see previous slide;
at each data position t, compute p(n)(y,xt), normalize]
- 4. Update li(n+1) = li(n) + (1/C) log(di/Ep(n)(fi))
- 5. Repeat 3.,4. until convergence.
Comments on Features
- Advantage of “variable” (~ not fixed) context in f(y,x):
- any feature o.k. (examples mostly for tagging):
- previous word’s part of speech is VBZ or VB or VBP, y is DT
- next word: capitalized, current: “.”, and y is a sentence break (SB detect)
- y is MD, and the current sentence is a question (last word: question mark)
- tag assigned by a different tagger is VBP, and y is VB
- it is before Thanksgiving and y is “turkey” (Language modeling)
- even (God forbid!) manually written rules, e.g. y is VBZ and there is ...
- remember, the predicted event plays a role in a feature:
- also, a set of events: f(y,x) is true if y is NNS or NN, and x is ...
- x can be ignored as well (“unigram” features)
Feature Selection
- Advantage:
- throw in many features
- typical case: specify templates manually (pool of features P), fill in from data, possibly add some specific manually written features
- let the machine select
- Maximum Likelihood ~ Minimum Entropy on training data
- after, of course, computing the li’s using the MaxEnt algorithm
- Naive (greedy of course) algorithm:
- start with empty S, add feature at a time (MLE after ME)
- too costly for full computation (|S| x |P| x |ME-time|)
- Solution: see Berger & DellaPietras
References
- Jelinek:
- Chapter 13 (includes application to LM)
- Chapter 14 (other applications)
- Berger & DellaPietras in CL, 1996, 1997
- Improved Iterative Scaling (does not need Si=1..N fi(y,x) = C)
- “Fast” Feature Selection!
- Hildebrand, F.B.: Methods of Applied Math., 1952