600.465 Introduction to NLP (Fall 1999)
Midterm Exam Answers
Date: Nov 01 2pm (30 min.)
If asked to compute something for which you have the numbers, that
really means to compute the final number, not just to write the
formula. If asked for a formula, write down the formula.
1. Probability
Let S = { a, b, c } (the sample space), and p be the joint
distribution on a sequence of two events (i.e. on S x S, ordered). If
you know that p(a,a) [a followed by a] = 0.25, p(a,b) [a followed by
b] = 0.125, p(b,c) [b followed by c] = 0.125, p(c,a) [c followed by a]
= 0.25, and p(c,c) [c followed by c] = 0.25, is it enough to compute
p(b|a) (i.e., the probability of seeing b if we already know that the
preceding event generated a)?
- Yes / No: __Yes__
- why? __The probabilities sum up to 1; therefore p is fully defined.___
__Therefore we can get pL(a) = sum over i of p(a,i), then p(b|a) = p(a,b)/p(a)._
- If yes, compute: p(b|a) = ___1/3__( = p(a,b) / pL(a) = .125 / .375)____
2. Estimation and Cross-entropy
Use the bigram distribution from question 1.
- Write one example of a data sequence which faithfully follows the
distribution (i.e., a training data from which we would get the above
bigram distribution using the MLE method):
E.g.: __c__
__a__
__a__
__b__
__c__
__c__
__c__
__a__
__a__
- What is the cross-entropy Hdata(p) in bits and the
perplexity1 Gdata(p) of the bigram distribution from
question 1 if computed against the following data:
data = b c a
Hdata(p) = ____4/3______
___
Gdata(p) = _ 3V16 __ (the cubic root of 16)
3. Mutual information
Use the bigram distribution from question 1.
- What is the pointwise mutual information of c and a (in this order)?
Ipointwise(c,a) = _ 0 _( = log2(p(c,a)/pL(c)pR(a)) = log2(0.25/((0.5)(0.5))) = log2(1) __
4. Smoothing and the sparse data problem
- Name three methods of smoothing:
- ____"Add 1" (or "Add lambda")_____________________________
- ____Good-Turing___________________________________________
- ____Linear Interpolation__________________________________
-
If you were to design a bigram language model, how would the final smoothed
distribution be defined if you use the linear interpolation smoothing method?
- ____p2'(wi|wi-1) = l2p2(wi|wi-1) + l1p1(wi) + l0(1/|V|), l0 + l1 + l2 = 1___
5. Classes based on Mutual Information
Suppose you have the following data:
Is this question really so easy , or was it rather
the previous question , that was so difficult ?
What is the best pair of candidates for the first merge, if you use the
greedy algorithm for classes based on bigram mutual information
(i.e. the homework #2 algorithm)? Use your judgment, not computation.
-
Word 1: ___or_______________
Word 2: ___that_____________
6. Hidden Markov Models
- What is the Viterbi algorithm good for? (Use max. 5 sentences for
the answer.)
____To find the best path through a state sequence graph given________________
____a parametrized (trained) HMM and some input data presumably_______________
____generated by that HMM._______________________________________________
_________________________________________________________________________
_________________________________________________________________________
- What is the Baum-Welch algorithm good for? (Use max. 5 sentences for
the answer.)
____For estimating the optimal set of parameters of an HMM. It needs_____
____a set of training data to do so. The topology of the HMM has to be___
____specified, too.______________________________________________________
_________________________________________________________________________
_________________________________________________________________________
Now check if you have filled in your name and SSN. Also,
please carefully check your answers and hand the exam in.
1 The perplexity computation is the only one computation
here for which you might need a calculator; it is ok if you use an
expression (use the appropriate (integer) numbers, though!).