600.665 - Statistical Language Learning

Statistical Language Learning
Prof. Jason Eisner
Course # 600.665 - Spring 2002

"When the going gets tough, the tough get empirical" -- Jon Carroll

Course Description

Catalog description: This course focuses on past and present research that has attempted, with mixed success, to induce the structure of language from raw data such as text. Lectures will be intermixed with reading and discussion of the primary literature. Students will critique the readings, answer open-ended homework questions, and undertake a final project. [Applications]
Prereq: 600.465 or perm req'd.

The main goals of the seminar are (a) to cover some techniques people have tried for inducing hidden structure from text, (b) to get you thinking about how to do it better.

Since most of the techniques in (a) don't perform that well, (b) is more important.

The course should also help to increase your comfort with the building blocks of statistical NLP - weighted transducers, probabilistic grammars, graphical models, etc., and the supervised training procedures for these building blocks.

Links:

Vital Statistics

Lectures:	MTW 2-3 pm, Shaffer 304 (but we'll move to the NEB 325a conference room if we're not too big)
Prof:	Jason Eisner - `jason@cs.jhu.edu`
Office hrs:	MW 3-4 pm, or by appt, in NEB 326
Web page:	http://cs.jhu.edu/~jason/665
Mailing list:	cs665@cs.jhu.edu (cs665 also works on NLP lab machines)
Textbook:	none, but the textbooks for 465 may come in handy
Policies:	Grading: 30% written responses (graded as check/check-plus, etc.), 30% class participation, 40% project. Announcements: New readings announced by email and posted below. Submission: Email me written responses to the whole week's readings by 11 am each Monday. Academic honesty: dept. policy (but you can work in pairs on reading responses)

Readings and Responses

Generally we will discuss about 3 related papers each week. Since we may flit from paper to paper, comparing and contrasting, you should read all the papers by the start of the week.

A centerpiece of the course is the requirement to respond thoughtfully to each paper in writing. You should email me your responses to the upcoming week's papers, in separate plaintext or postscript messages, by noon each Monday. (Include "665 response" and the paper's authors in the subject line.) I will print the responses out for everyone, and they will anchor our class discussion. They will also be a useful source of ideas for your final projects.

A typical response is 1-3 paragraphs; in a given week you might respond at greater length to some papers than others. It's okay to work with another person. What should you write about? Some possibilities:

Idea for a new experiment, model or other research opportunity inspired by the reading
A clearer explanation of some point that everyone probably had to struggle with
Unremarked consequences of the experimental design or results
Additional experiments you really wish the author had done
Other ways the research could be improved (e.g., flaws you spotted)
Non-obvious connections to other work you know about from class or elsewhere

Please be as concrete as possible - and write clearly, since your classmates will be reading your words of wisdom!

The Readings

Suggestions for readings are welcome, especially well in advance.

Week of Jan. 28: Bootstrapping
We will read one or two of these for Wednesday (to be chosen in class on Monday).
- David Yarowsky (1995). Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL '95, 189-196. http://www.cs.jhu.edu/~yarowsky/acl95.ps.gz
- Yael Karov and Shimon Edelman (1996). Learning similarity-based word sense disambiguation from sparse data. Proc. of the 4th Workshop on Very Large Corpora, Copenhagen. http://www.ai.mit.edu/~edelman/mirror/karov-96.ps.Z
- I. Dan Melamed (1997). A word-to-word model of translational equivalence. Proceedings of ACL/EACL '97, 490-497. http://xxx.lanl.gov/abs/cmp-lg/9706026
Week of Feb. 4: Classes of "interchangeable" words
- Chapter 3 of: Lillian Lee (1997). Similarity-based approaches to natural language processing. Ph.D. thesis. Harvard University Technical Report TR-11-97. http://xxx.lanl.gov/ps/cmp-lg/9708011
- Chapter 4 of: The same thing.
- Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391-407. http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf; scanned version with figures
An optional short reading comparing the above methods for IR: L. Douglas Baker and Andrew McCallum (1998). Distributional clustering of words for text classification. SIGIR '98. http://www.cs.cmu.edu/~mccallum/papers/clustering-sigir98.ps.gz

Optional reading explaining the advanced clustering method that Lee uses: K. Rose (1998). "Deterministic Annealing for Clustering, Compression, Classification, Regression, and Related Optimization Problems," Proceedings of the IEEE, vol. 80, pp. 2210-2239, November. http://scl.ece.ucsb.edu/pubs/pubs_B/b98_2.pdf
Week of Feb. 11: Word meanings, word boundaries
- Carl de Marcken (1996). Linguistic structure as composition and perturbation. Proceedings of ACL-96. http://xxx.lanl.gov/ps/cmp-lg/9606027
  Lexicon learning. Many additional details are given in de Marcken's thesis (1996), if you're curious or just need clarification. Note that de Marcken starts by assuming that the lexicon consists of single characters, which he hierarchically agglomerates; for comparison, Brent & Cartwright (1996) start by putting the corpus in the lexicon as a single word, and hierarchically divide it on MDL principles similar to de Marcken's. (Compare agglomerative vs. divisive clustering.) While Brent & Cartwright use a very greedy algorithm, Brent (1999) optimizes the segmentation via dynamic programming (like de Marcken), as well as using a more carefully chosen prior (abandoning MDL rhetoric for Bayesian) and processing the input one utterance at a time; it also compares the work empirically to several other unsupervised segmentation methods. Venkataraman (1999) is a variation on Brent (1999) that models the corpus as generated by a traditional smoothed bigram or trigram model rather than Brent's unigram-ish model. Goldsmith (1997) has a non-hierarchical method that learns how to divide words into stem + suffix, using something like de Marcken's MDL-plus-EM approach.
- Chengxiang Zhai (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. Proceedings of the International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97), Rio de Janeiro, Brzil, Feb. 4-6, 1997. 119-129. http://arXiv.org/ps/cmp-lg/9701001
  This short paper tries to identify two-word phrases. Unlike the very well-known methods such as smoothed mutual information, or the log-likelihood independence test of Dunning (1993), it looks at context.
- Jeffrey Mark Siskind:
  - (1995) `Robust Lexical Acquisition Despite Extremely Noisy Input,' Proceedings of the 19th Boston University Conference on Language Development (edited by D. MacLaughlin and S. McEwen), Cascadilla Press, March. ftp://ftp.nj.nec.com/pub/qobi/bucld95.ps.Z
    Try to relate this problem and Siskind's solution to the term x document and verb x object matrices we studied last week.
  - Section 6 of: (1996) A Computational Study of Cross-Situational Techniques for Learning Word-to-Meaning Mappings. Cognition 61(1-2): 39-91, October/November. ftp://ftp.nj.nec.com/pub/qobi/cognition96.ps.Z
    This fills in some missing material from the previous (short) article. Just read section 6, "The Noise-Free Monosemous Case," which starts on p. 54 of the print version and p. 19 of the online version. Try to relate it to Melamed (1997). (Warning: The layout is a bit confusing because each rule is explained before it is stated.) See Thompson & Mooney (1999) for a different approach and an empirical comparison to Siskind's system.
Week of Feb. 18: HMMs and Part-of-Speech Tagging
- Bernard Merialdo (1994). Tagging English text with a probabilistic model. Computational Linguistics 20(2):155-172. scanned PDF version
- David Elworthy (1994). Does Baum-Welch re-estimation help taggers? Proceedings of ANLP, Stuttgart, 53-58. http://xxx.lanl.gov/abs/cmp-lg/9410012
- Emmanuel Roche and Yves Schabes (1995). Deterministic Part-of-Speech Tagging with Finite State Transducers. Computational Linguistics, March. http://www.merl.com/reports/TR94-07/
  This gives a good account of Eric Brill's supervised tagging method, and shows how to compile it into a finite-state transducer once it's trained. (There are also tricks to speed up training, some of which are used by the fnTBL toolkit developed right here by Florian & Ngai.)
  
  To read about the excellent performance of Brill's method, see his 1995 Computational Linguistics article or his 1992 thesis. For discussion of why it works well and a comparison to decision trees, see Ramshaw & Marcus (1996). (And here's a bibliography of papers on Brill's "transformation-based learning.") Next week we'll probably look at Brill's extension to the unsupervised case.
Week of Feb. 25: Unsupervised Finite-State Topology
- Eric Brill (1995). Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. Proc. of 3rd Workshop on Very Large Corpora, MIT, June. Also appears in Natural Language Processing Using Very Large Corpora, 1997. http://www.cs.jhu.edu/~brill/acl-wkshp.ps.
  Shows how to use a dictionary and a lot of text to train an unsupervised tagger, by bootstrapping off the unambiguous words. This tagger could be compiled into a transducer as in last week's Roche & Schabes paper.
- Sections 2.4-2.5 and Chapter 3 of: Andreas Stolcke (1994). Bayesian Learning of Probabilistic Language Models. Ph.D., thesis, University of California at Berkeley. ftp://ftp.icsi.berkeley.edu/pub/ai/stolcke/thesis.ps.Z
  Sections 2.4-2.5 are motivation and discussion. Chapter 3 learns HMM topologies (probabilistic finite-state grammars) - joint work with Stephen Omohundro.
- Jose Oncina (1998). The data driven approach applied to the OSTIA algorithm. In Proceedings of the Fourth International Colloquium on Grammatical Inference Lecture Notes on Artificial Intelligence Vol. 1433, pp. 50-56 Springer-Verlag, Berlin 1998. ftp://altea.dlsi.ua.es/people/oncina/articulos/icgi98.ps.gz (draft)
  
  Please also glance at the following papers so that you roughly understand a couple of the variants that Oncina and his colleagues have proposed: section 1 of this paper on learning stochastic DFAs, and section 3 of this paper dealing with OSTIA-D and OSTIA-R.
Week of Mar. 4: Learning Tied Finite-State Parameters
- Kevin Knight and Jonathan Graehl (1998). Machine Transliteration. Computational Linguistics 24(4):599-612, December. [Hardcopy available and preferred; in a pinch, read the slightly less detailed ACL-97 version.]
  One stage of this work effectively learns weights for a weighted edit distance, a problem discussed more directly by Ristad & Yianilos.
- Richard Sproat and Michael Riley (1996). Compilation of Weighted Finite-State Transducers from Decision Trees. Proceedings of ACL. http://arXiv.org/ps/cmp-lg/9606018
- Jason Eisner (2002). Parameter Estimation for Probabilistic Finite-State Transducers. Submitted to ACL. http://cs.jhu.edu/~jason/papers/#acl02-fst
  In your response, please try to think of some interesting new uses for this algorithm. Look at the first page of this shorter version for some thoughts about this.
Week of Mar. 11: Inside-Outside Algorithm
If you need to review the inside-outside algorithm, check my course slides before reading the following papers. The slide fonts are unfortunately a bit screwy unless you view under Windows.
- K. Lari and S. Young (1990). The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language 4:35-56. scanned PDF version
- Fernando Pereira and Yves Schabes (1992). Inside-outside reestimation from partially bracketed corpora. Proceedings of the 20th Meeting of the Association for Computational Linguistics. scanned PDF version
  A follow-up paper is Schabes, Roth & Osborne (1993), which uses a fully-bracketed corpus (on harder data - WSJ rather than ATIS). Since the brackets do not come labeled with nonterminal categories, there is still some unsupervised learning going on. The paper tries a few interesting tricks, including training on short sentences only.
  
  For evaluating the performance of a (learned) grammar, you may also be interested in chapter 3 (pp. 120-165) of Goodman (1998). He demonstrates that if you're supposed to output just one parse from your parse chart, and you get partial credit for incorrect parses, then you should extract not the Viterbi (highest-probability) parse, but rather the one with (e.g.) the highest expected number of correct constituents.
- Carl de Marcken (1995). On the unsupervised induction of phrase-structure grammars. Proc. of the 3rd Workshop on Very Large Corpora. http://bobo.link.cs.cmu.edu/grammar/demarcken.ps
  A thoughtful paper showing why the Inside-Outside algorithm gets stuck even on a tiny toy example, and even when restricted to rules of the appropriate X-bar form. (Note: Briscoe & Waegner (1992) were the first to try such a restriction.) De Marcken suggests that link grammars may be easier to learn, using arguments that also apply to dependency grammar.
Week of Mar. 18: Spring break!
Week of Mar. 25: More CFG Learning
- Chapter 4 of: Andreas Stolcke (1994). Bayesian Learning of Probabilistic Language Models. Ph.D., thesis, University of California at Berkeley. ftp://ftp.icsi.berkeley.edu/pub/ai/stolcke/thesis.ps.Z [Same thesis as before. This week, read only chapter 4.]
- Stanley Chen (1995). Bayesian grammar induction for language modeling. In Proceedings of the 33rd ACL, pp. 228-235. http://www-2.cs.cmu.edu/~sfc/papers/acl95.ps.gz
- Glenn Carroll and Mats Rooth (1998). Valence induction with a head-lexicalized PCFG. Proceedings of the 3rd Conference on Empirical Methods in Natural Language Processing (EMNLP). http://xxx.lanl.gov/abs/cmp-lg/9805001
  You may prefer to read the longer writeup of this work here. Excellent background on lexicalized PCFG models is Charniak (1997).
Week of Apr. 2: Maximum Entropy Parsing Models
This week's readings use supervised training. Please read carefully and make sure you really understand the models. Could such models be trained with less supervision?
- Adwait Ratnaparkhi (1997). A linear observed time statistical parser based on maximum entropy models. Proceedings of EMNLP. http://xxx.lanl.gov/ps/cmp-lg/9706014
  The general spirit of this work is somewhat similar to Magerman (1995), the first successful statistical parser. Magerman used decision trees rather than maxent.
- Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler (1999). Estimators for stochastic "unification-based" grammars. Proceedings of ACL. http://www.cog.brown.edu/~mj/papers/lfg-fs.ps.gz
  Various follow-up papers include an application to German by Riezler, Prescher, Kuhn and Johnson (2000), which plays interesting and successful games with scoring lexical preferences (see section 3.2).
- Eugene Charniak (2000). A maximum-entropy-inspired parser. Proceedings of NAACL. http://www.cs.brown.edu/~ec/papers/shortMeP.ps.gz
Week of Apr. 9: Bootstrapping Syntax
- David Magerman and Mitchell Marcus (1990). Parsing a natural language using mutual information statistics. Proceedings of AAAI. http://www-cs-students.stanford.edu/~magerman/papers/aaai90.ps
- Eric Brill and Mitchell Marcus (1992). Automatically acquiring phrase structure using distributional analysis. DARPA Workshop on Speech and Natural Language. http://nlp.cs.jhu.edu/~nasmith/temp/Darpa92_parser.ps
- Menno van Zaanen (2000). Automatically acquiring phrase structure using distributional analysis. Proceedings of ICML. http://turing.wins.uva.nl/~mvzaanen/docs/p_icml00.ps
  Noah writes: There are several papers on van Zaanen's site about "Alignment-based Learning." If this one isn't at the right level, take a look at the COLING '00, CLUK '00, or CLN '99 papers ... or his dissertation (2001).
Week of Apr. 16: Neural nets
- Chalmers, D. (1990). Syntactic transformations on distributed representations. Connection Science 2, 53--62. http://www.u.arizona.edu/~chalmers/papers/transformations.ps
  Based on Pollack 1990, which can be regarded as the context-free generalization of Elman (1990)'s Simple Recurrent Nets ("Finding Structure In Time"). See also the "continuous stack" idea of Mozer & Das (1993).
- The following two 6-page papers overlap considerably, so read one and flip through the other to find the differences.
  - Simon Levy and Jordan Pollack (2001). Infinite RAAM: A Principled Connectionist Substrate for Cognitive Modeling. ICCM. http://citeseer.nj.nec.com/440008.html
  - Simon Levy, Ofer Melnik, and Jordan Pollack (2000). Infinite RAAM: A Principled Connectionist Basis for Grammatical Competence. COGSCII. http://citeseer.nj.nec.com/305778.html
- Terry Regier (1995). A Model of the Human Capacity for Categorizing Spatial Relations. Cognitive Linguistics 6(1). http://www.psych.uchicago.edu/faculty/Terry_Regier/ftp/cogling94.ps
Week of Apr. 23
- John M. Zelle and Raymond J. Mooney (1996). Comparative Results on Using Inductive Logic Programming for Corpus-based Parser Construction. In S. Wermter, E. Riloff and G. Scheler (Eds.), Symbolic, Connectionist, and Statistical Approaches to Learning for Natural Language Processing. Springer Verlag. http://www.cs.utexas.edu/users/ml/papers/chill-bkchapter-95.ps.gz
  Lots more papers about this approach from Mooney's group. In particular, Tang and Mooney (2000) stochasticize this logic-based approach. The standard reference for the CHILLIN algorithm is Zelle, Mooney and Konvisser (1994)
- Robert C. Berwick and Sam Pilato (1987). Learning Syntax by Automata Induction. Machine Learning 2: 9-38. scanned individual pages
  Use "xv -rotate 90" to view the TIFF files. (Sorry about the format - my copy was missing and I had to get a colleague elsewhere to send me a scanned version.) Note: If anyone wants it, I have a hardcopy of the original, more theoretical paper by Angluin on k-reversible learning.
Note: No class on Wednesday April 24.
Week of Apr. 30
- Makoto Kanazawa (1996). Identification in the Limit of Categorial Grammars. Journal of Logic, Language and Information 5(2), 115-155. scanned PDF version
  This turned into Kanazawa's thesis book (1998), Learnable Classes of Categorial Grammars.
- Jason Eisner (2002). Discovering Syntactic Deep Structure via Bayesian Statistics. Cognitive Science 26(3), May. http://cs.jhu.edu/~jason/papers/#cogsci02
- Jason Eisner (2002). Transformational Priors Over Grammars. Submitted to EMNLP. http://cs.jhu.edu/~jason/papers/#emnlp02
  Only one response needed for the two Eisner papers, since they describe the same work. The first paper is high-level and easy to read. The second is a hastily written draft but has a different perspective and many more experimental results.
Monday, May 13: Due date for final project
Wednesday, May 15, 9am-12pm: Project presentation party (in lieu of final exam) with 20-minute talks

Statistical Language Learning Prof. Jason Eisner Course # 600.665 - Spring 2002

Course Description

Vital Statistics

Readings and Responses

The Readings

Statistical Language Learning
Prof. Jason Eisner
Course # 600.665 - Spring 2002