For a crisper, current version, see my research summary.

The narrative on this page tries to tie together my work in a coherent way. It explains how papers I've written fit together, so you can decide which ones are best for you to read ...

(Alternatively, here's a list of papers and list of abstracts).

Unified Approaches

Syntax (Parsing, Learning, MT, ...)

Finite-State Methods

Optimality Theory

Pedagogy

Other Topics

In computational linguistics, we use formal techniques to describe and process language. The development cycle is as follows:

- come up with useful linguistic insights,
- formalize those insights mathematically,
- develop algorithms for the formalisms, and
- evaluate their results on real data.

If you're new to the area, here are a few outreach papers for a general audience:

- The Science of
Language: Computational Linguistics (
*Imagine...*2000) is a short overview of computational linguistics, which I was asked to write for a magazine for gifted high school students. - Cognitive Science and the Search for Intelligence (1991) is even broader, an introduction to cognitive science as a whole. It was the evening lecture to a philosophical society.
- Introduction to the Special
Section on Linguistically Apt Statistical Methods (
*Cognitive Science*2002) briefly explains why statistics and linguistics should fall in love.

Since I'm interested not only in competence grammars, but also in robust techniques for comprehension and acquisition, statistical models form an important part of my repertoire. I tend to think of the world as a big parametric probability distribution.

In early 1999 when I went on the job market, I wrote two brief overviews of my work at the time (mainly lexicalized syntax and optimality-theoretic phonology):- 4-page overview for computer scientists (ps pdf)
- 5-page overview for linguists (ps pdf)

Temperamentally I'm more of a mathematician than an engineer. That is not a value judgment about engineering, nor a scientific judgment that human cognition is neat rather than scruffy. It's just a research style. It means that I prefer formalisms and algorithms that are clean enough to be easily communicated, analyzed, and modified. As a result, I tend to focus on fundamental architectures and tools more than incremental improvements in performance. I'm not above engineering tweaks, but I do try to do them in an elegant and general way.

There are many recurring techniques that appear in NLP and more generally in AI. This is true both at the high level of algorithms and at the low level of implementation. You can spend a long time designing, building, and tuning an experimental system, while having the nagging feeling that you've done a lot of this before.

In recent work, my students and I have been trying to build a unified perspective on the techniques that we use in this field. Our Dyna programming language lets you write down models and algorithms at a high level. They are then compiled into efficient C++ code. We have an accompanying library of techniques for learning model parameters.

To design the language, the compiler, and the machine learning library, we have had to embed the various techniques we know into a common framework. This has been fruitful and enlightening work, and we all use the result to get our research done.

- Dyna: A Declarative Language for Implementing Dynamic Programs (ACL 2004 poster/demo session) gave a first quick look at Dyna.
- Compiling Comp Ling: Practical Weighted Dymamic Programming and the Dyna Language (EMNLP 2005) is a longer overview of Dyna. It gives more details on all fronts -- especially fundamental algorithms, speed measurements, program transformations, and lit review.
- dyna.org is an ever-growing website that explains what, why, and how in great detail. You can also download the language there.

My first real parsing research, summers 1989-1992, was a
collaboration with Mark A. Jones at AT&T Bell Labs. This was early in
the days of probabilistic NLP. The two of us eventually built a
probabilistic left-to-right LFG parser, with a hand-built domain
grammar and history-based probabilities trained from data.
Significantly, our probabilistic parsing model was the first to model
word-to-word ("bilexical") selectional preferences between phrase
heads, such as *eat*'s preference for *hamburger* rather
than *John* as its deep object. Two papers reported the state of
the work as of 1991:

- A Probabilistic Parser Applied to Software Testing Documents (AAAI 1992) gives the clearer description of the application domain and the probability model.
- A Probabilistic Parser and Its Application (workshop 1992) has more discussion of the parsing algorithm, including efficiency issues. It also gives more complete results, including the huge performance gain from using bilexical statistics.

The bilexical idea ultimately proved to be my most important contribution to that project, and it so dramatically improved performance that it continued to influence my direction. A 1994 comment by Mike Schultz got me curious about dependency grammars. This led to a probabilistic dependency parser that I built for fun in 1994, which was interesting for both its cubic-time algorithm and its bilexical parameterization. Unfortunately I didn't bother to evaluate it on a sufficient volume of training and test data until 1996, spurred by the strong results of my friend and colleague Mike Collins, who was working along similar lines. (At that time our parsers had complementary strengths and comparable results; in 1997 Mike built an even better model by combining the strengths of both, in particular switching to the generative approach that I'd found worked best.) The experimental work was reported in two papers:

- Three New Probabilistic Models for Dependency Parsing: An Exploration (COLING 1996) sketched the parsing algorithm and 3 rather different probability models for it, evaluating them on the Penn Treebank. The prettiest model (which was generative) won.
- An Empirical Comparison of Probability Models for Dependency Grammar (TR 1996) extended the experiments from 4,000 to 25,000 training sentences, showing that the results were comparable with Collins (1996). Full experimental details are given here, not to mention two additional models. However, the above COLING paper should probably be read first.

In subsequent papers I developed the algorithmic point further. Others
had been using O(n^{5}) algorithms for bilexical parsing models, whereas the above papers
had found an O(n^{3}) algorithm:

- Bilexical Grammars and Their Cubic-Time Parsing Algorithms (IWPT 1997, book chapter 2000)
presents a formal definition of bilexical dependency grammars, and both declarative and
procedural versions of the O(n
^{3}) parsing algorithm. It also shows how to extend the approach to other cases, including CFGs, link grammars (whose original parsing algorithm is quite similar), and the composition of a bilexical dependency grammar with a finite-state transducer. Note: The paper includes a generic agenda-based parsing algorithm. - Efficient Parsing
for Bilexical Context-Free Grammars and Head Automaton Grammars
(ACL 1999, with G. Satta) is probably the best of these
papers to read and implement. It improved
the above algorithm to run with a smaller grammar constant. The
presentation here was in terms of CFGs and HAGs (more general
than dependency grammars), and includes both O(n
^{3}) and O(n^{4}) algorithms. - A Faster Parsing Algorithm for Lexicalized Tree-Adjoining Grammars (TAG+5 2000, with G. Satta) used
the same insight to speed up TAG parsing from O(n
^{8}) to O(n^{7}).

The above algorithms find word-to-word dependencies in order to let the parser evaluate whether the dependencies are plausible. Usually this evaluation considers the two words being linked. But there are other ways to evalute a dependency:

- Parsing with Soft and Hard Constraints on Dependency Length (IWPT 2005, with N. A. Smith) argues that dependencies in natural language tend to be short. A parser that is biased toward short dependencies can be faster and sometimes more accurate, as shown by experiments. In fact, placing a hard limit on dependency length allows O(n) (linear-time) parsing, by modifying the algorithms above. The paper also shows how to use the approach for partial parsing.

In unrelated work, en route to grad school I devised a normal-form parsing algorithm for Combinatory Categorial Grammar (CCG), whose correctness I proved the following year. Rather than falling prey to "spurious ambiguity" (the bane of CCG), it was guaranteed to find exactly one parse from every semantic equivalence class:

In summer 2002, a team at the Johns Hopkins CLSP Summer Workshop investigated "tree-to-tree" translation (on dependency trees). That is, the training and decoding methods are given parse trees rather than sentences, and pay attention to syntax. My contribution was the formal architecture we used; a number of researchers expressed interest in reading more about it, so I wrote it up as an ACL short paper (4 pp. + a 10-minute talk -- that's all it deserves until the team gets real empirical results).

- Learning Non-Isomorphic Tree Mappings for Machine Translation (ACL poster/demo session 2003) argues for synchronous TSG as an appropriately expressive model. It also points out that EM alignment, training and decoding are quite tractable by dynamic programming: there are not so many ways to decompose a given tree into elementary trees. The formalism and algorithms are spelled out as carefully and as generally as possible in 4 pages.
- Natural Language Generation in the Context of Machine Translation (tech report 2004) was the workshop team's 87-page final report.

In my thesis work ("transformational smoothing"), I took up the question of learning deep structure, showing how to learn the probabilities of transformations or lexical redundancy rules that could turn one lexicalized context-free rule (or other lexical entry) into another:

- Smoothing a Probabilistic Lexicon Via Syntactic Transformations (Ph.D. thesis 2001) is the full thesis, and includes lots of interesting commentary, algorithms, and detail. Chapter 1 is designed to be a readable overview of the work: it explains the proposed "transformation models" by following a concrete example through the separate perspectives of engineering, linguistics, statistics, and child language acquisition.
- Discovering
Syntactic Deep Structure via Bayesian Statistics (
*Cognitive Science*2002) is a synopsis aimed at cognitive scientists. It emphasizes the linguistic motivation for the work. - Transformational Priors Over Grammars (EMNLP 2002) is aimed at engineers and summarizes the experimental work.

My thesis algorithms led directly into my more general work on flexible parameter estimation for finite-state machines. This is because the "transformation models" that I developed in the thesis can be regarded as probabilistic finite-state automata. Most arcs are epsilon-arcs (hence there are many epsilon-cycles), and the arc probabilities have a log-linear parameterization.

Unsupervised learning is inherently difficult. It involves looking for the "best" parameters for a grammar. But how do we define "best"? And how do we find the globally "best" parameters when the objective function has many local maxima? Noah A. Smith and I have considered ways of changing the objective function to improve learning.

- Annealing Techniques for Unsupervised Statistical Language Learning (ACL 2004, with Noah A. Smith) smooths the optimization surface and reintroduces local maxima only gradually. Some of the empirical results improve on the state of the art; others are inconclusive but are discussed.
- Contrastive Estimation: Training Log-Linear Models on Unlabeled Data (ACL 2005, with Noah A. Smith) shows dramatic gains on unsupervised part-of-speech tagging. Instead of maximizing the likelihood of the example strings, it tries to make them more likely than other, superficially similar strings in the same "neighborhood." This makes the learner more like a linguist. Furthermore, it makes unsupervised log-linear estimation tractable, which allows the beneficial incorporation of arbitrary features such as spelling.
- Guiding Unsupervised Grammar Induction Using Contrastive Estimation (IJCAI 2005 Workshop on Grammatical Inference Applications, with Noah A. Smith) shows that the previous idea pays dramatic dividends for grammar induction as well. We argue further that the choice of neighborhood can be used to induce task-appropriate grammars.

Finite-state automata and their stochastic cousins, Hidden Markov Models, are respectively standard topics in the CS and EE curricula. Recently there has been a surge of interest in finite-state approaches to natural language. Though Chomsky argued convincingly that such approaches cannot fully model grammaticality judgments on word sequences, there are reasons to use them:

- Many
*other*linguistic phenomena do appear to be genuinely finite-state: phonology, morphology, subcategorization (the right-hand sides of phrase-structure rules), and syntactic chunking. - Even context-free phenomena can be approximated with finite-state methods. Such approximations can also be reasonable theories of human performance, since in practice humans cannot process more than a couple of levels of center-embedding.
- Finite-state machines can implement common engineering models for many tasks: speech recognition, machine translation, segmentation, normalization, part-of-speech tagging, and other markup. They have also been used for selectional preferences, information extraction, and lexical redundancy rules.
- Finite-state machines are a pleasure to work with: they are efficient, flexible, scalable, and easy to design.

The finite-state machines used for language are generalizations of the
usual FSAs and HMMs. They can produce output as well as reading
input, and they can do so nondeterministically. In short, they
implement a class of *relations*, just as in relational
databases. Relations are more powerful than languages or functions.

- A Note on
Join and Auto-Intersection of
*n*-ary Rational Relations (FASTAR 2004) focuses on how an*n*-tape FSM can be regarded as a specification of a (possibly infinite)*n*-column relational database. It explores which database operations can be carried out effectively and which are theoretically problematic.

*Learning* of finite-state machines seems particularly
important. Currently I'm focusing on the case of hand-built
probabilistic machines with learnable parameters. Besides giving the
first EM algorithm for the basic case of learning independent arc
probabilities, I showed how my solution generalized to more complex
parameterizations. With such a general learning algorithm at hand,
the extended regular expression calculus becomes a language for
specifying domain-appropriate statistical models. This combines two
finite-state traditions: algebraic (hand-built expert systems) and
statistical (empirically trainable systems).

- Expectation Semirings: Flexible EM for Finite-State Transducers (FSMNLP 2001) was the initial workshop publication (a dense 4-page extended abstract).
- Parameter Estimation for Probabilistic Finite-State Transducers (ACL 2002) is the conference paper version. While still dense, it has rather more detail and an extended example. Read this first, then read the workshop version (which has some additional material on algebraic structure and possible applications).
- A leisurely journal paper on this work is coming soon. In their general form, some of the ideas were adapted from my thesis.

The above learning work uses weighted finite-state machines whose arc weights are probabilities or fall in a novel "expectation semiring." Most finite-state constructions can be easily adapted to weighted machines, but minimization is a bit harder.

- Simpler and
More General Minimization for Weighted Finite-State Machines
(HLT-NAACL 2003) presents nice minimization algorithms for a
general class of weighted finite-state machines. It also gives
theoretical limits on minimization, which can be NP-complete and
inapproximable if you get
*too*creative with the weights.

Since Feb. 2001 I have served as President of the ACL's Special Interest Group for Computational Phonology (SIGPHON). It was through a phonology course in 1995 that I first got interested in Optimality Theory (Prince and Smolensky 1993) and the formal questions it raises. OT has eclipsed previous approaches to phonology. It says that when you're trying to pronounce a word ("resign" or "resignation"), you consider all possible pronunciations and pick the best one. A grammar in OT simply specifies the criteria on which pronunciations are judged, and especially the relative importance of those criteria, which are known as violable constraints.

- Review of
*Optimality Theory*by René Kager (*Computational Linguistics*2000) reviews a leading OT textbook. It also discusses the relation of OT to other methods in computational linguistics, such as log-linear models (Gibbs distributions) and decision lists. - Comprehension and Compilation in Optimality Theory (ACL 2002) (discussed below) might help you understand OT at an abstract level: it includes a cute metaphor and some helpful notation.

Newcomers to OT always ask two questions: What kinds of constraints may OT grammars include? And how can a speaker manage to sift through an infinite set to find the best pronunciation? The two questions are related. Only after pinning down the formalism can one say much about computation. So I tried to do both:

- What Constraints Should OT Allow? (LSA 1997) is the seminal paper on Primitive OT (OTP). (Updated version: Doing OT in a Straitjacket (1999).) It argues that a very simple OT formalism is enough -- that with appropriate phonological representations, nearly all the OT constraints in the linguistic literature can be replaced by simple, local constraints of just two kinds.
- Efficient Generation in Primitive Optimality Theory (ACL 1997) gives the computational take on OTP. It explains how to find the best pronunciation in an OTP grammar. The trick is to turn OTP's representations into strings and its constraints into weighted finite-state automata.
- FootForm Decomposed: Using primitive constraints in OT (MITWPL 1997) defends OTP's decision to disallow Generalized Alignment (GA) constraints, which are overly powerful on both linguistic and computational grounds. This paper successfully reanalyzes "iterative" metrical stress within OTP, not using GA at all. The new analysis is arguably better: it explains typological gaps that were previously mysterious or unremarked.

- Directional Constraint Evaluation in Optimality Theory (COLING 2000) offers a solution. If constraints are evaluated "directionally" rather than by counting, then grammars can be compiled into FSTs. Directional evaluation is also attractive on empirical grounds (it naturally describes iterative phenomena without using GA) and on aesthetic grounds (it resembles constraint ranking).
- Comprehension and Compilation in Optimality Theory (ACL 2002) unifies directional constraints with other proposals for changing the constraint evaluation strategy. The other half of the paper discusses theoretical and practical issues with comprehension (i.e., how to map a pronunciation back to the underlying forms for which it is optimal).

- Easy and Hard Constraint Ranking in Optimality Theory: Algorithms and Complexity (SIGPHON 2000) shows that easy OT learning problems are easier than previously known, but hard problems are much harder, falling into high complexity classes.

The following papers might be useful if you are planning to teach these topics to someone else or to yourself.

- An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm (TNLP 2002) will amaze your students, and help them develop intuitions about HMMs and EM.
- State-of-the-Art Algorithms for Minimum Spanning Trees: A Tutorial Discussion (1997) tries to provide "deep" insight into some advanced graph algorithms. It does not cover the most recent algorithms (e.g., Pettie & Ramachandran), but faculty elsewhere have enthusiastically recommended it to their students.

You might also be interested in the Dyna programming language, which provides a unified perspective that might be useful for teaching.

These papers were one-off efforts, so they don't fit neatly into the narrative. But that doesn't mean they're not worth reading! (Especially the first two.)

**Predictive text entry:**Radiology Report Entry with Automatic Phrase Completion Driven by Language Modeling (*Radiographics*2004, with John Eng) speeds up text entry in a low-entropy domain by predicting what the user will type. The user may ignore, accept, or partially accept the predicted phrase. The tricky part, mathematically, is choosing how many words to predict at once.**Semantics:**`All'-less in Wonderland? Revisiting*any*(ESCOL 1995) resurrects an old position about the meaning and usage of the English word*any*.**Clustering:**Unsupervised Classification via Decision Trees: An Information-Theoretic Perspective gives a divisive method for hierarchical clustering of sequences. It is based on information-theoretic principles and finds the true clustering when the sequences are sufficiently long.**Neural nets:**Dynamical-Systems Behavior in Recurrent and Non-Recurrent Connectionist Nets (A.B. thesis 1990) proposes "gradual-response" neural networks in which node activation changes continuously over time. It gives a general learning algorithm that can train the trajectory through activation space to have a desired shape.**Coreference resolution:**Description of the University of Pennsylvania entry in the MUC-6 competition (1995).**Information filtering**,**text compression**, etc.: Patents with Fred Herz and others. Most of this work tries to use statistical prediction to help users cope with information overload. The lawyers uglified the writing, though.

This page online:

`http://cs.jhu.edu/~jason/papers/narrative.html`

Jason Eisner - jason@cs.jhu.edu (tech correspondence welcome) |
Last Mod $Date: 2006/10/16 01:26:17 $ |