The narrative on this page tries to tie together my work in a coherent way. It explains how papers I've written fit together, so you can decide which ones are best for you to read ...
(Alternatively, here's a list of papers and list of abstracts).
In computational linguistics, we use formal techniques to describe and process language. The development cycle is as follows:
If you're new to the area, here are a few outreach papers for a general audience:
Since I'm interested not only in competence grammars, but also in robust techniques for comprehension and acquisition, statistical models form an important part of my repertoire. I tend to think of the world as a big parametric probability distribution.In early 1999 when I went on the job market, I wrote two brief overviews of my work at the time (mainly lexicalized syntax and optimality-theoretic phonology):
Temperamentally I'm more of a mathematician than an engineer. That is not a value judgment about engineering, nor a scientific judgment that human cognition is neat rather than scruffy. It's just a research style. It means that I prefer formalisms and algorithms that are clean enough to be easily communicated, analyzed, and modified. As a result, I tend to focus on fundamental architectures and tools more than incremental improvements in performance. I'm not above engineering tweaks, but I do try to do them in an elegant and general way.
There are many recurring techniques that appear in NLP and more generally in AI. This is true both at the high level of algorithms and at the low level of implementation. You can spend a long time designing, building, and tuning an experimental system, while having the nagging feeling that you've done a lot of this before.
In recent work, my students and I have been trying to build a unified perspective on the techniques that we use in this field. Our Dyna programming language lets you write down models and algorithms at a high level. They are then compiled into efficient C++ code. We have an accompanying library of techniques for learning model parameters.
To design the language, the compiler, and the machine learning library, we have had to embed the various techniques we know into a common framework. This has been fruitful and enlightening work, and we all use the result to get our research done.
My first real parsing research, summers 1989-1992, was a collaboration with Mark A. Jones at AT&T Bell Labs. This was early in the days of probabilistic NLP. The two of us eventually built a probabilistic left-to-right LFG parser, with a hand-built domain grammar and history-based probabilities trained from data. Significantly, our probabilistic parsing model was the first to model word-to-word ("bilexical") selectional preferences between phrase heads, such as eat's preference for hamburger rather than John as its deep object. Two papers reported the state of the work as of 1991:
The bilexical idea ultimately proved to be my most important contribution to that project, and it so dramatically improved performance that it continued to influence my direction. A 1994 comment by Mike Schultz got me curious about dependency grammars. This led to a probabilistic dependency parser that I built for fun in 1994, which was interesting for both its cubic-time algorithm and its bilexical parameterization. Unfortunately I didn't bother to evaluate it on a sufficient volume of training and test data until 1996, spurred by the strong results of my friend and colleague Mike Collins, who was working along similar lines. (At that time our parsers had complementary strengths and comparable results; in 1997 Mike built an even better model by combining the strengths of both, in particular switching to the generative approach that I'd found worked best.) The experimental work was reported in two papers:
In subsequent papers I developed the algorithmic point further. Others had been using O(n5) algorithms for bilexical parsing models, whereas the above papers had found an O(n3) algorithm:
The above algorithms find word-to-word dependencies in order to let the parser evaluate whether the dependencies are plausible. Usually this evaluation considers the two words being linked. But there are other ways to evalute a dependency:
In unrelated work, en route to grad school I devised a normal-form parsing algorithm for Combinatory Categorial Grammar (CCG), whose correctness I proved the following year. Rather than falling prey to "spurious ambiguity" (the bane of CCG), it was guaranteed to find exactly one parse from every semantic equivalence class:
In summer 2002, a team at the Johns Hopkins CLSP Summer Workshop investigated "tree-to-tree" translation (on dependency trees). That is, the training and decoding methods are given parse trees rather than sentences, and pay attention to syntax. My contribution was the formal architecture we used; a number of researchers expressed interest in reading more about it, so I wrote it up as an ACL short paper (4 pp. + a 10-minute talk -- that's all it deserves until the team gets real empirical results).
In my thesis work ("transformational smoothing"), I took up the question of learning deep structure, showing how to learn the probabilities of transformations or lexical redundancy rules that could turn one lexicalized context-free rule (or other lexical entry) into another:
My thesis algorithms led directly into my more general work on flexible parameter estimation for finite-state machines. This is because the "transformation models" that I developed in the thesis can be regarded as probabilistic finite-state automata. Most arcs are epsilon-arcs (hence there are many epsilon-cycles), and the arc probabilities have a log-linear parameterization.
Unsupervised learning is inherently difficult. It involves looking for the "best" parameters for a grammar. But how do we define "best"? And how do we find the globally "best" parameters when the objective function has many local maxima? Noah A. Smith and I have considered ways of changing the objective function to improve learning.
Finite-state automata and their stochastic cousins, Hidden Markov Models, are respectively standard topics in the CS and EE curricula. Recently there has been a surge of interest in finite-state approaches to natural language. Though Chomsky argued convincingly that such approaches cannot fully model grammaticality judgments on word sequences, there are reasons to use them:
The finite-state machines used for language are generalizations of the usual FSAs and HMMs. They can produce output as well as reading input, and they can do so nondeterministically. In short, they implement a class of relations, just as in relational databases. Relations are more powerful than languages or functions.
Learning of finite-state machines seems particularly important. Currently I'm focusing on the case of hand-built probabilistic machines with learnable parameters. Besides giving the first EM algorithm for the basic case of learning independent arc probabilities, I showed how my solution generalized to more complex parameterizations. With such a general learning algorithm at hand, the extended regular expression calculus becomes a language for specifying domain-appropriate statistical models. This combines two finite-state traditions: algebraic (hand-built expert systems) and statistical (empirically trainable systems).
The above learning work uses weighted finite-state machines whose arc weights are probabilities or fall in a novel "expectation semiring." Most finite-state constructions can be easily adapted to weighted machines, but minimization is a bit harder.
Since Feb. 2001 I have served as President of the ACL's Special Interest Group for Computational Phonology (SIGPHON). It was through a phonology course in 1995 that I first got interested in Optimality Theory (Prince and Smolensky 1993) and the formal questions it raises. OT has eclipsed previous approaches to phonology. It says that when you're trying to pronounce a word ("resign" or "resignation"), you consider all possible pronunciations and pick the best one. A grammar in OT simply specifies the criteria on which pronunciations are judged, and especially the relative importance of those criteria, which are known as violable constraints.
Newcomers to OT always ask two questions: What kinds of constraints may OT grammars include? And how can a speaker manage to sift through an infinite set to find the best pronunciation? The two questions are related. Only after pinning down the formalism can one say much about computation. So I tried to do both:
The following papers might be useful if you are planning to teach these topics to someone else or to yourself.
You might also be interested in the Dyna
programming language, which provides a unified perspective that
might be useful for teaching.
These papers were one-off efforts, so they don't fit neatly into the narrative. But that doesn't mean they're not worth reading! (Especially the first two.)
|Jason Eisner - email@example.com (tech correspondence welcome)||Last Mod $Date: 2006/10/16 01:26:17 $|