Jason Eisner - All Papers and Slides

Publications listed here in reverse chronological order.
Also available:

Technical correspondence welcome! (email)


Learning Speed-Accuracy Tradeoffs in Nondeterministic Inference Algorithms (2011)

Abstract:
Could we explicitly train test-time inference heuristics to trade off accuracy and efficiency? We focus our discussion on agenda-based natural language parsing under a weighted context-free grammar. We frame the problem as reinforcement learning, discuss its special properties, and propose new strategies.
Citation (BibTeX):
Jason Eisner and Hal Daumé III (2011). Learning speed-accuracy tradeoffs in nondeterministic inference algorithms. Proceedings of COST: NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning, Sierra Nevada, Spain, December.
On-line document:
5 pp. PDF (199K)

Learning Cost-Aware, Loss-Aware Approximate Inference Policies for Probabilistic Graphical Models (2011)

Abstract:
Probabilistic graphical models are typically trained to maximize the likelihood of the training data and evaluated on some measure of accuracy on the test data. However, we are also interested in learning to produce predictions quickly. For example, one can speed up loopy belief propagation by choosing sparser models and by stopping at some point before convergence. We manage the speed-accuracy tradeoff by explicitly optimizing for a linear combination of speed and accuracy. Although this objective is not differentiable, we can compute the gradient of a smoothed version.
Citation (BibTeX):
Veselin Stoyanov and Jason Eisner (2011). Learning cost-aware, loss-aware approximate inference policies for probabilistic graphical models. Proceedings of COST: NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning, Sierra Nevada, Spain, December.
On-line document:
5 pp. PDF (367K)

Transformation Process Priors (2011)

Abstract:
Because of the neutrality property, a Dirichlet (process) prior on a discrete distribution cannot capture correlations among the probabilities of "similar" events. We propose obtaining the discrete distribution instead from a random walk model or transformation model, in which each observed event has evolved via a latent sequence of transformations. We are exploring transformation models in which the conditional distributions have infinite support and the prior over them is nonparametric.
Citation (BibTeX):
Nicholas Andrews and Jason Eisner (2011). Transformation process priors. Proceedings of NIPS 2011 Workshop on Bayesian Nonparametrics: Hope or Hype?, Sierra Nevada, Spain, December.
On-line document:
3 pp. PDF (213K)

Shared Components Topic Models with Application to Selectional Preference (2011)

Abstract:
Latent Dirichlet Allocation (LDA) has been used to learn selectional preferences as soft disjunctions over flat semantic classes. Our model, the SCTM, also learns the structure of each class as a soft conjunction of high-level semantic features.
Citation (BibTeX):
Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, and Jason Eisner (2011). Shared components topic models with application to selectional preference. Proceedings of NIPS 2011 Workshop on Learning Semantics, Sierra Nevada, Spain, December.
On-line document:
3 pp. PDF (129K)

Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model (2011)

Abstract:
We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.
Citation (BibTeX):
Markus Dreyer and Jason Eisner (2011). Discovering morphological paradigms from plain text using a Dirichlet process mixture model. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 616-627, Edinburgh, July.
On-line document:
21 pp. PDF including supplement (2.2M)
Slides:
PDF from thesis defense (12M)

Minimum Imputed Risk: Unsupervised Discriminative Training for Machine Translation (2011)

Abstract:
Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough "reverse" translation system. Intuitively, our method strives to ensure that probabilistic "round-trip" translation from a target-language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
Citation (BibTeX):
Zhifei Li, Jason Eisner, Ziyuan Wang, Sanjeev Khudanpur, and Brian Roark (2011). Minimum imputed risk: Unsupervised discriminative training for machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 920-929, Edinburgh, July.
On-line document:
10 pp. PDF (256K)

Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure (2011)

Abstract:
Graphical models are often used "inappropriately," with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic meta-descent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.
Citation (BibTeX):
Veselin Stoyanov, Alexander Ropson, and Jason Eisner (2011). Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, April. JMLR Workshop and Conference Proceedings, 15:725-733.
On-line document:
13 pp. PDF including supplement (636K)
Poster (4+ feet by 3 feet):
PDF (1.4M)

Dyna: Extending Datalog for Modern AI (2011)

Abstract:
Modern statistical AI systems are quite large and complex; this interferes with research, development, and education. We point out that most of the computation involves database-like queries and updates on complex views of the data. Specifically, recursive queries look up and aggregate relevant or potentially relevant values. If the results of these queries are memoized for reuse, the memos may need to be updated through change propagation. We propose a declarative language, which generalizes Datalog, to support this work in a generic way. Through examples, we show that a broad spectrum of AI algorithms can be concisely captured by writing down systems of equations in our notation. Many strategies could be used to actually solve those systems. Our examples motivate certain extensions to Datalog, which are connected to functional and object-oriented programming paradigms.
Citation (BibTeX):
Jason Eisner and Nathaniel W. Filardo (to appear 2011). Dyna: Extending Datalog for Modern AI. In Tim Furche, Georg Gottlob, Giovanni Grasso, Oege de Moor, and Andrew Sellers (eds.), Datalog 2.0. Lecture Notes in Computer Science.
On-line document:
Long version 55 pp. PDF (842K), short version 40 pp. with wide margins PDF (66K)

Novel System Combination Approaches for MT (2011)

Citation (BibTeX):
Antti-Veikko Rosti, Eugene Matusov, Jason Smith, Necip Ayan, Jason Eisner, Damianos Karakos, Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spyros Matsoukas, Hermann Ney, Richard Schwartz, B. Zhang, and J. Zheng (2011). Novel system combination approaches for MT. In Handbook of Natural Language Processing and Machine Translation, pp. 333-361, Springer.
On-line document:
29 pp. PDF (863K)

Unsupervised Discriminative Language Model Training for Machine Translation using Simulated Confusion Sets (2010)

Abstract:
An unsupervised discriminative training procedure is proposed for estimating a language model (LM) for machine translation (MT). An English-to-English synchronous context-free grammar is derived from a baseline MT system to capture translation alternatives: pairs of words, phrases or other sentence fragments that potentially compete to be the translation of the same source-language fragment. Using this grammar, a set of impostor sentences is then created for each English sentence to simulate confusions that would arise if the system were to process an (unavailable) input whose correct English translation is that sentence. An LM is then trained to discriminate between the original sentences and the impostors. The procedure is applied to the IWSLT Chinese-to-English translation task, and promising improvements on a state-of-the-art MT system are demonstrated.
Citation (BibTeX):
Zhifei Li and Ziyuan Wang and Sanjeev Khudanpur and Jason Eisner (2010). Unsupervised discriminative language model training for machine translation using simulated confusion sets. Proceedings of COLING, pages 656-664, Beijing, August.
On-line document:
9 pp. PDF (536K)

Favor Short Dependencies: Parsing with Soft and Hard Constraints on Dependency Length (2010)

Abstract:
In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. This fact can be exploited by parsers. We first show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. The approach with hard bounds, "vine grammar," accepts only a regular language, even though it happily retains a context-free parameterization and defines meaningful parse trees. We show how to parse this language in O(n) time, using a novel chart parsing algorithm with a low grammar constant (rather than an impractically large finite-state recognizer with an exponential grammar constant). For a uniform hard bound of k on dependencies of all types, our algorithm's runtime is O(nk2). We also extend our algorithm to parse weighted-FSA inputs such as lattices.
Citation (BibTeX):
Jason Eisner and Noah A. Smith (2010). Favor short dependencies: Parsing with soft and hard constraints on dependency length. In Harry Bunt, Paola Merlo, and Joakim Nivre (eds.), Trends in Parsing Technology: Dependency Parsing, Domain Adaptation, and Deep Parsing, pages 121-150. Springer.
On-line document:
31 pp. PDF (280K; the original publication is available at www.springerlink.com)
Slides:
powerpoint (1295K, animation)
Relationship to other papers:
This extends the work in Smith & Eisner (2005) with considerable new material.

Weighted Deduction as an Abstraction Level for AI (2009)

Invited talk at ILP+MLG+SRL, July 2009.

Abstract:
The field of AI has become implementation-bound. We have plenty of ideas, but it is increasingly laborious to try them out, as our models become more ambitious and our datasets become larger, noisier, and more heterogeneous. The software engineering burden makes it hard to start new work; hard to reuse and combine existing ideas; and hard to educate our students.

In this talk, I'll propose to hide many common implementation details behind a new level of abstraction that we are developing. Dyna is a declarative programming language that combines logic programming with functional programming. It also supports modularity. It may be regarded as a kind of deductive database, theorem prover, truth maintenance system, or equation solver.

I will illustrate how Dyna makes it easy to specify the combinatorial structure of typical computations needed in natural language processing, machine learning, and elsewhere in AI. Then I will sketch implementation strategies and program transformations that can help to make these computations fast and memory-efficient. Finally, I will suggest that machine learning should be used to search for the right strategies for a program on a particular workload.

Slides:
powerpoint (2.8M), video lecture

Joint Models with Missing Data for Semi-Supervised Learning (2009)

Keynote talk at the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing, June 2009. To all of you who asked in person for the slides, please email me (jason at cs dot jhu dot edu). Note that the talk referred to this work, this work, this work, and this work, among others.


First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests (2009)

Abstract:
Many statistical translation models can be regarded as weighted logical deduction. Under this paradigm, we use weights from the expectation semiring (Eisner, 2002), to compute first-order statistics (e.g., the expected hypothesis length or feature counts) over packed forests of translations (lattices or hypergraphs). We then introduce a novel second-order expectation semiring, which computes second-order statistics (e.g., the variance of the hypothesis length or the gradient of entropy). This second-order semiring is essential for many interesting training paradigms such as minimum risk, deterministic annealing, active learning, and semi-supervised learning, where gradient descent optimization requires computing the gradient of entropy or risk. We use these semirings in an open-source machine translation toolkit, Joshua, enabling minimum-risk training for a benefit of up to 1.0 BLEU point.
Citation (BibTeX):
Zhifei Li and Jason Eisner (2009). First- and second-order expectation semirings with applications to minimum-risk training on translation forests. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 40-51, Singapore, August.
On-line document:
12 pp. PDF (633K)
Slides:
animated PDF (7539K), printable PDF (4271K)

Graphical Models over Multiple Strings (2009)

Abstract:
We study graphical modeling in the case of string-valued random variables. Whereas a weighted finite-state transducer can model the probabilistic relationship between two strings, we are interested in building up joint models of three or more strings. This is needed for inflectional paradigms in morphology, cognate modeling or language reconstruction, and multiple-string alignment. We propose a Markov Random Field in which each factor (potential function) is a weighted finite-state machine, typically a transducer that evaluates the relationship between just two of the strings. The full joint distribution is then a product of these factors. Though decoding is actually undecidable in general, we can still do efficient joint inference using approximate belief propagation; the necessary computations and messages are all finite-state. We demonstrate the methods by jointly predicting morphological forms.
Citation (BibTeX):
Markus Dreyer and Jason Eisner (2009). Graphical models over multiple strings. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 101-110, August.
On-line document:
10 pp. PDF (347K)
Slides:
keynote (5.3M), quicktime (2.7M), PDF (5.5M)

Parser Adaptation and Projection with Quasi-Synchronous Grammar Features (2009)

Abstract:
We connect two scenarios in structured learning: adapting a parser trained on one corpus to another annotation style, and projecting syntactic annotations from one language to another. We propose quasi-synchronous grammar (QG) features for these structured learning tasks. That is, we score a aligned pair of source and target trees based on local features of the trees and the alignment. Our quasi-synchronous model assigns positive probability to any alignment of any trees, in contrast to a synchronous grammar, which would insist on some form of structural parallelism.

In monolingual dependency parser adaptation, we achieve high accuracy in translating among multiple annotation styles for the same sentence. On the more difficult problem of cross-lingual parser projection, we learn a dependency parser for a target language by using bilingual text, an English parser, and automatic word alignments. Our experiments show that unsupervised QG projection improves on parses trained using only high-precision projected annotations and far outperforms, by more than 35% absolute dependency accuracy, learning an unsupervised parser from raw target-language text alone. When a few target-language parse trees are available, projection gives a boost equivalent to doubling the number of target-language trees.

Citation (BibTeX):
David A. Smith and Jason Eisner (2009). Parser adaptation and projection with quasi-synchronous grammar features. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 822-831, August.
On-line document:
10 pp. PDF (421K)
Slides:
powerpoint (535K)

Learning Linear Ordering Problems for Better Translation (2009)

Abstract:
We apply machine learning to the Linear Ordering Problem in order to learn sentence-specific reordering models for machine translation. We demonstrate that even when these models are used as a mere preprocessing step for German-English translation, they significantly outperform Moses' integrated lexicalized reordering model.

Our models are trained on automatically aligned bitext. Their form is simple but novel. They assess, based on features of the input sentence, how strongly each pair of input word tokens wi;wj would like to reverse their relative order. Combining all these pairwise preferences to find the best global reordering is NP-hard. However, we present a non-trivial O(n3) algorithm, based on chart parsing, that at least finds the best reordering within a certain exponentially large neighborhood. We show how to iterate this reordering process within a local search algorithm, which we use in training.

Citation (BibTeX):
Roy Tromble and Jason Eisner (2009). Learning linear ordering problems for better translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1007-1016, August.
On-line document:
10 pp. PDF (192K)
Slides:
PDF (341K), OpenOffice (188K)

Variational Decoding for Statistical Machine Translation (2009)

Nominated for best paper award.
Abstract:
Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, finding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008). Experiments show that our approach improves the state of the art.
Citation (BibTeX):
Zhifei Li, Jason Eisner, and Sanjeev Khudanpur (2009). Variational decoding for statistical machine translation. Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics (ACL), pages 593-601, Singapore, August.
On-line document:
9 pp. PDF (231K)
Slides:
animated PDF (3874K), printable PDF (2982K)

Cross-Document Coreference Resolution: A Key Technology for Learning by Reading (2009)

Abstract:
Automatic knowledge base population from text is an important technology for a broad range of approaches to learning by reading. Effective automated knowledge base population depends critically upon coreference resolution of entities across sources. Use of a wide range of features, both those that capture evidence for entity merging and those that argue against merging, can significantly improve machine learning-based cross-document coreference resolution. Results from the Global Entity Detection and Recognition task of the NIST Automated Content Extraction (ACE) 2008 evaluation support this conclusion.
Citation (BibTeX):
James Mayfield, David Alexander, Bonnie Dorr, Jason Eisner, Tamer Elsayed, Tim Finin, Clay Fink, Marjorie Freedman, Nikesh Garera, Paul McNamee, Saif Mohammad, Douglas Oard, Christine Piatko, Asad Sayeed, Zareen Syed, Ralph Weischedel, Tan Xu and David Yarowsky (2009). Cross-document coreference resolution: A key technology for learning by reading. Proceedings of the AAAI 2009 Spring Symposium on Learning by Reading and Learning to Read, March.
On-line document:
6 pp. PDF (278K)

Dyna: A Non-Probabilistic Programming Language for Probabilistic AI (2008)

Abstract:
The Dyna programming language is intended to provide an declarative abstraction layer for building systems in ML and AI. It extends logic programming with weights in a way that resembles functional programming. The weights are often probabilities. Yet Dyna does not enforce a probabilistic semantics, since many AI and ML methods work with inexact probabilities (e.g., bounds) and other numeric and non-numeric quantities. Instead Dyna aims to provide a flexible abstraction layer that is "one level lower," and whose efficient implementation will be able to serve as infrastructure for building a variety of toolkits, languages, and specific systems.
Citation (BibTeX):
Jason Eisner (2008). Dyna: A non-probabilistic programming language for probabilistic AI. Extended abstract for talk at the NIPS*2008 Workshop on Probabilistic Programming, December.
On-line document:
2 pp. PDF (88K, extended abstract)
Slides:
longer talk (6764K, animation; includes extra slides at end about assorted applications)

Machine Learning with Annotator Rationales to Reduce Annotation Cost (2008)

Abstract:
We review two novel methods for text categorization, based on a new framework that utilizes richer annotations that we call annotator rationales. A human annotator provides hints to a machine learner by highlighting contextual "rationales" in support of each of his or her annotations. We have collected such rationales, in the form of substrings, for an existing document sentiment classification dataset [1]. We have developed two methods, one discriminative [2] and one generative [3], that use these rationales during training to obtain significant accuracy improvements over two strong baselines. Our generative model in particular could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks. Based on a small study of annotation speed, we posit that for some tasks, providing rationales can be a more fruitful use of an annotator's time than annotating more examples.
Citation (BibTeX):
Omar F. Zaidan, Jason Eisner, and Christine Piatko (2008). Machine learning with annotator rationales to reduce annotation cost. Proceedings of the NIPS*2008 Workshop on Cost Sensitive Learning, December.
On-line document:
10 pp. PDF (246K)
Slides:
powerpoint (4641K, much animation)
Poster (4 feet by 3 feet):
powerpoint (887K), PDF (2779K)
Relationship to other papers:
This paper is just a synthesis of Zaidan et al. (2007) and Zaidan & Eisner (2008). (I hate doing multiple papers on the same work, but I wanted the ML community outside of NLP to see these results, so I asked the workshop organizers for permission to republish them here.)

Dependency Parsing by Belief Propagation (2008)

Abstract:
We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with second-order features or latent variables, which would make exact parsing considerably slower or NP-hard, BP needs only O(n3) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features would increase the runtime additively rather than multiplicatively.
Citation (BibTeX):
David A. Smith and Jason Eisner (2008). Dependency parsing by belief propagation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 145-156, Honolulu, October.
On-line document:
12 pp. PDF (682K)
Slides:
powerpoint (this is a longer talk that includes some background; 3449K, much animation)

Modeling Annotators: A Generative Approach to Learning from Annotator Rationales (2008)

Abstract:
A human annotator can provide hints to a machine learner by highlighting contextual "rationales" for each of his or her annotations (Zaidan et al., 2007). How can one exploit this side information to better learn the desired parameters θ? We present a generative model of how a given annotator, knowing the true θ, stochastically chooses rationales. Thus, observing the rationales helps us infer the true θ. We collect substring rationales for a sentiment classification task (Pang and Lee, 2004) and use them to obtain significant accuracy improvements for each annotator. Our new generative approach exploits the rationales more effectively than our previous "masking SVM" approach. It is also more principled, and could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks.
Citation (BibTeX):
Omar F. Zaidan and Jason Eisner (2008). Modeling annotators: A generative approach to learning from annotator rationales. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 31-40, Honolulu, October.
On-line document:
10 pp. PDF (248K)
Slides:
powerpoint (1871K, animation?)

Latent-Variable Modeling of String Transductions with Finite-State Methods (2008)

Abstract:
String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional log-linear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38-92%.
Citation (BibTeX):
Markus Dreyer, Jason R. Smith, and Jason Eisner (2008). Latent-variable modeling of string transductions with finite-state methods. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1080-1089, Honolulu, October.
On-line document:
10 pp. PDF (531K)

Competitive Grammar Writing (2008)

Abstract:
Just as programming is the traditional introduction to computer science, writing grammars by hand is an excellent introduction to many topics in computational linguistics. We present and justify a well-tested introductory activity in which teams of mixed background compete to write probabilistic context-free grammars of English. The exercise brings together symbolic, probabilistic, algorithmic, and experimental issues in a way that is accessible to novices and enjoyable.
Citation (BibTeX):
Jason Eisner and Noah A. Smith (2008). Competitive grammar writing. Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics, pages 97-105, Columbus, June.
On-line document:
8 pp. PDF (152K)
Slides:
powerpoint (4038K, some animation)
Other resources:
our reusable instructional materials

Machine Translation System Combination Using ITG-Based Alignments (2008)

Abstract:
Given several systems' automatic translations of the same sentence, we show how to combine them into a confusion network, whose various paths represent composite translations that could be considered in a subsequent rescoring step. We build our confusion networks using the method of Rosti et al. (2007), but, instead of forming alignments using the tercom script (Snover et al., 2006), we create alignments that minimize invWER (Leusch et al., 2003), a form of edit distance that permits properly nested block movements of substrings. Oracle experiments with Chinese newswire and weblog translations show that our confusion networks contain paths which are significantly better (in terms of BLEU and TER) than those in tercom-based confusion networks.
Citation (BibTeX):
Damianos Karakos, Jason Eisner, Sanjeev Khudanpur, and Markus Dreyer (2008). Machine translation system combination using ITG-based alignments. Proceedings of ACL-08: HLT, Short Papers, pages 81-84, Columbus, June.
On-line document:
4 pp. PDF (76K)

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors (2007)

Abstract:
One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005). Drawing on Abney's (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likelihood on labeled data and confidence (negative Rényi entropy) on unlabeled data. In initial experiments, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied. For our models and training sets, more peaked measures of confidence, measured by Rényi entropy, outperformed smoother ones. We discuss how our feature set could be extended with cross-lingual or cross-domain features, to incorporate knowledge from parallel or comparable corpora during bootstrapping.
Citation (BibTeX):
David A. Smith and Jason Eisner (2007). Bootstrapping feature-rich dependency parsers with entropic priors. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), pp. 667-677, Prague, June.
On-line document:
11 pp. PDF (210K)
Slides:
powerpoint (1165K, some animation)

Using "Annotator Rationales" to Improve Machine Learning for Text Categorization (2007)

Abstract:
We propose a new framework for supervised machine learning. Our goal is to learn from smaller amounts of supervised training data, by collecting a richer kind of training data: annotations with "rationales." When annotating an example, the human teacher will also highlight evidence supporting this annotation -- thereby teaching the machine learner why the example belongs to the category. We provide some rationale-annotated data and present a learning method that exploits the rationales during training to boost performance significantly on a sample task, namely sentiment classification of movie reviews. We hypothesize that in some situations, providing rationales is a more fruitful use of an annotator's time than annotating more examples.
Citation (BibTeX):
Omar Zaidan, Jason Eisner, and Christine Piatko (2007). Using "annotator rationales" to improve machine learning for text categorization. Human Language Technologies: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 260-267, Rochester, NY, April.
On-line document:
8 pp. PDF (206K)
Slides:
powerpoint (1.8M, animation), PDF (730K, 4 per page)
Other resources:
our dataset of hand-annotated rationales on movie reviews
Relationship to other papers:
jump to this paper in my research summary

Cross-Instance Tuning of Unsupervised Document Clustering Algorithms (2007)

Abstract:
In unsupervised learning, where no training takes place, one simply hopes that the unsupervised learner will work well on any unlabeled test collection. However, when the variability in the data is large, such hope may be unrealistic; a tuning of the unsupervised algorithm may then be necessary in order to perform well on new test collections. In this paper, we show how to perform such a tuning in the context of unsupervised document clustering, by (i) introducing a degree of freedom, α, into two leading information-theoretic clustering algorithms, through the use of generalized mutual information quantities; and (ii) selecting the value of α based on clusterings of similar, but supervised document collections (cross-instance tuning). One option is to perform a tuning that directly minimizes the error on the supervised data sets; another option is to use "strapping" (Eisner and Karakos, 2005), which builds a classifier that learns to distinguish good from bad clusterings, and then selects the α with the best predicted clustering on the test set. Experiments from the "20 Newsgroups" corpus show that, although both techniques improve the performance of the baseline algorithms, "strapping" is clearly a better choice for cross-instance tuning.
Citation (BibTeX):
Damianos Karakos, Jason Eisner, Sanjeev Khudanpur, and Carey E. Priebe (2007). Cross-instance tuning of unsupervised document clustering algorithms. Human Language Technologies: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), pp. 252-259, Rochester, NY, April.
On-line document:
8 pp. PDF (165K)
Slides:
PDF (449K)
Relationship to other papers:
jump to this paper in my research summary

Iterative Denoising Using Jensen-Renyi Divergences with an Application to Unsupervised Document Categorization (2007)

Abstract:
Iterative denoising trees were used by Karakos et al. [1] for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic maximization objective. In this paper, we improve upon the work of [1] in two ways: (i) the amount of computation spent searching for a good projection at each node now adapts to the intrinsic dimensionality of the data observed at that node; (ii) the objective at each node is to find a split which maximizes a generalized form of mutual information, the Jensen-Renyi divergence; this is followed by an iterative Naive Bayes classification. The single parameter alpha of the Jensen-Renyi divergence is chosen based on the "strapping" methodology [2], which learns a meta-classifer on a related task. Compared with the sequential Information Bottleneck method [3], our procedure produces state-of-the-art results on an unsupervised categorization task of documents from the "20 Newsgroups" dataset.
Citation (BibTeX):
Damianos Karakos, Sanjeev Khudanpur, Jason Eisner, and Carey E. Priebe (2007). Iterative denoising using Jensen-Renyi divergences with an application to unsupervised document categorization. Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, April.
On-line document:
4 pp. PDF (96K)
Relationship to other papers:
jump to this paper in my research summary

Program Transformations for Optimization of Parsing Algorithms and Other Weighted Logic Programs (2007)

Abstract:

Dynamic programming algorithms in statistical natural language processing can be easily described as weighted logic programs. We give a notation and semantics for such programs. We then describe several source-to-source transformations that affect a program's efficiency, primarily by rearranging computations for better reuse or by changing the search strategy. We present practical examples of using these transformations, mainly to optimize context-free parsing algorithms, and we formalize them for use with new weighted logic programs.

Specifically, we define weighted versions of the folding and unfolding transformations, whose unweighted versions are used in the logic programming and deductive database communities. We then present a novel transformation called speculation -- a powerful generalization of folding that is motivated by gap-passing in categorial grammar. Finally, we give a simpler and more powerful formulation of the magic templates transformation.

Citation (BibTeX):
Jason Eisner and John Blatz (2007). Program transformations for optimization of parsing algorithms and other weighted logic programs. Proceedings of FG 2006: The 11th Conference on Formal Grammar, pp. 45-85. CSLI Publications.

This supersedes the pre-proceedings version (BibTeX):
Eisner, Jason and John Blatz (2006). Program transformations for optimization of parsing algorithms and other weighted logic programs. Pre-proceedings of the 11th Conference on Formal Grammar (FG-2006), pp. 39-59, Malaga, July.

On-line document:
41 pp. PDF (449K)
Slides:
powerpoint (2159K, animation)
Relationship to other papers:
jump to this paper in my research summary

Visual Navigation Through Large Directed Graphs and Hypergraphs (2006)

Abstract:
We describe Dynasty, a system for browsing large (possibly infinite) directed graphs and hypergraphs. Only a small subgraph is visible at any given time. We sketch how we lay out the visible subgraph, and how we update the layout smoothly and dynamically in an asynchronous environment. We also sketch our user interface for browsing and annotating such graphs -- in particular, how we try to make keyboard navigation usable.
Citation (BibTeX):
Jason Eisner, Michael Kornbluh, Gordon Woodhull, Raymond Buse, Samuel Huang, Constantinos Michael, and George Shafer (2006). Visual navigation through large directed graphs and hypergraphs. Proceedings of the IEEE Symposium on Information Visualization (InfoVis'06), Poster/Demo Session, pp. 116-117, Baltimore, October.
On-line document:
2 pp. PDF (93K)
Poster (4 feet by 3 feet):
powerpoint (434K), PDF (846K, minor imperfections)
Other resources:
The Dynasty website (includes screenshots and documentation)
Relationship to other papers:
jump to this paper in my research summary

A Natural-Language Approach to Automated Cryptanalysis of Two-Time Pads (2006)

Abstract:
While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that "an attacker would never be able to do that." In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002.
Citation (BibTeX):
Joshua Mason, Kathryn Watkins, Jason Eisner, and Adam Stubblefield (2006). A natural-language approach to automated cryptanalysis of two-time pads. Proceedings of the ACM Conference on Computer and Communications Security (CCS), pp. 235-244, Alexandria, VA, October.
On-line document:
10 pp. PDF (298K)
Slides:
PDF (1481K)
Relationship to other papers:
jump to this paper in my research summary

Better Informed Training of Latent Syntactic Features (2006)

Abstract:
We study unsupervised methods for learning refinements of the nonterminals in a treebank. Following Matsuzaki et al. (2005) and Prescher (2005), we may for example split NP without supervision into NP[0] and NP[1], which behave differently. We first propose to learn a PCFG that adds such features to nonterminals in such a way that they respect patterns of linguistic feature passing: each node's nonterminal features are either identical to, or independent of, those of its parent. This linguistic constraint reduces runtime and the number of parameters to be learned. However, it did not yield improvements when training on the Penn Treebank. An orthogonal strategy was more successful: to improve the performance of the EM learner by treebank preprocessing and by annealing methods that split nonterminals selectively. Using these methods, we can maintain high parsing accuracy while dramatically reducing the model size.
Citation (BibTeX):
Markus Dreyer and Jason Eisner (2006). Better informed training of latent syntactic features. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 317-326, Sydney, July.
On-line document:
10 pp. PDF (195K)
Poster:
PDF (5146K)
Relationship to other papers:
jump to this paper in my research summary

Minimum-Risk Annealing for Training Log-Linear Models (2006)

Abstract:
When training the parameters for a natural language system, one would prefer to minimize 1-best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize log-likelihood, which is conveniently differentiable and convex. We propose training instead to minimize the expected loss, or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen (anneal) to focus on the 1-best hypothesis. Besides the linear loss functions used in previous work, we also describe techniques for optimizing nonlinear functions such as precision or the BLEU metric. We present experiments training log-linear combinations of models for dependency parsing and for machine translation. In machine translation, annealed minimum risk training achieves significant improvements in BLEU over standard minimum error training. We also show improvements in labeled dependency parsing.
Citation (BibTeX):
David A. Smith and Jason Eisner (2006). Minimum-risk annealing for training log-linear models. Proceedings of the International Conference on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), Companion Volume, pp. 787-794, Sydney, July.
On-line document:
8 pp. PDF (269K)
Relationship to other papers:
jump to this paper in my research summary

Annealing Structural Bias in Multilingual Weighted Grammar Induction (2006)

Abstract:
We first show how a structural locality bias can improve the accuracy of state-of-the-art dependency grammar induction models trained by EM from unannotated examples (Klein and Manning, 2004). Next, by annealing the free parameter that controls this bias, we achieve further improvements. We then describe an alternative kind of structural bias, toward "broken" hypotheses consisting of partial structures over segmented sentences, and show a similar pattern of improvement. We relate this approach to contrastive estimation (Smith and Eisner, 2005a), apply the latter to grammar induction in six languages, and show that our new approach improves accuracy by 1-17% (absolute) over CE (and 8-30% over EM), achieving to our knowledge the best results on this task to date. Our method, structural annealing, is a general technique with broad applicability to hidden-structure discovery problems.
Citation (BibTeX):
Noah A. Smith and Jason Eisner (2006). Annealing structural bias in multilingual weighted grammar induction. Proceedings of the International Conference on Computational Linguistics and the Association for Computational Linguistics (COLING-ACL), pp. 569-576, Sydney, July.
On-line document:
8 pp. PDF (191K)
Relationship to other papers:
jump to this paper in my research summary

Local Search with Very Large-Scale Neighborhoods for Optimal Permutations in Machine Translation (2006)

Abstract:
We introduce a novel decoding procedure for statistical machine translation and other ordering tasks based on a family of Very Large-Scale Neighborhoods, some of which have previously been applied to other NP-hard permutation problems. We significantly generalize these problems by simultaneously considering three distinct sets of ordering costs. We discuss how these costs might apply to MT, and some possibilities for training them. We show how to search and sample from exponentially large neighborhoods using efficient dynamic programming algorithms that resemble statistical parsing. We also incorporate techniques from statistical parsing to improve the runtime of our search. Finally, we report results of preliminary experiments indicating that the approach holds promise.
Citation (BibTeX):
Jason Eisner and Roy W. Tromble (2006). Local search with very large-scale neighborhoods for optimal permutations in machine translation. Proceedings of the HLT-NAACL Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing, pp. 57-75, New York, June.
On-line document:
19 pp. PDF (214K)
Slides:
powerpoint (556K, much animation)
longer talk (see 2nd half) (1704K, much animation)
Relationship to other papers:
jump to this paper in my research summary

Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies (2006)

Nominated for best paper award.
Abstract:
Many syntactic models in machine translation are channels that transform one tree into another, or synchronous grammars that generate trees in parallel. We present a new model of the translation process: quasi-synchronous grammar (QG). Given a source-language parse tree T1, a QG defines a monolingual grammar that generates translations of T1. The trees T2 allowed by this monolingual grammar are inspired by pieces of substructure in T1 and aligned to T1 at those points. We describe experiments learning quasi-synchronous context-free grammars from bitext. As with other monolingual language models, we evaluate the crossentropy of QGs on unseen text and show that a better fit to bilingual data is achieved by allowing greater syntactic divergence. When evaluated on a word alignment task, QG matches standard baselines.
Citation (BibTeX):
David A. Smith and Jason Eisner (2006). Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies. Proceedings of the HLT-NAACL Workshop on Statistical Machine Translation, pages 23-30, New York, June.
On-line document:
8 pp. PDF (190K)
Slides:
powerpoint (499K, some animation)
see also 1st half of this talk (1704K, much animation)
Relationship to other papers:
jump to this paper in my research summary

A Fast Finite-State Relaxation Method for Enforcing Global Constraints on Sequence Decoding (2006)

Abstract:
We describe finite-state constraint relaxation, a method for applying global constraints, expressed as automata, to sequence model decoding. We present algorithms for both hard constraints and binary soft constraints. On the CoNLL-2004 semantic role labeling task, we report a speedup of at least 16x over a previous method that used integer linear programming.
Citation (BibTeX):
Roy W. Tromble and Jason Eisner (2006). A fast finite-state relaxation method for enforcing global constraints on sequence decoding. Proceedings of HLT-NAACL, pages 423-430, New York, June.
On-line document:
8 pp. PDF (205K)
Slides:
powerpoint (1495K, animation)
Relationship to other papers:
jump to this paper in my research summary

Finite-State Dirichlet Allocation: Learned Priors on Finite-State Models (2006)

Abstract:

To model a collection of documents, suppose that each document was generated by a different hidden Markov model or probabilistic finite-state automaton (PFSA). Further suppose that all these PFSAs are similar because they are drawn from a single (but unknown) prior distribution over PFSAs. We wish to infer the prior, obtain smoothed estimates of the individual PFSAs, and reconstruct the hidden paths by which the unknown PFSAs generated their documents.

As an initial application, particularly hard for our model because of its sparse data, we derive an FSA topology from WordNet. For each verb, we construct the “document” of all nouns that have appeared as its object. Our method then learns a better estimate of p(object | verb), as well as which paths in WordNet, and hence which senses of ambiguous objects, tend to be favored. Our method improves 14.6% over Witten-Bell smoothing on the conditional perplexity of objects given the verb, and 27.5% over random on detecting the most common senses of nouns in the SemCor corpus.

Citation (BibTeX):
Jia Cui and Jason Eisner (2006). Finite-state Dirichlet allocation: Learned priors on finite-state models.. Technical Report 53, Center for Language and Speech Processing, Johns Hopkins University, April.
On-line document:
18 pp. PDF (317K)
Relationship to other papers:
jump to this paper in my research summary

Parsing with Soft and Hard Constraints on Dependency Length (2005)

Abstract:
In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. We show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. This simple "vine grammar" formalism has only finite-state power, but a context-free parameterization with some extra parameters for stringing fragments together. We exhibit a linear-time chart parsing algorithm with a low grammar constant.
Citation (BibTeX):
Jason Eisner and Noah A. Smith (2005). Parsing with soft and hard constraints on dependency length. Proceedings of the International Workshop on Parsing Technologies (IWPT), Vancouver, October, pp. 30-41.
On-line document:
12 pp. PDF (264K)
Slides:
powerpoint (1295K, animation)
Relationship to other papers:
jump to this paper in my research summary

Bootstrapping Without the Boot (2005)

Abstract:
"Bootstrapping" methods for learning require a small amount of supervision to seed the learning process. We show that it is sometimes possible to eliminate this last bit of supervision, by trying many candidate seeds and selecting the one with the most plausible outcome. We discuss such "strapping" methods in general, and exhibit a particular method for strapping word-sense classifiers for ambiguous words. Our experiments on the Canadian Hansards show that our unsupervised technique is significantly more effective than picking seeds by hand (Yarowsky, 1995), which in turn is known to rival supervised methods.
Citation (BibTeX):
Jason Eisner and Damianos Karakos (2005). Bootstrapping without the boot. Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP), pp. 395-402, Vancouver, October.
On-line document:
8 pp. PDF (205K), postscript (654K)
Slides with audio:
powerpoint (2118K, inessential animation), PDF (3358K, 6 per page)
slightly longer powerpoint (2507K) with spoken lecture (1-hour MP3 audio, 55M)
Relationship to other papers:
jump to this paper in my research summary

Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language (2005)

Abstract:
Weighted deduction with aggregation is a powerful theoretical formalism that encompasses many NLP algorithms. This paper proposes a declarative specification language, Dyna; gives general agenda-based algorithms for computing weights and gradients; briefly discusses Dyna-to-Dyna program transformations; and shows that a first implementation of a Dyna-to-C++ compiler produces code that is efficient enough for real NLP research, though still several times slower than hand-crafted code.
Citation (BibTeX):
Jason Eisner, Eric Goldlust, and Noah A. Smith (2005). Compiling comp ling: Weighted dynamic programming and the Dyna language. Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT-EMNLP), pp. 281-290, Vancouver, October.
On-line document:
10 pp. PDF (246K), postscript (843K)
Slides with audio:
powerpoint (1223K, animation), PDF (474K, 6 per page)
spoken lecture accompanying above slides (25-minute Windows Media audio, 23M)
longer talk (6764K, animation; includes extra slides at end about assorted applications)
Appendix:
Derivation of gradient algorithm in figure 4
Other resources:
The Dyna website
Relationship to other papers:
jump to this paper in my research summary

Guiding Unsupervised Grammar Induction Using Contrastive Estimation (2005)

Abstract:
We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the Expectation-Maximization algorithm [Dempster et al., 1977]. CE is a natural fit for log-linear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EM-trained generative models on the task of matching human linguistic annotations (the MATCHLINGUIST task). The selection of an implicit negative evidence class -- a "neighborhood" -- appropriate to a given task has strong implications, but a good neighborhood can target the objective of grammar induction to a specific application.
Citation (BibTeX):
Noah A. Smith and Jason Eisner (2005). Guiding unsupervised grammar induction using contrastive estimation. International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Grammatical Inference Applications, pp. 73-82, Edinburgh, July.
On-line document:
10 pp. PDF (220K, minor corrections), postscript (1009K, minor corrections)
Relationship to other papers:
jump to this paper in my research summary

Contrastive Estimation: Training Log-Linear Models on Unlabeled Data (2005)

Nominated for best paper award.
Abstract:
Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for log-linear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem -- POS tagging given a tagging dictionary and unlabeled text -- contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features.
Citation (BibTeX):
Noah A. Smith and Jason Eisner (2005). Contrastive estimation: Training log-linear models on unlabeled data. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 354-362, Ann Arbor, Michigan, June.
Slides: PPT (2558K)
On-line document:
9 pp. PDF (248K), postscript (1007K)
Other resources:
mapping used here from 45 fine-grained to 17 coarse-grained tags (also appears in Smith (2006) dissertation)
Relationship to other papers:
jump to this paper in my research summary

A Class of Rational n-WFSM Auto-Intersections (2005)

Abstract:
Weighted finite-state machines with n tapes describe n-ary rational string relations. The join n-ary relation is very important in applications. It is shown how to compute it via a more simple operation, the auto-intersection. Join and auto-intersection generally do not preserve rationality. We define a class of triples (A,i,j) such that the auto-intersection of the machine A on tapes i and j can be computed by a delay-based algorithm. We point out how to extend this class and hope that it is sufficient for many practical applications.
Citation (BibTeX):
André Kempe, Jean-Marc Champarnaud, Jason Eisner, Franck Guingne, and Florent Nicart (2005). A class of rational n-WFSM auto-intersections. Proceedings of the Tenth International Conference on Implementation and Application of Automata (CIAA-2005). Lecture Notes in Computer Science 3845:189-200. Springer-Verlag.
On-line document:
12 pp. PDF (201K)
Relationship to other papers:
jump to this paper in my research summary

Unsupervised Classification via Decision Trees: An Information-Theoretic Perspective (2005)

Abstract:
Integrated Sensing and Processing Decision Trees (ISPDTs) were introduced in [1] as a tool for supervised classification of high-dimensional data. In this paper, we consider the problem of unsupervised classification, through a recursive construction of ISPDTs, where at each internal node the data (i) are split into clusters, and (ii) are transformed independently of other clusters, guided by some optimization objective. We show that the maximization of information-theoretic quantities such as mutual information and alpha-divergences is theoretically justified for growing ISPDTs, assuming that each data point is generated by a finite-memory random process given the class label. Furthermore, we present heuristics that perform the maximization in a greedy manner, and we demonstrate their effectiveness with empirical results from multi-spectral imaging.
Citation (BibTeX):
Damianos Karakos, Sanjeev Khudanpur, Jason Eisner, and Carey E. Priebe (2005). Unsupervised classification via decision trees: An information-theoretic perspective. Proceedings of the 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1081-1084, Philadelphia, March. Invited talk.
On-line document:
4 pp. PDF (712K)
Relationship to other papers:
jump to this paper in my research summary

A Note on Join and Auto-Intersection of n-ary Rational Relations (2004)

Abstract:

A finite-state machine with n tapes describes a rational (or regular) relation on n strings. It is more expressive than a relational database table with n columns, which can only describe a finite relation.

We describe some basic operations on n-ary rational relations and propose notation for them. (For generality we give the semiring-weighted case in which each tuple has a weight.) Unfortunately, the join operation is problematic: if two rational relations are joined on more than one tape, it can lead to non-rational relations with undecidable properties. We recast join in terms of "auto-intersection" and illustrate some cases in which difficulties arise. We close with the hope that partial or restricted algorithms may be found that are still powerful enough to have practical use.

Citation (BibTeX):
André Kempe, Jean-Marc Champarnaud, and Jason Eisner (2004). A note on join and auto-intersection of n-ary rational relations. In Loek Cleophas and Bruce Watson, editors, Proceedings of the Eindhoven FASTAR Days (Computer Science Technical Report 04-40), pages 64-78. Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, Netherlands, December.
On-line document:
15 pp. PDF (276K), postscript (429K)
Relationship to other papers:
jump to this paper in my research summary

Dyna: A Declarative Language for Implementing Dynamic Programs (2004)

Abstract:
We present the first version of a new declarative programming language. Dyna has many uses but was designed especially for rapid development of new statistical NLP systems. A Dyna program is a small set of equations, resembling Prolog inference rules, that specify the abstract structure of a dynamic programming algorithm. It compiles into efficient, portable, C++ classes that can be easily invoked from a larger application. By default, these classes run a generalization of agenda-based parsing, prioritizing the partial parses by some figure of merit. The classes can also perform an exact backward (outside) pass in the service of parameter training. The compiler already knows several implementation tricks, algorithmic transforms, and numerical optimization techniques. It will acquire more over time: we intend for it to generalize and encapsulate best practices, and serve as a testbed for new practices. Dyna is now being used for parsing, machine translation, morphological analysis, grammar induction, and finite-state modeling.
Citation (BibTeX):
Jason Eisner, Eric Goldlust, and Noah A. Smith (2004). Dyna: A declarative language for implementing dynamic programs. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Companion Volume, pages 218-221, Barcelona, July.
On-line document:
4 pp. PDF (60K), postscript (68K)
Consider this longer, more recent paper instead
Other resources:
The Dyna website
Relationship to other papers:
jump to this paper in my research summary

Annealing Techniques for Unsupervised Statistical Language Learning (2004)

Abstract:
Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the Expectation-Maximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA begins by globally maximizing an easy concave function and maintains a local maximum as it gradually morphs the function into the desired non-concave likelihood function. Applying DA to parsing and tagging models is shown to be straightforward; significant improvements over EM are shown on a part-of-speech tagging task. We describe a variant, skewed DA, which can incorporate a good initializer when it is available, and show significant improvements over EM on a grammar induction task.
Citation (BibTeX):
Noah A. Smith and Jason Eisner (2004). Annealing techniques for unsupervised statistical language learning. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 486-493, Barcelona, July.
On-line document:
8 pp. PDF (144K), postscript (171K)
Relationship to other papers:
jump to this paper in my research summary

Radiology Report Entry with Automatic Phrase Completion Driven by Language Modeling (2004)

Abstract:
Language modeling, a technology found in many computerized speech recognition systems, can also be used in a text editor to implement an automated phrase completion feature that significantly reduces the number of keystrokes required to generate a radiology report, therefore increasing typing speed.

Radiology reports have especially low entropy, which allows prediction of multi-word phrases. Our system therefore chooses an optimal phrase length for each prediction, using Bellman-style dynamic programming to minimize the expected cost of typing the rest of the document. This computation considers what the user is likely to type in the future, and how many keystrokes it will take, considering the future effect of phrase completion as well.

Citation (BibTeX):
John Eng and Jason M. Eisner (2004). Radiology report entry with automatic phrase completion driven by language modeling. Radiographics 24(5):1493-1501, September-October.
On-line:
9 pp. PDF (636K)
Relationship to other papers:
jump to this paper in my research summary

Natural Language Generation in the Context of Machine Translation (2004)

Abstract:
Final report from the team at the JHU CLSP 2002 summer workshop. See project description.
Citation (BibTeX):
Jan Hajic, Martin Cmejrek, Bonnie Dorr, Yuan Ding, Jason Eisner, Daniel Gildea, Terry Koo, Kristen Parton, Gerald Penn, Dragomir Radev, and Owen Rambow (2004). Natural language generation in the context of machine translation. Technical report, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, March. Final report from 2002 CLSP summer workshop. 87 pp.
On-line document:
87 pp. PDF (395K)
Slides:
powerpoint (1072K)
Other resources:
Official workshop team page
Relationship to other papers:
jump to this paper in my research summary

Learning Non-Isomorphic Tree Mappings for Machine Translation (2003)

Abstract:
Often one may wish to learn a tree-to-tree mapping, training it on unaligned pairs of trees, or on a mixture of trees and strings. Unlike previous statistical formalisms (limited to isomorphic trees), synchronous tree substitution grammar allows local distortion of the tree topology. We reformulate it to permit dependency trees, and sketch EM/Viterbi algorithms for alignment, training, and decoding.
[Note: At a reviewer's request, the paper describes TSG more formally than in the previous literature, which might be helpful for some readers and implementers.]
Citation (BibTeX):
Jason Eisner (2003). Learning non-isomorphic tree mappings for machine translation. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (Companion Volume), 205-208, Sapporo, July.
On-line document:
4 pp. PDF (91K), postscript (103K) - see also correction
Slides:
powerpoint (320K, much animation), PDF (267K, 6 per page)
Relationship to other papers:
jump to this paper in my research summary

Simpler and More General Minimization for Weighted Finite-State Automata (2003)

Abstract:

Previous work on minimizing weighted finite-state automata (including transducers) is limited to particular types of weights. We present efficient new minimization algorithms that apply much more generally, while being simpler and about as fast.

We also point out theoretical limits on minimization algorithms. We characterize the kind of ``well-behaved'' weight semirings where our methods work. Outside these semirings, minimization is not well-defined (in the sense of producing a unique minimal automaton), and even finding the minimum number of states is in general NP-complete and inapproximable.

Citation (BibTeX):
Jason Eisner (2003). Simpler and more general minimization for weighted finite-state automata. Proceedings of the Joint Meeting of the Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 64-71, Edmonton, May.
On-line document:
8 pp. PDF (263K), postscript (2027K)
Slides:
powerpoint (615K, much animation), PDF (397K, 6 per page)
Relationship to other papers:
jump to this paper in my research summary

Parameter Estimation for Probabilistic Finite-State Transducers (2002)

Abstract:
Weighted finite-state transducers suffer from the lack of a training algorithm. Training is even harder for transducers that have been assembled via finite-state operations such as composition, minimization, union, concatenation, and closure, as this yields tricky parameter tying. We formulate a "parameterized FST" paradigm and give training algorithms for it, including a general bookkeeping trick ("expectation semirings") that cleanly and efficiently computes expectations and gradients.
Citation (BibTeX):
Jason Eisner (2002). Parameter estimation for probabilistic finite-state transducers. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 1-8, Philadelphia, July.
On-line document:
8 pp. PDF (230K), postscript (1758K)
Slides from a longer version of this talk:
powerpoint (850K, some animation, a few speaker notes), PDF (479K, 6 per page)
Relationship to other papers:
jump to this paper in my research summary

Comprehension and Compilation in Optimality Theory (2002)

Abstract:
This paper ties up some loose ends in finite-state Optimality Theory. First, it discusses how to perform comprehension under Optimality Theory grammars consisting of finite-state constraints. Comprehension has not been much studied in OT; we show that unlike production, it does not always yield a regular set, making finite-state methods inapplicable. However, after giving a suitably flexible presentation of OT, we show carefully how to treat comprehension under recent variants of OT in which grammars can be compiled into finite-state transducers. We then unify these variants, showing that compilation is possible if all components of the grammar are regular relations, including the harmony ordering on scored candidates. A side benefit of our construction is a far simpler implementation of directional OT (Eisner, 2000).
Citation (BibTeX):
Jason Eisner (2002). Comprehension and compilation in Optimality Theory. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 56-63, Philadelphia, July.
On-line document:
8 pp. postscript (242K), PDF (191K, missing tableau shading)
Slides:
powerpoint (302K, a few special effects), PDF (240K, 6 per page)
Other resources:
related homework assignment (see problem 4) with an implementation
overlapping paper by Gerhard Jäger published independently at the same time
Relationship to other papers:
jump to this paper in my research summary

An Interactive Spreadsheet for Teaching the Forward-Backward Algorithm (2002)

Abstract:
This paper offers a detailed lesson plan on the forward-backward algorithm. The lesson is taught from a live, commented spreadsheet that implements the algorithm and graphs its behavior on a whimsical toy example. By experimenting with different inputs, one can help students develop intuitions about HMMs in particular and Expectation Maximization in general. The spreadsheet and a coordinated follow-up assignment are available.
Citation (BibTeX):
Jason Eisner (2002). An interactive spreadsheet for teaching the forward-backward algorithm. In Dragomir Radev and Chris Brew (eds.), Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching NLP and CL, pages 10-18, Philadelphia, July.
On-line document (includes lesson plan):
9 pp. PDF (242K, prettier), postscript (1026K, uglier)
Other resources:
Lecture video (see part 2)
Excel spreadsheet (348K - revised to better fit 800x600 projector), Viterbi version (324K)
Tricks and tips for quick navigation when teaching from a live spreadsheet
homework assignment tied to the spreadsheet (contact me for latex source)
A textbook explanation of HMMs, centered on the "ice cream" example from this paper (Jurafsky & Martin, 2nd edition)
Another EM spreadsheet of mine, animating soft clustering (245K)
Animated Powerpoint examples of Earley's algorithm (1482K) and fast bilexical parsing (517K)
Related sites (discovered after publication):
Spreadsheets in Education has a long bibliography, many links, and examples (including Fourier synthesis!)
Visualizations with Excel explains how to do algorithm animation with Excel macros; examples include edit distance and Huffman coding
Relationship to other papers:
jump to this paper in my research summary

Transformational Priors Over Grammars (2002)

Nominated for best paper award.
Abstract:
This paper proposes a novel class of PCFG parameterizations that support linguistically reasonable priors over PCFGs. To estimate the parameters is to discover a notion of relatedness among context-free rules such that related rules tend to have related probabilities. The prior favors grammars in which the relationships are simple to describe and have few major exceptions. A basic version that bases relatedness on weighted edit distance yields superior smoothing of grammars learned from the Penn Treebank (20% reduction of rule perplexity over the best previous method).
Citation (BibTeX):
Jason Eisner (2002). Transformational priors over grammars. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 63-70, Philadelphia, July.
On-line document:
8 pp. postscript (205K), PDF (166K)
Slides:
powerpoint (749K, some speaker notes & animation), PDF (338K, 6 per page), PDF (579K, with speaker notes)
Relationship to other papers:
jump to this paper in my research summary

Discovering Syntactic Deep Structure via Bayesian Statistics (2002)

Abstract:
In the Bayesian framework, a language learner should seek a grammar that explains observed data well and is also a priori probable. This paper proposes such a measure of prior probability. Indeed it develops a full statistical framework for lexicalized syntax. The learner's job is to discover the system of probabilistic transformations (often called lexical redundancy rules) that underlies the patterns of regular and irregular syntactic constructions listed in the lexicon. Specifically, the learner discovers what transformations apply in the language, how often they apply, and in what contexts. It considers simpler systems of transformations to be more probable a priori. Experiments show that the learned transformations are more effective than previous statistical models at predicting the probabilities of lexical entries, especially those for which the learner had no direct evidence.
Citation (BibTeX):
Jason Eisner (2002). Discovering syntactic deep structure via Bayesian statistics. Cognitive Science 26(3):255-268, May-June.
On-line document:
9 pp. PDF (145K)
Slides: see this talk
Relationship to other papers:
jump to this paper in my research summary

Introduction to the Special Section on Linguistically Apt Statistical Methods (2002)

Abstract:
This brief introduction, from the editor of the special section, reviews why and how statistical and linguistic approaches to language can help each other. It also asks how statistical modeling fits into the broader program of cognitive science.
Citation (BibTeX):
Jason Eisner (2002). Introduction to the special section on linguistically apt statistical methods. Cognitive Science 26(3):235-237, May-June.
On-line document:
2 pp. PDF (30K)
Relationship to other papers:
jump to this paper in my research summary

Expectation Semirings: Flexible EM for Finite-State Transducers (2001)

Abstract:
This paper gives the first EM algorithm for general probabilistic finite-state transducers (with epsilon). Furthermore, the approach is powerful enough to fit machines' parameters even after the machines are combined by operations of the finite-state calculus, such as composition and minimization. This allows an expert to build a parameterized transducer in any way that is appropriate to the domain, and then fit the parameters automatically from data. Many standard algorithms are special cases, and there are many further applications. Yet the algorithm remains surprisingly simple because all the difficult work is subcontracted to existing algorithms for semiring-weighted automata. The trick is to use a novel semiring.
Citation (BibTeX):
Jason Eisner (2001). Expectation semirings: Flexible EM for finite-state transducers. In Gertjan van Noord (ed.), Proceedings of the ESSLLI Workshop on Finite-State Methods in Natural Language Processing, Helsinki, August.
On-line document:
5 pp. postscript (176K), PDF (240K)
Slides: see this talk
Relationship to other papers:
jump to this paper in my research summary

Smoothing a Probabilistic Lexicon Via Syntactic Transformations (2001)

Abstract:

Probabilistic parsing requires a lexicon that specifies each word's syntactic preferences in terms of probabilities. To estimate these probabilities for words that were poorly observed during training, this thesis assumes the existence of arbitrarily powerful transformations (also known to linguists as lexical redundancy rules or metarules) that can add, delete, retype or reorder the argument and adjunct positions specified by a lexical entry.

In a given language, some transformations apply frequently and others rarely. We describe how to estimate the rates of the transformations from a sample of lexical entries. More deeply, we learn which properties of a transformation increase or decrease its rate in the language. As a result, we can smooth the probabilities of lexical entries. Given enough direct evidence about a lexical entry's probability, our Bayesian approach trusts the evidence; but when less evidence or no evidence is available, it relies more on the transformations' rates to guess how often the entry will be derived from related entries.

Abstractly, the proposed ``transformation models'' are probability distributions that arise from graph random walks with a log-linear parameterization. A domain expert constructs the parameterized graph, and a vertex is likely according to whether random walks tend to halt at it. Transformation models are suited to any domain where ``related'' events (as defined by the graph) may have positively covarying probabilities. Such models admit a natural prior that favors simple regular relationships over stipulative exceptions. The model parameters can be locally optimized by gradient-based methods or by Expectation-Maximization. Exact algorithms (matrix inversion) and approximate ones (relaxation) are provided, with optimizations. Variations on the idea are also discussed.

We compare the new technique empirically to previous techniques from the probabilistic parsing literature, using comparable features, and obtain a 20% perplexity reduction (similar to doubling the amount of training data). Some of this reduction is shown to stem from the transformation model's ability to match observed probabilities, and some from its ability to generalize. Model averaging yields a final 24% perplexity reduction.

Citation (BibTeX):
Jason M. Eisner (2001). Smoothing a Probabilistic Lexicon Via Syntactic Transformations. Ph.D. thesis, University of Pennsylvania, July.
On-line document:
318 double-spaced pp. PDF (2850K), PS (2909K)
"Chapter 1: An Executive Summary": 33 double-spaced pp. , PDF (364K), PS (586K)
Slides: see this talk
Relationship to other papers:
jump to this paper in my research summary

Easy and Hard Constraint Ranking in Optimality Theory: Algorithms and Complexity (2000)

Abstract:
We consider the problem of ranking a set of OT constraints in a manner consistent with data.

We speed up Tesar and Smolensky's RCD algorithm to be linear on the number of constraints. This finds a ranking so each attested form xi beats or ties a particular competitor yi.

We also generalize RCD so each xi beats or ties all possible competitors. Alas, neither this more realistic version of learning, nor even generation, has any polynomial algorithm unless P=NP! That is, one cannot improve qualitatively upon brute force:

Merely checking that a single (given) ranking is consistent with given forms is coNP-complete if the surface forms are fully observed and Delta2p-complete if not. Indeed, OT generation is OptP-complete. As for ranking, determining whether any consistent ranking exists is coNP-hard (but in Delta2p) if the forms are fully observed, and Sigma2p-complete if not.

Finally, we show that generation and ranking are easier in derivational theories: in P, and NP-complete.

Citation (BibTeX):
Jason Eisner (2000). Easy and hard constraint ranking in optimality theory: Algorithms and complexity. In Jason Eisner, Lauri Karttunen and Alain Thériault (eds.), Finite-State Phonology: Proceedings of the 5th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON), pages 22-33, Luxembourg, August.
On-line document:
12 pp. A4 postscript (321K), PDF (335K)
Slides:
powerpoint (265K, some animation), PDF (172K, 6 per page)
Relationship to other papers:
jump to this paper in my research summary

Directional Constraint Evaluation in Optimality Theory (2000)

Abstract:
Weighted finite-state constraints that can count unboundedly many violations make Optimality Theory more powerful than finite-state transduction (Frank and Satta, 1998). This result is empirically and computationally awkward. We propose replacing these unbounded constraints, as well as non-finite-state Generalized Alignment constraints, with a new class of finite-state directional constraints. We give linguistic applications, results on generative power, and algorithms to compile grammars into transducers.
Citation (BibTeX):
Jason Eisner (2000). Directional constraint evaluation in Optimality Theory. Proceedings of the 18th International Conference on Computational Linguistics (COLING 2000), pages 257-263, Saarbrücken, August.
On-line document:
7 pp. postscript (266K), PDF (266K)
Slides (detailed speaker notes, light animation):
powerpoint (855K, detailed speaker notes, light animation), PDF (263K, 6 per page), PDF (417K, with speaker notes)
Relationship to other papers:
jump to this paper in my research summary

The Science of Language: Computational Linguistics (2000)

Citation (BibTeX):
Jason Eisner (2000). The science of language: Computational linguistics. Imagine Magazine 7(4):14-15, Center for Talented Youth, Johns Hopkins University, Baltimore, March/April.
On-line document:
2 pp. HTML, PDF (4640K)

Review of Optimality Theory by René Kager (2000)

Abstract:
This book review also sketches why OT is interesting to computational linguists, and how it relates to other approaches for combining non-orthogonal surface features, such as maximum-entropy modeling.
Citation (BibTeX):
Jason Eisner (2000). Review of Optimality Theory by René Kager. Computational Linguistics 26(2):286-290, June.
On-line document:
5 pp. postscript (41K), PDF (196K)
Relationship to other papers:
jump to this paper in my research summary

Bilexical Grammars and Their Cubic-Time Parsing Algorithms (1997, 2000)

Abstract:
This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such ``bilexicalism'' has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.

The obvious parsing algorithm for bilexical grammars (used by most authors) takes time O(n5). A more efficient O(n3) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Citation (BibTeX):
Jason Eisner (2000). Bilexical grammars and their cubic-time parsing algorithms. In Harry Bunt and Anton Nijholt (eds.), Advances in Probabilistic and Other Parsing Technologies, pages 29-62. Kluwer Academic Publishers, October.

This is an improved and extended version of an earlier paper (BibTeX):
Eisner, Jason (1997). Bilexical grammars and a cubic-time probabilistic parser. Proceedings of the International Workshop on Parsing Technologies, pages 54-65, MIT, September.

On-line document:
33 pp. postscript (316K), PDF (330K)
Earlier 1997 version: 12 pp. postscript (318K), PDF (383K)
Slides from 1997 talk (black-and-white, no animation):
powerpoint (680K), PDF (122K, 6 per page)
Relationship to other papers:
jump to this paper in my research summary

A Faster Parsing Algorithm for Lexicalized Tree-Adjoining Grammars (2000)

Abstract:
This paper points out some computational inefficiencies of standard TAG parsing algorithms when applied to LTAGs. We propose a novel algorithm with an asymptotic improvement, from O(n8 g2 t) to O(n6 max(n,g) g t), where n is the input length and g, t are grammar constants that are independent of vocabulary size.
Citation (BibTeX):
Jason Eisner and Giorgio Satta (2000). A faster parsing algorithm for lexicalized tree-adjoining grammars. Proceedings of the 5th Workshop on Tree-Adjoining Grammars and Related Formalisms (TAG+5), pages 14-19, Paris, May.
On-line document:
6 pp. postscript (70K), PDF (82K)
Relationship to other papers:
jump to this paper in my research summary

Efficient Parsing for Bilexical Context-Free Grammars and Head Automaton Grammars (1999)

Abstract:
Several recent stochastic parsers use bilexical grammars, where each word type idiosyncratically prefers particular complements with particular head words. We present O(n4) parsing algorithms for two bilexical formalisms (see title), improving the previous upper bounds of O(n5). Also, for a common special case that was known to allow O(n3) parsing (Eisner, 1997), we present an O(n3) algorithm with an improved grammar constant.
Citation (BibTeX):
Jason Eisner and Giorgio Satta (1999). Efficient parsing for bilexical context-free grammars and head automaton grammars. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 457-464, College Park, Maryland, June.
On-line document:
8 pp. postscript (536K), PDF (413K)
Slides:
powerpoint (517K, animation, speaker notes), PDF (203K, 6 per page), PDF (284K, with speaker notes)
Note that the slides include experimental speed comparisons that were not in the paper.
Relationship to other papers:
jump to this paper in my research summary

Doing OT in a Straitjacket (1999)

Note:
This is an extended version of What Constraints Should OT Allow? (1997).
Abstract:
A universal theory of human phonology should be clearly specified and falsifiable. To turn Optimality Theory (OT) into a complete proposal for phonological Universal Grammar, one must put some cards on the table: What kinds of constraints may an OT grammar state? And how can anyone tell what data this grammar predicts, without constructing infinite tableaux?

In this talk I'll motivate a restrictive formalization of OT that allows just two types of simple, local constraint. Gen freely proposes gestures and prosodic constituents; the constraints try to force these to coincide or not coincide temporally. An efficient algorithm exists to find the optimal candidate.

I will argue that despite its simplicity, primitive OT is expressive enough to describe and unify nearly all current work in OT phonology. However, it is provably more constrained: because it is unable to mimic deeply non-local mechanisms like Generalized Alignment, it forces a new and arguably better account of metrical stress typology. I will even discuss a proposal for constraining it further.

Citation (BibTeX):
Jason M Eisner (1999). Doing OT in a straitjacket. Talk handout (27 pages), UCLA Linguistics Dept., June. http://cs.jhu.edu/~jason/papers/#ucla99. Extended version of talk at the 1997 LSA.
On-line document:
27 pp. postscript (297K), PDF (458K)
14 pp. postscript (260K), PDF (407K) (as handout)
Relationship to other papers:
jump to this paper in my research summary

Efficient Generation in Primitive Optimality Theory (1997)

Abstract:
This paper introduces computational linguists to primitive Optimality Theory (OTP), a clean and linguistically motivated formalization of OT. OTP specifies the class of autosegmental representations, the universal generator Gen, and the two simple families of permissible constraints. It is therefore possible to study its computational generation, comprehension, and learning properties.

Some results on generation are presented. Unlike less restricted theories using Generalized Alignment, OTP grammars can derive optimal surface forms with finite-state methods adapted from Ellison (1994). Unfortunately these methods take time exponential on the size of the grammar. Indeed the generation problem is shown NP-complete in this sense. However, techniques are discussed for making Ellison's approach fast and practical in the typical case, including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size. One avenue for future improvements is a new finite-state notion, ``factored automata,'' where regular languages are represented compactly via formal intersections of FSAs.

Citation (BibTeX):
Jason Eisner (1997). Efficient generation in primitive Optimality Theory. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 313-320, Madrid, July.
On-line document:
8 pp. postscript (178K), PDF (235K)
Slides (black-and-white, no animation, no speaker notes):
powerpoint (263K), PDF (104K)
Other resources:
Some proof details suppressed from the paper for space reasons
Relationship to other papers:
jump to this paper in my research summary

State-of-the-Art Algorithms for Minimum Spanning Trees: A Tutorial Discussion (1997)

Abstract:
The classic ``easy'' optimization problem is to find the MST of a connected, undirected graph. Good polynomial-time algorithms have been known since 1930. Over the last 10 years, however, the standard O(m log n) results of Kruskal and Prim have been improved to linear or near-linear time. The new methods use several tricks of general interest in order to reduce the number of edge weight comparisons and the amount of other work. This tutorial reviews those methods, building up strategies step by step so as to expose the insights behind the algorithms. Implementation details are clarified, and some generalizations are given.

Specifically, the paper attempts to shed light on the classical algorithms of Kruskal, Prim, and Boruvka; the improved approach of Gabow, Galil, and Spencer, which takes time only O(m log (lg* n - lg* m/n)); and the randomized O(m) algorithm of Karger, Klein, and Tarjan, which relies on an O(m) MST verification algorithm by King. It also considers Frederickson's method for maintaining an MST in time O(sqrt(m)) per change to the graph. An appendix explains Fibonacci heaps.

Citation (BibTeX):
Jason Eisner (1997). State-of-the-art algorithms for minimum spanning trees: A tutorial discussion. Manuscript, University of Pennsylvania, April. 78 pp. (To be turned into a technical report, with more diagrams, as soon as I get a chance.)
On-line document:
78 pp. postscript (725K), PDF (1037K)
Relationship to other papers:
jump to this paper in my research summary

FootForm Decomposed: Using primitive constraints in OT (1997)

Abstract:
Hayes (1995) gives a typology of the world's metrical stress systems, which is marked by several striking asymmetries (parametric gaps). Most work on metrical stress within Optimality Theory (OT) has adopted this typology without explaining the gaps. Moreover, the OT versions use uncomfortably non-local constraints (Align, FootForm, FtBin).

This paper presents a rather different and in some ways more explanatory typology of stress, couched in the framework of primitive Optimality Theory (OTP), which allows only primitive, radically local constraints. For example, Generalized Alignment is not allowed. The paper presents a single, coherent system of rerankable constraints that yields the basic facts about iambic and trochaic foot form, iambic lengthening, quantity sensitivity, unbounded feet, simple word-initial and word-final stress, directionality of footing, syllable (and foot) extrametricality, degenerate feet, and word-level stress.

The metrical part of the account rests on the following intuitions:

An interesting prediction of (b) and (c) is that left-to-right trochees should be incompatible with extrametricality. This prediction is robustly confirmed in Hayes.
Citation (BibTeX):
Jason Eisner (1997). FootForm decomposed: Using primitive constraints in OT. In Benjamin Bruening (ed.), MIT Working Papers in Linguistics, vol. 31, pages 115-143.
On-line document:
29 pp. postscript (301K), PDF (305K)
Relationship to other papers:
jump to this paper in my research summary

What Constraints Should OT Allow? (1997)

Note:
A more recent, extended version of this talk is Doing OT in a Straitjacket (1999).
Abstract:
Optimality Theory (OT) has shown itself to be an elegant framework for phonological description. Two important questions remain to be settled, however: What constraints are allowed? And what kind of representations do they constrain? Formalizing what OT can and cannot say is part of stating UG.

This talk proposes an approach to constraining OT, called "primitive Optimality Theory" (OTP). Most constraints given in the literature can be reformulated (not always obviously) as coming from one of two simple, local families of ``primitive constraints'':

Here, a and b may be constituents, edges of constituents, or restricted kinds of conjunctive or disjunctive configurations.

We will formalize these families and the representations that they constrain. As in Optimal Domains Theory, neither the constraints nor the representations use association lines. The constraints control only the relative timing of articulatory gestures, and other phonological or morphological constituents, along a continuous timeline.

A list of hundreds of constraints drawn from the literature is presented, showing how every degree of freedom of OTP is exploited in each of several areas: features, prosody, feature-prosody interaction, input-output relationships, and morphophonology. To show that the primitive constraints are not merely necessary, but also close to sufficient, we also discuss how to handle a few apparently difficult cases of non-local phenomena.

Citation (BibTeX):
Jason M. Eisner (1997). What constraints should OT allow? Talk handout (22 pages), Annual Meeting of the Linguistic Society of America, Chicago, January. Available on the Rutgers Optimality Archive (http://ruccs.rutgers.edu/roa.html) and at http://cs.jhu.edu/~jason/papers/#lsa97.

With some additions and corrections:
Eisner, Jason (1997). Constraining OT: Primitive Optimality Theory. Talk handout, MIT, September. http://www.cs.jhu.edu/~jason/papers/#mit97.

On-line document (Click here instead for a more recent version!)
22 pp. postscript (176K), PDF (324K) (corrected 9/1997 version)
11 pp. postscript (171K), PDF (317K) (corrected 9/1997 version, as handout)
10 pp. postscript (158K) (original 1/1997 version, as handout)
Relationship to other papers:
jump to this paper in my research summary

An Empirical Comparison of Probability Models for Dependency Grammar (1996)

Abstract:
This technical report is an appendix to Eisner (1996): it gives superior experimental results that were reported only in the talk version of that paper. Eisner (1996) trained three probability models on a small set of about 4,000 conjunction-free, dependency-grammar parses derived from the Wall Street Journal section of the Penn Treebank, and then evaluated the models on a held-out test set, using a novel O(n3) parsing algorithm.

The present paper describes some details of the experiments and repeats them with a larger training set of 25,000 sentences. As reported at the talk, the more extensive training yields greatly improved performance. Nearly half the sentences are parsed with no misattachments; two-thirds are parsed with at most one misattachment.

Of the models described in the original written paper, the best score is still obtained with the generative (top-down) ``model C.'' However, slightly better models are also explored, in particular, two variants on the comprehension (bottom-up) ``model B.'' The better of these has an attachment accuracy of 90%, and (unlike model C) tags words more accurately than the comparable trigram tagger. Differences are statistically significant.

If tags are roughly known in advance, search error is all but eliminated and the new model attains an attachment accuracy of 93%. We find that the parser of Collins (1996), when combined with a highly-trained tagger, also achieves 93% when trained and tested on the same sentences. Similarities and differences are discussed.

Citation (BibTeX):
Jason M. Eisner (1996). An empirical comparison of probability models for dependency grammar. Technical report IRCS-96-11, Institute for Research in Cognitive Science, Univ. of Pennsylvania. 18 pp.
On-line document:
18 pp. postscript (327K), PDF (357K)
Relationship to other papers:
jump to this paper in my research summary

Three New Probabilistic Models for Dependency Parsing: An Exploration (1996)

Abstract:
After presenting a novel O(n3) parsing algorithm for dependency grammar, we develop three contrasting ways to stochasticize it. We propose (a) a lexical affinity model where words struggle to modify each other, (b) a sense tagging model where words fluctuate randomly in their selectional preferences, and (c) a generative model where the speaker fleshes out each word's syntactic and conceptual structure without regard to the implications for the hearer. We also give preliminary empirical results from evaluating the three models' parsing performance on annotated Wall Street Journal training text (derived from the Penn Treebank). In these results, the generative (i.e., top-down) model performs significantly better than the others, and does about equally well at assigning part-of-speech tags.
Citation (BibTeX):
Jason M. Eisner (1996). Three new probabilistic models for dependency parsing: An exploration. Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pages 340-345, Copenhagen, August.
On-line document:
6 pp. postscript (153K), PDF (218K)
Relationship to other papers:
jump to this paper in my research summary

Efficient Normal-Form Parsing for Combinatory Categorial Grammar (1996)

Abstract:
Under categorial grammars that have powerful rules like composition, a simple n-word sentence can have exponentially many parses that are semantically equivalent. Generating all parses is inefficient and obscures whatever true semantic ambiguities are in the input. This paper addresses the problem for a fairly general form of Combinatory Categorial Grammar, by means of an efficient, correct, and easy to implement normal-form parsing technique. The parser is proved to find exactly one parse in each semantic equivalence class of allowable parses; that is, spurious ambiguity (as carefully defined) is shown to be both safely and completely eliminated.
Citation (BibTeX):
Jason Eisner (1996). Efficient normal-form parsing for Combinatory Categorial Grammar. Proceedings of the 34th Annual Meeting of the ACL, pages 79-86, Santa Cruz, June.
On-line document:
8 pp. postscript (149K), PDF (226K)
Slides (black-and-white):
PDF (50K)
Other resources:
proof details suppressed from the paper for space reasons
Relationship to other papers:
jump to this paper in my research summary

Description of the University of Pennsylvania entry in the MUC-6 competition (1995)

Citation (BibTeX):
B. Baldwin, J.C. Reynar, M. Collins, J. Eisner, A. Ratnaparkhi, J. Rosenzweig, A. Sarkar, and B. Srinivas (1995). Description of the University of Pennsylvania entry in the MUC-6 competition. Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 177-191, Maryland, October.
Summary:
A competitive system for coreference resolution, hacked together in our spare time.
On-line document:
15 pp. postscript (166K)
Relationship to other papers:
jump to this paper in my research summary

`All'-less in Wonderland? Revisiting any (1995)

Abstract:
English any is often treated as two unrelated or semi-related lexemes: a negative-polarity item, NPI any, and a universal quantifier, free-choice (FC) any. The latter is idiosyncratic in that it must appear in the scope of a licenser, but moves to take scope immediately over that licenser at LF. I give a semantic account of FC any as an ``irrealis'' quantifier. This explains some curious (new and old) facts about FC any's semantics and licensing environments. Furthermore, it predicts that negation and other NPI-licensing environments should license FC any, which would then have just the same meaning as NPI any (pace Ladusaw (1979), Carlson (1980)). Thus, we may unify the two any's as a single universal quantifier, as originally proposed by Lasnik (1972) and others. Such an account implies that NPI any moves over negation at LF -- which is confirmed by scope tests. It also explains some well-known problems concerning NPI any in non-downward-entailing environments and under sorry vs. glad.
Citation (BibTeX):
Jason Eisner (1995). `All'-less in Wonderland? Revisiting any. In Fuller, Janet, Ho Han, and David Parkinson (eds.), Proceedings of ESCOL 11 (October 1994), pages 92-103. Ithaca, NY: DMLL Publications.
On-line document:
12 pp. postscript (96K), PDF (131K)
Relationship to other papers:
jump to this paper in my research summary

A Probabilistic Parser Applied to Software Testing Documents (1992)

Abstract:
We describe an approach to training a statistical parser from a bracketed corpus, and demonstrate its use in a software testing application that translates English specifications into an automated testing language. A grammar is not explicitly specified; the rules and contextual probabilities of occurrence are automatically generated from the corpus. The parser is extremely successful at producing and identifying the correct parse, and nearly deterministic in the number of parses that it produces. To compensate for undertraining, the parser also uses general, linguistic subtheories which aid in guessing some types of novel structures.
Citation (BibTeX):
Mark A. Jones and Jason M. Eisner (1992). A probabilistic parser applied to software testing documents. Proceedings of National Conference on Artificial Intelligence (AAAI-92), pages 322-328, San Jose
On-line document:
7 pp. postscript (161K), PDF (206K)
Relationship to other papers:
jump to this paper in my research summary

A Probabilistic Parser and Its Application (1992)

Abstract:
We describe a general approach to the probabilistic parsing of context-free grammars. The method integrates context-sensitive statistical knowledge of various types (e.g., syntactic and semantic) and can be trained incrementally from a bracketed corpus. We introduce a variant of the GHR context-free recognition algorithm, and explain how to adapt it for efficient probabilistic parsing. On a real-world corpus of sentences from software testing documents, with 23 possible parses for a sentence of average length, the system accurately finds the correct parse in 99% of cases, while producing only 1.02 parses per sentence. Significantly, the success rate would be only 66% without the semantic statistics.
Citation (BibTeX):
Mark A. Jones and Jason M. Eisner (1992). A probabilistic parser and its application. In Carl Weir (ed.), Statistically-Based Natural Language Processing Techniques: Papers from the 1992 Workshop, pages 20-27, Technical Report W-92-01, AAAI Press, Menlo Park.
On-line document:
8 pp. postscript (147K), PDF (220K)
Relationship to other papers:
jump to this paper in my research summary

Indirect STV Election: A Voting System for South Africa (1991)

Abstract:
"Winner take all" electoral systems are not fully representative. Unfortunately, the ANC's proposed system of proportional representation is not much better. Because it ensconces party politics, it is only slightly more representative, and poses a serious threat to accountability.

Many modern students of democracy favor proportional representation through the Single Transferable Vote (STV). In countries with high illiteracy, however, this system may be unworkable.

This paper proposes a practical modification of STV. In the modified system, each citizen votes for only one candidate. Voters need not specify their second, third, and fourth choices. Instead, each candidate specifies his or her second, third, and fourth choices. The modified system is no more difficult for voters than current proposals -- and it provides virtually all the benefits of STV, together with some new ones.

Citation (BibTeX):
Jason Eisner (1991). Indirect STV election: A voting system for South Africa. White paper, University of Cape Town, June. 16 pp.
On-line document:
16 pp. PDF (54K), Microsoft Word (59K)
Relationship to other papers:
jump to this paper in my research summary

Cognitive Science and the Search for Intelligence (1991)

Abstract:
This talk for a general audience gives a sketch of what the field of cognitive science is about. In its latter half, it turns to the philosophical question of defining intelligence, and proposes a non-operational alternative to the Turing Test.
Citation (BibTeX):
Jason Eisner (1991). Cognitive science and the search for intelligence. Invited paper presented to the Socratic Society, University of Cape Town, South Africa, May.
On-line document:
24 pp. postscript (1017K), PDF (217K), RTF (100K), MSWord (102K)
Relationship to other papers:
jump to this paper in my research summary

Dynamical-Systems Behavior in Recurrent and Non-Recurrent Connectionist Nets (1990)

Abstract:
A broad approach is developed for training dynamical behaviors in connectionist networks. General recurrent networks are powerful computational devices, necessary for difficult tasks like constraint satisfaction and temporal processing. These tasks are discussed here in some detail. From both theoretical and empirical considerations, it is concluded that such tasks are best addressed by recurrent networks that operate continuously in time -- and further, that effective learning rules for these continuous-time networks must be able to prescribe their dynamical properties. A general class of such learning rules is derived and tested on simple problems. Where existing learning algorithms for recurrent and non-recurrent networks only attempt to train a network's position in activation space, the models presented here can also explicitly and successfully prescribe the nature of its movement through activation space.
Citation (BibTeX):
Jason M. Eisner (1990). Dynamical-systems behavior in recurrent and non-recurrent connectionist nets. Undergraduate honors thesis, Harvard University, April.
On-line document:
57 pp. postscript (338K), PDF (417K)
Relationship to other papers:
jump to this paper in my research summary

Patents

A Lempel-Ziv Data Compression Technique Utilizing a Dictionary Pre-Filled with Frequent Letter Combinations, Words and/or Phrases (1996)

Abstract:
An adaptive compression technique which is an improvement to Lempel-Ziv (LZ) compression techniques, both as applied for purposes of reducing required storage space and for reducing the transmission time associated with transferring data from point to point. Pre-filled compression dictionaries are utilized to address the problem with prior Lempel-Ziv techniques in which the compression software starts with an empty compression dictionary, whereby little compression is achieved until the dictionary has been filled with sequences common in the data being compressed. In accordance with the invention, the compression dictionary is pre-filled, prior to the beginning of the data compression, with letter sequences, words and/or phrases frequent in the domain from which the data being compressed is drawn. The letter sequences, words, and/or phrases used in the pre-filled compression dictionary may be determined by statistically sampling text data from the same genre of text. Multiple pre-filled dictionaries may be utilized by the compression software at the beginning of the compression process, where the most appropriate dictionary for maximum compression is identified and used to compress the current data. These modifications are made to any of the known Lempel-Ziv compression techniques based on the variants detailed in 1977 and 1978 articles by Ziv and Lempel.
Citation (BibTeX):
Jeffrey C. Reynar, Fred Herz, Jason Eisner, and Lyle Ungar. A Lempel-Ziv data compression technique utilizing a dictionary pre-filled with frequent letter combinations, words and/or phrases. U.S. Patent #5,951,623 issued 9/14/1999, filed 1996.
On-line document:
#5,951,623 (at Patent Storm)
Relationship to other papers:
jump to this paper in my research summary

System for the Automatic Determination of Customized Prices and Promotions (1996)

Citation (BibTeX):
Frederick Herz, Lyle Ungar, and Jason M. Eisner. System for the automatic determination of customized prices and promotions. Patent pending, filed 1996.
Relationship to other papers:
jump to this paper in my research summary

A System for Customized Electronic Identification of Desirable Objects (1995)

Abstract:
This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a "target profile" for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a "target profile interest summary" for each user, which target profile interest summary describes the user's interest level in various types of target objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media. Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.
Citations (BibTeX):
  • Frederick S. M. Herz, Jason M. Eisner, and Lyle H. Ungar. System for generation of object profiles for a system for customized electronic identification of desirable objects. U.S. Patent #5,835,087 issued 11/10/1998, filed 1995.
  • Frederick S. M. Herz, Jason M. Eisner, Lyle H. Ungar, and Mitchell P. Marcus. System for generation of user profiles for a system for customized electronic identification of desirable objects. U.S. Patent #5,754,939 issued 5/19/1998, filed 1995.
  • Frederick S. M. Herz, Jason Eisner, and Marcos Salganicoff. Pseudonymous server for system for customized electronic identification of desirable objects. U.S. Patent #5,754,938 issued 5/19/1998, filed 1995.
On-line documents:
#5,835,087; #5,754,939; #5,754,983 (at Patent Storm)
Relationship to other papers:
jump to this paper in my research summary

This page online: http://cs.jhu.edu/~jason/papers
Jason Eisner - jason@cs.jhu.edu (tech correspondence welcome) Last Mod $Date: 2012/01/10 20:56:35 $