Here are some recommended papers that give a good sense of what we've worked on over the years.

Natural language problems often demand new algorithms. The main challenges are

a combinatorially large discrete space of linguistic structures a high-dimensional continuous space of statistical parameters

Many of our algorithmic papers give general solutions to some formal problem, and thus have multiple uses. I have occasionally proved hardness results.

What are all these discrete algorithms doing? “Structured prediction” is the problem of modeling unknown variables that are themselves complex structures, such as vectors, strings, or trees. The number of values for such a variable may be astronomically large. Searching for the most likely structure is a combinatorial optimization problem. Other combinatorial problems compute the probability of a particular structure or sub-structure.

Doesn't the machine learning community do structured prediction? Yes: graphical model inference must predict discrete vectors. But linguistics must predict strings and trees. Our systems must guess the syntactic structure of a sentence, the translation of a sentence, the grammar of a language, or the set of real-world facts that is consistent with a set of documents.

Dynamic programming is extremely useful for analyzing sequence data. The papers below introduce novel dynamic programming algorithms (primarily for parsing and machine translation).

Other cool papers, not listed below, show how dynamic programming algorithms can be embedded as efficient subroutines within variational inference (belief propagation), relaxation (dual decomposition, row generation), and large-neighborhood local search.

Machine learning often searches for parameters that maximize some non-convex and possibly expensive objective function. Our contributions include semiring computations of objective functions and their gradients, a special case of gradients by automatic differentiation (“back-propagation”); variational training objectives that are tractable to compute; and deterministic annealing methods that smooth a non-convex or non-differentiable objective function during early iterations of training.

My students and I often identify a pesky formal problem in statistical NLP or ML and try to give a general solution to it. The formal settings for our algorithms often involve finite-state machines, various kinds of grammars and synchronous grammars, and graphical models.

These objects are usually equipped with real-valued weights that define a structured prediction problem (see here). One can treat a wider variety of problems by allowing the weights to be elements of an arbitrary semiring.

Our work on weighted logic programming (see papers on the Dyna language) has led us to develop flexible algorithms for maintaining truth values in arithmetic circuits. Once upon a time, I was into neural networks.

These papers classify problems into specific computational complexity classes.

My students and I have worked in several ML settings, some of them novel. I have a relatively well-developed statistical philosophy, leading to the design of novel training objectives. We've also offered techniques for optimizing such objectives.

Other papers, not listed below, also do unsupervised learning, but using traditional approaches such as EM and MCEM. We develop such approaches for our transformation models and nonparametric models.

What does it mean to learn? What objective should a learning algorithm maximize?

Decision-making systems should be traineddiscriminatively even if they are structured generatively. Contrastive estimation guides an unsupervised learner to learn the “right” latent variables, by asking it to discriminate between positive data and certain implicit negative data. It is also popular for being fast. Structural annealing applies a domain-specific search bias during early learning. Finally, we have designed objectives for semi-supervised learning.

Bootstrapping is a general strategy for semi-supervised learning. Bootstrapping algorithms sometimes get confused and perform poorly. To cure this, our completely unsupervised “strapping” method is remarkably effective at selecting a successful run of bootstrapping.

We also relate bootstrapping to entropy regularization and apply it in a feature-rich setting of grammar induction.

Intelligent systems may be structured to do approximate probabilistic inference under some carefully crafted model. However, they should be trained discriminatively, so that their actual decisions or predictions maximize task-specific performance. This allows the parameters to compensate any approximations in modeling, inference, and decoding.

My philosophy comes from Bayesian decision theory:

The task of science is generative modeling. Given a data sample and some knowledge of the underlying processes, what is the true data distribution D? The task of engineering is to find a decision rule that is expected to be accurate and perhaps also fast (on distribution D).

This is the proper relation between generative and discriminative modeling. One should design a sensible space of decision rules (policies) and explicitly search for one having high expected performance over D. In practice this can greatly improve accuracy.

Do you have a realistic model of your domain? Then probabilistic inference will be intractable or slow. But do you really need exact inference? Often, you could confidently make a prediction without reasoning about all of the potentially relevant variables. Many variables have redundant or negligible influence on the final decision.

“Cost-aware learning” refers to learning policies for cheap but accurate inference. This is a form of discriminative training (discussed here) where the cost function includes terms for inference speed, data acquisition cost, model size, etc. Our papers are below.

We are particularly interested in learning inference policies that make dynamic decisions at runtime about where to spend computation (e.g., for parsing or arithmetic circuits). More simply, however, one can choose a static policy for each domain or each example.

A human annotator is an AI system hidden inside a skull. Fortunately, it is not a black-box system. We shouldn't just ask annotators to give us the right answers on training data—while they're at it, they can also mark why they chose those answers.

We showed below that annotator “rationales” are efficient to gather and can be exploited to improve classification accuracy. The idea has been adopted elsewhere in NLP and in computer vision.

How do young children listen to their native language and figure out its linguistic structure? I'm dying to know, but I would settle for solving the related NLP problem of grammar induction. Given word sequences or part-of-speech sequences that were presumably generated by a (natural language) grammar, we seek to identify the grammar or the resulting tree structures.

Locally maximizing likelihood (the inside-outside algorithm) does quite poorly on this task for a variety of reasons. We have tried to get some insight into the problems and address them with a variety of search methods and modified objective functions. However, this area is far from solved and we are considering new angles.

I have also worked on learning “deep” grammar from “surface” grammar by modeling syntactic transformations. Beyond syntactic grammars, see also our work on inducing morphological and phonological grammars.

Below are papers that present technically innovative models of linguistic domains (not just algorithms for existing models).

We have designed various classes of generative models. These models are of general interest although they were motivated by linguistic problems. They include transformation models and variations on topic models. Some of our models are nonparametric. We have also extended Markov random fields to string-valued random variables.

In Bayesian modeling, one often uses a Dirichlet distribution or Dirichlet process as a prior for a discrete distribution. These priors have the neutrality property: if event x is observed, we raise our posterior estimate of p(x) and correspondingly scale down the estimate of p(y) for all other y.

However, what if x and y are “related” events? In that case, their probability should covary——observing one should raise the estimated probability of the other. A transformation model captures this by positing that some instances of y were derived by transformation from x. Indeed, p(y) is defined by summing over all transformation sequences that would terminate at y. We fit a feature-based model of the transformation probabilities, permitting generalization to new events.

Beyond devising exact algorithms, we have developed several principled approximations for speeding up parsing, both for basic models and for enriched models where exact parsing would be impractical.

A number of our papers (not all shown below) try to improve the actual models of linguistic syntax that are used in parsing. For example, several of these algorithms aim to preserve speed for lexicalized models of grammar, which acknowledge that two different verbs (for example) may behave differently.

Remark: The probabilities under lexicalized models can capture some crude semantic preferences along with syntax (i.e., selectional preferences). In fact, in our very early work, we actually conditioned probabilities on words according to their role in a semantic representation. I subsequently argued for bilexical parsing as an approximation to this, and gave the first generative model for dependency parsing (which was also the first non-edge-factored model).

A natural-language grammar will generally contain many related syntactic constructions for a given word (e.g., active and passive). Most grammar formalisms explain this redundancy by assuming some mechanism for generating new constructions systematically from old ones.

My dissertation work showed how to model these “syntactic transformations” statistically, learning how deeply related rules covary. It inferred the deep relationships from a sample of observed constructions, enabling it to generalize to unseen constructions (“transformational smoothing”). This work introduced the more general technique of transformation modeling.

The 2008, 2009, and 2011 papers below built up an elegant model of inflectional morphology, with each paper building on the previous one. The work is gathered together in Dreyer's dissertation. Further work beginning in 2015 extended the approach to use latent underlying morphs, allowing it to treat derivational morphology as well.

Most syntax-based models of translation assume that in training data, a sentence and its translation have isomorphic syntactic structure. The papers below work to weaken that assumption, which often fails in practice.

Below are all our papers on machine translation—an assortment of interesting techniques motivated by different search, learning, and modeling challenges in MT. If there is any consistent theme, it is that we usually work with a full probability distribution over possible translations, not just its mode.

I was in graduate school when Optimality Theory took over phonology. There was no computational treatment of OT yet. I provided key finite-state algorithms for generation and comprehension, and proposed plausible modifications to the formalism to keep it within finite-state power. I also analyzed the computational complexity of grammar learning.

In order to do this computational work, I first had to conjecture a universal set of legal constraints (i.e., universal grammar). Nearly all the constraints I found in hundreds of OT papers fit into my simple taxonomy of “primitive constraints.” For those that didn't fit, I exhibited an alternative analysis within my framework of the linguistic data. The “primitive OT” analysis of metrical stress is arguably superior because it predicted previously unexplained typological gaps.

More recently, my students and I have worked on recovering the phonological underlying forms of a language, jointly with a probabilistic phonology.

The Dyna language is our bid to provide a unifying framework for data and algorithms across many settings.

Programming in Dyna is meant to be easy. A program is a short, high-level schematic description of the structure of a computation. It simply defines various values in terms of other values. The user can query and update values at runtime.

It is the system's job to choose efficient data structures and computations to support the possible queries and updates. This leads to interesting algorithmic problems, connected to query planning, deductive databases, and adaptive systems.

The forthcoming version of the language is described in Eisner & Filardo (2011), which illustrates its power on a wide range of problems in statistical AI and beyond. We released a prototype back in 2005, which was limited to semiring-weighted computations but has been used profitably in a number of NLP papers. The new working implementation under development is available here on github.

We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n^{3}) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find the gradient by back-propagation, treating the entire parser (approximations and all) as a differentiable circuit, as Stoyanov et al. (2011) and Domke (2010) did for loopy CRFs.

We investigate dual decomposition for joint MAP inference of many strings. Given an arbitrary graphical model, we decompose it into small acyclic sub-models, whose MAP configurations can be found by finite-state composition and dynamic programming. We force the solutions of these subproblems to agree on overlapping variables, by tuning Lagrange multipliers for an adaptively expanding set of variable-length n-gram count features.

This is the first inference method for arbitrary graphical models over strings that does not require approximations such as random sampling, message simplification, or a bound on string length. Provided that the inference method terminates, it gives a certificate of global optimality (though MAP inference in our setting is undecidable in general). On our global phonological inference problems, it does indeed terminate, and achieves more accurate results than max-product and sum-product loopy belief propagation.

We present penalized expectation propagation, a novel algorithm for approximate inference in graphical models. Expectation propagation is a well-known message-passing algorithm that prevents messages from growing in complexity by forcing them back into a given family of distributions. Our extension uses a structured-sparsity penalty to prefer simpler messages. In the case of string-valued random variables, this allows us to use a non-parametric message family, related to variable-order n-gram models. The method automatically calibrates the complexity of each message to balance speed and accuracy. We test the algorithm on phonological inference problems, showing substantial speedup over previous algorithms with no significant loss in accuracy.

The observed pronunciations or spellings of words are often explained as arising from the “underlying forms” of their morphemes. These forms are latent strings that linguists try to reconstruct by hand. We propose to reconstruct them automatically at scale, enabling generalization to new words. Given some surface word types of a concatenative language along with the abstract morpheme sequences that they express, we show how to recover consistent underlying forms for these morphemes, together with the (stochastic) phonology that maps each concatenation of underlying forms to a surface form. Our technique involves loopy belief propagation in a natural directed graphical model whose variables are unknown strings and whose conditional distributions are encoded as finite-state machines with trainable weights. We define training and evaluation paradigms for the task of surface word prediction, and report results on subsets of 7 languages.

Branch-and-bound is a widely used method in combinatorial optimization, including mixed integer programming, structured prediction and MAP inference. While most work has been focused on developing problem-specific techniques, little is known about how to systematically design the node searching strategy on a branch-and-bound tree. We address the key challenge of learning an adaptive node searching order for any class of problem solvable by branch-and-bound. Our strategies are learned by imitation learning. We apply our algorithm to linear programming based branch-and-bound for solving mixed integer programs (MIP). We compare our method with one of the fastest open-source solvers, SCIP; and a very efficient commercial solver, Gurobi. We demonstrate that our approach achieves better solutions faster on four MIP libraries.

Under multi-headed dependency grammar, a parse is a connected DAG rather than a tree. Such formalisms can be more syntactically and semantically expressive. However, it is hard to train, test, or improve multi-headed parsers because few multi-headed corpora exist, particularly for the projective case. To help fill this gap, we observe that link grammar already produces undirected projective graphs. We use Integer Linear Programming to assign consistent directions to the labeled links in a corpus of several thousand parses produced by the Link Grammar Parser, which has broad-coverage hand-written grammars of English as well as Russian and other languages. We find that such directions can indeed be consistently assigned in a way that yields valid multi-headed dependency parses. The resulting parses in English appear reasonably linguistically plausible, though differing in style from CoNLL-style parses of the same sentences; we discuss the differences.

Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evaluation on several datasets.

String similarity is most often measured by weighted or unweighted edit distance d(x,y). Ristad and Yianilos (1998) defined stochastic edit distance—a probability distribution p(y|x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic contextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.

Note: The PFSTs developed in this paper are used within the Bayesian network of Cotterell et al. (2015).

Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.

We present an open-source virtual manipulative for conditional log-linear models. This web-based interactive visualization lets the user tune the probabilities of various shapes—which grow and shrink accordingly—by dragging sliders that correspond to feature weights. The visualization displays a regularized training objective; it supports gradient ascent by optionally displaying gradients on the sliders and providing “Step” and “Solve” buttons. The user can sample parameters and datasets of different sizes and compare their own parameters to the truth. Our website, http://cs.jhu.edu/~jason/tutorials/loglin/, guides the user through a series of interactive lessons and provides auxiliary readings, explanations, practice problems and resources.

Linguistics Olympiads, now offered in more than 20 countries, provide secondary-school students a compelling introduction to an unfamiliar field. The North American Computational Linguistics Olympiad (NACLO) includes computational puzzles in addition to purely linguistic ones. This paper explores the computational subject matter we want to convey via NACLO, as well as some of the challenges that arise when adapting problems in computational linguistics to an audience that may have no background in computer science, linguistics, or advanced mathematics. We present a small library of reusable design patterns that have proven useful when composing puzzles appropriate for secondary-school students.

Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent.

We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ε) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum.

As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.

Message scheduling is shown to be very effective in belief propagation (BP) algorithms. However, most existing scheduling algorithms use ﬁxed heuristics regardless of the structure of the graphs or properties of the distribution. On the other hand, designing diﬀerent scheduling heuristics for all graph structures is not feasible. In this paper, we propose a reinforcement learning based message scheduling framework (RLBP) to learn the heuristics automatically which generalizes to any graph structures and distributions. In the experiments, we show that the learned problem-speciﬁc heuristics largely outperform other baselines in speed.

Grammar induction aims to induce the parameters of a probabilistic context-free grammar (PCFG). Crucially, the same parameters should be used not only at different positions in the sentence (as in convolutional networks for vision) but also at different levels in the tree.

We consider how several ideas from deep learning can help construct a PCFG from the bottom up while resisting bad local optima. Our proposed architecture learns to model a sentence as a sequence of phrases, each generated by a PCFG. The root nonterminal of each phrase is predicted from the context of that phrase. Formally this is a kind of autoencoder that predicts the sentence from itself, but structured as a semi-Markov CRF whose emission distribution is a PCFG, and which predicts each phrase only from context. During learning, we “anneal” the search bias from generating a long sequence of 1-word phrases (so the method finds word embeddings based on context) to a single phrase that covers the whole sentence (at which point we have an ordinary PCFG with no dependence on context).

Each run of the learner derives features of the context from the grammars found on previous, more naive runs. This stacking of multiple runs is what makes the method deep. We also mention extensions that involve supervised fine-tuning or richer, vector-valued representations of words and nonterminals.

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” is far too good to successfully imitate with our inexpensive features. Moreover, it is not specifically tuned for the known reward function. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with only a few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.

Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle's ability and the learner's policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to a novel cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic features selection and two static feature selection methods.

We describe an approach to coreference resolution that relies on the intuition that easy decisions should be made early, while harder decisions should be left for later when more information is available. We are inspired by the recent success of the rule-based system of Raghunathan et al. (2010), which relies on the same intuition. Our system, however, automatically learns from training data what constitutes an easy decision. Thus, we can utilize more features, learn more precise weights, and adapt to any dataset for which training data is available. Experiments show that our system outperforms recent state-of-the-art coreference systems including Raghunathan et al.’s system as well as a competitive baseline that uses a pairwise classifier.

Arithmetic circuits arise in the context of weighted logic programming languages, such as Datalog with aggregation, or Dyna. A weighted logic program defines a generalized arithmetic circuit—the weighted version of a proof forest, with nodes having arbitrary rather than boolean values. In this paper, we focus on finite circuits. We present a flexible algorithm for efficiently querying node values as they change under updates to the circuit's inputs. Unlike traditional algorithms, ours is agnostic about which nodes are tabled (materialized), and can vary smoothly between the traditional strategies of forward and backward chaining. Our algorithm is designed to admit future generalizations, including cyclic and infinite circuits and propagation of delta updates.

Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

Note: The experiments in this paper are incomplete—we regrettably had to omit some other experimental results because of a bug in the code. You can find the proper results on the slides (and on the poster, from the October 2012 Mid-Atlantic Student Colloquium on Speech, Language and Learning).

In many domains, our best models are computationally intractable. This problem will only get worse as we manage to build more richly detailed models of specific domains. Fortunately, the practical goal of artificial or natural intelligence is not to do perfect detailed inference, but rather to answer specific questions by reasoning from observed data. Thus, we should seek policies for fast approximate inference that will actually achieve low expected loss on our target task. The target task is a distribution not only over test-time data but also over which variables will be observed and queried. The loss function may explicitly penalize for runtime (or data acquisition).

This story leaves open an engineering question: What space of policies should we search? I will review a range of options and point out past work for each. Among others, I will show our own recent successes using message-passing approximate inference policies for graphical models. The form of these policies is determined by the structure of our intractable and surely mismatched domain model, but we tune the parameters to minimize loss (Stoyanov & Eisner, 2012).

One may search a space of sophisticated policies or a space of crude hacks, but crucially, one should tune the policy parameters to optimize expected error and runtime. This expectation can be taken over training data (empirical risk minimization), or over samples from a posterior belief about the target task (which I will call imputed risk minimization). The latter case requires some sort of prior model, but this is necessary when the empirical risk estimate suffers from sparse, non-independent, or out-of-domain training data.

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs for particular problems or datasets. We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is too large to explore naively. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.

Note: Consider instead the NIPS 2012 paper that is the "final version" of this workshop paper.

We present an instance-specific test-time dynamic feature selection algorithm. Our algorithm sequentially chooses features given previously selected features and their values. It stops the selection process to make a prediction according to a user-specified accuracy-cost trade-off. We cast the sequential decision-making problem as a Markov Decision Process and apply imitation learning techniques. We address the problem of learning and inference jointly in a simple multiclass classification setting. Experimental results on UCI datasets show that our approach achieves the same or higher accuracy using only a small fraction of features than static feature selection methods.

We are interested in speeding up approximate inference in Markov Random Fields (MRFs). We present a new method that uses gates—binary random variables that determine which factors of the MRF to use. Which gates are open depends on the observed evidence; when many gates are closed, the MRF takes on a sparser and faster structure that omits “unnecessary” factors. We train parameters that control the gates, jointly with the ordinary MRF parameters, in order to locally minimize an objective that combines loss and runtime.

With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.

We propose an algorithm to find the best path through an intersection of arbitrarily many weighted automata, without actually performing the intersection. The algorithm is based on dual decomposition: the automata attempt to agree on a string by communicating about features of the string. We demonstrate the algorithm on the Steiner consensus string problem, both on synthetic data and on consensus decoding for speech recognition. This involves implicitly intersecting up to 100 automata.

Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form of n-gram counts (Brants and Franz, 2006). While n-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way that n-gram language models are estimated. We treat this distribution over sentences as an approximate corpus and show how unsupervised learning can be performed on such a corpus using variational inference. We compare hidden Markov model (HMM) training on exact and approximate corpora of various sizes, measuring speed and accuracy on unsupervised part-of-speech tagging.

Conditional Random Fields (CRFs) are a popular formalism for structured prediction in NLP. It is well known how to train CRFs with certain topologies that admit exact inference, such as linear-chain CRFs. Some NLP phenomena, however, suggest CRFs with more complex topologies. Should such models be used, considering that they make exact inference intractable? Stoyanov et al. (2011) recently argued for training parameters to minimize the task-specific loss of whatever approximate inference and decoding methods will be used at test time. We apply their method to three NLP problems, showing that (i) using more complex CRFs leads to improved performance, and that (ii) minimum-risk training learns more accurate models.

We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or “primitives,” which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.

Could we explicitly train test-time inference heuristics to trade off accuracy and efficiency? We focus our discussion on agenda-based natural language parsing under a weighted context-free grammar. We frame the problem as reinforcement learning, discuss its special properties, and propose new strategies.

Probabilistic graphical models are typically trained to maximize the likelihood of the training data and evaluated on some measure of accuracy on the test data. However, we are also interested in learning to produce predictions quickly. For example, one can speed up loopy belief propagation by choosing sparser models and by stopping at some point before convergence. We manage the speed-accuracy tradeoff by explicitly optimizing for a linear combination of speed and accuracy. Although this objective is not differentiable, we can compute the gradient of a smoothed version.

Because of the neutrality property, a Dirichlet (process) prior on a discrete distribution cannot capture correlations among the probabilities of “similar” events. We propose obtaining the discrete distribution instead from a random walk model or transformation model, in which each observed event has evolved via a latent sequence of transformations. We are exploring transformation models in which the conditional distributions have infinite support and the prior over them is nonparametric.

Latent Dirichlet Allocation (LDA) has been used to learn selectional preferences as soft disjunctions over flat semantic classes. Our model, the SCTM, also learns the structure of each class as a soft conjunction of high-level semantic features.

We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.

Note: Additional details are given in the dissertation of Dreyer (2011), and on the associated slides.

Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target-language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.

Note: Additional details are given in the dissertation of Li (2010).

The field of statistical natural language processing has been turning toward morphologically rich languages. These languages have vocabularies that are often orders of magnitude larger than that of English, since words may be inflected in various different ways. This leads to problems with data sparseness and calls for models that can deal with this abundance of related words—models that can learn, analyze, reduce and generate morphological inflections. But surprisingly, statistical approaches to morphology are still rare, which stands in contrast to the many recent advances of sophisticated models in parsing, grammar induction, translation and many other areas of natural language processing.

This thesis presents a novel, unified statistical approach to inflectional morphology, an approach that can decode and encode the inflectional system of a language. At the center of this approach stands the notion of inflectional paradigms. These paradigms cluster the large vocabulary of a language into structured chunks; inflections of the same word, like break, broke, breaks, breaking,..., all belong in the same paradigm. And moreover, each of these inflections has an exact place within a paradigm, since each paradigm has designated slots for each possible inflection; for verbs, there is a slot for the first person singular indicative present, one for the third person plural subjunctive past and slots for all other possible forms. The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters. These models can be learned with minimal supervision for any language that has inflectional morphology. As training data, some sample paradigms and a raw, unannotated text corpus can be used.

The models over morphological paradigms are developed in three main chapters that start with smaller components and build up to larger ones.

The first of these chapters (Chapter 2) presents novel probability models over strings and string pairs. These are applicable to lemmatization or to relate a past tense form to its associated present tense form, or for similar morphological tasks. It turns out they are general enough to tackle the popular task of transliteration very well, as well as other string-to-string tasks.

The second (Chapter 3) introduces the notion of a probability model over multiple strings, which is a novel variant of Markov Random Fields. These are used to relate the many inflections in an inflectional paradigm to one another, and they use the probability models from Chapter 2 as components. A novel version of belief propagation is presented, which propagates distributions over strings through a network of connected finite-state transducers, to perform inference in morphological paradigms (or other string fields).

Finally (Chapter 4), a non-parametric joint probability model over an unannotated text corpus and the morphological paradigms from Chapter 3 is presented. This model is based on a generative story for inflectional morphology that naturally incorporates common linguistic notions, such as lexemes, paradigms and inflections. Sampling algorithms are presented that perform inference over large text corpora and their implicit, hidden morphological paradigms. We show that they are able to discover the morphological paradigms that are implicit in the corpora. The model is based on finite-state operations and seamlessly handles concatenative and nonconcatenative morphology.

Note: Dr. Dreyer's dissertation advisor was Jason Eisner.

Graphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic meta-descent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.

Note:Stoyanov and Eisner (2012a) subsequently applied this training method to three structured prediction problems in NLP, getting striking accuracy improvements. Gormley et al. (2015) applied it to dependency parsing by belief propagation. Stoyanov and Eisner (2011, 2012b) gave preliminary extensions to optimize speed jointly with accuracy.

Modern statistical AI systems are quite large and complex; this interferes with research, development, and education. We point out that most of the computation involves database-like queries and updates on complex views of the data. Specifically, recursive queries look up and aggregate relevant or potentially relevant values. If the results of these queries are memoized for reuse, the memos may need to be updated through change propagation. We propose a declarative language, which generalizes Datalog, to support this work in a generic way. Through examples, we show that a broad spectrum of AI algorithms can be concisely captured by writing down systems of equations in our notation. Many strategies could be used to actually solve those systems. Our examples motivate certain extensions to Datalog, which are connected to functional and object-oriented programming paradigms.

Note: The followup paper Filardo and Eisner (2012) explains execution mechanisms for handling queries and updates.

Confusion network decoding for MT system combination.
Antti-Veikko Rosti, Eugene Matusov, Jason Smith, Necip Ayan, Jason
Eisner, Damianos Karakos, Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spyros
Matsoukas, Hermann Ney, Richard Schwartz, B. Zhang, and J. Zheng (2011).
In Handbook of Natural Language Processing and Machine
Translation. [ paper | book | bib ]

Note: This chapter includes an speedup of Karakos et al. (2008) using an A^{*} heuristic. See p. 355.

Much recent work in natural language processing treats linguistic analysis as an inference problem over graphs. This development opens up useful connections between machine learning, graph theory, and linguistics.

The first part of this dissertation formulates syntactic dependency parsing as a dynamic Markov random field with the novel ingredient of global constraints. Global constraints are enforced by calling combinatorial optimization algorithms as subroutines during message-passing inference in the graphical model, and these global constraints greatly improve on the accuracy of collections of local constraints. In particular, combinatorial subroutines enforce the constraint that the parser's output must form a tree. This is the first application that uses efficient computation of marginals for combinatorial problems to improve the speed and accuracy of belief propagation. If the dependency tree is projective, the tree constraint exploits the inside-outside algorithm; if non-projective, with discontiguous constituents, it exploits the directed matrix-tree theorem, here newly applied to NLP problems. Even with second-order features or latent variables, which would make exact parsing asymptotically slower or NP-hard, approximate inference with belief propagation is as efficient as a simple edge-factored parser times a constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features increases the runtime additively rather than multiplicatively.

The second part extends these models to capture correspondences among non-isomorphic structures. When bootstrapping a parser in a low-resource target language by exploiting a parser in a high-resource source language, models that score the alignment and the correspondence of divergent syntactic configurations in translational sentence pairs achieve higher accuracy in parsing the target language. These noisy (quasi-synchronous) mappings have further applications in adapting parsers across domains, in learning features of the syntax-semantics interface, and in question answering, paraphrasing, and information retrieval.

Note: Dr. Smith's dissertation advisor was Jason Eisner.

An unsupervised discriminative training procedure is proposed for estimating a language model (LM) for machine translation (MT). An English-to-English synchronous context-free grammar is derived from a baseline MT system to capture translation alternatives: pairs of words, phrases or other sentence fragments that potentially compete to be the translation of the same source-language fragment. Using this grammar, a set of impostor sentences is then created for each English sentence to simulate confusions that would arise if the system were to process an (unavailable) input whose correct English translation is that sentence. An LM is then trained to discriminate between the original sentences and the impostors. The procedure is applied to the IWSLT Chinese-to-English translation task, and promising improvements on a state-of-the-art MT system are demonstrated.

Note: Additional details are given in the dissertation of Li (2010).

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. This fact can be exploited by parsers. We first show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. The approach with hard bounds, “vine grammar,” accepts only a regular language, even though it happily retains a context-free parameterization and defines meaningful parse trees. We show how to parse this language in O(n) time, using a novel chart parsing algorithm with a low grammar constant (rather than an impractically large finite-state recognizer with an exponential grammar constant). For a uniform hard bound of k on dependencies of all types, our algorithm's runtime is O(nk^{2}). We also extend our algorithm to parse weighted-FSA inputs such as lattices.

Note: This book chapter extends Eisner & Smith (2005) with considerable new material (e.g., lattice parsing).

A hypergraph or “packed forest” is a compact data structure that uses structure-sharing to represent exponentially many trees in polynomial space. A probabilistic/weighted hypergraph also defines a probability (or other weight) for each tree, and can be used to represent the hypothesis space considered (for a given input) by a monolingual parser or a tree-based translation system (e.g., tree to string, string to tree, tree to tree, or string to string with latent tree structures).

Given a weighted/probabilistic hypergraph, we might ask three questions. What atomic operations can we perform on the weighted hypergraph? How do we set the weights in the hypergraph? Which particular translation (among the possible translations encoded in a hypergraph) should we present to an end user? These correspond to three fundamental problems: inference, training, and decoding, for which this dissertation will present novel techniques.

The atomic inference operations we may want to perform include finding one-best, k-best, or expectations over the hypergraph. To perform each operation, we may implement a dedicated dynamic programming algorithm. However, a more general framework to specify these algorithms is semiring-weighted logic programming. Within this framework, we first extend the expectation semiring, which is originally proposed for a finite state automaton, to a hypergraph. We then propose a novel second-order expectation semiring. These semirings can be used to compute a large number of expectations (e.g., entropy and its gradient) over the exponentially many trees presented in a hypergraph.

The weights used in a hypergraph are usually learnt by a discriminative training method. One common drawback of such method is that it relies on the existence of high-quality supervised data (i.e., bilingual data), which may be expensive to obtain. We present two unsupervised discriminative training methods: minimum imputed-risk training, and contrastive language model estimation, both can exploit monolingual English data to perform discriminative training. In minimum imputed-risk training, we first use a reverse translation model to impute the missing inputs, and then train a discriminative forward model by minimizing the expected loss of the forward translations of the missing inputs.

In contrast, the contrastive language model estimation does not use a reverse system. It first extracts a confusion grammar, then generates many alternative sentences (i.e., a contrastive set) for each English sentence using the confusion grammar, and finally trains a discriminative language model on the contrastive sets such that the model will prefer the original English sentences (over the sentences in the contrastive sets).

During decoding, we are interested in finding a translation that has a maximum posterior probability (i.e., MAP decoding). However, this is intractable due to spurious ambiguity, a situation where the probability of a translation string is split among many distinct derivations (e.g., trees or segmentations). Therefore, most systems use a simple Viterbi decoding that approximates the string probability with its most probable derivation's probability. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to lattice-based minimum-risk decoding for BLEU. Experiments show that our approach improves the state of the art.

All the above methods have been implemented in an open-source machine translation toolkit Joshua. In this dissertation, the methods have mainly been applied to a machine translation task, butwe expect that they will also find applications in other areas of natural language processing (e.g., parsing and speech recognition).

Note: Dr. Li's dissertation advisor and co-advisor were Sanjeev Khudanpur and Jason Eisner.

Many statistical translation models can be regarded as weighted logical deduction. Under this paradigm, we use weights from the expectation semiring (Eisner, 2002), to compute first-order statistics (e.g., the expected hypothesis length or feature counts) over packed forests of translations (lattices or hypergraphs). We then introduce a novel second-order expectation semiring, which computes second-order statistics (e.g., the variance of the hypothesis length or the gradient of entropy). This second-order semiring is essential for many interesting training paradigms such as minimum risk, deterministic annealing, active learning, and semi-supervised learning, where gradient descent optimization requires computing the gradient of entropy or risk. We use these semirings in an open-source machine translation toolkit, Joshua, enabling minimum-risk training for a benefit of up to 1.0 BLEU point.

Note: Additional introduction and details are given in the dissertation of Li (2010).

We study graphical modeling in the case of string-valued random variables. Whereas a weighted finite-state transducer can model the probabilistic relationship between two strings, we are interested in building up joint models of three or more strings. This is needed for inflectional paradigms in morphology, cognate modeling or language reconstruction, and multiple-string alignment. We propose a Markov Random Field in which each factor (potential function) is a weighted finite-state machine, typically a transducer that evaluates the relationship between just two of the strings. The full joint distribution is then a product of these factors. Though decoding is actually undecidable in general, we can still do efficient joint inference using approximate belief propagation; the necessary computations and messages are all finite-state. We demonstrate the methods by jointly predicting morphological forms.

Note: Additional details are given in the dissertation of Dreyer (2011), and on the associated slides. Cotterell and Eisner (2015) give an improved inference method for graphical models over strings. Cotterell et al. (2015) extended the morphological modeling approach to handle latent underlying morphs and derivational morphology.

We connect two scenarios in structured learning: adapting a parser trained on one corpus to another annotation style, and projecting syntactic annotations from one language to another. We propose quasi-synchronous grammar (QG) features for these structured learning tasks. That is, we score a aligned pair of source and target trees based on local features of the trees and the alignment. Our quasi-synchronous model assigns positive probability to any alignment of any trees, in contrast to a synchronous grammar, which would insist on some form of structural parallelism.

In monolingual dependency parser adaptation, we achieve high accuracy in translating among multiple annotation styles for the same sentence. On the more difficult problem of cross-lingual parser projection, we learn a dependency parser for a target language by using bilingual text, an English parser, and automatic word alignments. Our experiments show that unsupervised QG projection improves on parses trained using only high-precision projected annotations and far outperforms, by more than 35% absolute dependency accuracy, learning an unsupervised parser from raw target-language text alone. When a few target-language parse trees are available, projection gives a boost equivalent to doubling the number of target-language trees.

Note: Additional details are given in the dissertation of Smith (2010).

We apply machine learning to the Linear Ordering Problem in order to learn sentence-specific reordering models for machine translation. We demonstrate that even when these models are used as a mere preprocessing step for German-English translation, they significantly outperform Moses' integrated lexicalized reordering model.

Our models are trained on automatically aligned bitext. Their form is simple but novel. They assess, based on features of the input sentence, how strongly each pair of input word tokens wi;wj would like to reverse their relative order. Combining all these pairwise preferences to find the best global reordering is NP-hard. However, we present a non-trivial O(n^{3}) algorithm, based on chart parsing, that at least finds the best reordering within a certain exponentially large neighborhood. We show how to iterate this reordering process within a local search algorithm, which we use in training.

Note: Additional details are given in the dissertation of Tromble (2009).

Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, finding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008). Experiments show that our approach improves the state of the art.

Note: Nominated for best paper award. Additional introduction and details are given in the dissertation of Li (2010).

The field of AI has become implementation-bound. We have plenty of ideas, but it is increasingly laborious to try them out, as our models become more ambitious and our datasets become larger, noisier, and more heterogeneous. The software engineering burden makes it hard to start new work; hard to reuse and combine existing ideas; and hard to educate our students.

In this talk, I'll propose to hide many common implementation details behind a new level of abstraction that we are developing. Dyna is a declarative programming language that combines logic programming with functional programming. It also supports modularity. It may be regarded as a kind of deductive database, theorem prover, truth maintenance system, or equation solver.

I will illustrate how Dyna makes it easy to specify the combinatorial structure of typical computations needed in natural language processing, machine learning, and elsewhere in AI. Then I will sketch implementation strategies and program transformations that can help to make these computations fast and memory-efficient. Finally, I will suggest that machine learning should be used to search for the right strategies for a program on a particular workload.

This dissertation is about ordering. The problem of arranging a set of n items in a desired order is quite common, as well as fundamental to computer science. Sorting is one instance, as is the Traveling Salesman Problem. Each problem instance can be thought of as optimization of a function that applies to the set of permutations.

The dissertation treats word reordering for machine translation as another instance of a combinatorial optimization problem. The approach introduced is to combine three different functions of permutations. The first function is based on finite-state automata, the second is an instance of the Linear Ordering Problem, and the third is an entirely new permutation problem related to the LOP.

The Linear Ordering Problem has the most attractive computational properties of the three, all of which are NP-hard optimization problems. The dissertation expends significant effort developing neighborhoods for local search on the LOP, and uses grammars and other tools from natural language parsing to introduce several new results, including a state-of-the-art local search procedure.

Combinatorial optimization problems such as the TSP or the LOP are usually given the function over permutations. In the machine translation setting, the weights are not given, only words. The dissertation applies machine learning techniques to derive a LOP from each given sentence using a corpus of sentences and their translations for training. It proposes a set of features for such learning and argues their propriety for translation based on an analogy to dependency parsing. It adapts a number of parameter optimization procedures to the novel setting of the LOP.

The signature result of the thesis is the application of a machine learned set of linear ordering problems to machine translation. Using reorderings found by search as a preprocessing step significantly improves translation of German to English, and significantly more than the lexicalized reordering model that is the default of the translation system.

In addition, the dissertation provides a number of new theoretical results, and lays out an ambitious program for potential future research. Both the reordering model and the optimization techniques have broad applicability, and the availability of machine learning makes even new problems without obvious structure approachable.

Note: Dr. Tromble's dissertation advisor was Jason Eisner.

Cross-document coreference resolution: A key technology for learning by
reading.
James Mayfield, David Alexander, Bonnie Dorr, Jason Eisner, Tamer
Elsayed, Tim Finin, Clay Fink, Marjorie Freedman, Nikesh Garera, Paul
McNamee, Saif Mohammad, Douglas Oard, Christine Piatko, Asad Sayeed, Zareen
Syed, Ralph Weischedel, Tan Xu, and David Yarowsky (2009).
In AAAI 2009 Spring Symposium on Learning by Reading and
Learning to Read. [ paper | scholar | bib ]

The Dyna programming language is intended to provide an declarative abstraction layer for building systems in ML and AI. It extends logic programming with weights in a way that resembles functional programming. The weights are often probabilities. Yet Dyna does not enforce a probabilistic semantics, since many AI and ML methods work with inexact probabilities (e.g., bounds) and other numeric and non-numeric quantities. Instead Dyna aims to provide a flexible abstraction layer that is “one level lower,” and whose efficient implementation will be able to serve as infrastructure for building a variety of toolkits, languages, and specific systems.

We review two novel methods for text categorization, based on a new framework that utilizes richer annotations that we call annotator rationales. A human annotator provides hints to a machine learner by highlighting contextual “rationales” in support of each of his or her annotations. We have collected such rationales, in the form of substrings, for an existing document sentiment classification dataset [1]. We have developed two methods, one discriminative [2] and one generative [3], that use these rationales during training to obtain significant accuracy improvements over two strong baselines. Our generative model in particular could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks. Based on a small study of annotation speed, we posit that for some tasks, providing rationales can be a more fruitful use of an annotator's time than annotating more examples.

Note: This paper is just a synthesis of Zaidan et al. (2007) and Zaidan & Eisner (2008). (I hate doing multiple papers on the same work, but I wanted the ML community outside of NLP to see these results, so I asked the workshop organizers for permission to republish them here.)

We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with second-order features or latent variables, which would make exact parsing considerably slower or NP-hard, BP needs only O(n^{3}) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features would increase the runtime additively rather than multiplicatively.

Note: Additional details are given in the dissertation of Smith (2010).

A human annotator can provide hints to a machine learner by highlighting contextual “rationales” for each of his or her annotations (Zaidan et al., 2007). How can one exploit this side information to better learn the desired parameters θ? We present a generative model of how a given annotator, knowing the true θ, stochastically chooses rationales. Thus, observing the rationales helps us infer the true θ. We collect substring rationales for a sentiment classification task (Pang and Lee, 2004) and use them to obtain significant accuracy improvements for each annotator. Our new generative approach exploits the rationales more effectively than our previous “masking SVM” approach. It is also more principled, and could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks.

String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional log-linear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38-92%.

Note: Additional details are given in the dissertation of Dreyer (2011).

Just as programming is the traditional introduction to computer science, writing grammars by hand is an excellent introduction to many topics in computational linguistics. We present and justify a well-tested introductory activity in which teams of mixed background compete to write probabilistic context-free grammars of English. The exercise brings together symbolic, probabilistic, algorithmic, and experimental issues in a way that is accessible to novices and enjoyable.

One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005). Drawing on Abney's (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likelihood on labeled data and confidence (negative Rényi entropy) on unlabeled data. In initial experiments, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied. For our models and training sets, more peaked measures of confidence, measured by Rényi entropy, outperformed smoother ones. We discuss how our feature set could be extended with cross-lingual or cross-domain features, to incorporate knowledge from parallel or comparable corpora during bootstrapping.

We propose a new framework for supervised machine learning. Our goal is to learn from smaller amounts of supervised training data, by collecting a richer kind of training data: annotations with “rationales.” When annotating an example, the human teacher will also highlight evidence supporting this annotation—thereby teaching the machine learner why the example belongs to the category. We provide some rationale-annotated data and present a learning method that exploits the rationales during training to boost performance significantly on a sample task, namely sentiment classification of movie reviews. We hypothesize that in some situations, providing rationales is a more fruitful use of an annotator's time than annotating more examples.

In unsupervised learning, where no training takes place, one simply hopes that the unsupervised learner will work well on any unlabeled test collection. However, when the variability in the data is large, such hope may be unrealistic; a tuning of the unsupervised algorithm may then be necessary in order to perform well on new test collections. In this paper, we show how to perform such a tuning in the context of unsupervised document clustering, by (i) introducing a degree of freedom, α, into two leading information-theoretic clustering algorithms, through the use of generalized mutual information quantities; and (ii) selecting the value of α based on clusterings of similar, but supervised document collections (cross-instance tuning). One option is to perform a tuning that directly minimizes the error on the supervised data sets; another option is to use “strapping” (Eisner and Karakos, 2005), which builds a classifier that learns to distinguish good from bad clusterings, and then selects the α with the best predicted clustering on the test set. Experiments from the “20 Newsgroups” corpus show that, although both techniques improve the performance of the baseline algorithms, “strapping” is clearly a better choice for cross-instance tuning.

Iterative denoising trees were used by Karakos et al. [1] for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic maximization objective. In this paper, we improve upon the work of [1] in two ways: (i) the amount of computation spent searching for a good projection at each node now adapts to the intrinsic dimensionality of the data observed at that node; (ii) the objective at each node is to find a split which maximizes a generalized form of mutual information, the Jensen-Renyi divergence; this is followed by an iterative Naive Bayes classification. The single parameter alpha of the Jensen-Renyi divergence is chosen based on the “strapping” methodology [2], which learns a meta-classifer on a related task. Compared with the sequential Information Bottleneck method [3], our procedure produces state-of-the-art results on an unsupervised categorization task of documents from the “20 Newsgroups” dataset.

Dynamic programming algorithms in statistical natural language processing can be easily described as weighted logic programs. We give a notation and semantics for such programs. We then describe several source-to-source transformations that affect a program's efficiency, primarily by rearranging computations for better reuse or by changing the search strategy. We present practical examples of using these transformations, mainly to optimize context-free parsing algorithms, and we formalize them for use with new weighted logic programs.

Specifically, we define weighted versions of the folding and unfolding transformations, whose unweighted versions are used in the logic programming and deductive database communities. We then present a novel transformation called speculation—a powerful generalization of folding that is motivated by gap-passing in categorial grammar. Finally, we give a simpler and more powerful formulation of the magic templates transformation.

Note: The speculation transform is really a form of lifted inference (a term that gained currency later).

This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias.

Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations) is selected. The latter is far superior, but surprisingly few annotated examples are required. The experimentation focuses on a single dependency grammar induction task, in depth. The aim is to give strong support for the usefulness of the new techniques in one scenario. It must be noted, however, that the task (as defined here and in prior work) is somewhat artificial, and improved performance on this particular task is not a direct contribution to the greater field of natural language processing. The real problem the task seeks to simulate—the induction of syntactic structure in natural language text—is certainly of interest to the community, but this thesis does not directly approach the problem of exploiting induced syntax in applications. We also do not attempt any realistic simulation of human language learning, as our newspaper text data do not resemble the data encountered by a child during language acquisition. Further, our iterative learning algorithms assume a fixed batch of data that can be repeatedly accessed, not a long stream of data observed over time in tandem with acquisition. (Of course, the cognitive criticisms apply to virtually all existing learning methods in natural language processing, not just the new ones presented here.) Nonetheless, the novel estimation methods presented are, we will argue, better suited to adaptation for real engineering tasks than the maximum likelihood baseline.

Our new methods are shown to achieve significant improvements over maximum likelihood estimation and maximum a posteriori estimation, using the EM algorithm, for a state-of-the-art probabilistic model used in dependency grammar induction (Klein and Manning, 2004). The task is to induce dependency trees from part-of-speech tag sequences; we follow standard practice and train and test on sequences of ten tags or fewer. Our results are the best published to date for six languages, with supervised model selection: English (improvement from 41.6% directed attachment accuracy to 66.7%, a 43% relative error rate reduction), German (54.4 -> 71.8%, a 38% error reduction), Bulgarian (45.6% -> 58.3%, a 23% error reduction), Mandarin (50.0% -> 58.0%, a 16% error reduction), Turkish (48.0% -> 62.4%, a 28% error reduction, but only 2% error reduction from a left-branching baseline, which gives 61.8%), and Portuguese (42.5% -> 71.8%, a 51% error reduction). We also demonstrate the success of contrastive estimation at learning to disambiguate part-of-speech tags (from unannotated English text): 78.0% to 88.7% tagging accuracy on a known-dictionary task (a 49% relative error rate reduction), and 66.5% to 78.4% on a more difficult task with less dictionary knowledge (a 35% error rate reduction).

The experiments presented in this thesis give one of the most thorough explorations to date of unsupervised parameter estimation for models of discrete structures. Two sides of the problem are considered in depth: the choice of objective function to be optimized during training, and the method of optimizing it. We find that both are important in unsupervised learning. Our best results on most of the six languages involve both improved objectives and improved search.

The methods presented in this thesis were originally presented in Smith and Eisner (2004, 2005a,b, 2006). The thesis gives a more thorough exposition, relating the methods to other work, presents more experimental results and error analysis, and directly compares the methods to each other.

Note: Dr. Smith's dissertation advisor was Jason Eisner.

We describe Dynasty, a system for browsing large (possibly infinite) directed graphs and hypergraphs. Only a small subgraph is visible at any given time. We sketch how we lay out the visible subgraph, and how we update the layout smoothly and dynamically in an asynchronous environment. We also sketch our user interface for browsing and annotating such graphs—in particular, how we try to make keyboard navigation usable.

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that “an attacker would never be able to do that.” In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002.

We study unsupervised methods for learning refinements of the nonterminals in a treebank. Following Matsuzaki et al. (2005) and Prescher (2005), we may for example split NP without supervision into NP[0] and NP[1], which behave differently. We first propose to learn a PCFG that adds such features to nonterminals in such a way that they respect patterns of linguistic feature passing: each node's nonterminal features are either identical to, or independent of, those of its parent. This linguistic constraint reduces runtime and the number of parameters to be learned. However, it did not yield improvements when training on the Penn Treebank. An orthogonal strategy was more successful: to improve the performance of the EM learner by treebank preprocessing and by annealing methods that split nonterminals selectively. Using these methods, we can maintain high parsing accuracy while dramatically reducing the model size.

When training the parameters for a natural language system, one would prefer to minimize 1-best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize log-likelihood, which is conveniently differentiable and convex. We propose training instead to minimize the expected loss, or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen (anneal) to focus on the 1-best hypothesis. Besides the linear loss functions used in previous work, we also describe techniques for optimizing nonlinear functions such as precision or the BLEU metric. We present experiments training log-linear combinations of models for dependency parsing and for machine translation. In machine translation, annealed minimum risk training achieves significant improvements in BLEU over standard minimum error training. We also show improvements in labeled dependency parsing.

We introduce a novel decoding procedure for statistical machine translation and other ordering tasks based on a family of Very Large-Scale Neighborhoods, some of which have previously been applied to other NP-hard permutation problems. We significantly generalize these problems by simultaneously considering three distinct sets of ordering costs. We discuss how these costs might apply to MT, and some possibilities for training them. We show how to search and sample from exponentially large neighborhoods using efficient dynamic programming algorithms that resemble statistical parsing. We also incorporate techniques from statistical parsing to improve the runtime of our search. Finally, we report results of preliminary experiments indicating that the approach holds promise.

Many syntactic models in machine translation are channels that transform one tree into another, or synchronous grammars that generate trees in parallel. We present a new model of the translation process: quasi-synchronous grammar (QG). Given a source-language parse tree T_{1}, a QG defines a monolingual grammar that generates translations of T_{1}. The trees T_{2} allowed by this monolingual grammar are inspired by pieces of substructure in T_{1} and aligned to T_{1} at those points. We describe experiments learning quasi-synchronous context-free grammars from bitext. As with other monolingual language models, we evaluate the crossentropy of QGs on unseen text and show that a better fit to bilingual data is achieved by allowing greater syntactic divergence. When evaluated on a word alignment task, QG matches standard baselines.

Note: Nominated for 5-year retrospective best paper award.

We describe finite-state constraint relaxation, a method for applying global constraints, expressed as automata, to sequence model decoding. We present algorithms for both hard constraints and binary soft constraints. On the CoNLL-2004 semantic role labeling task, we report a speedup of at least 16x over a previous method that used integer linear programming.

To model a collection of documents, suppose that each document was generated by a different hidden Markov model or probabilistic finite-state automaton (PFSA). Further suppose that all these PFSAs are similar because they are drawn from a single (but unknown) prior distribution over PFSAs. We wish to infer the prior, obtain smoothed estimates of the individual PFSAs, and reconstruct the hidden paths by which the unknown PFSAs generated their documents.

As an initial application, particularly hard for our model because of its sparse data, we derive an FSA topology from WordNet. For each verb, we construct the “document” of all nouns that have appeared as its object. Our method then learns a better estimate of p(object | verb), as well as which paths in WordNet, and hence which senses of ambiguous objects, tend to be favored. Our method improves 14.6% over Witten-Bell smoothing on the conditional perplexity of objects given the verb, and 27.5% over random on detecting the most common senses of nouns in the SemCor corpus.

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. We show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. This simple “vine grammar” formalism has only finite-state power, but a context-free parameterization with some extra parameters for stringing fragments together. We exhibit a linear-time chart parsing algorithm with a low grammar constant.

Note: Consider instead the 2010 book chapter that is an expanded version of this paper.

“Bootstrapping” methods for learning require a small amount of supervision to seed the learning process. We show that it is sometimes possible to eliminate this last bit of supervision, by trying many candidate seeds and selecting the one with the most plausible outcome. We discuss such “strapping” methods in general, and exhibit a particular method for strapping word-sense classifiers for ambiguous words. Our experiments on the Canadian Hansards show that our unsupervised technique is significantly more effective than picking seeds by hand (Yarowsky, 1995), which in turn is known to rival supervised methods.

Weighted deduction with aggregation is a powerful theoretical formalism that encompasses many NLP algorithms. This paper proposes a declarative specification language, Dyna; gives general agenda-based algorithms for computing weights and gradients; briefly discusses Dyna-to-Dyna program transformations; and shows that a first implementation of a Dyna-to-C++ compiler produces code that is efficient enough for real NLP research, though still several times slower than hand-crafted code.

We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the Expectation-Maximization algorithm [Dempster et al., 1977]. CE is a natural fit for log-linear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EM-trained generative models on the task of matching human linguistic annotations (the MatchLinguist task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood can target the objective of grammar induction to a specific application.

Note: This version of the paper has minor corrections. Additional details are given in the dissertation of Smith (2006).

Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for log-linear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features.

Note: Nominated for best paper award. Additional details are given in the dissertation of Smith (2006). In particular, here is the mapping from 45 fine-grained to 17 coarse-grained tags.

Weighted finite-state machines with n tapes describe n-ary rational string relations. The join n-ary relation is very important in applications. It is shown how to compute it via a more simple operation, the auto-intersection. Join and auto-intersection generally do not preserve rationality. We define a class of triples (A,i,j) such that the auto-intersection of the machine A on tapes i and j can be computed by a delay-based algorithm. We point out how to extend this class and hope that it is sufficient for many practical applications.

Note:Dreyer & Eisner (2009) and Paul & Eisner (2012) make some progress on the uncomputable problem of joining or auto-intersecting FSMs. The first paper gives an approximate algorithm for the real or tropical semiring; the second gives an algorithm for the tropical semiring that is exact if it terminates.

Integrated Sensing and Processing Decision Trees (ISPDTs) were introduced in [1] as a tool for supervised classification of high-dimensional data. In this paper, we consider the problem of unsupervised classification, through a recursive construction of ISPDTs, where at each internal node the data (i) are split into clusters, and (ii) are transformed independently of other clusters, guided by some optimization objective. We show that the maximization of information-theoretic quantities such as mutual information and alpha-divergences is theoretically justified for growing ISPDTs, assuming that each data point is generated by a finite-memory random process given the class label. Furthermore, we present heuristics that perform the maximization in a greedy manner, and we demonstrate their effectiveness with empirical results from multi-spectral imaging.

A finite-state machine with n tapes describes a rational (or regular) relation on n strings. It is more expressive than a relational database table with n columns, which can only describe a finite relation.

We describe some basic operations on n-ary rational relations and propose notation for them. (For generality we give the semiring-weighted case in which each tuple has a weight.) Unfortunately, the join operation is problematic: if two rational relations are joined on more than one tape, it can lead to non-rational relations with undecidable properties. We recast join in terms of “auto-intersection” and illustrate some cases in which difficulties arise. We close with the hope that partial or restricted algorithms may be found that are still powerful enough to have practical use.

Note:Dreyer & Eisner (2009) and Paul & Eisner (2012) make some progress on the uncomputable problem of joining or auto-intersecting FSMs. The first paper gives an approximate algorithm for the real or tropical semiring; the second gives an algorithm for the tropical semiring that is exact if it terminates.

We present the first version of a new declarative programming language. Dyna has many uses but was designed especially for rapid development of new statistical NLP systems. A Dyna program is a small set of equations, resembling Prolog inference rules, that specify the abstract structure of a dynamic programming algorithm. It compiles into efficient, portable, C++ classes that can be easily invoked from a larger application. By default, these classes run a generalization of agenda-based parsing, prioritizing the partial parses by some figure of merit. The classes can also perform an exact backward (outside) pass in the service of parameter training. The compiler already knows several implementation tricks, algorithmic transforms, and numerical optimization techniques. It will acquire more over time: we intend for it to generalize and encapsulate best practices, and serve as a testbed for new practices. Dyna is now being used for parsing, machine translation, morphological analysis, grammar induction, and finite-state modeling.

Note: Consider the longer 2005 version of this paper instead.

Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the Expectation-Maximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA begins by globally maximizing an easy concave function and maintains a local maximum as it gradually morphs the function into the desired non-concave likelihood function. Applying DA to parsing and tagging models is shown to be straightforward; significant improvements over EM are shown on a part-of-speech tagging task. We describe a variant, skewed DA, which can incorporate a good initializer when it is available, and show significant improvements over EM on a grammar induction task.

Note: Additional details are given in the dissertation of Smith (2006).

Language modeling, a technology found in many computerized speech recognition systems, can also be used in a text editor to implement an automated phrase completion feature that significantly reduces the number of keystrokes required to generate a radiology report, therefore increasing typing speed.

Radiology reports have especially low entropy, which allows prediction of multi-word phrases. Our system therefore chooses an optimal phrase length for each prediction, using Bellman-style dynamic programming to minimize the expected cost of typing the rest of the document. This computation considers what the user is likely to type in the future, and how many keystrokes it will take, considering the future effect of phrase completion as well.

Often one may wish to learn a tree-to-tree mapping, training it on unaligned pairs of trees, or on a mixture of trees and strings. Unlike previous statistical formalisms (limited to isomorphic trees), synchronous tree substitution grammar allows local distortion of the tree topology. We reformulate it to permit dependency trees, and sketch EM/Viterbi algorithms for alignment, training, and decoding.

Note: At a reviewer's request, the paper describes TSG more formally than in the previous literature, which might be helpful for some readers and implementers.

Previous work on minimizing weighted finite-state automata (including transducers) is limited to particular types of weights. We present efficient new minimization algorithms that apply much more generally, while being simpler and about as fast.

We also point out theoretical limits on minimization algorithms. We characterize the kind of “well-behaved” weight semirings where our methods work. Outside these semirings, minimization is not well-defined (in the sense of producing a unique minimal automaton), and even finding the minimum number of states is in general NP-complete and inapproximable.

Weighted finite-state transducers suffer from the lack of a training algorithm. Training is even harder for transducers that have been assembled via finite-state operations such as composition, minimization, union, concatenation, and closure, as this yields tricky parameter tying. We formulate a “parameterized FST” paradigm and give training algorithms for it, including a general bookkeeping trick (“expectation semirings”) that cleanly and efficiently computes expectations and gradients.

Note:Expectation semirings are included in the excellent OpenFST library. I believe that Zhifei Li, Ariya Rastrow, Markus Dreyer, and Roy Tromble have all written implementations that work with OpenFST or other packages. Some of these support the higher-order expectation semirings of Li & Eisner (2009).

This paper ties up some loose ends in finite-state Optimality Theory. First, it discusses how to perform comprehension under Optimality Theory grammars consisting of finite-state constraints. Comprehension has not been much studied in OT; we show that unlike production, it does not always yield a regular set, making finite-state methods inapplicable. However, after giving a suitably flexible presentation of OT, we show carefully how to treat comprehension under recent variants of OT in which grammars can be compiled into finite-state transducers. We then unify these variants, showing that compilation is possible if all components of the grammar are regular relations, including the harmony ordering on scored candidates. A side benefit of our construction is a far simpler implementation of directional OT (Eisner, 2000).

Note: A related paper was published independently by Gerhard Jäger at the same time.

This paper offers a detailed lesson plan on the forward-backward algorithm. The lesson is taught from a live, commented spreadsheet that implements the algorithm and graphs its behavior on a whimsical toy example. By experimenting with different inputs, one can help students develop intuitions about HMMs in particular and Expectation Maximization in general. The spreadsheet and a coordinated follow-up assignment are available.

This paper proposes a novel class of PCFG parameterizations that support linguistically reasonable priors over PCFGs. To estimate the parameters is to discover a notion of relatedness among context-free rules such that related rules tend to have related probabilities. The prior favors grammars in which the relationships are simple to describe and have few major exceptions. A basic version that bases relatedness on weighted edit distance yields superior smoothing of grammars learned from the Penn Treebank (20% reduction of rule perplexity over the best previous method).

Note: Nominated for best paper award. See also the linguistic perspective in Eisner (2002). Additional details are given in my dissertation.

In the Bayesian framework, a language learner should seek a grammar that explains observed data well and is also a priori probable. This paper proposes such a measure of prior probability. Indeed it develops a full statistical framework for lexicalized syntax. The learner's job is to discover the system of probabilistic transformations (often called lexical redundancy rules) that underlies the patterns of regular and irregular syntactic constructions listed in the lexicon. Specifically, the learner discovers what transformations apply in the language, how often they apply, and in what contexts. It considers simpler systems of transformations to be more probable a priori. Experiments show that the learned transformations are more effective than previous statistical models at predicting the probabilities of lexical entries, especially those for which the learner had no direct evidence.

Note: See also the engineering perspective in Eisner (2002). Additional details are given in my dissertation.

This brief introduction, from the editor of the special section, reviews why and how statistical and linguistic approaches to language can help each other. It also asks how statistical modeling fits into the broader program of cognitive science.

This paper gives the first EM algorithm for general probabilistic finite-state transducers (with epsilon). Furthermore, the approach is powerful enough to fit machines' parameters even after the machines are combined by operations of the finite-state calculus, such as composition and minimization. This allows an expert to build a parameterized transducer in any way that is appropriate to the domain, and then fit the parameters automatically from data. Many standard algorithms are special cases, and there are many further applications. Yet the algorithm remains surprisingly simple because all the difficult work is subcontracted to existing algorithms for semiring-weighted automata. The trick is to use a novel semiring.

Note: Extended abstract. Consider the longer 2002 version instead.

Probabilistic parsing requires a lexicon that specifies each word's syntactic preferences in terms of probabilities. To estimate these probabilities for words that were poorly observed during training, this thesis assumes the existence of arbitrarily powerful transformations (also known to linguists as lexical redundancy rules or metarules) that can add, delete, retype or reorder the argument and adjunct positions specified by a lexical entry.

In a given language, some transformations apply frequently and others rarely. We describe how to estimate the rates of the transformations from a sample of lexical entries. More deeply, we learn which properties of a transformation increase or decrease its rate in the language. As a result, we can smooth the probabilities of lexical entries. Given enough direct evidence about a lexical entry's probability, our Bayesian approach trusts the evidence; but when less evidence or no evidence is available, it relies more on the transformations' rates to guess how often the entry will be derived from related entries.

Abstractly, the proposed “transformation models” are probability distributions that arise from graph random walks with a log-linear parameterization. A domain expert constructs the parameterized graph, and a vertex is likely according to whether random walks tend to halt at it. Transformation models are suited to any domain where “related” events (as defined by the graph) may have positively covarying probabilities. Such models admit a natural prior that favors simple regular relationships over stipulative exceptions. The model parameters can be locally optimized by gradient-based methods or by Expectation-Maximization. Exact algorithms (matrix inversion) and approximate ones (relaxation) are provided, with optimizations. Variations on the idea are also discussed.

We compare the new technique empirically to previous techniques from the probabilistic parsing literature, using comparable features, and obtain a 20% perplexity reduction (similar to doubling the amount of training data). Some of this reduction is shown to stem from the transformation model's ability to match observed probabilities, and some from its ability to generalize. Model averaging yields a final 24% perplexity reduction.

We consider the problem of ranking a set of OT constraints in a manner consistent with data. (1) We speed up Tesar and Smolensky's RCD algorithm to be linear on the number of constraints. This finds a ranking so each attested form x_{i} beats or ties a particular competitor y_{i}. (2) We also generalize RCD so each x_{i} beats or ties all possible competitors.

Alas, neither the more realistic version of ranking in (2), nor even generation, has any polynomial algorithm unless P=NP! That is, one cannot improve qualitatively upon brute force: (3) Merely checking that a single (given) ranking is consistent with given forms is coNP-complete if the surface forms are fully observed and Δ_{2}^{p}-complete if not. Indeed, OT generation is OptP-complete. (4) As for ranking, determining whether any consistent ranking exists is coNP-hard (but in Δ_{2}^{p}) if the forms are fully observed, and Σ_{2}^{p}-complete if not.

Finally, we show (5) generation and ranking are easier in derivational theories: in P, and NP-complete.

Weighted finite-state constraints that can count unboundedly many violations make Optimality Theory more powerful than finite-state transduction (Frank and Satta, 1998). This result is empirically and computationally awkward. We propose replacing these unbounded constraints, as well as non-finite-state Generalized Alignment constraints, with a new class of finite-state directional constraints. We give linguistic applications, results on generative power, and algorithms to compile grammars into transducers.

Note: This paper makes the linguistic case for directional OT, and gives an interesting transducer construction. However, Eisner (2002) shows that the mathematical result can actually be obtained as a special case of a more general theorem that is based on a simpler algebraic construction.

This book review also sketches why OT is interesting to computational linguists, and how it relates to other approaches for combining non-orthogonal surface features, such as maximum-entropy modeling.

This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such “bilexicalism” has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.

The obvious parsing algorithm for bilexical grammars (used by most authors) takes time O(n^{5}). A more efficient O(n^{3}) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Note: This book chapter is an improved and extended version of Eisner (1997).

This paper points out some computational inefficiencies of standard TAG parsing algorithms when applied to LTAGs. We propose a novel algorithm with an asymptotic improvement, from O(n^{8}g^{2}t) to O(n^{6} max(n,g) gt), where n is the input length and g,t are grammar constants that are independent of vocabulary size.

Several recent stochastic parsers use bilexical grammars, where each word type idiosyncratically prefers particular complements with particular head words. We present O(n^{4}) parsing algorithms for two bilexical formalisms (see title), improving the previous upper bounds of O(n^{5}). Also, for a common special case that was known to allow O(n^{3}) parsing (Eisner, 1997), we present an O(n^{3}) algorithm with an improved grammar constant.

Note: Note that the slides include experimental speed comparisons that were not in the paper.

A universal theory of human phonology should be clearly specified and falsifiable. To treat Optimality Theory (OT) as a real proposal, one must put some cards on the table: What kinds of constraints may an OT grammar state? And how can anyone tell what data this grammar predicts, without constructing infinite tableaux?

In this talk I'll motivate a restrictive formalization of OT that allows just two types of simple, local constraint. Gen freely proposes gestures and prosodic constituents; the constraints try to force these to coincide or not coincide temporally. An efficient algorithm exists to find the optimal candidate.

I will argue that despite its simplicity, primitive OT is expressive enough to describe and unify most of the work in OT phonology. However, it is provably more constrained: because it is unable to mimic deeply non-local mechanisms like Generalized Alignment, it forces a new and arguably better account of metrical stress typology.

Finally, I will sketch a more radical extension, directional evaluation, which changes how a constraint ranks candidates. This change brings back some of the descriptive convenience of Generalized Alignment, but it also constrains OT grammars to describe only regular relations, which is linguistically and computationally desirable.

Hayes (1995) gives a typology of the world's metrical stress systems, which is marked by several striking asymmetries (parametric gaps). Most work on metrical stress within Optimality Theory (OT) has adopted this typology without explaining the gaps. Moreover, the OT versions use uncomfortably non-local constraints (Align, FootForm, FtBin).

This paper presents a rather different and in some ways more explanatory typology of stress, couched in the framework of primitive Optimality Theory (OTP), which allows only primitive, radically local constraints. For example, Generalized Alignment is not allowed. The paper presents a single, coherent system of rerankable constraints that yields the basic facts about iambic and trochaic foot form, iambic lengthening, quantity sensitivity, unbounded feet, simple word-initial and word-final stress, directionality of footing, syllable (and foot) extrametricality, degenerate feet, and word-level stress.

The metrical part of the account rests on the following intuitions: <UL> <LI> (a) iambs are special because syllable structure allows them to lengthen their strong ends; <LI> (b) directionality of footing is really the result of local lapse avoidance; <LI> (c) any lapses are forced by a (localist) generalization of right extrametricality; <LI> (d) although degenerate feet are absolutely banned, primary stress does not require a foot in all languages. </UL> An interesting prediction of (b) and (c) is that left-to-right trochees should be incompatible with extrametricality. This prediction is robustly confirmed in Hayes.

This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such “bilexicalism” has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.

The obvious parsing algorithm for bilexical grammars (used by most authors) takes time O(n^{5}). A more efficient O(n^{3}) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Note: Consider instead the 2000 book chapter that is an expanded version of this paper.

This paper introduces computational linguists to primitive Optimality Theory (OTP), a clean and linguistically motivated formalization of OT. OTP specifies the class of autosegmental representations, the universal generator Gen, and the two simple families of permissible constraints. It is therefore possible to study its computational generation, comprehension, and learning properties.

Some results on generation are presented. Unlike less restricted theories using Generalized Alignment, OTP grammars can derive optimal surface forms with finite-state methods adapted from Ellison (1994). Unfortunately these methods take time exponential on the size of the grammar. Indeed the generation problem is shown NP-complete in this sense. However, techniques are discussed for making Ellison's approach fast and practical in the typical case, including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size. One avenue for future improvements is a new finite-state notion, “factored automata,” where regular languages are represented compactly via formal intersections of FSAs.

The classic “easy” optimization problem is to find the MST of a connected, undirected graph. Good polynomial-time algorithms have been known since 1930. Over the last 10 years, however, the standard O(m logn) results of Kruskal and Prim have been improved to linear or near-linear time. The new methods use several tricks of general interest in order to reduce the number of edge weight comparisons and the amount of other work. This tutorial reviews those methods, building up strategies step by step so as to expose the insights behind the algorithms. Implementation details are clarified, and some generalizations are given.

Specifically, the paper attempts to shed light on the classical algorithms of Kruskal, Prim, and Boruvka; the improved approach of Gabow, Galil, and Spencer, which takes time only O(m log(log^{*}n - log^{*}m/n)); and the randomized O(m) algorithm of Karger, Klein, and Tarjan, which relies on an O(m) MST verification algorithm by King. It also considers Frederickson's method for maintaining an MST in time O(sqrt((m)) per change to the graph. An appendix explains Fibonacci heaps.)

This technical report is an appendix to Eisner (1996): it gives superior experimental results that were reported only in the talk version of that paper. Eisner (1996) trained three probability models on a small set of about 4,000 conjunction-free, dependency-grammar parses derived from the Wall Street Journal section of the Penn Treebank, and then evaluated the models on a held-out test set, using a novel O(n^{3}) parsing algorithm.

The present paper describes some details of the experiments and repeats them with a larger training set of 25,000 sentences. As reported at the talk, the more extensive training yields greatly improved performance. Nearly half the sentences are parsed with no misattachments; two-thirds are parsed with at most one misattachment.

Of the models described in the original written paper, the best score is still obtained with the generative (top-down) “model C.” However, slightly better models are also explored, in particular, two variants on the comprehension (bottom-up) “model B.” The better of these has an attachment accuracy of 90%, and (unlike model C) tags words more accurately than the comparable trigram tagger. Differences are statistically significant.

If tags are roughly known in advance, search error is all but eliminated and the new model attains an attachment accuracy of 93%. We find that the parser of Collins (1996), when combined with a highly-trained tagger, also achieves 93% when trained and tested on the same sentences. Similarities and differences are discussed.

After presenting a novel O(n^{3}) parsing algorithm for dependency grammar, we develop three contrasting ways to stochasticize it. We propose (a) a lexical affinity model where words struggle to modify each other, (b) a sense tagging model where words fluctuate randomly in their selectional preferences, and (c) a generative model where the speaker fleshes out each word's syntactic and conceptual structure without regard to the implications for the hearer. We also give preliminary empirical results from evaluating the three models' parsing performance on annotated Wall Street Journal training text (derived from the Penn Treebank). In these results, the generative (i.e., top-down) model performs significantly better than the others, and does about equally well at assigning part-of-speech tags.

Under categorial grammars that have powerful rules like composition, a simple n-word sentence can have exponentially many parses that are semantically equivalent. Generating all parses is inefficient and obscures whatever true semantic ambiguities are in the input. This paper addresses the problem for a fairly general form of Combinatory Categorial Grammar, by means of an efficient, correct, and easy to implement normal-form parsing technique. The parser is proved to find exactly one parse in each semantic equivalence class of allowable parses; that is, spurious ambiguity (as carefully defined) is shown to be both safely and completely eliminated.

Note: A competitive system for coreference resolution, hacked together in our spare time.

English any is often treated as two unrelated or semi-related lexemes: a negative-polarity item, NPI any, and a universal quantifier, free-choice (FC) any. The latter is idiosyncratic in that it must appear in the scope of a licenser, but moves to take scope immediately over that licenser at LF. I give a semantic account of FC any as an “irrealis” quantifier. This explains some curious (new and old) facts about FC any's semantics and licensing environments. Furthermore, it predicts that negation and other NPI-licensing environments should license FC any, which would then have just the same meaning as NPI any (pace Ladusaw (1979), Carlson (1980)). Thus, we may unify the two any's as a single universal quantifier, as originally proposed by Lasnik (1972) and others. Such an account implies that NPI any moves over negation at LF—which is confirmed by scope tests. It also explains some well-known problems concerning NPI any in non-downward-entailing environments and under sorry vs. glad.

We describe an approach to training a statistical parser from a bracketed corpus, and demonstrate its use in a software testing application that translates English specifications into an automated testing language. A grammar is not explicitly specified; the rules and contextual probabilities of occurrence are automatically generated from the corpus. The parser is extremely successful at producing and identifying the correct parse, and nearly deterministic in the number of parses that it produces. To compensate for undertraining, the parser also uses general, linguistic subtheories which aid in guessing some types of novel structures.

We describe a general approach to the probabilistic parsing of context-free grammars. The method integrates context-sensitive statistical knowledge of various types (e.g., syntactic and semantic) and can be trained incrementally from a bracketed corpus. We introduce a variant of the GHR context-free recognition algorithm, and explain how to adapt it for efficient probabilistic parsing. On a real-world corpus of sentences from software testing documents, with 23 possible parses for a sentence of average length, the system accurately finds the correct parse in 99% of cases, while producing only 1.02 parses per sentence. Significantly, the success rate would be only 66% without the semantic statistics.

“Winner take all” electoral systems are not fully representative. Unfortunately, the ANC's proposed system of proportional representation is not much better. Because it ensconces party politics, it is only slightly more representative, and poses a serious threat to accountability.

Many modern students of democracy favor proportional representation through the Single Transferable Vote (STV). In countries with high illiteracy, however, this system may be unworkable.

This paper proposes a practical modification of STV. In the modified system, each citizen votes for only one candidate. Voters need not specify their second, third, and fourth choices. Instead, each candidate specifies his or her second, third, and fourth choices. The modified system is no more difficult for voters than current proposals—and it provides virtually all the benefits of STV, together with some new ones.

This talk for a general audience gives a sketch of what the field of cognitive science is about. In its latter half, it turns to the philosophical question of defining intelligence, and proposes a non-operational alternative to the Turing Test.

A broad approach is developed for training dynamical behaviors in connectionist networks. General recurrent networks are powerful computational devices, necessary for difficult tasks like constraint satisfaction and temporal processing. These tasks are discussed here in some detail. From both theoretical and empirical considerations, it is concluded that such tasks are best addressed by recurrent networks that operate continuously in time—and further, that effective learning rules for these continuous-time networks must be able to prescribe their dynamical properties. A general class of such learning rules is derived and tested on simple problems. Where existing learning algorithms for recurrent and non-recurrent networks only attempt to train a network's position in activation space, the models presented here can also explicitly and successfully prescribe the nature of its movement through activation space.

We present a method of using natural language processing (NLP) techniques to extract information from online news feeds and then using the information so extracted to predict changes in stock prices or volatilities. These predictions can be used to make profitable trading strategies. More specifically, company names can be recognized and simple templates describing company actions can be automatically filled using parsing or pattern matching on words in or near the sentence containing the company name. These templates can be clustered into groups which are statistically correlated with changes in the stock prices.

A communications method utilizes memory areas to buffer portions of the media streams. These buffer areas are shared by user applications, with the desirable consequence of reducing workload for the server system distributing media to the user (client) applications. The preferred method allows optimal balancing of buffering delays and server loads, as well as optimal choice of buffer contents for the shared memory buffers.

Secure data interchange.
Frederick S. M. Herz, Walter Paul Labys, David C. Parkes, Sampath
Kannan, and Jason M. Eisner (filed 2000).
Patent pending. [ patent | scholar | bib ]

A secure data interchange system enables information about bilateral and multilateral interactions between multiple persistent parties to be exchanged and leveraged within an environment that uses a combination of techniques to control access to information, release of information, and matching of information back to parties. Access to data records can be controlled using an associated price rule. A data owner can specify a price for different types and amounts of information access.

The system for the automatic determination of customized prices and promotions automatically constructs product offers tailored to individual shoppers, or types of shopper, in a way that attempts to maximize the vendor's profits. These offers are represented digitally. They are communicated either to the vendor, who may act on them as desired, or to an on-line computer shopping system that directly makes such offers to shoppers. Largely by tracking the behavior of shoppers, the system accumulates extensive profiles of the shoppers and the offers that they consider. The system can then select, present, price, and promote goods and services in ways that are tailored to an individual consumer. Likely shoppers can be identified, then enticed with the most effective visual and textual advertisements; deals can be offered to them, either on-line or off-line; detailed product information screens can be subtly rearranged from one type of shopper to the next. Furthermore, when a product can be tailored to a particular shopper, a general technique or expert system can offer each consumer an appropriately customized product.

An adaptive compression technique which is an improvement to Lempel-Ziv (LZ) compression techniques, both as applied for purposes of reducing required storage space and for reducing the transmission time associated with transferring data from point to point. Pre-filled compression dictionaries are utilized to address the problem with prior Lempel-Ziv techniques in which the compression software starts with an empty compression dictionary, whereby little compression is achieved until the dictionary has been filled with sequences common in the data being compressed. In accordance with the invention, the compression dictionary is pre-filled, prior to the beginning of the data compression, with letter sequences, words and/or phrases frequent in the domain from which the data being compressed is drawn. The letter sequences, words, and/or phrases used in the pre-filled compression dictionary may be determined by statistically sampling text data from the same genre of text. Multiple pre-filled dictionaries may be utilized by the compression software at the beginning of the compression process, where the most appropriate dictionary for maximum compression is identified and used to compress the current data. These modifications are made to any of the known Lempel-Ziv compression techniques based on the variants detailed in 1977 and 1978 articles by Ziv and Lempel.

This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a “target profile” for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a “target profile interest summary” for each user, which target profile interest summary describes the user's interest level in various types of target objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media. Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.