Here are some recommended papers that give a good sense of what we've worked on over the years.

Natural language problems often demand new algorithms. The main challenges are

a combinatorially large discrete space of linguistic structures a high-dimensional continuous space of statistical parameters

Many of our algorithmic papers give general solutions to some formal problem, and thus have multiple uses. I have occasionally proved hardness results.

What are all these discrete algorithms doing? “Structured prediction” is the problem of modeling unknown variables that are themselves complex structures, such as vectors, strings, or trees. The number of values for such a variable may be astronomically large. Searching for the most likely structure is a combinatorial optimization problem. Other combinatorial problems compute the probability of a particular structure or sub-structure.

Doesn't the machine learning community do structured prediction? Yes: graphical model inference must predict discrete vectors. But linguistics must predict strings and trees. Our systems must guess the syntactic structure of a sentence, the translation of a sentence, the grammar of a language, or the set of real-world facts that is consistent with a set of documents.

Dynamic programming is extremely useful for analyzing sequence data. The papers below introduce novel dynamic programming algorithms (primarily for parsing and machine translation).

Other cool papers, not listed below, show how dynamic programming algorithms can be embedded as efficient subroutines within variational inference (belief propagation), relaxation (dual decomposition, row generation), and large-neighborhood local search.

"Deep learning" usually refers to the use of multi-layer neural networks to model functions or probability distributions. The advantage of these highly parametric models is that they are expressive enough to fit a wide range of real-world phenomena. We have been particularly interested in combining deep learning with graphical models and other approaches to structured prediction, in order to marry the flexibility of deep learning with the insight of domain-specific modeling.

Deep architectures for NLP typically include parameters for vector embeddings of words, which is an important subtopic.

Machine learning often searches for parameters that maximize some non-convex and possibly expensive objective function. Our contributions include semiring computations of objective functions and their gradients, a special case of gradients by automatic differentiation (“back-propagation”); variational training objectives that are tractable to compute; and deterministic annealing methods that smooth a non-convex or non-differentiable objective function during early iterations of training.

My students and I often identify a pesky formal problem in statistical NLP or ML and try to give a general solution to it. The formal settings for our algorithms often involve finite-state machines, various kinds of grammars and synchronous grammars, and graphical models.

These objects are usually equipped with real-valued weights that define a structured prediction problem (see here). One can treat a wider variety of problems by allowing the weights to be elements of an arbitrary semiring.

Our work on weighted logic programming (see papers on the Dyna language) has led us to develop flexible algorithms for maintaining truth values in arithmetic circuits.

These papers classify problems into specific computational complexity classes.

My students and I have worked in several ML settings, some of them novel. I have a relatively well-developed statistical philosophy, leading to the design of novel training objectives. We've also offered techniques for optimizing such objectives. Once upon a time, I was into neural networks and have started to use deep learning again.

Our novel contributions to unsupervised learning were primarily developed on grammar induction, including an approach for converting unsupervised learning to supervised learning on synthetically generated data. We have also proposed unsupervised bootstrapping and done a little work on clustering.

Other papers, not listed below, also do unsupervised learning, but using traditional approaches such as EM and MCEM. We develop such approaches for our transformation models and nonparametric models.

What does it mean to learn? What objective should a learning algorithm maximize?

Decision-making systems should be traineddiscriminatively even if they are structured generatively. Contrastive estimation guides an unsupervised learner to learn the “right” latent variables, by asking it to discriminate between positive data and certain implicit negative data. It is also popular for being fast. Structural annealing applies a domain-specific search bias during early learning. Finally, we have designed objectives for semi-supervised learning.

Bootstrapping is a general strategy for semi-supervised learning. Bootstrapping algorithms sometimes get confused and perform poorly. To cure this, our completely unsupervised “strapping” method is remarkably effective at selecting a successful run of bootstrapping.

We also relate bootstrapping to entropy regularization and apply it in a feature-rich setting of grammar induction.

Intelligent systems may be structured to do approximate probabilistic inference under some carefully crafted model. However, they should be trained discriminatively, so that their actual decisions or predictions maximize task-specific performance. This allows the parameters to compensate any approximations in modeling, inference, and decoding.

My philosophy comes from Bayesian decision theory:

The task of science is generative modeling. Given a data sample and some knowledge of the underlying processes, what is the true data distribution D? The task of engineering is to find a decision rule that is expected to be accurate and perhaps also fast (on distribution D).

This is the proper relation between generative and discriminative modeling. One should design a sensible space of decision rules (policies) and explicitly search for one having high expected performance over D. In practice this can greatly improve accuracy.

Do you have a realistic model of your domain? Then probabilistic inference will be intractable or slow. But do you really need exact inference? Often, you could confidently make a prediction without reasoning about all of the potentially relevant variables. Many variables have redundant or negligible influence on the final decision.

“Cost-aware learning” refers to learning policies for cheap but accurate inference. This is a form of discriminative training (discussed here) where the cost function includes terms for inference speed, data acquisition cost, model size, etc. Our papers are below.

We are particularly interested in learning inference policies that make dynamic decisions at runtime about where to spend computation (e.g., for parsing or arithmetic circuits). More simply, however, one can choose a static policy for each domain or each example.

A human annotator is an AI system hidden inside a skull. Fortunately, it is not a black-box system. We shouldn't just ask annotators to give us the right answers on training data—while they're at it, they can also mark why they chose those answers.

We showed below that annotator “rationales” are efficient to gather and can be exploited to improve classification accuracy. The idea has been adopted elsewhere in NLP and in computer vision. In addition, several papers on interpretable machine learning have asked artificial systems to produce their own “rationales” in the same form.

How do young children listen to their native language and figure out its typological properties and detailed linguistic structure? I'm dying to know, but I would settle for solving the related NLP problem of grammar induction. Given word sequences or part-of-speech sequences that were presumably generated by a (natural language) grammar, we seek to identify the grammar or the resulting tree structures.

Locally maximizing likelihood (the inside-outside algorithm) does quite poorly on this task for a variety of reasons. We have tried to get some insight into the problems and address them with a variety of search methods and modified objective functions. However, this area is far from solved and we are considering new angles.

I have also worked on learning “deep” grammar from “surface” grammar by modeling syntactic transformations. Beyond syntactic grammars, see also our work on inducing morphological and phonological grammars.

Below are papers that present technically innovative models of linguistic domains (not just algorithms for existing models).

Other papers are more general in nature—they develop generative models or finite-state models that could be applied in other domains. In general, my students and I like to build deep probabilistic models that are intuitively plausible as domain models, while retaining plenty of parametric flexibility to fit unexpected patterns in the data.

We have designed various classes of generative models. These models are of general interest although they were motivated by linguistic problems. They include transformation models and variations on topic models. Some of our models are nonparametric or have deep learning architectures. We have also extended Markov random fields to string-valued random variables.

In Bayesian modeling, one often uses a Dirichlet distribution or Dirichlet process as a prior for a discrete distribution. These priors have the neutrality property: if event x is observed, we raise our posterior estimate of p(x) and correspondingly scale down the estimate of p(y) for all other y.

However, what if x and y are “related” events? In that case, their probability should covary——observing one should raise the estimated probability of the other. A transformation model captures this by positing that some instances of y were derived by transformation from x. Indeed, p(y) is defined by summing over all transformation sequences that would terminate at y. We fit a feature-based model of the transformation probabilities, permitting generalization to new events.

Each of the papers below uses finite-state machines to help model some linguistic domain. In most cases, the model combines multiple machines, or combines finite-state machines with deep learning. Many of these papers also present algorithmic methods.

Natural language data is very rich and can be analyzed at many levels. My students and I have happily worked on problems all over NLP:

Beyond devising exact algorithms, we have developed several principled approximations for speeding up parsing, both for basic models and for enriched models where exact parsing would be impractical.

A number of our papers (not all shown below) try to improve the actual models of linguistic syntax that are used in parsing. For example, several of these algorithms aim to preserve speed for lexicalized models of grammar, which acknowledge that two different verbs (for example) may behave differently.

Remark: The probabilities under lexicalized models can capture some crude semantic preferences along with syntax (i.e., selectional preferences). In fact, in our very early work, we actually conditioned probabilities on words according to their role in a semantic representation. I subsequently argued for bilexical parsing as an approximation to this, and gave the first generative model for dependency parsing (which was also the first non-edge-factored model).

A natural-language grammar will generally contain many related syntactic constructions for a given word (e.g., active and passive). Most grammar formalisms explain this redundancy by assuming some mechanism for generating new constructions systematically from old ones.

My dissertation work showed how to model these “syntactic transformations” statistically, learning how deeply related rules covary. It inferred the deep relationships from a sample of observed constructions, enabling it to generalize to unseen constructions (“transformational smoothing”). This work introduced the more general technique of transformation modeling.

The 2008, 2009, and 2011 papers below built up an elegant model of inflectional morphology, with each paper building on the previous one. The work is gathered together in Dreyer's dissertation. Further work beginning in 2015 extended the approach to use latent underlying morphs, allowing it to treat derivational morphology as well.

Most syntax-based models of translation assume that in training data, a sentence and its translation have isomorphic syntactic structure. The papers below work to weaken that assumption, which often fails in practice.

Below are all our papers on machine translation—an assortment of interesting techniques motivated by different search, learning, and modeling challenges in MT. If there is any consistent theme, it is that we usually work with a full probability distribution over possible translations, not just its mode.

I was in graduate school when Optimality Theory took over phonology. There was no computational treatment of OT yet. I provided key finite-state algorithms for generation and comprehension, and proposed plausible modifications to the formalism to keep it within finite-state power. I also analyzed the computational complexity of grammar learning.

In order to do this computational work, I first had to conjecture a universal set of legal constraints (i.e., universal grammar). Nearly all the constraints I found in hundreds of OT papers fit into my simple taxonomy of “primitive constraints.” For those that didn't fit, I exhibited an alternative analysis within my framework of the linguistic data. The “primitive OT” analysis of metrical stress is arguably superior because it predicted previously unexplained typological gaps.

More recently, my students and I have worked on recovering the phonological underlying forms of a language, jointly with a probabilistic phonology. We have also worked on probabilistically modeling the typology of vowel systems.

Learning French in high school was so slow and artificial compared to learning my native language, English. Why all these vocabulary lists and toy-data sentences? Why couldn't I pick up French words and constructions in context, by reading something interesting?

In high school, I wanted to write a novel that gradually morphed from English to French, starting with English words in French order, and gradually dropping in French function words and content words when they were clear from context. Now with machine translation, we're starting to create this kind of hybrid text automatically ...

The Dyna language is our bid to provide a unifying framework for data and algorithms across many settings.

Programming in Dyna is meant to be easy. A program is a short, high-level schematic description of the structure of a computation. It simply defines various values in terms of other values. The user can query and update values at runtime.

It is the system's job to choose efficient data structures and computations to support the possible queries and updates. This leads to interesting algorithmic problems, connected to query planning, deductive databases, and adaptive systems.

The forthcoming version of the language is described in Eisner & Filardo (2011), which illustrates its power on a wide range of problems in statistical AI and beyond. We released a prototype back in 2005, which was limited to semiring-weighted computations but has been used profitably in a number of NLP papers. The new working implementation under development is available here on github.

This paper introduces a framework for the automatic evaluation of natural language text. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges—indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be combined to predict each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that with 8 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1–4, with RMS error < 0.5, a 2× improvement over the uncalibrated baseline.

Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.

We design probes trained on the internal representations of a transformer language model that are predictive of its hallucinatory behavior on in-context generation tasks. To facilitate this detection, we create a span-annotated dataset of organic and synthetic hallucinations over several tasks. We find that probes trained on the force-decoded states of synthetic hallucinations are generally ecologically invalid in organic hallucination detection. Furthermore, hidden state information about hallucination appears to be task and distribution-dependent. Intrinsic and extrinsic hallucination saliency varies across layers, hidden state types, and tasks; notably, extrinsic hallucinations tend to be more salient in a transformer's internal representations. Outperforming multiple contemporary baselines, we show that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.

Large language models (LLMs) have demonstrated impressive capabilities in storing and recalling factual knowledge, but also in adapting to novel in-context information. Yet, the mechanisms underlying their in-context grounding remain unknown, especially in situations where in-context information contradicts factual knowledge embedded in the parameters. This is critical for retrieval-augmented generation methods, which enrich the context with up-to-date information, hoping that grounding can rectify the outdated parametric knowledge. In this study, we introduce Fakepedia, a counterfactual dataset designed to evaluate grounding abilities when the parametric knowledge clashes with the in-context information. We benchmark various LLMs with Fakepedia and discover that GPT-4-turbo has a strong preference for its parametric knowledge. Mistral-7B, on the contrary, is the model that most robustly chooses the grounded answer. Then, we conduct causal mediation analysis on LLM components when answering Fakepedia queries. We demonstrate that inspection of the computational graph alone can predict LLM grounding with 92.8% accuracy, especially because few MLPs in the Transformer can predict non-grounded behavior. Our results, together with existing findings about factual recall mechanisms, provide a coherent narrative of how grounding and factual recall mechanisms interact within LLMs.

A language model may be viewed as a stochastic process over some alphabet. However, in some pathological situations, such a stochastic process may “leak” probability mass onto the set of infinite strings and hence is not equivalent to the conventional view of a language model as a distribution over strings. Such ill-behaved language processes are referred to as non-tight in the literature. In this work, we study conditions of tightness through the lens of stochastic processes. In particular, formulating EOS as a stopping time and using results from martingale theory, we give characterizations of tightness that generalize previous works.

We consider conditional sampling from neural language models—for example, in controlled generation and infilling. This is equivalent to sampling from a energy-based model. Although this target distribution is discrete, the internal computations of the energy function (given by the language model) are differentiable, so one would like to exploit gradient information within a method such as MCMC. Alas, all previous attempts to generalize gradient-based MCMC to discrete state spaces such as text fail to sample correctly from the target distribution. We propose a solution, along with variants, and study its theoretical properties. Through experiments on various forms of text generation, we demonstrate that our unbiased samplers are able to generate more fluent text while adhering to the control objectives better. The same methods could be applied to sample from discrete energy-based models unrelated to text.

We describe a class of tasks called decision-oriented dialogues, in which AI assistants such as large language models (LMs) must collaborate with one or more humans via natural language to help them make complex decisions. We formalize three domains in which users face everyday decisions: (1) choosing an assignment of reviewers to conference papers, (2) planning a multi-step itinerary in a city, and (3) negotiating travel plans for a group of friends. In each of these settings, AI assistants and users have disparate abilities that they must combine to arrive at the best decision: assistants can access and process large amounts of information, while users have preferences and constraints external to the system. For each task, we build a dialogue environment where agents receive a reward based on the quality of the final decision they reach. We evaluate LMs in self-play and in collaboration with humans and find that they fall short compared to human assistants, achieving much lower rewards despite engaging in longer dialogues. We highlight a number of challenges models face in decision-oriented dialogues, ranging from goal-directed behavior to reasoning and optimization, and release our environments as a testbed for future work.

Users of natural language interfaces, generally powered by Large Language Models (LLMs),often must repeat their preferences each time they make a similar request. To alleviate this, we propose including some of a user's preferences and instructions in natural language—collectively termed standing instructions—as additional context for such interfaces. For example, when a user states I'm hungry, their previously expressed preference for Persian food will be automatically added to the LLM prompt, so as to influence the search for relevant restaurants. We develop NLSI, a language-to-program dataset consisting of over 2.4K dialogues spanning 17 domains, where each dialogue is paired with a user profile (a set of users specific standing instructions) and corresponding structured representations (API calls). A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue. NLSI contains diverse phenomena, from simple preferences to interdependent instructions such as triggering a hotel search whenever the user is booking tickets to an event. We conduct experiments on NLSI using prompting with large language models and various retrieval approaches, achieving a maximum of 44.7% exact match on API prediction. Our results demonstrate the challenges in identifying the relevant standing instructions and their interpretation into API calls.

I present a new approach to implementing weighted logic programming languages. I first present a bag-relational algebra that is expressive enough to capture the desired denotational semantics, directly representing the recursive conjunctions, disjunctions, and aggregations that are specified by a source program. For the operational semantics, I develop a term-rewriting system that executes a program by simplifying its corresponding algebraic expression.

I have used this approach to create the first complete implementation of the Dyna programming language. A Dyna program consists of rules that define a potentially infinite and cyclic computation graph, which is queried to answer data-dependent questions. Dyna is a unified declarative framework for machine learning and artificial intelligence researchers that supports dynamic programming, constraint logic programming, reactive programming, and object-oriented programming. I have further modernized Dyna to support functional programming with lambda closures and embedded domain-specific languages.

The implementation includes a front-end that translates Dyna programs to bag-relational expressions, a Python API, hundreds of term rewriting rules, and a procedural engine for determining which rewrite rules to apply. The rewrite rules generalize techniques used in constraint logic programming. In practice, our system is usually able to provide simple answers to queries.

Mixing disparate programming paradigms is not without challenges. We had to rethink the classical techniques used to implement logic programming languages. This includes the development of a novel approach for memoization (dynamic programming) that supports partial memoization of fully or partially simplified algebraic expressions, which may contain delayed, unevaluated constraints. Furthermore, real-world Dyna programs require fast and efficient execution. For this reason, I present a novel approach to just-in-time (JIT) compile sequences of term rewrites using a custom tracing JIT.

Note: Dr. Francis-Landau's dissertation advisor was Jason Eisner.

Report of the ACL committee on the anonymity policy.
Leon Derczynski, Jason Eisner, Yoav Goldberg, Iryna Gurevych, Lillian
Lee, Joakim Nivre, Brian Roark, Thamar Solorio, Yue Zhang, Chengqing Zong,
and Diyi Yang (2023).
Report available on the wiki of the Association for Computational
Linguistics. [ report | policies | bib ]

This is the report of a working group appointed by the ACL Executive Committee to redesign policies on conference submissions and social media discussion. The group was chaired by ACL President Iryna Gurevych. The policies were adopted by the Association for Computational Linguistics effective January 12, 2024.

Neural finite-state transducers (NFSTs) form an expressive family of neurosymbolic sequence transduction models. An NFST models each string pair as having been generated by a latent path in a finite-state transducer. As they are deep generative models, both training and inference of NFSTs require inference networks that approximate posterior distributions over such latent variables. In this paper, we focus on the resulting challenge of imputing the latent alignment path that explains a given pair of input and output strings (e.g., during training). We train three autoregressive approximate models for amortized inference of the path, which can then be used as proposal distributions for importance sampling. All three models perform lookahead. Our most sophisticated (and novel) model leverages the FST structure to consider the graph of future paths; unfortunately, we find that it loses out to the simpler approaches—except on an artificial task that we concocted to confuse the simpler approaches.

Recent work has shown that generation from a prompted or fine-tuned language model can perform well at semantic parsing when the output is constrained to be a valid semantic representation. We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing, that includes context-free grammars for seven semantic parsing datasets and two syntactic parsing datasets with varied output representations, as well as a constrained decoding interface to generate only valid outputs covered by these grammars. We provide low, medium, and high resource splits for each dataset, allowing accurate comparison of various language models under different data regimes. Our benchmark supports evaluation of language models using prompt-based learning as well as fine-tuning. We benchmark eight language models, including two GPT-3 variants available only through an API. Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.

Can non-programmers annotate natural language utterances with complex programs that represent their meaning? We introduce APEL, a framework in which non-programmers select among candidate programs generated by a seed semantic parser (e.g., Codex). Since they cannot understand the candidate programs, we ask them to select indirectly by examining the programs’ input-ouput examples. For each utterance, APEL actively searches for a simple input on which the candidate programs tend to produce different outputs. It then asks the non-programmers only to choose the appropriate output, thus allowing us to infer which program is correct and could be used to fine-tune the parser. As a first case study, we recruited human non-programmers to use APEL to re-annotate SPIDER, a text-to-SQL dataset. Our approach achieved the same annotation accuracy as the original expert annotators (75%) and exposed many subtle errors in the original annotations.

Large language models (LLMs) can improve their accuracy on various tasks through iteratively refining and revising their output based on feedback. We observe that these revisions can introduce errors, in which case it is better to roll back to a previous result. Further, revisions are typically homogeneous: they use the same reasoning method that produced the initial answer, which may not correct errors. To enable exploration in this space, we present SCREWS, a modular framework for reasoning with revisions. It is comprised of three main modules: Sampling, Conditional Resampling, and Selection, each consisting of sub-modules that can be hand-selected per task. We show that SCREWS not only unifies several previous approaches under a common framework, but also reveals several novel strategies for identifying improved reasoning chains. We evaluate our framework with state-of-the-art LLMs (ChatGPT and GPT-4) on a diverse set of reasoning tasks and uncover useful new reasoning strategies for each: arithmetic word problems, multi-hop question answering, and code debugging. Heterogeneous revision strategies prove to be important, as does selection between original and revised candidates.

Language models have always been a fundamental NLP tool and application. This thesis
focuses on open-vocabulary language models, i.e., models that can deal with novel and
unknown words at runtime. We will propose both new ways to construct such models as
well as use such models in cross-linguistic evaluations to answer questions of difficulty and
language-specificity in modern NLP tools.

We start by surveying linguistic background as well as past and present NLP approaches
to tokenization and open-vocabulary language modeling (Mielke et al., 2021). Thus
equipped, we establish desirable principles for such models, both from an engineering
mindset as well as a linguistic one and hypothesize a model based on the marriage of neural
language modeling and Bayesian nonparametrics to handle a truly infinite vocabulary,
boasting attractive theoretical properties and mathematical soundness, but presenting
practical implementation difficulties. As a compromise, we thus introduce a word-based
two-level language model that still has many desirable characteristics while being highly
feasible to run (Mielke and Eisner, 2019). Unlike the more dominant approaches of
characters or subword units as one-layer tokenization it uses words; its key feature is the
ability to generate novel words in context and in isolation.

Moving on to evaluation, we ask: how do such models deal with the wide variety of
languages of the world—are they struggling with some languages? Relating this question
to a more linguistic one, are some languages inherently more difficult to deal with? Using
simple methods, we show that indeed they are, starting with a small pilot study that
suggests typological predictors of difficulty (Cotterell et al., 2018). Thus encouraged,
iiwe design a far bigger study with more powerful methodology, a principled and highly
feasible evaluation and comparison scheme based again on multi-text likelihood (Mielke
et al., 2019). This larger study shows that the earlier conclusion of typological predictors is
difficult to substantiate, but also offers a new insight on the complexity of Translationese.
Following that theme, we end by extending this scheme to machine translation models to
answer questions traditional evaluation metrics like BLEU cannot (Bugliarello et al., 2020).

Note: Dr. Mielke's dissertation advisor was Jason Eisner.

There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual population. However, existing datasets are limited in size. It is expensive and difficult to collect real transcribed spoken CS data due to the challenges of finding and identifying CS data in the wild. As a result, many attempts have been made to generate synthetic CS data. Existing methods either require the existence of CS data during training, or are driven by linguistic knowledge. We introduce a novel approach of forcing a multilingual MT system that was trained on non-CS data to generate CS translations. Comparing against two prior methods, we show that simply leveraging the shared representations of two languages (Mandarin and English) yields better CS text generation and, ultimately, better CS ASR.

Many NLP algorithms have been described in terms of deduction systems. Unweighted deduction allows a generic forward-chaining execution strategy. For weighted deduction, however, efficient execution should propagate the weight of each item only after it has converged. This means visiting the items in topologically sorted order (as in dynamic programming). Toposorting is fast on a materialized graph; unfortunately, materializing the graph would take extra space. Is there a generic weighted deduction strategy which, for every acyclic deduction system and every input, uses only a constant factor more time and space than generic unweighted deduction? After reviewing past strategies, we answer this question in the affirmative by combining ideas of Goodman (1999) and Kahn (1962). We also give an extension to cyclic deduction systems, based on Tarjan (1972).

This thesis develops a system for automatically analyzing and improving dynamic programs, such as those that have driven progress in natural language processing and computer science, more generally, for decades. Finding a correct program with the optimal asymptotic runtime can be unintuitive, timeconsuming, and error-prone. This thesis aims to automate this laborious process. To this end, we develop an approach based on (1) a high-level, domain-specific language called Dyna for concisely specifying dynamic programs; (2) a general-purpose solver to efficiently execute these programs; (3) a static analysis system that provides type analysis and worst-case time/space complexity analyses; (4) a rich collection of meaning-preserving transformations to programs, which systematizes the repeated insights of numerous authors when speeding up algorithms in the literature; (5) a search algorithm for identifying a good sequence of transformations that reduce the runtime complexity given an initial, correct program. We show that, in practice, automated search—like the mental search performed by human programmers—can find substantial improvements to the initial program. Empirically, we show that many speed-ups described in the NLP literature could have been discovered automatically by our system. We provide a freely available prototype system at https://github.com/timvieira/dyna-pi.

Note: Dr. Vieira's dissertation advisor was Jason Eisner.

In a real-world dialogue system, generated text must be truthful and informative while remaining fluent and adhering to a prescribed style. Satisfying these constraints simultaneously is difficult for the two predominant paradigms in language generation: neural language modeling and rule-based generation. We describe a hybrid architecture for dialogue response generation that combines the strengths of both paradigms. The first component of this architecture is a rule-based content selection model defined using a new formal framework called dataflow transduction, which uses declarative rules to transduce a dialogue agent's actions and their results (represented as dataflow graphs) into context-free grammars representing the space of contextually acceptable responses. The second component is a constrained decoding procedure that uses these grammars to constrain the output of a neural language model, which selects fluent utterances. Our experiments show that this system outperforms both rule-based and learned approaches in human evaluations of fluency, relevance, and truthfulness.

Voice dictation is an increasingly important text input modality. Existing systems that allow both dictation and editing-by-voice restrict their command language to flat templates invoked by trigger words. In this work, we study the feasibility of allowing users to interrupt their dictation with spoken editing commands in open-ended natural language. We introduce a new task and dataset, , to experiment with such systems. To support this flexibility in real-time, a system must incrementally segment and classify spans of speech as either dictation or command, and interpret the spans that are commands. We experiment with using large pre-trained language models to predict the edited text, or alternatively, to predict a small text-editing program. Experiments show a natural trade-off between model accuracy and latency: a smaller model achieves 28% single-command interpretation accuracy with 1.3 seconds of latency, while a larger model achieves 55% with 7 seconds of latency.

Task-oriented dialogue systems often assist users with personal or confidential matters. For this reason, the developers of such a system are generally prohibited from observing actual usage. So how can they know where the system is failing and needs more training data or new functionality? In this work, we study ways in which realistic user utterances can be generated synthetically, to help increase the linguistic and functional coverage of the system, without compromising the privacy of actual users. To this end, we propose a two-stage Differentially Private (DP) generation method which first generates latent semantic parses, and then generates utterances based on the parses. Our proposed approach improves MAUVE by 3.8× and parse tree node-type overlap by 1.4× relative to current approaches for private synthetic data generation, improving both on fluency and semantic coverage. We further validate our approach on a realistic domain adaptation task of adding new functionality from private user data to a semantic parser, and show gains of 1.3× on its accuracy with the new feature.

Given a language model (LM), maximum probability is a poor decoding objective for open-ended generation, because it produces short and repetitive text. On the other hand, sampling can often produce incoherent text that drifts from the original topics. We propose contrastive decoding (CD), a reliable decoding approach that optimizes a contrastive objective subject to a plausibility constraint. The contrastive objective returns the difference between the likelihood under a large LM (called the expert, e.g., OPT-13B) and a small LM (called the amateur, e.g., OPT-125M), and the constraint ensures that the outputs are plausible. CD is inspired by the fact that the failures of larger LMs (e.g., repetition, incoherence) are even more prevalent in smaller LMs, and that this difference signals which texts should be preferred. CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone. It also works across model scales (OPT-13B and GPT2-1.5B) and significantly outperforms four strong decoding algorithms (e.g., nucleus, top-k) in automatic and human evaluations across wikipedia, news and story domains.

Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can “leak” onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.

This paper provides a reference description, in the form of a deduction system, of Earley's (1970) context-free parsing algorithm with various speed-ups. Our presentation includes a known worst-case runtime improvement from Earley's O(N^{3}|G||R|), which is unworkable for the large grammars that arise in natural language processing, to O(N^{3}|G|), which matches the runtime of CKY on a binarized version of the grammar G. Here N is the length of the sentence, |R| is the number of productions in G, and |G| is the total length of those productions. We also provide a version that achieves runtime of O(N^{3}|M|) with |M|<=|G| when the grammar is represented compactly as a single finite-state automaton M (this is partly novel). We carefully treat the generalization to semiring-weighted deduction, preprocessing the grammar like Stolcke (1995) to eliminate deduction cycles, and further generalize Stolcke's method to compute the weights of sentence prefixes. We also provide implementation details for efficient execution, ensuring that on a preprocessed grammar, the semiring-weighted versions of our methods have the same asymptotic runtime and space requirements as the unweighted methods, including sub-cubic runtime on some grammars.

To predict the next token, autoregressive models ordinarily examine the past. Could they also benefit from also examining hypothetical futures? We consider a novel Transformer-based autoregressive architecture that estimates the next-token distribution by extrapolating multiple continuations of the past, according to some proposal distribution, and attending to these extended strings. This architecture draws insights from classical AI systems such as board game players: when making a local decision, a policy may benefit from exploring possible future trajectories and analyzing them. On multiple tasks including morphological inflection and Boolean satisfiability, our lookahead model is able to outperform the ordinary Transformer model of comparable size. However, on some tasks, it appears to be benefiting from the extra computation without actually using the lookahead information. We discuss possible variant architectures as well as future speedups.

Weighted finite-state automata (WSFAs) are commonly used in NLP. Failure transitions are a useful extension for compactly representing backoffs or interpolation in n-gram models and CRFs, which are special cases of WFSAs. The pathsum in ordinary acyclic WFSAs is efficiently computed by the backward algorithm in time O(|E|), where E is the set of transitions. However, this does not allow failure transitions, and preprocessing the WFSA to eliminate failure transitions could greatly increase |E|. We extend the backward algorithm to handle failure transitions directly. Our approach is efficient when the average state has outgoing arcs for only a small fraction s << 1 of the alphabet Σ. We propose an algorithm for general acyclic WFSAs which runs in O(|E| + s |Σ| |Q| T_{max} log |Σ|), where Q is the set of states and T_{max} is the size of the largest connected component of failure transitions. When the failure transition topology satisfies a condition exemplified by CRFs, the T_{max} factor can be dropped, and when the weight semiring is a ring, the log |Σ| factor can be dropped. In the latter case (ring-weighted acyclic WFSAs), we also give an alternative algorithm with complexity O(|E| + |Σ| |Q| min(1,s π_{max}) ), where π_{max} is the size of the longest failure path.

In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation into this incremental symbol learning scenario. Our analyses reveal a troubling quirk in building (broad-coverage) NLU systems: as the training dataset grows, more data is needed to learn new symbols, forming a vicious cycle. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues and their lack of contextual understanding.

Huge neural autoregressive sequence models have achieved impressive performance across different applications, such as NLP, reinforcement learning, and bioinformatics. However, some lingering problems (e.g., consistency and coherency of generated texts) continue to exist, regardless of the parameter count. In the first part of this thesis, we chart a taxonomy of the expressiveness of various sequence model families (Ch 3). In particular, we put forth complexity-theoretic proofs that string latent-variable sequence models are strictly more expressive than energy-based sequence models, which in turn are more expressive than autoregressive sequence models. Based on these findings, we introduce residual energy-based sequence models, a family of energy-based sequence models (Ch 4) whose sequence weights can be evaluated efficiently, and also perform competitively against autoregressive models. However, we show how unrestricted energy-based sequence models can suffer from uncomputability; and how such a problem is generally unfixable without knowledge of the true sequence distribution (Ch 5). In the second part of the thesis, we study practical sequence model families and algorithms based on theoretical findings in the first part of the thesis. We introduce neural particle smoothing (Ch 6), a family of approximate sampling methods that work with conditional latent variable models. We also introduce neural finite-state transducers (Ch 7), which extend weighted finite state transducers with the introduction of mark strings, allowing scoring transduction paths in a finite state transducer with a neural network. Finally, we propose neural regular expressions (Ch 8), a family of neural sequence models that are easy to engineer, allowing a user to design flexible weighted relations using Marked FSTs, and combine these weighted relations together with various operations.

Note: Dr. Lin's dissertation advisor was Jason Eisner.

The Bar-Hillel construction is a classic result in formal language theory. It shows, by construction, that the intersection between a context-free language and a regular language is itself context-free. However, neither its original formulation (Bar-Hillel et al., 1961) nor its weighted extension (Nederhof and Satta, 2003) can handle automata with ε-arcs. In this short note, we generalize the Bar-Hillel construction to correctly compute the intersection even when the automaton contains ε-arcs. We further prove that our generalized construction leads to a grammar that encodes the structure of both the input automaton and grammar while retaining the asymptotic size of the original construction.

Standard conversational semantic parsing maps a complete user utterance into an executable program, after which the program is executed to respond to the user. This could be slow when the program contains expensive function calls. We investigate the opportunity to reduce latency by predicting and executing function calls while the user is still speaking. We introduce the task of online semantic parsing for this purpose, with a formal latency reduction metric inspired by simultaneous machine translation. We propose a general framework with first a learned prefix-to-program prediction module, and then a simple yet effective thresholding heuristic for subprogram selection for early execution. Experiments on the SMCalFlow and TreeDST datasets show our approach achieves large latency reduction with good parsing quality, with a 30–60% latency reduction depending on function execution time and allowed cost.

The typology of sound systems in spoken human languages should be explained in part by functional pressures on communication. Two competing pressures are per-phoneme transmission rate and ease of communication. A system with few phonemes transmits limited information per phoneme, but the phonemes can be easily pronounced and perceived. Adding more phonemes can increase the amount of information per phoneme—but that requires using “outlier” sounds, which are more difficult to pronounce or perceive, or else splitting phonemes, which requires more speaker and hearer effort to keep them distinct. We encode these two competing pressures into a proposed universal prior for a generative probability model. We find that a model of vowel token formants is more predictive of held-out data if it is trained with the help of this prior (that is, by MAP rather than ML).

The neural Hawkes process (Mei & Eisner, 2017) is a generative model of irregularly spaced sequences of discrete events. To handle complex domains with many event types, Mei et al. (2020) further consider a setting in which each event in the sequence updates a deductive database of facts (via domain-specific pattern-matching rules); future events are then conditioned on the database contents. They show how to convert such a symbolic system into a neuro-symbolic continuous-time generative model, in which each database fact and possible event has a time-varying embedding that is derived from its symbolic provenance.

In this paper, we modify both models, replacing their recurrent LSTM-based architectures with flatter attention-based architectures (Vaswani et al. 2017), which are simpler and more parallelizable. This does not appear to hurt our accuracy, which is comparable to or better than that of the original models as well as (where applicable) previous attention-based methods (Zuo et al., 2020; Zhang et al., 2020a).

Computational models of human language often involve combinatorial problems. For instance, a probabilistic parser may marginalize over exponentially many trees to make predictions. Algorithms for such problems often employ dynamic programming and are not always unique. Finding one with optimal asymptotic runtime can be unintuitive, time-consuming, and error-prone. Our work aims to automate this laborious process. Given an initial correct declarative program, we search for a sequence of semantics-preserving transformations to improve its running time as much as possible. To this end, we describe a set of program transformations, a simple metric for assessing the efficiency of a transformed program, and a heuristic search procedure to improve this metric. We show that in practice, automated search—like the mental search performed by human programmers—can find substantial improvements to the initial program. Empirically, we show that many speed-ups described in the NLP literature could have been discovered automatically by our system.

We explore the use of large pretrained language models as few-shot semantic parsers. The goal in semantic parsing is to generate a structured meaning representation given a natural language input. However, language models are trained to generate natural language. To bridge the gap, we use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation. With a small amount of data and very little code to convert into English-like representations, we provide a blueprint for rapidly bootstrapping semantic parsers and demonstrate good performance on multiple tasks.

Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while most—but not all—languages have an [u] sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology: What makes a natural vowel inventory? We introduce a series of deep probability models. In Chapter 1, we give an overview of the relevant background material in phonetics and the typology of vowel systems. In Chapter 2, we introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages. In Chapter 3, we work directly with the acoustic information—the first two formant values—rather than modeling discrete sets of phonemic symbols (IPA). We develop a novel generative probability model and report results based on the same corpus of over 200 languages. In Chapter 4, we focus on a functional account of vowel systems. The typology of vowel systems can, in part, be explained in part by functional pressures on communication. We find that a model of vowel token formants is more predictive of held-out data if it is trained with the help of this prior (that is, by MAP rather than ML).

Note: Dr. Cotterell's dissertation advisor was Jason Eisner.

This thesis focuses on modeling event sequences, namely, sequences of discrete events in continuous time. We build a family of generative probabilistic models that is able to reason about what events will happen in the future and when, given the history of previous events. Under our models, each event—as it happens—is allowed to update the future intensities of multiple event types, and the intensity of each event type—as nothing happens—is allowed to evolve with time along a trajectory. We use neural networks to allow the “updates” and “trajectories” to be complex and realistic. In the purely neural version of our model, all future event intensities are conditioned on the hidden state of a continuous-time LSTM, which has consumed every past event as it happened. To exploit domain-specific knowledge of how an event might only affect a few—but not all—future event intensities, we propose to introduce domain-specific structure into the model. We design a modeling language, by which a domain expert can write down the rules of a temporal deductive database. The database tracks facts over time; the rules deduce facts from other facts and from past events. Each fact has a time-varying state, computed by a neural network whose topology is determined by the fact’s provenance, including its experience of the past events that have contributed to deducing it. The possible event types at any time are given by special facts, whose intensities are neurally modeled alongside their states. We develop efficient methods for training our models, and doing inference with them. Applying the general principle of noise-contrastive estimation, we work out a stochastic training objective that is less expensive to optimize than the log-likelihood, which people typically maximize for parameter estimation. As in the discrete-time case that inspired us, the parameters that maximize our objective will provably maximize the log-likelihood as well. For the scenarios where we are given incomplete sequences, we propose particle smoothing—a form of sequential importance sampling—to impute the missing events. This thesis includes extensive experiments, demonstrating the effectiveness of our models and algorithms. On many synthetic and real-world datasets, on held-out sequences, we show empirically: (1) our purely neural model achieves competitive likelihood and predictive accuracy; (2) our neural-symbolic model improves prediction by encoding appropriate domain knowledge in the architecture; (3) for models to achieve the same level of log-likelihood, our noise-contrastive estimation needs considerably fewer function evaluations and less wall-clock time than maximum likelihood estimation; (4) our particle smoothing method is effective at inferring the ground-truth unobserved events. In this thesis, I will also discuss a few future research ppppdirections, including embedding our models within a reinforcement learner to discover causal structure and learn an intervention policy.

Note: Dr. Mei's dissertation advisor was Jason Eisner.

Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length.

Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.

Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks, using a fill-in-the-blank paradigm (Petroni et al., 2019) or a few-shot extrapolation paradigm (Brown et al., 2020). For example, language models retain factual knowledge from their training corpora that can be extracted by asking them to “fill in the blank” in a sentential prompt. However, where does this prompt come from? We explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. Our prompts consist of “soft words,” i.e., continuous vectors that are not necessarily word type embeddings from the language model. Furthermore, for each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. Across multiple English LMs and tasks, our approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.

The log-likelihood of a generative model often involves both positive and negative terms. For a temporal multivariate point process, the negative term sums over all the possible event types at each time and also integrates over all the possible times. As a result, maximum likelihood estimation is expensive. We show how to instead apply a version of noise-contrastive estimation—a general parameter estimation method with a less expensive stochastic objective. Our specific instantiation of this general idea works out in an interestingly non-trivial way and has provable guarantees for its optimality, consistency and efficiency. On several synthetic and real-world datasets, our method shows benefits: for the model to achieve the same level of log-likelihood on held-out data, our method needs considerably fewer function evaluations and less wall-clock time.

Note: This is part of a series of papers on the neural Hawkes process, although the method also applies to other continuous temporal models.

Task-oriented dialogue as dataflow synthesis.
Semantic Machines, Jacob Andreas, John Bufe, David Burkett, Charles
Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner,
Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill,
Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo
Lanman, Percy Liang, Christopher H. Lin, Ilya Lintsbakh, Andy McGovern,
Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth,
Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon
Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela
Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov (2020). TACL. [ paper | official link | arxiv | twitter | blog | video | extended video | data | bib ]

We describe an approach to task-oriented dialogue in
which dialogue state is represented as a dataflow
graph. A dialogue agent maps each user utterance to
a program that extends this graph. Programs include
metacomputation operators for reference and revision
that reuse dataflow fragments from previous
turns. Our graph-based state enables the expression
and manipulation of complex user intents, and
explicit metacomputation makes these intents easier
for learned models to predict. We introduce a new
dataset, SMCalFlow, featuring complex dialogues
about events, weather, places, and people.
Experiments show that dataflow graphs and
metacomputation substantially improve
representability and predictability in these natural
dialogues. Additional experiments on the MultiWOZ
dataset show that our dataflow representation
enables an otherwise off-the-shelf
sequence-to-sequence model to match the best
existing task-specific state tracking model. The
SMCalFlow dataset, code for replicating experiments,
and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.

Learning how to predict future events from patterns of past events is difficult when the set of possible event types is large. Training an unrestricted neural model might overfit to spurious patterns. To exploit domain-specific knowledge of how past events might affect an event's present probability, we propose using a temporal deductive database to track structured facts over time. Rules serve to prove facts from other facts and from past events. Each fact has a time-varying state—a vector computed by a neural net whose topology is determined by the fact's provenance, including its experience of past events. The possible event types at any time are given by special facts, whose probabilities are neurally modeled alongside their states. In both synthetic and real-world domains, we show that neural probabilistic models derived from concise Datalog programs improve prediction by encoding appropriate domain knowledge in their architecture.

Note: This is part of a series of papers on the neural Hawkes process for modeling irregular time series, but can also be used for discrete-time sequence modeling as in our work on neural finite-state methods. Is also related through Datalog to our work on Dyna.

A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present Vox Clamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io/.

We present a scheme for translating logic programs with built-ins and aggregation into algebraic expressions that denote bag relations over ground terms of the Herbrand universe. To evaluate queries against these relations, we develop an operational semantics based on term rewriting of the algebraic expressions. This approach can exploit arithmetic identities and recovers a range of useful strategies, including lazy strategies that defer work until it becomes both possible and necessary.

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.

We present a machine foreign-language teacher that modifies text in a student's native language (L1) by replacing selected word tokens with glosses in a foreign language (L2), in such a way that L2 vocabulary can be learned simply by reading the resulting macaronic text. The machine teacher uses no supervised data from human students. Instead, to guide the machine teacher's choices, we equip a cloze language model with a training procedure that can incrementally learn representations for novel words, and use this model as a proxy for the word guessing and learning ability of real human students. We use Mechanical Turk to evaluate two variants of the student model: (i) one that generates a representation for a novel word using only surrounding context and (ii) an extension that also uses the spelling of the novel word.

This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters by supervised training on synthetic languages. Our parsing framework has three major components: Synthetic language generation gives a rich set of training languages by mix-and-match over the real languages; surface-form feature extraction maps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally, language-agnostic parsing incorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language.

The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (unparsed corpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well.

This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works' interpretable typological features that require parsed corpora or expert categorization of the language.

Note: Dr. Wang's dissertation advisor was Jason Eisner.

We present a machine foreign-language teacher that takes documents written in a student's native language and detects situations where it can replace words with their foreign glosses such that new foreign vocabulary can be learned simply through reading the resulting mixed-language text. We show that is it possible to design such a machine teacher without any supervised data from (human) students. We accomplish this by modifying a language model to incrementally learn new vocabulary items, and use this language model as a proxy for the word guessing ability of real students. Our machine foreign-language teacher consults this language model and creates mixed-language documents.

We evaluate three variants of our student proxy language models through a study on Amazon Mechanical Turk. We find that Mechanical Turk “students” were able to guess the meanings of foreign words introduced by the machine teacher with high accuracy for both function words as well as content words in two out of the three word guessing models.

Note: The best of these models was further improved in our followup paper (Renduchintala et al., 2019b). The followup paper also evaluated on real languages instead of artificial ones.

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that “translationese” is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

Events in the world may be caused by other, unobserved events. We consider sequences of events in continuous time. Given a probability model of complete sequences, we propose particle smoothing—a form of sequential importance sampling—to impute the missing events in an incomplete sequence. We develop a trainable family of proposal distributions based on a type of bidirectional continuous-time LSTM. Bidirectionality lets the proposals condition on future observations, not just on the past as in particle filtering. Our method can sample an ensemble of possible complete sequences (particles), from which we form a single consensus prediction that has low Bayes risk under our chosen loss metric. We experiment in multiple synthetic and real domains, using different missingness mechanisms, and modeling the complete sequences in each domain with a neural Hawkes process (Mei & Eisner, 2017). On held-out incomplete sequences, our method is effective at inferring the ground-truth unobserved events, with particle smoothing consistently improving upon particle filtering.

We introduce neural finite state transducers (NFSTs), a family of string transduction models defining joint and conditional probability distributions over pairs of strings. The probability of a string pair is obtained by marginalizing over the scores of all its accepting paths in a finite state transducer. In contrast to ordinary weighted FSTs, however, each path is scored using a recurrent neural network, which breaks the usual conditional independence assumption (Markov property). NFSTs are more powerful than previous finite-state models with neural features (Rastogi et al., 2016). We present training and inference algorithms for locally and globally normalized variants of NFSTs. In experiments on different transduction tasks, they compete favorably against seq2seq models while offering interpretable paths that correspond to hard monotonic alignments.

Critical to natural language generation is the production of correctly inflected text. In this paper, we isolate the task of predicting a fully inflected sentence from its partially lemmatized version. Unlike traditional morphological inflection or surface realization, our task input does not provide “gold” tags that specify what morphological features to realize on each lemmatized word; rather, such features must be inferred from sentential context. We develop a neural hybrid graphical model that explicitly reconstructs morphological features before predicting the inflected forms, and compare this to a system that directly predicts the inflected forms without relying on any morphological annotation. We experiment on several typologically diverse languages from the Universal Dependencies treebanks, showing the utility of incorporating linguistically-motivated latent variables into NLP models.

Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's “true” punctuation marks are not observed (Nunberg, 1990). These latent “underlying” marks serve to delimit or separate constituents in the syntax tree. When the tree's yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into “surface” marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to EM). When we use the trained model to reconstruct the tree's underlying punctuation, the results appear plausible across 5 languages, and in particular are consistent with Nunberg's analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our qreconstruction of a sentence's underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.

We quantify the linguistic complexity of different languages' morphological systems. We verify that there is a statistically significant empirical trade-off between paradigm size and irregularity: a language's inflectional paradigms may be either large in size or highly irregular, but never both. We define a new measure of paradigm irregularity based on the conditional entropy of the surface realization of a paradigm—how hard it is to jointly predict all the word forms in a paradigm from the lemma. We estimate irregularity by training a predictive model. Our measurements are taken on large morphological paradigms from 36 typologically diverse languages.

We show how the spellings of known words can help us deal with unknown words in open-vocabulary NLP tasks. The method we propose can be used to extend any closed-vocabulary generative model, but in this paper we specifically consider the case of neural language modeling. Our Bayesian generative story combines a standard RNN language model (generating the word tokens in each sentence) with an RNN-based spelling model (generating the letters in each word type). These two RNNs respectively capture sentence structure and word structure, and are kept separate as in linguistics. By invoking the second RNN to generate spellings for novel words in context, we obtain an open-vocabulary language model. For known words, embeddings are naturally inferred by combining evidence from type spelling and token context. Comparing to baselines (including a novel strong baseline), we beat previous work and establish state-of-the-art results on multiple datasets.

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

Note: This was a refereed abstract and presentation of previously published work (Cotterell et al., 2018).

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages (Wang and Eisner, 2016) in the training achieves further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work's interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.65 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).

Note: This paper builds on the synthetic Galactic Dependencies treebanks developed by Wang and Eisner (2016), and extends their use from typology prediction (Wang and Eisner, 2017) to parsing of new languages.

To approximately parse an unfamiliar language, it helps to have a treebank of a similar language. But what if the only available treebanks have the wrong word order? We show how to (stochastically) permute the constituents of an existing dependency treebank so that its surface part-of-speech statistics approximately match those of the target language. The parameters of the permutation model can be evaluated for quality by dynamic programming and tuned by gradient descent. This optimization procedure yields trees for a new artificial language that resembles the target language. We show that delexicalized parsers for the target language can be successfully trained using such “made to order” artificial languages.

Note: This method of creating synthetic treebanks is analogous to biological mutation. It differs from the method of Wang and Eisner (2016), which is analogous to sexual reproduction.

The CoNLL–SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task. This second task featured seven languages. Task 1 received 27 submissions and task 2 received 6 submissions. Both tasks featured a low, medium, and high data condition. Nearly all submissions featured a neural component and built on highly-ranked systems from the earlier 2017 shared task. In the inflection task (task 1), 41 of the 52 languages present in last year’s inflection task showed improvement by the best systems in the low-resource setting. The cloze task (task 2) proved to be difficult, and few submissions managed to consistently improve upon both a simple neural baseline system and a lemma-repeating baseline.

Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bundles (even rare ones). Although this basic model does not consider a token's context, that very property allows it to operate on a simple list of unigram type counts, partitioning each count among different analyses of that unigram. We discuss evaluation metrics for this novel task and report results on 5 languages.

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with both n-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

Note: Our followup paper with a more sophisticated analysis and weaker conclusions is Mielke et al. (2019).

We introduce neural particle smoothing, a sequential Monte Carlo method for sampling annotations of an input string from a given probability model. In contrast to conventional particle filtering algorithms, we train a proposal distribution that looks ahead to the end of the input string by means of a right-to-left LSTM. We demonstrate that this innovation can improve the quality of the sample. To motivate our formal choices, we explain how neural transduction models and our sampler can be viewed as low-dimensional but nonlinear approximations to working with HMMs over very large state spaces.

What makes some types of languages more probable than others? For instance, we know that almost all spoken languages contain the vowel phoneme /i/; why should that be? The field of linguistic typology seeks to answer these questions and, thereby, divine the mechanisms that underlie human language. In our work, we tackle the problem of vowel system typology, i.e., we propose a generative probability model of which vowels a language contains. In contrast to previous work, we work directly with the acoustic information—the first two formant values—rather than modeling discrete sets of phonemic symbols (IPA). We develop a novel generative probability model and report results based on a corpus of 233 languages.

UniMorph 2.0: Universal morphology.
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine
Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke,
Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans
Hulden (2018).
In LREC. [ paper | official paper | arxiv | bib ]

The Universal Morphology (UniMorph) project is a collaborative effort to improve how NLP handles complex morphology across the world’s languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema. Additional supporting data and tools are also released on a per-language basis when available. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the collection, annotation, and dissemination of project resources since the initial UniMorph release described at LREC 2016.

Many languages' inflectional morphological systems are replete with irregulars, i.e., words that do not seem to follow standard inflectional rules. In this work, we quantitatively investigate the conditions under which irregulars can survive in a language over the course of time. Using recurrent neural networks to simulate language learners, we test the diachronic relation between frequency of words and their irregularity.

Note: Accepted to NAACL 2018, but withdrawn in order to add more
thorough experiments before full publication.

We show how to predict the basic word-order facts of a novel language, and even obtain approximate syntactic parses, given only a corpus of part-of-speech (POS) sequences. We are motivated by the longstanding challenge of determining the structure of a language from its superficial features. While this is usually regarded as an unsupervised learning problem, there are good reasons that generic unsupervised learners are not up to the challenge. We do much better with a supervised approach where we train a system – a kind of language acquisition device – to predict how linguists will annotate a language. Our system uses a neural network to extract predictions from a large collection of numerical measurements. We train it on a mixture of real treebanks and synthetic treebanks obtained by systematically permuting the real trees, which we can motivate as sampling from an approximate prior over possible human languages.

Note: This work in this talk was mainly reported in Wang and Eisner (2016) and its followup papers.

A language's lexicon of surface forms and constructions includes many systematic regularities, as well as semi-regular and irregular exceptions. Generative linguists often explain regularities using shared latent representations and regular derivational processes. A probabilistic model with those elements can naturally allow for deviations from regularity and model the fact that some deviations are improbable. The probability of a derivational change can be sensitive to subtle properties of the context. I will outline several probabilistic models of the morphophonological and syntactic lexicons, which can extrapolate predictions based on their reconstruction of latent structure: e.g., underlying forms, cyclic derivations, and input-output alignments.

Note: This work in this talk was mainly reported in Cotterell et al. (2015) and its followup papers.

We show how to predict the basic word-order facts of a novel language given only a corpus of its part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the directionalities of all dependency relations. Although recovering syntactic structure is usually regarded as unsupervised learning, we train our predictor on languages of known structure. It outperforms the state-of-the-art unsupervised learning by a large margin, especially when we augment the training data with many synthetic languages.

Note: This was a refereed abstract and presentation of previously published work (Wang and Eisner, 2017).

Many events occur in the world. Some event types are stochastically excited or inhibited—in the sense of having their probabilities elevated or decreased—by patterns in the sequence of previous events. Discovering such patterns can help us predict which type of event will happen next and when. We model streams of discrete events in continuous time, by constructing a neurally self-modulating multivariate point process in which the intensities of multiple event types evolve according to a novel continuous-time LSTM. This generative model allows past events to influence the future in complex and realistic ways, by conditioning future event intensities on the hidden state of a recurrent neural network that has consumed the stream of past events. Our model has desirable qualitative properties. It achieves competitive likelihood and predictive accuracy on real and synthetic datasets, including under missing-data conditions.

We investigate the design of an expressive, purely-declarative, weighted logic programming language, Dyna. Dyna is a decade-plus effort in pushing the boundaries of declarative programming and “executable mathematics;” it instantiates an unusual point in the design space, as it is both Turing-complete (unlike Datalog) and devoid of a specified execution order (unlike Prolog). That is, it is designed to be, at once, both highly expressive and rich in opportunities for automated optimization. This thesis contains two major thrusts. We first consider both the denotational and operational aspects of Dyna. In particular, for operational semantics, we introduce and extend our EarthBound solver for finite circuits; the next chapter considers the generalization to logic programs proper. We then turn our attention to the static analysis of this language, considering mechanisms for reasoning both about abstract notions of well-formedness of programs as well as more mundane concerns of realizability of programs in actual computation. Along the way we endeavour to place our work in the context of the larger field of logic programming languages and present our current thoughts on future avenues of exploration.

Note: Dr. Filardo's dissertation advisor was Jason Eisner.

ACL policies and guidelines for submission, review and citation.
Jason Eisner, Jennifer Foster, Iryna Guryvech, Marti Hearst, Heng Ji,
Lillian Lee, Christopher Manning, Paola Merlo, Yusuke Miyao, Joakim Nivre,
Amanda Stent, and Ming Zhou (2017).
Report available on the wiki of the Association for Computational
Linguistics. [ report | policies | bib ]

This is the report of a working group appointed by the ACL Executive Committee to review policies and guidelines for conference submissions. The group was chaired by ACL President Joakim Nivre. The policies were adopted by the Association for Computational Linguistics effective January 1, 2018.

The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.

We present a feature-rich knowledge tracing method that captures a student's acquisition and retention of knowledge during a foreign language phrase learning task. We model the student's behavior as making predictions under a log-linear model, and adopt a neural gating mechanism to model how the student updates their log-linear parameters in response to feedback. The gating mechanism allows the model to learn complex patterns of retention and acquisition for each feature, while the log-linear parameterization results in an interpretable knowledge state. We collect human data and evaluate several versions of the model.

Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while most—but not all—languages have a /u/ sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology: What makes a natural vowel inventory? We introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages.

Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.

Declarative programming is a paradigm that allows programmers to specify what they want to compute, leaving how to compute it to a solver. Our declarative programming language, Dyna, is designed to compactly specify computations like those that are frequently encountered in machine learning. As a declarative language, Dyna's solver has a large space of (correct) strategies available to it. We describe a reinforcement learning framework for adaptively choosing among these strategies to maximize efficiency for a given workload. Adaptivity in execution is especially important for software that will run under a variety of workloads, where no fixed policy works well. We hope that reinforcement learning will identify good policies reasonably quickly—offloading the burden of writing efficient code from human programmers.

The popular skip-gram model induces word embeddings by exploiting the signal from word-context coocurrence. We offer a new interpretation of skip-gram based on exponential family PCA—a form of matrix factorization to generalize the skip-gram model to tensor factorization. In turn, this lets us train embeddings through richer higher-order coocurrences, e.g., triples that include positional information (to incorporate syntax) or morphological information (to share parameters across related words). We experiment on 40 languages and show our model improves upon skip-gram.

We show how to predict the basic word-order facts of a novel language given only a corpus of part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the directionalities of all dependency relations. Such typological properties could be helpful in grammar induction. While such a problem is usually regarded as unsupervised learning, our innovation is to treat it as supervised learning, using a large collection of realistic synthetic languages as training data. The supervised learner must identify surface features of a language's POS sequence (hand-engineered or neural features) that correlate with the language's deeper structure (latent trees). In the experiment, we show: 1) Given a small set of real languages, it helps to add many synthetic languages to the training data. 2) Our system is robust even when the POS sequences include noise. 3) Our system on this task outperforms a grammar induction baseline by a large margin.

Note: This paper builds on the synthetic Galactic Dependencies treebanks developed by Wang and Eisner (2016).
Caveat lector: Our experimental design evaluated on held-out treebanks. Readers should be aware that two of the 17 evaluation treebanks were for languages that were also represented in the training (albeit with different text and different annotation efforts). Withholding those treebanks from the evaluation does not qualitatively change the published results. We are preparing an addendum that gives the results for this second experimental design and compares the two designs.

Pruning hypotheses during dynamic programming is commonly used to speed up inference in settings such as parsing. Unlike prior work, we train a pruning policy under an objective that measures end-to-end performance: we search for a fast and accurate policy. This poses a difficult machine learning problem, which we tackle with the algorithm. We apply our approach to constituency parsing. Our experimental results show that accounting for performance of the end-to-end system leads to a better Pareto frontier—i.e., parsers which are more accurate for a given runtime.

Structured prediction algorithms—used when applying machine learning to tasks like natural language parsing and image understanding—present some opportunities for fine-grained parallelism, but also have problem-specific serial dependencies. Most implementations exploit only simple opportunities such as parallel BLAS, or embarrassing parallelism over input examples. In this work we explore an orthogonal direction: using the fact that these algorithms can be described as specialized forward-chaining theorem provers (Pereira and Warren, 1983; Eisner et al., 2005), and implementing fine-grained parallelization of the forward-chaining mechanism. We study context-free parsing as a simple canonical example, but the approach is more general.

One commonly needs to obtain the expected counts of states, transitions, constituents, or rules under probabilistic or weighted grammars. This requires an algorithm such as inside-outside or forward-backward that is tailored to the grammar formalism. Conveniently, each such algorithm can be obtained by automatically differentiating an “inside” algorithm that merely computes the log-probability of the evidence. This mechanical procedure produces correct and efficient code. Just as for any instance of back-propagation, it can be carried out manually or by software. This pedagogical paper carefully spells out the construction and relates it to traditional and non-traditional views of these algorithms.

We propose a method for learning the structure of variable-order CRFs, a more flexible variant of higher-order linear-chain CRFs. Variable-order CRFs achieve faster inference by including features for only some of the tag n-grams. Our learning method discovers the useful higher-order features at the same time as it trains their weights, by maximizing an objective that combines log-likelihood with a structured-sparsity regularizer. An active-set outer loop allows the feature set to grow as far as needed. On part-of-speech tagging in 5 randomly chosen languages from the Universal Dependencies dataset, our method of shrinking the model achieved a 2–6x speedup over a baseline, with no significant drop in accuracy.

We release Galactic Dependencies 1.0—a large set of synthetic languages not found on Earth, but annotated in Universal Dependencies format. This new resource aims to provide training and development data for NLP methods that aim to adapt to unfamiliar languages. Each synthetic treebank is produced from a real treebank by stochastically permuting the dependents of nouns and/or verbs to match the word order of other real languages. We discuss the usefulness, realism, parsability, perplexity, and diversity of the synthetic languages. As a simple demonstration of the use of Galactic Dependencies, we consider single-source transfer, which attempts to parse a real target language using a parser trained on a “nearby” source language. We find that including synthetic source languages somewhat increases the diversity of the source pool, which significantly improves results for most target languages.

Note: In later papers, we successfully used the Galactic Dependencies treebanks to extrapolate fine-grained typology prediction (Wang and Eisner, 2018) and multilingual parsing (Wang and Eisner, 2018) to unseen languages. Some of the material in these 3 papers is synthesized into this overview talk. We also developed an alternative method for creating synthetic languages on demand (Wang and Eisner, 2018).

Caveat lector: Our experimental design evaluated on held-out treebanks. Readers should be aware that two of the 17 evaluation treebanks were for languages that were also represented in the training (albeit with different text and different annotation efforts). Withholding those treebanks from the evaluation does not qualitatively change the published results. We are preparing an addendum that gives the results for this second experimental design and compares the two designs.

The 2016 SIGMORPHON Shared Task was devoted to the problem of morphological reinflection. It introduced morphological datasets for 10 languages with diverse typological characteristics. The shared task drew submissions from 9 teams representing 11 institutions reflecting a variety of approaches to addressing supervised learning of reinflection. For the simplest task, inflection generation from lemmas, the best system averaged 95.56% exact-match accuracy across all languages, ranging from Maltese (88.99%) to Hungarian (99.30%). With the relatively large training datasets provided, recurrent neural network architectures consistently performed best—in fact, there was a significant margin between neural and non-neural approaches. The best neural approach, averaged over all tasks and languages, outperformed the best non-neural one by 13.76% absolute; on individual tasks and languages the gap in accuracy sometimes exceeded 60%. Overall, the results show a strong state of the art, and serve as encouragement for future shared tasks that explore morphological analysis and generation with varying degrees of supervision.

In this work, we explore how learners can infer second-language noun meanings in the context of their native language. Motivated by an interest in building interactive tools for language learning, we collect data on three word-guessing tasks, analyze their difficulty, and explore the types of errors that novice learners make. We train a log-linear model for predicting our subjects' guesses of word meanings in varying kinds of contexts. The model's predictions correlate well with subject performance, and we provide quantitative and qualitative analyses of both human and model performance.

Note: This study is closely related to (Renduchintala et al., 2016), but it uses a more controlled experimental design that permits a simpler model, and it examines more features. Our companion paper (Renduchintala et al., 2016) describes an interactive user interface for reading “macaronic” text. See also this talk.

We present a prototype of a novel technology for second language instruction. Our learn-by-reading approach lets a human learner acquire new words and constructions by encountering them in context. To facilitate reading comprehension, our technology presents mixed native language (L1) and second language (L2) sentences to a learner and allows them to interact with the sentences to make the sentences easier (more L1-like) or harder (more L2-like) to read. Eventually, our system should continuously track a learner's knowledge and learning style by modeling their interactions, including performance on a pop quiz feature. This will allow our system to generate personalized mixed-language texts for learners.

Foreign language learners can acquire new vocabulary by using cognate and context clues when reading. To measure such incidental comprehension, we devise an experimental framework that involves reading mixed-language “macaronic” sentences. Using data collected via Amazon Mechanical Turk, we train a graphical model to simulate a human subject's comprehension of foreign words, based on cognate clues (edit distance to an English word), context clues (pointwise mutual information), and prior exposure. Our model does a reasonable job at predicting which words a user will be able to understand, which should facilitate the automatic construction of comprehensible text for personalized foreign language education.

Languages with rich inflectional morphology exhibit lexical data sparsity, since the word used to express a given concept will vary with the syntactic context. For instance, each count noun in Czech has 12 forms (where English uses only singular and plural). Even in large corpora, we are unlikely to observe all inflections of a given lemma. This reduces the vocabulary coverage of methods that induce continuous representations for words from distributional corpus information. We solve this problem by exploiting existing morphological resources that can enumerate a word's component morphemes. We present a latent-variable Gaussian graphical model that allows us to extrapolate continuous representations for words not observed in the training corpus, as well as smoothing the representations provided for the observed words. The latent variables represent embeddings of morphemes, which combine to create embeddings of words. Over several languages and training sizes, our model improves the embeddings for words, when evaluated on an analogy task, skip-gram predictive accuracy, and word similarity.

Rigid Tree Automata (RTAs) are a strict super-class of Regular Tree Automata (TAs), additionally capable of recognizing certain nonlinear patterns such as {f〈x,x〉: x ∈ X}. RTAs were developed for use in tree-automata-based model checking; we hope to use them as part of a static analysis system for a logic programming language. In developing that system, we noted that RTAs are not closed under Kleene-star or pre-concatenation with a regular language. We now introduce a strict super-class of RTA, called Isolating Rigid Tree Automata, which can accept rigid structures with arbitrarily many isolated rigid substructures, such as “lists of equal pairs,” by allowing rigidity to be confined within subtrees. This class is Kleene-star and concatenation closed and retains many features of RTAs, including linear-time emptiness testing and NP-complete membership testing. However, it gives up closure under intersection.

How should one apply deep learning to tasks such as morphological reinflection, which stochastically edit one string to get another? A recent approach to such sequence-to-sequence tasks is to compress the input string into a vector that is then used to generate the output string, using recurrent neural networks. In contrast, we propose to keep the traditional architecture, which uses a finite-state transducer to score all possible output strings, but to augment the scoring function with the help of recurrent networks. A stack of bidirectional LSTMs reads the input string from left-to-right and right-to-left, in order to summarize the input context in which a transducer arc is applied. We combine these learned features with the transducer to define a probability distribution over aligned output strings, in the form of a weighted finite-state automaton. This reduces hand-engineering of features, allows learned features to examine unbounded context in the input string, and still permits exact inference through dynamic programming. We illustrate our method on the tasks of morphological reinflection and lemmatization.

Note: Jason's talk for this paper was based around the movie Cowboys & Aliens. The method developed in this paper should be called a BiLSTM-FST, by analogy with related architectures such as the BiLSTM-CRF and the BiLSTM-CFG. Aharoni and Goldberg (2017) refer to latent alignments as "hard monotonic attention."

Learning from unlabeled data is a long-standing challenge in machine learning. A principled solution involves modeling the full joint distribution over inputs and the latent structure of interest, and imputing the missing data via marginalization. Unfortunately, such marginalization is expensive for most non-trivial problems, which places practical limits on the expressiveness of generative models. As a result, joint models often encode strict assumptions about the underlying process such as fixed-order Markovian assumptions and employ simple count-based features of the inputs.

In contrast, conditional models, which do not directly model the observed data, are free to incorporate rich overlapping features of the input in order to predict the latent structure of interest. It would be desirable to develop expressive generative models that retain tractable inference. This is the topic of this thesis. In particular, we explore joint models which relax fixed-order Markov assumptions, and investigate the use of recurrent neural networks for automatic feature induction in the generative process.

We focus on two structured prediction problems: (1) imputing labeled segmentations of input character sequences, and (2) imputing directed spanning trees relating strings in text corpora. These problems arise in many applications of practical interest, but we are primarily concerned with named-entity recognition and cross-document coreference resolution in this work.

For named-entity recognition, we propose a generative model in which the observed characters originate from a latent non-Markov process over words, and where the characters are themselves produced via a non-Markov process: a recurrent neural network (RNN). We propose a sampler for the proposed model in which sequential Monte Carlo is used as a transition kernel for a Gibbs sampler. The kernel is amenable to a fast parallel implementation, and results in fast mixing in practice.

For cross-document coreference resolution, we move beyond sequence modeling to consider string-to-string transduction. We stipulate a generative process for a corpus of documents in which entity names arise from copying—and optionally transforming—previous names of the same entity. Our proposed model is sensitive to both the context in which the names occur as well as their spelling. The string-to-string transformations correspond to systematic linguistic processes such as abbreviation, typos, and nicknaming, and by analogy to biology, we think of them as mutations along the edges of a phylogeny. We propose a novel block Gibbs sampler for this problem that alternates between sampling an ordering of the mentions and a spanning tree relating all mentions in the corpus.

Note: Dr. Andrews's dissertation advisors were Jason Eisner and Mark Dredze.

We propose that one should learn a foreign language by reading interesting prose. But how can one get started? We are building an intelligent user interface that partially translates text, leaning at first on the learner's native vocabulary but gradually introducing new foreign words and constructions in context. Faced with hybrid text of this sort, the learner can also use the mouse to translate or untranslate portions of a sentence; as a side benefit, this provides feedback about what the learner currently understands. We give an overview of the project, including pedagogical motivation, modeling of the learner, data collection, user interface design, linguistic issues, and our use of machine translation and reinforcement learning inside the system.

Natural language processing must sometimes consider the internal structure of words, e.g., in order to understand or generate an unfamiliar word. Unfamiliar words are systematically related to familiar ones due to linguistic processes such as morphology, phonology, abbreviation, copying error, and historical change.

We will show how to build joint probability models over many strings. These models are capable of predicting unobserved strings, or predicting the relationships among observed strings. However, computing the predictions of these models can be computationally hard. We outline approximate algorithms based on Markov chain Monte Carlo, expectation propagation, and dual decomposition. We give results on some NLP tasks.

Note: This work in this talk was mainly reported in Cotterell et al. (2015) and its followup papers.

This thesis broadens the space of rich yet practical models for structured prediction. We introduce a general framework for modeling with four ingredients: (1) latent variables, (2) structural constraints, (3) learned (neural) feature representations of the inputs, and (4) training that takes the approximations made during inference into account. The thesis builds up to this framework through an empirical study of three NLP tasks: semantic role labeling, relation extraction, and dependency parsing—obtaining state-of-the-art results on the former two. We apply the resulting graphical models with structured and neural factors, and approximation-aware learning to jointly model part-of-speech tags, a syntactic dependency parse, and semantic roles in a low-resource setting where the syntax is unobserved. We also present an alternative view of these models as neural networks with a topology inspired by inference on graphical models that encode our intuitions about the data.

Note: Dr. Gormley's dissertation advisors were Jason Eisner and Mark Dredze.

We investigate dual decomposition for joint MAP inference of many strings. Given an arbitrary graphical model, we decompose it into small acyclic sub-models, whose MAP configurations can be found by finite-state composition and dynamic programming. We force the solutions of these subproblems to agree on overlapping variables, by tuning Lagrange multipliers for an adaptively expanding set of variable-length n-gram count features.

This is the first inference method for arbitrary graphical models over strings that does not require approximations such as random sampling, message simplification, or a bound on string length. Provided that the inference method terminates, it gives a certificate of global optimality (though MAP inference in our setting is undecidable in general). On our global phonological inference problems, it does indeed terminate, and achieves more accurate results than max-product and sum-product loopy belief propagation.

We present penalized expectation propagation, a novel algorithm for approximate inference in graphical models. Expectation propagation is a well-known message-passing algorithm that prevents messages from growing in complexity by forcing them back into a given family of distributions. Our extension uses a structured-sparsity penalty to prefer simpler messages. In the case of string-valued random variables, this allows us to use a non-parametric message family, related to variable-order n-gram models. The method automatically calibrates the complexity of each message to balance speed and accuracy. We test the algorithm on phonological inference problems, showing substantial speedup over previous algorithms with no significant loss in accuracy.

We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n^{3}) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by back-propagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.

The observed pronunciations or spellings of words are often explained as arising from the “underlying forms” of their morphemes. These forms are latent strings that linguists try to reconstruct by hand. We propose to reconstruct them automatically at scale, enabling generalization to new words. Given some surface word types of a concatenative language along with the abstract morpheme sequences that they express, we show how to recover consistent underlying forms for these morphemes, together with the (stochastic) phonology that maps each concatenation of underlying forms to a surface form. Our technique involves loopy belief propagation in a natural directed graphical model whose variables are unknown strings and whose conditional distributions are encoded as finite-state machines with trainable weights. We define training and evaluation paradigms for the task of surface word prediction, and report results on subsets of 7 languages.

Branch-and-bound is a widely used method in combinatorial optimization, including mixed integer programming, structured prediction and MAP inference. While most work has been focused on developing problem-specific techniques, little is known about how to systematically design the node searching strategy on a branch-and-bound tree. We address the key challenge of learning an adaptive node searching order for any class of problem solvable by branch-and-bound. Our strategies are learned by imitation learning. We apply our algorithm to linear programming based branch-and-bound for solving mixed integer programs (MIP). We compare our method with one of the fastest open-source solvers, SCIP; and a very efficient commercial solver, Gurobi. We demonstrate that our approach achieves better solutions faster on four MIP libraries.

Under multi-headed dependency grammar, a parse is a connected DAG rather than a tree. Such formalisms can be more syntactically and semantically expressive. However, it is hard to train, test, or improve multi-headed parsers because few multi-headed corpora exist, particularly for the projective case. To help fill this gap, we observe that link grammar already produces undirected projective graphs. We use Integer Linear Programming to assign consistent directions to the labeled links in a corpus of several thousand parses produced by the Link Grammar Parser, which has broad-coverage hand-written grammars of English as well as Russian and other languages. We find that such directions can indeed be consistently assigned in a way that yields valid multi-headed dependency parses. The resulting parses in English appear reasonably linguistically plausible, though differing in style from CoNLL-style parses of the same sentences; we discuss the differences.

Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evaluation on several datasets.

String similarity is most often measured by weighted or unweighted edit distance d(x,y). Ristad and Yianilos (1998) defined stochastic edit distance—a probability distribution p(y|x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic contextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.

Note: The PFSTs developed in this paper are used within the Bayesian network of Cotterell et al. (2015).

Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework of dynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.

We present an open-source virtual manipulative for conditional log-linear models. This web-based interactive visualization lets the user tune the probabilities of various shapes—which grow and shrink accordingly—by dragging sliders that correspond to feature weights. The visualization displays a regularized training objective; it supports gradient ascent by optionally displaying gradients on the sliders and providing “Step” and “Solve” buttons. The user can sample parameters and datasets of different sizes and compare their own parameters to the truth. Our website, https://cs.jhu.edu/~jason/tutorials/loglin/, guides the user through a series of interactive lessons and provides auxiliary readings, explanations, practice problems and resources.

Linguistics Olympiads, now offered in more than 20 countries, provide secondary-school students a compelling introduction to an unfamiliar field. The North American Computational Linguistics Olympiad (NACLO) includes computational puzzles in addition to purely linguistic ones. This paper explores the computational subject matter we want to convey via NACLO, as well as some of the challenges that arise when adapting problems in computational linguistics to an audience that may have no background in computer science, linguistics, or advanced mathematics. We present a small library of reusable design patterns that have proven useful when composing puzzles appropriate for secondary-school students.

Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent.

We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ε) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum.

As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.

Message scheduling is shown to be very effective in belief propagation (BP) algorithms. However, most existing scheduling algorithms use ﬁxed heuristics regardless of the structure of the graphs or properties of the distribution. On the other hand, designing diﬀerent scheduling heuristics for all graph structures is not feasible. In this paper, we propose a reinforcement learning based message scheduling framework (RLBP) to learn the heuristics automatically which generalizes to any graph structures and distributions. In the experiments, we show that the learned problem-speciﬁc heuristics largely outperform other baselines in speed.

Grammar induction aims to induce the parameters of a probabilistic context-free grammar (PCFG). Crucially, the same parameters should be used not only at different positions in the sentence (as in convolutional networks for vision) but also at different levels in the tree.

We consider how several ideas from deep learning can help construct a PCFG from the bottom up while resisting bad local optima. Our proposed architecture learns to model a sentence as a sequence of phrases, each generated by a PCFG. The root nonterminal of each phrase is predicted from the context of that phrase. Formally this is a kind of autoencoder that predicts the sentence from itself, but structured as a semi-Markov CRF whose emission distribution is a PCFG, and which predicts each phrase only from context. During learning, we “anneal” the search bias from generating a long sequence of 1-word phrases (so the method finds word embeddings based on context) to a single phrase that covers the whole sentence (at which point we have an ordinary PCFG with no dependence on context).

Each run of the learner derives features of the context from the grammars found on previous, more naive runs. This stacking of multiple runs is what makes the method deep. We also mention extensions that involve supervised fine-tuning or richer, vector-valued representations of words and nonterminals.

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” is far too good to successfully imitate with our inexpensive features. Moreover, it is not specifically tuned for the known reward function. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with only a few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.

Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle's ability and the learner's policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use a coach that demonstrates easy-to-learn actions for the learner and gradually approches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to a novel cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic features selection and two static feature selection methods.

We describe an approach to coreference resolution that relies on the intuition that easy decisions should be made early, while harder decisions should be left for later when more information is available. We are inspired by the recent success of the rule-based system of Raghunathan et al. (2010), which relies on the same intuition. Our system, however, automatically learns from training data what constitutes an easy decision. Thus, we can utilize more features, learn more precise weights, and adapt to any dataset for which training data is available. Experiments show that our system outperforms recent state-of-the-art coreference systems including Raghunathan et al.’s system as well as a competitive baseline that uses a pairwise classifier.

Arithmetic circuits arise in the context of weighted logic programming languages, such as Datalog with aggregation, or Dyna. A weighted logic program defines a generalized arithmetic circuit—the weighted version of a proof forest, with nodes having arbitrary rather than boolean values. In this paper, we focus on finite circuits. We present a flexible algorithm for efficiently querying node values as they change under updates to the circuit's inputs. Unlike traditional algorithms, ours is agnostic about which nodes are tabled (materialized), and can vary smoothly between the traditional strategies of forward and backward chaining. Our algorithm is designed to admit future generalizations, including cyclic and infinite circuits and propagation of delta updates.

Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

Note: The experiments in this paper are incomplete—we regrettably had to omit some other experimental results because of a bug in the code. You can find the proper results on the slides (and on the poster, from the October 2012 Mid-Atlantic Student Colloquium on Speech, Language and Learning).

In many domains, our best models are computationally intractable. This problem will only get worse as we manage to build more richly detailed models of specific domains. Fortunately, the practical goal of artificial or natural intelligence is not to do perfect detailed inference, but rather to answer specific questions by reasoning from observed data. Thus, we should seek policies for fast approximate inference that will actually achieve low expected loss on our target task. The target task is a distribution not only over test-time data but also over which variables will be observed and queried. The loss function may explicitly penalize for runtime (or data acquisition).

This story leaves open an engineering question: What space of policies should we search? I will review a range of options and point out past work for each. Among others, I will show our own recent successes using message-passing approximate inference policies for graphical models. The form of these policies is determined by the structure of our intractable and surely mismatched domain model, but we tune the parameters to minimize loss (Stoyanov & Eisner, 2012).

One may search a space of sophisticated policies or a space of crude hacks, but crucially, one should tune the policy parameters to optimize expected error and runtime. This expectation can be taken over training data (empirical risk minimization), or over samples from a posterior belief about the target task (which I will call imputed risk minimization). The latter case requires some sort of prior model, but this is necessary when the empirical risk estimate suffers from sparse, non-independent, or out-of-domain training data.

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs for particular problems or datasets. We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is too large to explore naively. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.

Note: Consider instead the NeurIPS 2012 paper that is the "final version" of this workshop paper.

We present an instance-specific test-time dynamic feature selection algorithm. Our algorithm sequentially chooses features given previously selected features and their values. It stops the selection process to make a prediction according to a user-specified accuracy-cost trade-off. We cast the sequential decision-making problem as a Markov Decision Process and apply imitation learning techniques. We address the problem of learning and inference jointly in a simple multiclass classification setting. Experimental results on UCI datasets show that our approach achieves the same or higher accuracy using only a small fraction of features than static feature selection methods.

We are interested in speeding up approximate inference in Markov Random Fields (MRFs). We present a new method that uses gates—binary random variables that determine which factors of the MRF to use. Which gates are open depends on the observed evidence; when many gates are closed, the MRF takes on a sparser and faster structure that omits “unnecessary” factors. We train parameters that control the gates, jointly with the ordinary MRF parameters, in order to locally minimize an objective that combines loss and runtime.

With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.

We propose an algorithm to find the best path through an intersection of arbitrarily many weighted automata, without actually performing the intersection. The algorithm is based on dual decomposition: the automata attempt to agree on a string by communicating about features of the string. We demonstrate the algorithm on the Steiner consensus string problem, both on synthetic data and on consensus decoding for speech recognition. This involves implicitly intersecting up to 100 automata.

Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form of n-gram counts (Brants and Franz, 2006). While n-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way that n-gram language models are estimated. We treat this distribution over sentences as an approximate corpus and show how unsupervised learning can be performed on such a corpus using variational inference. We compare hidden Markov model (HMM) training on exact and approximate corpora of various sizes, measuring speed and accuracy on unsupervised part-of-speech tagging.

Conditional Random Fields (CRFs) are a popular formalism for structured prediction in NLP. It is well known how to train CRFs with certain topologies that admit exact inference, such as linear-chain CRFs. Some NLP phenomena, however, suggest CRFs with more complex topologies. Should such models be used, considering that they make exact inference intractable? Stoyanov et al. (2011) recently argued for training parameters to minimize the task-specific loss of whatever approximate inference and decoding methods will be used at test time. We apply their method to three NLP problems, showing that (i) using more complex CRFs leads to improved performance, and that (ii) minimum-risk training learns more accurate models.

We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or “primitives,” which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.

Could we explicitly train test-time inference heuristics to trade off accuracy and efficiency? We focus our discussion on agenda-based natural language parsing under a weighted context-free grammar. We frame the problem as reinforcement learning, discuss its special properties, and propose new strategies.

Probabilistic graphical models are typically trained to maximize the likelihood of the training data and evaluated on some measure of accuracy on the test data. However, we are also interested in learning to produce predictions quickly. For example, one can speed up loopy belief propagation by choosing sparser models and by stopping at some point before convergence. We manage the speed-accuracy tradeoff by explicitly optimizing for a linear combination of speed and accuracy. Although this objective is not differentiable, we can compute the gradient of a smoothed version.

Because of the neutrality property, a Dirichlet (process) prior on a discrete distribution cannot capture correlations among the probabilities of “similar” events. We propose obtaining the discrete distribution instead from a random walk model or transformation model, in which each observed event has evolved via a latent sequence of transformations. We are exploring transformation models in which the conditional distributions have infinite support and the prior over them is nonparametric.

Latent Dirichlet Allocation (LDA) has been used to learn selectional preferences as soft disjunctions over flat semantic classes. Our model, the SCTM, also learns the structure of each class as a soft conjunction of high-level semantic features.

We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.

Note: Additional details are given in the dissertation of Dreyer (2011), and on the associated slides.

Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target-language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.

Note: Additional details are given in the dissertation of Li (2010).

The field of statistical natural language processing has been turning toward morphologically rich languages. These languages have vocabularies that are often orders of magnitude larger than that of English, since words may be inflected in various different ways. This leads to problems with data sparseness and calls for models that can deal with this abundance of related words—models that can learn, analyze, reduce and generate morphological inflections. But surprisingly, statistical approaches to morphology are still rare, which stands in contrast to the many recent advances of sophisticated models in parsing, grammar induction, translation and many other areas of natural language processing.

This thesis presents a novel, unified statistical approach to inflectional morphology, an approach that can decode and encode the inflectional system of a language. At the center of this approach stands the notion of inflectional paradigms. These paradigms cluster the large vocabulary of a language into structured chunks; inflections of the same word, like break, broke, breaks, breaking,..., all belong in the same paradigm. And moreover, each of these inflections has an exact place within a paradigm, since each paradigm has designated slots for each possible inflection; for verbs, there is a slot for the first person singular indicative present, one for the third person plural subjunctive past and slots for all other possible forms. The main goal of this thesis is to build probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters. These models can be learned with minimal supervision for any language that has inflectional morphology. As training data, some sample paradigms and a raw, unannotated text corpus can be used.

The models over morphological paradigms are developed in three main chapters that start with smaller components and build up to larger ones.

The first of these chapters (Chapter 2) presents novel probability models over strings and string pairs. These are applicable to lemmatization or to relate a past tense form to its associated present tense form, or for similar morphological tasks. It turns out they are general enough to tackle the popular task of transliteration very well, as well as other string-to-string tasks.

The second (Chapter 3) introduces the notion of a probability model over multiple strings, which is a novel variant of Markov Random Fields. These are used to relate the many inflections in an inflectional paradigm to one another, and they use the probability models from Chapter 2 as components. A novel version of belief propagation is presented, which propagates distributions over strings through a network of connected finite-state transducers, to perform inference in morphological paradigms (or other string fields).

Finally (Chapter 4), a non-parametric joint probability model over an unannotated text corpus and the morphological paradigms from Chapter 3 is presented. This model is based on a generative story for inflectional morphology that naturally incorporates common linguistic notions, such as lexemes, paradigms and inflections. Sampling algorithms are presented that perform inference over large text corpora and their implicit, hidden morphological paradigms. We show that they are able to discover the morphological paradigms that are implicit in the corpora. The model is based on finite-state operations and seamlessly handles concatenative and nonconcatenative morphology.

Graphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic meta-descent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.

Note:Stoyanov and Eisner (2012a) subsequently applied this training method to three structured prediction problems in NLP, getting striking accuracy improvements. Gormley et al. (2015) applied it to dependency parsing by belief propagation. Stoyanov and Eisner (2011, 2012b) gave preliminary extensions to optimize speed jointly with accuracy.

Modern statistical AI systems are quite large and complex; this interferes with research, development, and education. We point out that most of the computation involves database-like queries and updates on complex views of the data. Specifically, recursive queries look up and aggregate relevant or potentially relevant values. If the results of these queries are memoized for reuse, the memos may need to be updated through change propagation. We propose a declarative language, which generalizes Datalog, to support this work in a generic way. Through examples, we show that a broad spectrum of AI algorithms can be concisely captured by writing down systems of equations in our notation. Many strategies could be used to actually solve those systems. Our examples motivate certain extensions to Datalog, which are connected to functional and object-oriented programming paradigms.

Note: The followup paper Filardo and Eisner (2012) explains execution mechanisms for handling queries and updates.

Confusion network decoding for MT system combination.
Antti-Veikko Rosti, Eugene Matusov, Jason Smith, Necip Ayan, Jason
Eisner, Damianos Karakos, Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spyros
Matsoukas, Hermann Ney, Richard Schwartz, B. Zhang, and J. Zheng (2011).
In Handbook of Natural Language Processing and Machine
Translation. [ paper | book | bib ]

Note: This chapter includes an speedup of Karakos et al. (2008) using an A^{*} heuristic. See p. 355.

Much recent work in natural language processing treats linguistic analysis as an inference problem over graphs. This development opens up useful connections between machine learning, graph theory, and linguistics.

The first part of this dissertation formulates syntactic dependency parsing as a dynamic Markov random field with the novel ingredient of global constraints. Global constraints are enforced by calling combinatorial optimization algorithms as subroutines during message-passing inference in the graphical model, and these global constraints greatly improve on the accuracy of collections of local constraints. In particular, combinatorial subroutines enforce the constraint that the parser's output must form a tree. This is the first application that uses efficient computation of marginals for combinatorial problems to improve the speed and accuracy of belief propagation. If the dependency tree is projective, the tree constraint exploits the inside-outside algorithm; if non-projective, with discontiguous constituents, it exploits the directed matrix-tree theorem, here newly applied to NLP problems. Even with second-order features or latent variables, which would make exact parsing asymptotically slower or NP-hard, approximate inference with belief propagation is as efficient as a simple edge-factored parser times a constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features increases the runtime additively rather than multiplicatively.

The second part extends these models to capture correspondences among non-isomorphic structures. When bootstrapping a parser in a low-resource target language by exploiting a parser in a high-resource source language, models that score the alignment and the correspondence of divergent syntactic configurations in translational sentence pairs achieve higher accuracy in parsing the target language. These noisy (quasi-synchronous) mappings have further applications in adapting parsers across domains, in learning features of the syntax-semantics interface, and in question answering, paraphrasing, and information retrieval.

Note: Dr. Smith's dissertation advisor was Jason Eisner.

An unsupervised discriminative training procedure is proposed for estimating a language model (LM) for machine translation (MT). An English-to-English synchronous context-free grammar is derived from a baseline MT system to capture translation alternatives: pairs of words, phrases or other sentence fragments that potentially compete to be the translation of the same source-language fragment. Using this grammar, a set of impostor sentences is then created for each English sentence to simulate confusions that would arise if the system were to process an (unavailable) input whose correct English translation is that sentence. An LM is then trained to discriminate between the original sentences and the impostors. The procedure is applied to the IWSLT Chinese-to-English translation task, and promising improvements on a state-of-the-art MT system are demonstrated.

Note: Additional details are given in the dissertation of Li (2010).

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. This fact can be exploited by parsers. We first show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. The approach with hard bounds, “vine grammar,” accepts only a regular language, even though it happily retains a context-free parameterization and defines meaningful parse trees. We show how to parse this language in O(n) time, using a novel chart parsing algorithm with a low grammar constant (rather than an impractically large finite-state recognizer with an exponential grammar constant). For a uniform hard bound of k on dependencies of all types, our algorithm's runtime is O(nk^{2}). We also extend our algorithm to parse weighted-FSA inputs such as lattices.

Note: This book chapter extends Eisner & Smith (2005) with considerable new material (e.g., lattice parsing).

A hypergraph or “packed forest” is a compact data structure that uses structure-sharing to represent exponentially many trees in polynomial space. A probabilistic/weighted hypergraph also defines a probability (or other weight) for each tree, and can be used to represent the hypothesis space considered (for a given input) by a monolingual parser or a tree-based translation system (e.g., tree to string, string to tree, tree to tree, or string to string with latent tree structures).

Given a weighted/probabilistic hypergraph, we might ask three questions. What atomic operations can we perform on the weighted hypergraph? How do we set the weights in the hypergraph? Which particular translation (among the possible translations encoded in a hypergraph) should we present to an end user? These correspond to three fundamental problems: inference, training, and decoding, for which this dissertation will present novel techniques.

The atomic inference operations we may want to perform include finding one-best, k-best, or expectations over the hypergraph. To perform each operation, we may implement a dedicated dynamic programming algorithm. However, a more general framework to specify these algorithms is semiring-weighted logic programming. Within this framework, we first extend the expectation semiring, which is originally proposed for a finite state automaton, to a hypergraph. We then propose a novel second-order expectation semiring. These semirings can be used to compute a large number of expectations (e.g., entropy and its gradient) over the exponentially many trees presented in a hypergraph.

The weights used in a hypergraph are usually learnt by a discriminative training method. One common drawback of such method is that it relies on the existence of high-quality supervised data (i.e., bilingual data), which may be expensive to obtain. We present two unsupervised discriminative training methods: minimum imputed-risk training, and contrastive language model estimation, both can exploit monolingual English data to perform discriminative training. In minimum imputed-risk training, we first use a reverse translation model to impute the missing inputs, and then train a discriminative forward model by minimizing the expected loss of the forward translations of the missing inputs.

In contrast, the contrastive language model estimation does not use a reverse system. It first extracts a confusion grammar, then generates many alternative sentences (i.e., a contrastive set) for each English sentence using the confusion grammar, and finally trains a discriminative language model on the contrastive sets such that the model will prefer the original English sentences (over the sentences in the contrastive sets).

During decoding, we are interested in finding a translation that has a maximum posterior probability (i.e., MAP decoding). However, this is intractable due to spurious ambiguity, a situation where the probability of a translation string is split among many distinct derivations (e.g., trees or segmentations). Therefore, most systems use a simple Viterbi decoding that approximates the string probability with its most probable derivation's probability. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to lattice-based minimum-risk decoding for BLEU. Experiments show that our approach improves the state of the art.

All the above methods have been implemented in an open-source machine translation toolkit Joshua. In this dissertation, the methods have mainly been applied to a machine translation task, butwe expect that they will also find applications in other areas of natural language processing (e.g., parsing and speech recognition).

Many statistical translation models can be regarded as weighted logical deduction. Under this paradigm, we use weights from the expectation semiring (Eisner, 2002), to compute first-order statistics (e.g., the expected hypothesis length or feature counts) over packed forests of translations (lattices or hypergraphs). We then introduce a novel second-order expectation semiring, which computes second-order statistics (e.g., the variance of the hypothesis length or the gradient of entropy). This second-order semiring is essential for many interesting training paradigms such as minimum risk, deterministic annealing, active learning, and semi-supervised learning, where gradient descent optimization requires computing the gradient of entropy or risk. We use these semirings in an open-source machine translation toolkit, Joshua, enabling minimum-risk training for a benefit of up to 1.0 BLEU point.

Note: Additional introduction and details are given in the dissertation of Li (2010).

We study graphical modeling in the case of string-valued random variables. Whereas a weighted finite-state transducer can model the probabilistic relationship between two strings, we are interested in building up joint models of three or more strings. This is needed for inflectional paradigms in morphology, cognate modeling or language reconstruction, and multiple-string alignment. We propose a Markov Random Field in which each factor (potential function) is a weighted finite-state machine, typically a transducer that evaluates the relationship between just two of the strings. The full joint distribution is then a product of these factors. Though decoding is actually undecidable in general, we can still do efficient joint inference using approximate belief propagation; the necessary computations and messages are all finite-state. We demonstrate the methods by jointly predicting morphological forms.

Note: Additional details are given in the dissertation of Dreyer (2011), and on the associated slides. Cotterell and Eisner (2015) give an improved inference method for graphical models over strings. Cotterell et al. (2015) extended the morphological modeling approach to handle latent underlying morphs and derivational morphology.

We connect two scenarios in structured learning: adapting a parser trained on one corpus to another annotation style, and projecting syntactic annotations from one language to another. We propose quasi-synchronous grammar (QG) features for these structured learning tasks. That is, we score a aligned pair of source and target trees based on local features of the trees and the alignment. Our quasi-synchronous model assigns positive probability to any alignment of any trees, in contrast to a synchronous grammar, which would insist on some form of structural parallelism.

In monolingual dependency parser adaptation, we achieve high accuracy in translating among multiple annotation styles for the same sentence. On the more difficult problem of cross-lingual parser projection, we learn a dependency parser for a target language by using bilingual text, an English parser, and automatic word alignments. Our experiments show that unsupervised QG projection improves on parses trained using only high-precision projected annotations and far outperforms, by more than 35% absolute dependency accuracy, learning an unsupervised parser from raw target-language text alone. When a few target-language parse trees are available, projection gives a boost equivalent to doubling the number of target-language trees.

Note: Additional details are given in the dissertation of Smith (2010).

We apply machine learning to the Linear Ordering Problem in order to learn sentence-specific reordering models for machine translation. We demonstrate that even when these models are used as a mere preprocessing step for German-English translation, they significantly outperform Moses' integrated lexicalized reordering model.

Our models are trained on automatically aligned bitext. Their form is simple but novel. They assess, based on features of the input sentence, how strongly each pair of input word tokens wi;wj would like to reverse their relative order. Combining all these pairwise preferences to find the best global reordering is NP-hard. However, we present a non-trivial O(n^{3}) algorithm, based on chart parsing, that at least finds the best reordering within a certain exponentially large neighborhood. We show how to iterate this reordering process within a local search algorithm, which we use in training.

Note: Additional details are given in the dissertation of Tromble (2009).

Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, finding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to minimum-risk decoding for BLEU (Tromble et al., 2008). Experiments show that our approach improves the state of the art.

Note: Additional introduction and details are given in the dissertation of Li (2010).

The field of AI has become implementation-bound. We have plenty of ideas, but it is increasingly laborious to try them out, as our models become more ambitious and our datasets become larger, noisier, and more heterogeneous. The software engineering burden makes it hard to start new work; hard to reuse and combine existing ideas; and hard to educate our students.

In this talk, I'll propose to hide many common implementation details behind a new level of abstraction that we are developing. Dyna is a declarative programming language that combines logic programming with functional programming. It also supports modularity. It may be regarded as a kind of deductive database, theorem prover, truth maintenance system, or equation solver.

I will illustrate how Dyna makes it easy to specify the combinatorial structure of typical computations needed in natural language processing, machine learning, and elsewhere in AI. Then I will sketch implementation strategies and program transformations that can help to make these computations fast and memory-efficient. Finally, I will suggest that machine learning should be used to search for the right strategies for a program on a particular workload.

This dissertation is about ordering. The problem of arranging a set of n items in a desired order is quite common, as well as fundamental to computer science. Sorting is one instance, as is the Traveling Salesman Problem. Each problem instance can be thought of as optimization of a function that applies to the set of permutations.

The dissertation treats word reordering for machine translation as another instance of a combinatorial optimization problem. The approach introduced is to combine three different functions of permutations. The first function is based on finite-state automata, the second is an instance of the Linear Ordering Problem, and the third is an entirely new permutation problem related to the LOP.

The Linear Ordering Problem has the most attractive computational properties of the three, all of which are NP-hard optimization problems. The dissertation expends significant effort developing neighborhoods for local search on the LOP, and uses grammars and other tools from natural language parsing to introduce several new results, including a state-of-the-art local search procedure.

Combinatorial optimization problems such as the TSP or the LOP are usually given the function over permutations. In the machine translation setting, the weights are not given, only words. The dissertation applies machine learning techniques to derive a LOP from each given sentence using a corpus of sentences and their translations for training. It proposes a set of features for such learning and argues their propriety for translation based on an analogy to dependency parsing. It adapts a number of parameter optimization procedures to the novel setting of the LOP.

The signature result of the thesis is the application of a machine learned set of linear ordering problems to machine translation. Using reorderings found by search as a preprocessing step significantly improves translation of German to English, and significantly more than the lexicalized reordering model that is the default of the translation system.

In addition, the dissertation provides a number of new theoretical results, and lays out an ambitious program for potential future research. Both the reordering model and the optimization techniques have broad applicability, and the availability of machine learning makes even new problems without obvious structure approachable.

Note: Dr. Tromble's dissertation advisor was Jason Eisner.

Cross-document coreference resolution: A key technology for learning by
reading.
James Mayfield, David Alexander, Bonnie Dorr, Jason Eisner, Tamer
Elsayed, Tim Finin, Clay Fink, Marjorie Freedman, Nikesh Garera, Paul
McNamee, Saif Mohammad, Douglas Oard, Christine Piatko, Asad Sayeed, Zareen
Syed, Ralph Weischedel, Tan Xu, and David Yarowsky (2009).
In AAAI 2009 Spring Symposium on Learning by Reading and
Learning to Read. [ paper | scholar | bib ]

The Dyna programming language is intended to provide an declarative abstraction layer for building systems in ML and AI. It extends logic programming with weights in a way that resembles functional programming. The weights are often probabilities. Yet Dyna does not enforce a probabilistic semantics, since many AI and ML methods work with inexact probabilities (e.g., bounds) and other numeric and non-numeric quantities. Instead Dyna aims to provide a flexible abstraction layer that is “one level lower,” and whose efficient implementation will be able to serve as infrastructure for building a variety of toolkits, languages, and specific systems.

We review two novel methods for text categorization, based on a new framework that utilizes richer annotations that we call annotator rationales. A human annotator provides hints to a machine learner by highlighting contextual “rationales” in support of each of his or her annotations. We have collected such rationales, in the form of substrings, for an existing document sentiment classification dataset [1]. We have developed two methods, one discriminative [2] and one generative [3], that use these rationales during training to obtain significant accuracy improvements over two strong baselines. Our generative model in particular could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks. Based on a small study of annotation speed, we posit that for some tasks, providing rationales can be a more fruitful use of an annotator's time than annotating more examples.

Note: This paper is just a synthesis of Zaidan et al. (2007) and Zaidan & Eisner (2008). (I hate doing multiple papers on the same work, but I wanted the ML community outside of NLP to see these results, so I asked the workshop organizers for permission to republish them here.)

We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool for approximate learning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with second-order features or latent variables, which would make exact parsing considerably slower or NP-hard, BP needs only O(n^{3}) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features would increase the runtime additively rather than multiplicatively.

Note: Additional details are given in the dissertation of Smith (2010).

A human annotator can provide hints to a machine learner by highlighting contextual “rationales” for each of his or her annotations (Zaidan et al., 2007). How can one exploit this side information to better learn the desired parameters θ? We present a generative model of how a given annotator, knowing the true θ, stochastically chooses rationales. Thus, observing the rationales helps us infer the true θ. We collect substring rationales for a sentiment classification task (Pang and Lee, 2004) and use them to obtain significant accuracy improvements for each annotator. Our new generative approach exploits the rationales more effectively than our previous “masking SVM” approach. It is also more principled, and could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks.

String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional log-linear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38-92%.

Note: Additional details are given in the dissertation of Dreyer (2011).

Just as programming is the traditional introduction to computer science, writing grammars by hand is an excellent introduction to many topics in computational linguistics. We present and justify a well-tested introductory activity in which teams of mixed background compete to write probabilistic context-free grammars of English. The exercise brings together symbolic, probabilistic, algorithmic, and experimental issues in a way that is accessible to novices and enjoyable.

One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005). Drawing on Abney's (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likelihood on labeled data and confidence (negative Rényi entropy) on unlabeled data. In initial experiments, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied. For our models and training sets, more peaked measures of confidence, measured by Rényi entropy, outperformed smoother ones. We discuss how our feature set could be extended with cross-lingual or cross-domain features, to incorporate knowledge from parallel or comparable corpora during bootstrapping.

We propose a new framework for supervised machine learning. Our goal is to learn from smaller amounts of supervised training data, by collecting a richer kind of training data: annotations with “rationales.” When annotating an example, the human teacher will also highlight evidence supporting this annotation—thereby teaching the machine learner why the example belongs to the category. We provide some rationale-annotated data and present a learning method that exploits the rationales during training to boost performance significantly on a sample task, namely sentiment classification of movie reviews. We hypothesize that in some situations, providing rationales is a more fruitful use of an annotator's time than annotating more examples.

In unsupervised learning, where no training takes place, one simply hopes that the unsupervised learner will work well on any unlabeled test collection. However, when the variability in the data is large, such hope may be unrealistic; a tuning of the unsupervised algorithm may then be necessary in order to perform well on new test collections. In this paper, we show how to perform such a tuning in the context of unsupervised document clustering, by (i) introducing a degree of freedom, α, into two leading information-theoretic clustering algorithms, through the use of generalized mutual information quantities; and (ii) selecting the value of α based on clusterings of similar, but supervised document collections (cross-instance tuning). One option is to perform a tuning that directly minimizes the error on the supervised data sets; another option is to use “strapping” (Eisner and Karakos, 2005), which builds a classifier that learns to distinguish good from bad clusterings, and then selects the α with the best predicted clustering on the test set. Experiments from the “20 Newsgroups” corpus show that, although both techniques improve the performance of the baseline algorithms, “strapping” is clearly a better choice for cross-instance tuning.

Iterative denoising trees were used by Karakos et al. [1] for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic maximization objective. In this paper, we improve upon the work of [1] in two ways: (i) the amount of computation spent searching for a good projection at each node now adapts to the intrinsic dimensionality of the data observed at that node; (ii) the objective at each node is to find a split which maximizes a generalized form of mutual information, the Jensen-Renyi divergence; this is followed by an iterative Naive Bayes classification. The single parameter alpha of the Jensen-Renyi divergence is chosen based on the “strapping” methodology [2], which learns a meta-classifer on a related task. Compared with the sequential Information Bottleneck method [3], our procedure produces state-of-the-art results on an unsupervised categorization task of documents from the “20 Newsgroups” dataset.

Dynamic programming algorithms in statistical natural language processing can be easily described as weighted logic programs. We give a notation and semantics for such programs. We then describe several source-to-source transformations that affect a program's efficiency, primarily by rearranging computations for better reuse or by changing the search strategy. We present practical examples of using these transformations, mainly to optimize context-free parsing algorithms, and we formalize them for use with new weighted logic programs.

Specifically, we define weighted versions of the folding and unfolding transformations, whose unweighted versions are used in the logic programming and deductive database communities. We then present a novel transformation called speculation—a powerful generalization of folding that is motivated by gap-passing in categorial grammar. Finally, we give a simpler and more powerful formulation of the magic templates transformation.

Note: The speculation transform is really a form of lifted inference (a term that gained currency later).

This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias.

Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations) is selected. The latter is far superior, but surprisingly few annotated examples are required. The experimentation focuses on a single dependency grammar induction task, in depth. The aim is to give strong support for the usefulness of the new techniques in one scenario. It must be noted, however, that the task (as defined here and in prior work) is somewhat artificial, and improved performance on this particular task is not a direct contribution to the greater field of natural language processing. The real problem the task seeks to simulate—the induction of syntactic structure in natural language text—is certainly of interest to the community, but this thesis does not directly approach the problem of exploiting induced syntax in applications. We also do not attempt any realistic simulation of human language learning, as our newspaper text data do not resemble the data encountered by a child during language acquisition. Further, our iterative learning algorithms assume a fixed batch of data that can be repeatedly accessed, not a long stream of data observed over time in tandem with acquisition. (Of course, the cognitive criticisms apply to virtually all existing learning methods in natural language processing, not just the new ones presented here.) Nonetheless, the novel estimation methods presented are, we will argue, better suited to adaptation for real engineering tasks than the maximum likelihood baseline.

Our new methods are shown to achieve significant improvements over maximum likelihood estimation and maximum a posteriori estimation, using the EM algorithm, for a state-of-the-art probabilistic model used in dependency grammar induction (Klein and Manning, 2004). The task is to induce dependency trees from part-of-speech tag sequences; we follow standard practice and train and test on sequences of ten tags or fewer. Our results are the best published to date for six languages, with supervised model selection: English (improvement from 41.6% directed attachment accuracy to 66.7%, a 43% relative error rate reduction), German (54.4 -> 71.8%, a 38% error reduction), Bulgarian (45.6% -> 58.3%, a 23% error reduction), Mandarin (50.0% -> 58.0%, a 16% error reduction), Turkish (48.0% -> 62.4%, a 28% error reduction, but only 2% error reduction from a left-branching baseline, which gives 61.8%), and Portuguese (42.5% -> 71.8%, a 51% error reduction). We also demonstrate the success of contrastive estimation at learning to disambiguate part-of-speech tags (from unannotated English text): 78.0% to 88.7% tagging accuracy on a known-dictionary task (a 49% relative error rate reduction), and 66.5% to 78.4% on a more difficult task with less dictionary knowledge (a 35% error rate reduction).

The experiments presented in this thesis give one of the most thorough explorations to date of unsupervised parameter estimation for models of discrete structures. Two sides of the problem are considered in depth: the choice of objective function to be optimized during training, and the method of optimizing it. We find that both are important in unsupervised learning. Our best results on most of the six languages involve both improved objectives and improved search.

The methods presented in this thesis were originally presented in Smith and Eisner (2004, 2005a,b, 2006). The thesis gives a more thorough exposition, relating the methods to other work, presents more experimental results and error analysis, and directly compares the methods to each other.

Note: Dr. Smith's dissertation advisor was Jason Eisner.

We describe Dynasty, a system for browsing large (possibly infinite) directed graphs and hypergraphs. Only a small subgraph is visible at any given time. We sketch how we lay out the visible subgraph, and how we update the layout smoothly and dynamically in an asynchronous environment. We also sketch our user interface for browsing and annotating such graphs—in particular, how we try to make keyboard navigation usable.

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that “an attacker would never be able to do that.” In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002.

We study unsupervised methods for learning refinements of the nonterminals in a treebank. Following Matsuzaki et al. (2005) and Prescher (2005), we may for example split NP without supervision into NP[0] and NP[1], which behave differently. We first propose to learn a PCFG that adds such features to nonterminals in such a way that they respect patterns of linguistic feature passing: each node's nonterminal features are either identical to, or independent of, those of its parent. This linguistic constraint reduces runtime and the number of parameters to be learned. However, it did not yield improvements when training on the Penn Treebank. An orthogonal strategy was more successful: to improve the performance of the EM learner by treebank preprocessing and by annealing methods that split nonterminals selectively. Using these methods, we can maintain high parsing accuracy while dramatically reducing the model size.

When training the parameters for a natural language system, one would prefer to minimize 1-best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize log-likelihood, which is conveniently differentiable and convex. We propose training instead to minimize the expected loss, or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen (anneal) to focus on the 1-best hypothesis. Besides the linear loss functions used in previous work, we also describe techniques for optimizing nonlinear functions such as precision or the BLEU metric. We present experiments training log-linear combinations of models for dependency parsing and for machine translation. In machine translation, annealed minimum risk training achieves significant improvements in BLEU over standard minimum error training. We also show improvements in labeled dependency parsing.

We introduce a novel decoding procedure for statistical machine translation and other ordering tasks based on a family of Very Large-Scale Neighborhoods, some of which have previously been applied to other NP-hard permutation problems. We significantly generalize these problems by simultaneously considering three distinct sets of ordering costs. We discuss how these costs might apply to MT, and some possibilities for training them. We show how to search and sample from exponentially large neighborhoods using efficient dynamic programming algorithms that resemble statistical parsing. We also incorporate techniques from statistical parsing to improve the runtime of our search. Finally, we report results of preliminary experiments indicating that the approach holds promise.

Many syntactic models in machine translation are channels that transform one tree into another, or synchronous grammars that generate trees in parallel. We present a new model of the translation process: quasi-synchronous grammar (QG). Given a source-language parse tree T_{1}, a QG defines a monolingual grammar that generates translations of T_{1}. The trees T_{2} allowed by this monolingual grammar are inspired by pieces of substructure in T_{1} and aligned to T_{1} at those points. We describe experiments learning quasi-synchronous context-free grammars from bitext. As with other monolingual language models, we evaluate the crossentropy of QGs on unseen text and show that a better fit to bilingual data is achieved by allowing greater syntactic divergence. When evaluated on a word alignment task, QG matches standard baselines.

We describe finite-state constraint relaxation, a method for applying global constraints, expressed as automata, to sequence model decoding. We present algorithms for both hard constraints and binary soft constraints. On the CoNLL-2004 semantic role labeling task, we report a speedup of at least 16x over a previous method that used integer linear programming.

To model a collection of documents, suppose that each document was generated by a different hidden Markov model or probabilistic finite-state automaton (PFSA). Further suppose that all these PFSAs are similar because they are drawn from a single (but unknown) prior distribution over PFSAs. We wish to infer the prior, obtain smoothed estimates of the individual PFSAs, and reconstruct the hidden paths by which the unknown PFSAs generated their documents.

As an initial application, particularly hard for our model because of its sparse data, we derive an FSA topology from WordNet. For each verb, we construct the “document” of all nouns that have appeared as its object. Our method then learns a better estimate of p(object | verb), as well as which paths in WordNet, and hence which senses of ambiguous objects, tend to be favored. Our method improves 14.6% over Witten-Bell smoothing on the conditional perplexity of objects given the verb, and 27.5% over random on detecting the most common senses of nouns in the SemCor corpus.

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. We show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. This simple “vine grammar” formalism has only finite-state power, but a context-free parameterization with some extra parameters for stringing fragments together. We exhibit a linear-time chart parsing algorithm with a low grammar constant.

Note: Consider instead the 2010 book chapter that is an expanded version of this paper.

“Bootstrapping” methods for learning require a small amount of supervision to seed the learning process. We show that it is sometimes possible to eliminate this last bit of supervision, by trying many candidate seeds and selecting the one with the most plausible outcome. We discuss such “strapping” methods in general, and exhibit a particular method for strapping word-sense classifiers for ambiguous words. Our experiments on the Canadian Hansards show that our unsupervised technique is significantly more effective than picking seeds by hand (Yarowsky, 1995), which in turn is known to rival supervised methods.

Weighted deduction with aggregation is a powerful theoretical formalism that encompasses many NLP algorithms. This paper proposes a declarative specification language, Dyna; gives general agenda-based algorithms for computing weights and gradients; briefly discusses Dyna-to-Dyna program transformations; and shows that a first implementation of a Dyna-to-C++ compiler produces code that is efficient enough for real NLP research, though still several times slower than hand-crafted code.

We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the Expectation-Maximization algorithm [Dempster et al., 1977]. CE is a natural fit for log-linear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EM-trained generative models on the task of matching human linguistic annotations (the MatchLinguist task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood can target the objective of grammar induction to a specific application.

Note: This version of the paper has minor corrections. Additional details are given in the dissertation of Smith (2006).

Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for log-linear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features.

Note: Additional details are given in the dissertation of Smith (2006). In particular, here is the mapping from 45 fine-grained to 17 coarse-grained tags.

Weighted finite-state machines with n tapes describe n-ary rational string relations. The join n-ary relation is very important in applications. It is shown how to compute it via a more simple operation, the auto-intersection. Join and auto-intersection generally do not preserve rationality. We define a class of triples (A,i,j) such that the auto-intersection of the machine A on tapes i and j can be computed by a delay-based algorithm. We point out how to extend this class and hope that it is sufficient for many practical applications.

Note:Dreyer & Eisner (2009) and Paul & Eisner (2012) make some progress on the uncomputable problem of joining or auto-intersecting FSMs. The first paper gives an approximate algorithm for the real or tropical semiring; the second gives an algorithm for the tropical semiring that is exact if it terminates.

Integrated Sensing and Processing Decision Trees (ISPDTs) were introduced in [1] as a tool for supervised classification of high-dimensional data. In this paper, we consider the problem of unsupervised classification, through a recursive construction of ISPDTs, where at each internal node the data (i) are split into clusters, and (ii) are transformed independently of other clusters, guided by some optimization objective. We show that the maximization of information-theoretic quantities such as mutual information and alpha-divergences is theoretically justified for growing ISPDTs, assuming that each data point is generated by a finite-memory random process given the class label. Furthermore, we present heuristics that perform the maximization in a greedy manner, and we demonstrate their effectiveness with empirical results from multi-spectral imaging.

A finite-state machine with n tapes describes a rational (or regular) relation on n strings. It is more expressive than a relational database table with n columns, which can only describe a finite relation.

We describe some basic operations on n-ary rational relations and propose notation for them. (For generality we give the semiring-weighted case in which each tuple has a weight.) Unfortunately, the join operation is problematic: if two rational relations are joined on more than one tape, it can lead to non-rational relations with undecidable properties. We recast join in terms of “auto-intersection” and illustrate some cases in which difficulties arise. We close with the hope that partial or restricted algorithms may be found that are still powerful enough to have practical use.

Note:Dreyer & Eisner (2009) and Paul & Eisner (2012) make some progress on the uncomputable problem of joining or auto-intersecting FSMs. The first paper gives an approximate algorithm for the real or tropical semiring; the second gives an algorithm for the tropical semiring that is exact if it terminates.

We present the first version of a new declarative programming language. Dyna has many uses but was designed especially for rapid development of new statistical NLP systems. A Dyna program is a small set of equations, resembling Prolog inference rules, that specify the abstract structure of a dynamic programming algorithm. It compiles into efficient, portable, C++ classes that can be easily invoked from a larger application. By default, these classes run a generalization of agenda-based parsing, prioritizing the partial parses by some figure of merit. The classes can also perform an exact backward (outside) pass in the service of parameter training. The compiler already knows several implementation tricks, algorithmic transforms, and numerical optimization techniques. It will acquire more over time: we intend for it to generalize and encapsulate best practices, and serve as a testbed for new practices. Dyna is now being used for parsing, machine translation, morphological analysis, grammar induction, and finite-state modeling.

Note: Consider the longer 2005 version of this paper instead.

Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the Expectation-Maximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA begins by globally maximizing an easy concave function and maintains a local maximum as it gradually morphs the function into the desired non-concave likelihood function. Applying DA to parsing and tagging models is shown to be straightforward; significant improvements over EM are shown on a part-of-speech tagging task. We describe a variant, skewed DA, which can incorporate a good initializer when it is available, and show significant improvements over EM on a grammar induction task.

Note: Additional details are given in the dissertation of Smith (2006).

Language modeling, a technology found in many computerized speech recognition systems, can also be used in a text editor to implement an automated phrase completion feature that significantly reduces the number of keystrokes required to generate a radiology report, therefore increasing typing speed.

Radiology reports have especially low entropy, which allows prediction of multi-word phrases. Our system therefore chooses an optimal phrase length for each prediction, using Bellman-style dynamic programming to minimize the expected cost of typing the rest of the document. This computation considers what the user is likely to type in the future, and how many keystrokes it will take, considering the future effect of phrase completion as well.

Often one may wish to learn a tree-to-tree mapping, training it on unaligned pairs of trees, or on a mixture of trees and strings. Unlike previous statistical formalisms (limited to isomorphic trees), synchronous tree substitution grammar allows local distortion of the tree topology. We reformulate it to permit dependency trees, and sketch EM/Viterbi algorithms for alignment, training, and decoding.

Note: At a reviewer's request, the paper describes TSG more formally than in the previous literature, which might be helpful for some readers and implementers.

Previous work on minimizing weighted finite-state automata (including transducers) is limited to particular types of weights. We present efficient new minimization algorithms that apply much more generally, while being simpler and about as fast.

We also point out theoretical limits on minimization algorithms. We characterize the kind of “well-behaved” weight semirings where our methods work. Outside these semirings, minimization is not well-defined (in the sense of producing a unique minimal automaton), and even finding the minimum number of states is in general NP-complete and inapproximable.

Weighted finite-state transducers suffer from the lack of a training algorithm. Training is even harder for transducers that have been assembled via finite-state operations such as composition, minimization, union, concatenation, and closure, as this yields tricky parameter tying. We formulate a “parameterized FST” paradigm and give training algorithms for it, including a general bookkeeping trick (“expectation semirings”) that cleanly and efficiently computes expectations and gradients.

Note:Expectation semirings are included in the excellent OpenFST library. I believe that Zhifei Li, Ariya Rastrow, Markus Dreyer, and Roy Tromble have all written implementations that work with OpenFST or other packages. Some of these support the higher-order expectation semirings of Li & Eisner (2009).

This paper ties up some loose ends in finite-state Optimality Theory. First, it discusses how to perform comprehension under Optimality Theory grammars consisting of finite-state constraints. Comprehension has not been much studied in OT; we show that unlike production, it does not always yield a regular set, making finite-state methods inapplicable. However, after giving a suitably flexible presentation of OT, we show carefully how to treat comprehension under recent variants of OT in which grammars can be compiled into finite-state transducers. We then unify these variants, showing that compilation is possible if all components of the grammar are regular relations, including the harmony ordering on scored candidates. A side benefit of our construction is a far simpler implementation of directional OT (Eisner, 2000).

Note: A related paper was published independently by Gerhard Jäger at the same time.

This paper offers a detailed lesson plan on the forward-backward algorithm. The lesson is taught from a live, commented spreadsheet that implements the algorithm and graphs its behavior on a whimsical toy example. By experimenting with different inputs, one can help students develop intuitions about HMMs in particular and Expectation Maximization in general. The spreadsheet and a coordinated follow-up assignment are available.

This paper proposes a novel class of PCFG parameterizations that support linguistically reasonable priors over PCFGs. To estimate the parameters is to discover a notion of relatedness among context-free rules such that related rules tend to have related probabilities. The prior favors grammars in which the relationships are simple to describe and have few major exceptions. A basic version that bases relatedness on weighted edit distance yields superior smoothing of grammars learned from the Penn Treebank (20% reduction of rule perplexity over the best previous method).

Note: See also the linguistic perspective in Eisner (2002). Additional details are given in my dissertation.

In the Bayesian framework, a language learner should seek a grammar that explains observed data well and is also a priori probable. This paper proposes such a measure of prior probability. Indeed it develops a full statistical framework for lexicalized syntax. The learner's job is to discover the system of probabilistic transformations (often called lexical redundancy rules) that underlies the patterns of regular and irregular syntactic constructions listed in the lexicon. Specifically, the learner discovers what transformations apply in the language, how often they apply, and in what contexts. It considers simpler systems of transformations to be more probable a priori. Experiments show that the learned transformations are more effective than previous statistical models at predicting the probabilities of lexical entries, especially those for which the learner had no direct evidence.

Note: See also the engineering perspective in Eisner (2002). Additional details are given in my dissertation.

This brief introduction, from the editor of the special section, reviews why and how statistical and linguistic approaches to language can help each other. It also asks how statistical modeling fits into the broader program of cognitive science.

This paper gives the first EM algorithm for general probabilistic finite-state transducers (with epsilon). Furthermore, the approach is powerful enough to fit machines' parameters even after the machines are combined by operations of the finite-state calculus, such as composition and minimization. This allows an expert to build a parameterized transducer in any way that is appropriate to the domain, and then fit the parameters automatically from data. Many standard algorithms are special cases, and there are many further applications. Yet the algorithm remains surprisingly simple because all the difficult work is subcontracted to existing algorithms for semiring-weighted automata. The trick is to use a novel semiring.

Note: Extended abstract. Consider the longer 2002 version instead.

Probabilistic parsing requires a lexicon that specifies each word's syntactic preferences in terms of probabilities. To estimate these probabilities for words that were poorly observed during training, this thesis assumes the existence of arbitrarily powerful transformations (also known to linguists as lexical redundancy rules or metarules) that can add, delete, retype or reorder the argument and adjunct positions specified by a lexical entry.

In a given language, some transformations apply frequently and others rarely. We describe how to estimate the rates of the transformations from a sample of lexical entries. More deeply, we learn which properties of a transformation increase or decrease its rate in the language. As a result, we can smooth the probabilities of lexical entries. Given enough direct evidence about a lexical entry's probability, our Bayesian approach trusts the evidence; but when less evidence or no evidence is available, it relies more on the transformations' rates to guess how often the entry will be derived from related entries.

Abstractly, the proposed “transformation models” are probability distributions that arise from graph random walks with a log-linear parameterization. A domain expert constructs the parameterized graph, and a vertex is likely according to whether random walks tend to halt at it. Transformation models are suited to any domain where “related” events (as defined by the graph) may have positively covarying probabilities. Such models admit a natural prior that favors simple regular relationships over stipulative exceptions. The model parameters can be locally optimized by gradient-based methods or by Expectation-Maximization. Exact algorithms (matrix inversion) and approximate ones (relaxation) are provided, with optimizations. Variations on the idea are also discussed.

We compare the new technique empirically to previous techniques from the probabilistic parsing literature, using comparable features, and obtain a 20% perplexity reduction (similar to doubling the amount of training data). Some of this reduction is shown to stem from the transformation model's ability to match observed probabilities, and some from its ability to generalize. Model averaging yields a final 24% perplexity reduction.

We consider the problem of ranking a set of OT constraints in a manner consistent with data. (1) We speed up Tesar and Smolensky's RCD algorithm to be linear on the number of constraints. This finds a ranking so each attested form x_{i} beats or ties a particular competitor y_{i}. (2) We also generalize RCD so each x_{i} beats or ties all possible competitors.

Alas, neither the more realistic version of ranking in (2), nor even generation, has any polynomial algorithm unless P=NP! That is, one cannot improve qualitatively upon brute force: (3) Merely checking that a single (given) ranking is consistent with given forms is coNP-complete if the surface forms are fully observed and Δ_{2}^{p}-complete if not. Indeed, OT generation is OptP-complete. (4) As for ranking, determining whether any consistent ranking exists is coNP-hard (but in Δ_{2}^{p}) if the forms are fully observed, and Σ_{2}^{p}-complete if not.

Finally, we show (5) generation and ranking are easier in derivational theories: in P, and NP-complete.

Weighted finite-state constraints that can count unboundedly many violations make Optimality Theory more powerful than finite-state transduction (Frank and Satta, 1998). This result is empirically and computationally awkward. We propose replacing these unbounded constraints, as well as non-finite-state Generalized Alignment constraints, with a new class of finite-state directional constraints. We give linguistic applications, results on generative power, and algorithms to compile grammars into transducers.

Note: This paper makes the linguistic case for directional OT, and gives an interesting transducer construction. However, Eisner (2002) shows that the mathematical result can actually be obtained as a special case of a more general theorem that is based on a simpler algebraic construction.

This book review also sketches why OT is interesting to computational linguists, and how it relates to other approaches for combining non-orthogonal surface features, such as maximum-entropy modeling.

This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such “bilexicalism” has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.

The obvious parsing algorithm for bilexical grammars (used by most authors) takes time O(n^{5}). A more efficient O(n^{3}) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Note: This book chapter is an improved and extended version of Eisner (1997).

This paper points out some computational inefficiencies of standard TAG parsing algorithms when applied to LTAGs. We propose a novel algorithm with an asymptotic improvement, from O(n^{8}g^{2}t) to O(n^{6} max(n,g) gt), where n is the input length and g,t are grammar constants that are independent of vocabulary size.

Several recent stochastic parsers use bilexical grammars, where each word type idiosyncratically prefers particular complements with particular head words. We present O(n^{4}) parsing algorithms for two bilexical formalisms (see title), improving the previous upper bounds of O(n^{5}). Also, for a common special case that was known to allow O(n^{3}) parsing (Eisner, 1997), we present an O(n^{3}) algorithm with an improved grammar constant.

Note: Note that the slides include experimental speed comparisons that were not in the paper.

A universal theory of human phonology should be clearly specified and falsifiable. To treat Optimality Theory (OT) as a real proposal, one must put some cards on the table: What kinds of constraints may an OT grammar state? And how can anyone tell what data this grammar predicts, without constructing infinite tableaux?

In this talk I'll motivate a restrictive formalization of OT that allows just two types of simple, local constraint. Gen freely proposes gestures and prosodic constituents; the constraints try to force these to coincide or not coincide temporally. An efficient algorithm exists to find the optimal candidate.

I will argue that despite its simplicity, primitive OT is expressive enough to describe and unify most of the work in OT phonology. However, it is provably more constrained: because it is unable to mimic deeply non-local mechanisms like Generalized Alignment, it forces a new and arguably better account of metrical stress typology.

Finally, I will sketch a more radical extension, directional evaluation, which changes how a constraint ranks candidates. This change brings back some of the descriptive convenience of Generalized Alignment, but it also constrains OT grammars to describe only regular relations, which is linguistically and computationally desirable.

Hayes (1995) gives a typology of the world's metrical stress systems, which is marked by several striking asymmetries (parametric gaps). Most work on metrical stress within Optimality Theory (OT) has adopted this typology without explaining the gaps. Moreover, the OT versions use uncomfortably non-local constraints (Align, FootForm, FtBin).

This paper presents a rather different and in some ways more explanatory typology of stress, couched in the framework of primitive Optimality Theory (OTP), which allows only primitive, radically local constraints. For example, Generalized Alignment is not allowed. The paper presents a single, coherent system of rerankable constraints that yields the basic facts about iambic and trochaic foot form, iambic lengthening, quantity sensitivity, unbounded feet, simple word-initial and word-final stress, directionality of footing, syllable (and foot) extrametricality, degenerate feet, and word-level stress.

The metrical part of the account rests on the following intuitions: <UL> <LI> (a) iambs are special because syllable structure allows them to lengthen their strong ends; <LI> (b) directionality of footing is really the result of local lapse avoidance; <LI> (c) any lapses are forced by a (localist) generalization of right extrametricality; <LI> (d) although degenerate feet are absolutely banned, primary stress does not require a foot in all languages. </UL> An interesting prediction of (b) and (c) is that left-to-right trochees should be incompatible with extrametricality. This prediction is robustly confirmed in Hayes.

This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such “bilexicalism” has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.

The obvious parsing algorithm for bilexical grammars (used by most authors) takes time O(n^{5}). A more efficient O(n^{3}) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Note: Consider instead the 2000 book chapter that is an expanded version of this paper.

This paper introduces computational linguists to primitive Optimality Theory (OTP), a clean and linguistically motivated formalization of OT. OTP specifies the class of autosegmental representations, the universal generator Gen, and the two simple families of permissible constraints. It is therefore possible to study its computational generation, comprehension, and learning properties.

Some results on generation are presented. Unlike less restricted theories using Generalized Alignment, OTP grammars can derive optimal surface forms with finite-state methods adapted from Ellison (1994). Unfortunately these methods take time exponential on the size of the grammar. Indeed the generation problem is shown NP-complete in this sense. However, techniques are discussed for making Ellison's approach fast and practical in the typical case, including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size. One avenue for future improvements is a new finite-state notion, “factored automata,” where regular languages are represented compactly via formal intersections of FSAs.

The classic “easy” optimization problem is to find the MST of a connected, undirected graph. Good polynomial-time algorithms have been known since 1930. Over the last 10 years, however, the standard O(m log n) results of Kruskal and Prim have been improved to linear or near-linear time. The new methods use several tricks of general interest in order to reduce the number of edge weight comparisons and the amount of other work. This tutorial reviews those methods, building up strategies step by step so as to expose the insights behind the algorithms. Implementation details are clarified, and some generalizations are given.

Specifically, the paper attempts to shed light on the classical algorithms of Kruskal, Prim, and Boruvka; the improved approach of Gabow, Galil, and Spencer, which takes time only O(m log (log ^{*}n - log ^{*}m/n)); and the randomized O(m) algorithm of Karger, Klein, and Tarjan, which relies on an O(m) MST verification algorithm by King. It also considers Frederickson's method for maintaining an MST in time O(sqrt((m)) per change to the graph. An appendix explains Fibonacci heaps.)

This technical report is an appendix to Eisner (1996): it gives superior experimental results that were reported only in the talk version of that paper. Eisner (1996) trained three probability models on a small set of about 4,000 conjunction-free, dependency-grammar parses derived from the Wall Street Journal section of the Penn Treebank, and then evaluated the models on a held-out test set, using a novel O(n^{3}) parsing algorithm.

The present paper describes some details of the experiments and repeats them with a larger training set of 25,000 sentences. As reported at the talk, the more extensive training yields greatly improved performance. Nearly half the sentences are parsed with no misattachments; two-thirds are parsed with at most one misattachment.

Of the models described in the original written paper, the best score is still obtained with the generative (top-down) “model C.” However, slightly better models are also explored, in particular, two variants on the comprehension (bottom-up) “model B.” The better of these has an attachment accuracy of 90%, and (unlike model C) tags words more accurately than the comparable trigram tagger. Differences are statistically significant.

If tags are roughly known in advance, search error is all but eliminated and the new model attains an attachment accuracy of 93%. We find that the parser of Collins (1996), when combined with a highly-trained tagger, also achieves 93% when trained and tested on the same sentences. Similarities and differences are discussed.

After presenting a novel O(n^{3}) parsing algorithm for dependency grammar, we develop three contrasting ways to stochasticize it. We propose (a) a lexical affinity model where words struggle to modify each other, (b) a sense tagging model where words fluctuate randomly in their selectional preferences, and (c) a generative model where the speaker fleshes out each word's syntactic and conceptual structure without regard to the implications for the hearer. We also give preliminary empirical results from evaluating the three models' parsing performance on annotated Wall Street Journal training text (derived from the Penn Treebank). In these results, the generative (i.e., top-down) model performs significantly better than the others, and does about equally well at assigning part-of-speech tags.

Under categorial grammars that have powerful rules like composition, a simple n-word sentence can have exponentially many parses that are semantically equivalent. Generating all parses is inefficient and obscures whatever true semantic ambiguities are in the input. This paper addresses the problem for a fairly general form of Combinatory Categorial Grammar, by means of an efficient, correct, and easy to implement normal-form parsing technique. The parser is proved to find exactly one parse in each semantic equivalence class of allowable parses; that is, spurious ambiguity (as carefully defined) is shown to be both safely and completely eliminated.

Note: The example "intentionally knock twice," mentioned on page 4, should have been credited to Schabes & Shieber (1994).

English any is often treated as two unrelated or semi-related lexemes: a negative-polarity item, NPI any, and a universal quantifier, free-choice (FC) any. The latter is idiosyncratic in that it must appear in the scope of a licenser, but moves to take scope immediately over that licenser at LF. I give a semantic account of FC any as an “irrealis” quantifier. This explains some curious (new and old) facts about FC any's semantics and licensing environments. Furthermore, it predicts that negation and other NPI-licensing environments should license FC any, which would then have just the same meaning as NPI any (pace Ladusaw (1979), Carlson (1980)). Thus, we may unify the two any's as a single universal quantifier, as originally proposed by Lasnik (1972) and others. Such an account implies that NPI any moves over negation at LF—which is confirmed by scope tests. It also explains some well-known problems concerning NPI any in non-downward-entailing environments and under sorry vs. glad.

We describe an approach to training a statistical parser from a bracketed corpus, and demonstrate its use in a software testing application that translates English specifications into an automated testing language. A grammar is not explicitly specified; the rules and contextual probabilities of occurrence are automatically generated from the corpus. The parser is extremely successful at producing and identifying the correct parse, and nearly deterministic in the number of parses that it produces. To compensate for undertraining, the parser also uses general, linguistic subtheories which aid in guessing some types of novel structures.

We describe a general approach to the probabilistic parsing of context-free grammars. The method integrates context-sensitive statistical knowledge of various types (e.g., syntactic and semantic) and can be trained incrementally from a bracketed corpus. We introduce a variant of the GHR context-free recognition algorithm, and explain how to adapt it for efficient probabilistic parsing. On a real-world corpus of sentences from software testing documents, with 23 possible parses for a sentence of average length, the system accurately finds the correct parse in 99% of cases, while producing only 1.02 parses per sentence. Significantly, the success rate would be only 66% without the semantic statistics.

“Winner take all” electoral systems are not fully representative. Unfortunately, the ANC's proposed system of proportional representation is not much better. Because it ensconces party politics, it is only slightly more representative, and poses a serious threat to accountability.

Many modern students of democracy favor proportional representation through the Single Transferable Vote (STV). In countries with high illiteracy, however, this system may be unworkable.

This paper proposes a practical modification of STV. In the modified system, each citizen votes for only one candidate. Voters need not specify their second, third, and fourth choices. Instead, each candidate specifies his or her second, third, and fourth choices. The modified system is no more difficult for voters than current proposals—and it provides virtually all the benefits of STV, together with some new ones.

This talk for a general audience gives a sketch of what the field of cognitive science is about. In its latter half, it turns to the philosophical question of defining intelligence, and proposes a non-operational alternative to the Turing Test.

A broad approach is developed for training dynamical behaviors in connectionist networks. General recurrent networks are powerful computational devices, necessary for difficult tasks like constraint satisfaction and temporal processing. These tasks are discussed here in some detail. From both theoretical and empirical considerations, it is concluded that such tasks are best addressed by recurrent networks that operate continuously in time—and further, that effective learning rules for these continuous-time networks must be able to prescribe their dynamical properties. A general class of such learning rules is derived and tested on simple problems. Where existing learning algorithms for recurrent and non-recurrent networks only attempt to train a network's position in activation space, the models presented here can also explicitly and successfully prescribe the nature of its movement through activation space.

We present a method of using natural language processing (NLP) techniques to extract information from online news feeds and then using the information so extracted to predict changes in stock prices or volatilities. These predictions can be used to make profitable trading strategies. More specifically, company names can be recognized and simple templates describing company actions can be automatically filled using parsing or pattern matching on words in or near the sentence containing the company name. These templates can be clustered into groups which are statistically correlated with changes in the stock prices.

A communications method utilizes memory areas to buffer portions of the media streams. These buffer areas are shared by user applications, with the desirable consequence of reducing workload for the server system distributing media to the user (client) applications. The preferred method allows optimal balancing of buffering delays and server loads, as well as optimal choice of buffer contents for the shared memory buffers.

Secure data interchange.
Frederick S. M. Herz, Walter Paul Labys, David C. Parkes, Sampath
Kannan, and Jason M. Eisner (2000).
U.S. Patent 7,630,986, issued 2009. [ patent | scholar | bib ]

A secure data interchange system enables information about bilateral and multilateral interactions between multiple persistent parties to be exchanged and leveraged within an environment that uses a combination of techniques to control access to information, release of information, and matching of information back to parties. Access to data records can be controlled using an associated price rule. A data owner can specify a price for different types and amounts of information access.

The system for the automatic determination of customized prices and promotions automatically constructs product offers tailored to individual shoppers, or types of shopper, in a way that attempts to maximize the vendor's profits. These offers are represented digitally. They are communicated either to the vendor, who may act on them as desired, or to an on-line computer shopping system that directly makes such offers to shoppers. Largely by tracking the behavior of shoppers, the system accumulates extensive profiles of the shoppers and the offers that they consider. The system can then select, present, price, and promote goods and services in ways that are tailored to an individual consumer. Likely shoppers can be identified, then enticed with the most effective visual and textual advertisements; deals can be offered to them, either on-line or off-line; detailed product information screens can be subtly rearranged from one type of shopper to the next. Furthermore, when a product can be tailored to a particular shopper, a general technique or expert system can offer each consumer an appropriately customized product.

An adaptive compression technique which is an improvement to Lempel-Ziv (LZ) compression techniques, both as applied for purposes of reducing required storage space and for reducing the transmission time associated with transferring data from point to point. Pre-filled compression dictionaries are utilized to address the problem with prior Lempel-Ziv techniques in which the compression software starts with an empty compression dictionary, whereby little compression is achieved until the dictionary has been filled with sequences common in the data being compressed. In accordance with the invention, the compression dictionary is pre-filled, prior to the beginning of the data compression, with letter sequences, words and/or phrases frequent in the domain from which the data being compressed is drawn. The letter sequences, words, and/or phrases used in the pre-filled compression dictionary may be determined by statistically sampling text data from the same genre of text. Multiple pre-filled dictionaries may be utilized by the compression software at the beginning of the compression process, where the most appropriate dictionary for maximum compression is identified and used to compress the current data. These modifications are made to any of the known Lempel-Ziv compression techniques based on the variants detailed in 1977 and 1978 articles by Ziv and Lempel.

This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a “target profile” for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a “target profile interest summary” for each user, which target profile interest summary describes the user's interest level in various types of target objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media. Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.