**Limitations of autoregressive models and their alternatives**.

Chu-Cheng Lin, Aaron Jaech, Xin Li, Matt Gormley, and Jason Eisner
(2021).

In *NAACL*. [ bib ]

My students and I work broadly on computational approaches to human language. Our high-level agenda is here.

We're developing Dyna, a beautiful high-level programming language, to facilitate all of the above.

Don't know what to read? Try these selected papers.

Also listed here are supervised dissertations, some invited talks, talk videos, patents, edited volumes, and teaching papers.

Here are some recommended papers that give a good sense of what we've worked on over the years.

Natural language problems often demand new algorithms. The main challenges are

- a combinatorially large discrete space of linguistic structures

a high-dimensional continuous space of statistical parameters

Our combinatorial algorithms are centered on dynamic programming, often combined with other traditions: local search, variational approximations, greedy methods, systematic search, and relaxation methods such as row generation and dual decomposition. Some of these are exact methods for NP-hard global optimization problems.

What are all these discrete algorithms doing? “Structured prediction” is the problem of modeling unknown variables that are themselves complex structures, such as vectors, strings, or trees. The number of values for such a variable may be astronomically large. Searching for the most likely structure is a *combinatorial optimization* problem. Other combinatorial problems compute the probability of a particular structure or sub-structure.

Doesn't the machine learning community do structured prediction? Yes: graphical model inference must predict discrete vectors. But linguistics must predict *strings* and *trees*. Our systems must guess the syntactic structure of a sentence, the translation of a sentence, the grammar of a language, or the set of real-world facts that is consistent with a set of documents.

Dynamic programming is extremely useful for analyzing sequence data. The papers below introduce *novel* dynamic programming algorithms (primarily for parsing and machine translation).

Other cool papers, not listed below, show how dynamic programming algorithms can be *embedded as efficient subroutines* within variational inference (belief propagation), relaxation (dual decomposition, row generation), and large-neighborhood local search.

"Deep learning" usually refers to the use of multi-layer neural networks to model functions or probability distributions. The advantage of these highly parametric models is that they are expressive enough to fit a wide range of real-world phenomena. We have been particularly interested in combining deep learning with graphical models and other approaches to structured prediction, in order to marry the flexibility of deep learning with the insight of domain-specific modeling.

Deep architectures for NLP typically include parameters for vector embeddings of words, which is an important subtopic.

Machine learning often searches for parameters that maximize some non-convex and possibly expensive objective function. Our contributions include semiring computations of objective functions and their gradients, a special case of gradients by automatic differentiation (“back-propagation”); variational training objectives that are tractable to compute; and deterministic annealing methods that smooth a non-convex or non-differentiable objective function during early iterations of training.

My students and I often identify a pesky formal problem in statistical NLP or ML and try to give a general solution to it. The formal settings for our algorithms often involve finite-state machines, various kinds of grammars and synchronous grammars, and graphical models.

These objects are usually equipped with real-valued weights that define a structured prediction problem (see here). One can treat a wider variety of problems by allowing the weights to be elements of an arbitrary semiring.

Our work on weighted logic programming (see papers on the Dyna language) has led us to develop flexible algorithms for maintaining truth values in arithmetic circuits.

These papers classify problems into specific computational complexity classes.

My students and I have worked in several ML settings, some of them novel. I have a relatively well-developed statistical philosophy, leading to the design of novel training objectives. We've also offered techniques for optimizing such objectives. Once upon a time, I was into neural networks and have started to use deep learning again.

We have introduced new machine learning ideas in several settings, including in unsupervised, supervised and semi-supervised learning; domain adaptation; structure learning for graphical models; hybrids of probabilistic modeling with deep learning; cost-aware learning; reinforcement learning; and creative use of annotators.

Our *novel* contributions to unsupervised learning were primarily developed on grammar induction, including an approach for converting unsupervised learning to supervised learning on synthetically generated data. We have also proposed unsupervised bootstrapping and done a little work on clustering.

Other papers, not listed below, also do unsupervised learning, but using traditional approaches such as EM and MCEM. We develop such approaches for our transformation models and nonparametric models.

What does it mean to learn? What objective should a learning algorithm maximize?

Decision-making systems should be *trained* discriminatively even if they are *structured* generatively. Contrastive estimation guides an unsupervised learner to learn the “right” latent variables, by asking it to discriminate between positive data and certain implicit negative data. It is also popular for being fast. Structural annealing applies a domain-specific search bias during early learning. Finally, we have designed objectives for semi-supervised learning.

Bootstrapping is a general strategy for semi-supervised learning. Bootstrapping algorithms sometimes get confused and perform poorly. To cure this, our completely unsupervised “strapping” method is remarkably effective at selecting a successful run of bootstrapping.

We also relate bootstrapping to entropy regularization and apply it in a feature-rich setting of grammar induction.

Intelligent systems may be *structured* to do approximate probabilistic inference under some carefully crafted model. However, they should be *trained* discriminatively, so that their actual decisions or predictions maximize task-specific performance. This allows the parameters to compensate any approximations in modeling, inference, and decoding.

My philosophy comes from Bayesian decision theory:

- The task of

The task of

*See also other papers on machine learning objectives.*

Do you have a realistic model of your domain? Then probabilistic inference will be intractable or slow. But do you really need exact inference? Often, you could confidently make a prediction without reasoning about *all* of the potentially relevant variables. Many variables have redundant or negligible influence on the final decision.

“Cost-aware learning” refers to learning policies for cheap but accurate inference. This is a form of discriminative training (discussed here) where the cost function includes terms for inference speed, data acquisition cost, model size, etc. Our papers are below.

We are particularly interested in learning inference policies that make dynamic decisions at runtime about where to spend computation (e.g., for parsing or arithmetic circuits). More simply, however, one can choose a static policy for each domain or each example.

A human annotator is an AI system hidden inside a skull. Fortunately, it is not a black-box system. We shouldn't just ask annotators to give us the right answers on training data—while they're at it, they can also mark *why* they chose those answers.

We showed below that annotator “rationales” are efficient to gather and can be exploited to improve classification accuracy. The idea has been adopted elsewhere in NLP and in computer vision. In addition, several papers on interpretable machine learning have asked artificial systems to produce their own “rationales” in the same form.

How do young children listen to their native language and figure out its typological properties and detailed linguistic structure? I'm dying to know, but I would settle for solving the related NLP problem of *grammar induction*. Given word sequences or part-of-speech sequences that were presumably generated by a (natural language) grammar, we seek to identify the grammar or the resulting tree structures.

Locally maximizing likelihood (the inside-outside algorithm) does quite poorly on this task for a variety of reasons. We have tried to get some insight into the problems and address them with a variety of search methods and modified objective functions. However, this area is far from solved and we are considering new angles.

I have also worked on learning “deep” grammar from “surface” grammar by modeling syntactic transformations. Beyond syntactic grammars, see also our work on inducing morphological and phonological grammars.

Below are papers that present technically innovative models of linguistic domains (not just algorithms for existing models).

Some are focused on modeling syntax, morphology, translation, or annotator behavior, or refining the non-statistical Optimality Theory formalism that is popular in linguistics. We have also modeled the comprehension of foreign-language learners.

Other papers are more general in nature—they develop generative models or finite-state models that could be applied in other domains. In general, my students and I like to build deep probabilistic models that are intuitively plausible as domain models, while retaining plenty of parametric flexibility to fit unexpected patterns in the data.

We have designed various classes of generative models. These models are of general interest although they were motivated by linguistic problems. They include transformation models and variations on topic models. Some of our models are nonparametric or have deep learning architectures. We have also extended Markov random fields to string-valued random variables.

In Bayesian modeling, one often uses a Dirichlet distribution or Dirichlet process as a prior for a discrete distribution. These priors have the *neutrality property*: if event *x* is observed, we raise our posterior estimate of *p*(*x*) and correspondingly scale down the estimate of *p*(*y*) for all other *y*.

However, what if *x* and *y* are “related” events? In that case, their probability should covary——observing one should raise the estimated probability of the other. A *transformation model* captures this by positing that some instances of *y* were derived by transformation from *x*. Indeed, *p*(*y*) is defined by summing over all transformation sequences that would terminate at *y*. We fit a feature-based model of the transformation probabilities, permitting generalization to new events.

I originally introduced this idea in order to model syntactic transformations, but we have subsequently explored it in other settings.

Each of the papers below uses finite-state machines to help model some linguistic domain. In most cases, the model combines multiple machines, or combines finite-state machines with deep learning. Many of these papers also present algorithmic methods.

Natural language data is very rich and can be analyzed at many levels. My students and I have happily worked on problems all over NLP:

Parsing, machine translation, multi-sentence alignment, and grammar induction all consider the tree structure of a sentence.

Sequence tagging exploits the sequential structure of a sentence (e.g., part-of-speech tagging and named entity tagging).

Morphology, phonology, and name clustering all consider the internal spelling structure of words and their relationships to other words.

Word sense disambiguation and selectional preferences start to address the meanings of those words.

Information extraction and coreference resolution start to consider sentence meaning. I have also done a bit of work on linguistic semantics.

Sentiment analysis, text categorization, document clustering, and topic analysis look for document-level “meaning.”

Finally, I have several early papers that focus on linguistic theory and data (independent of computation).

I've also hit a few miscellaneous topics: educational technology, predictive text entry, cryptanalysis, text compression, graph visualization, collaborative filtering, online commerce, content delivery, and voting systems.

Parsing a sentence is a key part of obtaining its meaning or translation. Leading systems for QA, IE, and MT now rely on parsing.

We have devised fundamental, widely-used exact parsing algorithms for dependency grammars, combinatory categorial grammars, context-free grammars and tree-adjoining grammars. We also showed that different parsing algorithms are often interrelated by formal transformations that appear widely applicable.

Beyond devising exact algorithms, we have developed several principled approximations for speeding up parsing, both for basic models and for enriched models where exact parsing would be impractical.

A number of our papers (not all shown below) try to improve the actual models of linguistic syntax that are used in parsing. For example, several of these algorithms aim to preserve speed for lexicalized models of grammar, which acknowledge that two different verbs (for example) may behave differently.

A parser is only as accurate as its linguistic model. Many existing grammar formalisms aim to capture different aspects of syntax (see parsing papers). We have tried to enrich these formalisms in appropriate ways, by explicitly modeling lexicalization, dependency length, non-local statistical interactions (beyond what the grammar formalism provides), and syntactic transformations.

*Remark:* The probabilities under lexicalized models can capture some crude semantic preferences along with syntax (i.e., selectional preferences). In fact, in our very early work, we actually conditioned probabilities on words according to their role in a semantic representation. I subsequently argued for bilexical parsing as an approximation to this, and gave the first generative model for dependency parsing (which was also the first non-edge-factored model).

A natural-language grammar will generally contain many related syntactic constructions for a given word (e.g., active and passive). Most grammar formalisms explain this redundancy by assuming some mechanism for generating new constructions systematically from old ones.

My dissertation work showed how to model these “syntactic transformations” statistically, learning how deeply related rules covary. It inferred the deep relationships from a sample of observed constructions, enabling it to generalize to unseen constructions (“transformational smoothing”). This work introduced the more general technique of transformation modeling.

The 2008, 2009, and 2011 papers below built up an elegant model of inflectional morphology, with each paper building on the previous one. The work is gathered together in Dreyer's dissertation. Further work beginning in 2015 extended the approach to use latent underlying morphs, allowing it to treat derivational morphology as well.

Most syntax-based models of translation assume that in training data, a sentence and its translation have isomorphic syntactic structure. The papers below work to weaken that assumption, which often fails in practice.

*See also other papers on machine translation.*

Below are all our papers on machine translation—an assortment of interesting techniques motivated by different search, learning, and modeling challenges in MT. If there is any consistent theme, it is that we usually work with a full probability distribution over possible translations, not just its mode.

Our work on foreign language education uses MT within educational technology.

I was in graduate school when Optimality Theory took over phonology. There was no computational treatment of OT yet. I provided key finite-state algorithms for generation and comprehension, and proposed plausible modifications to the formalism to keep it within finite-state power. I also analyzed the computational complexity of grammar learning.

In order to do this computational work, I first had to conjecture a universal set of legal constraints (i.e., universal grammar). Nearly all the constraints I found in hundreds of OT papers fit into my simple taxonomy of “primitive constraints.” For those that didn't fit, I exhibited an alternative analysis within my framework of the linguistic data. The “primitive OT” analysis of metrical stress is arguably superior because it predicted previously unexplained typological gaps.

More recently, my students and I have worked on recovering the phonological underlying forms of a language, jointly with a probabilistic phonology. We have also worked on probabilistically modeling the typology of vowel systems.

Learning French in high school was so slow and artificial compared to learning my native language, English. Why all these vocabulary lists and toy-data sentences? Why couldn't I pick up French words and constructions in context, by reading something interesting?

In high school, I wanted to write a novel that gradually morphed from English to French, starting with English words in French order, and gradually dropping in French function words and content words when they were clear from context. Now with machine translation, we're starting to create this kind of hybrid text automatically ...

The Dyna language is our bid to provide a unifying framework for data and algorithms across many settings.

Programming in Dyna is meant to be easy. A program is a short, high-level schematic description of the structure of a computation. It simply defines various values in terms of other values. The user can query and update values at runtime.

It is the system's job to choose efficient data structures and computations to support the possible queries and updates. This leads to interesting algorithmic problems, connected to query planning, deductive databases, and adaptive systems.

The forthcoming version of the language is described in Eisner & Filardo (2011), which illustrates its power on a wide range of problems in statistical AI and beyond. We released a prototype back in 2005, which was limited to semiring-weighted computations but has been used profitably in a number of NLP papers. The new working implementation under development is available here on github.

**Limitations of autoregressive models and their alternatives**.

Chu-Cheng Lin, Aaron Jaech, Xin Li, Matt Gormley, and Jason Eisner
(2021).

In *NAACL*. [ bib ]

**Learning how to ask: Querying LMs with mixtures of soft prompts**.

Guanghui Qin and Jason Eisner (2021).

In *NAACL*. [ paper+supplement | arxiv | bib ]

Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks, using a fill-in-the-blank paradigm (Petroni et al., 2019) or a few-shot extrapolation paradigm (Brown et al., 2020). For example, language models retain factual knowledge from their training corpora that can be extracted by asking them to “fill in the blank” in a sentential prompt. However, where does this prompt come from? We explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. Our prompts consist of “soft words,” i.e., continuous vectors that are not necessarily word type embeddings from the language model. Furthermore, for each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. Across multiple English LMs and tasks, our approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.

**Noise-contrastive estimation for multivariate point processes**.

Hongyuan Mei, Tom Wan, and Jason Eisner (2020).

In *NeurIPS*. [ paper+supplement | arxiv | slides | poster | bib ]

The log-likelihood of a generative model often involves both positive and negative terms. For a temporal multivariate point process, the negative term sums over all the possible event types at each time and also integrates over all the possible times. As a result, maximum likelihood estimation is expensive. We show how to instead apply a version of noise-contrastive estimation—a general parameter estimation method with a less expensive stochastic objective. Our specific instantiation of this general idea works out in an interestingly non-trivial way and has provable guarantees for its optimality, consistency and efficiency. On several synthetic and real-world datasets, our method shows benefits: for the model to achieve the same level of log-likelihood on held-out data, our method needs considerably fewer function evaluations and less wall-clock time.

Note:This is part of a series of papers on the neural Hawkes process, although the method also applies to other continuous temporal models.

Keywords:event streams, deep learning, generative modeling, training objectives, discriminative training

**Autoregressive modeling is misspecified for some sequence
distributions**.

Chu-Cheng Lin, Aaron Jaech, Xin Li, Matt Gormley, and Jason Eisner
(2020).

*CoRR*. [ paper+supplement | arxiv | bib ]

Should sequences be modeled autoregressively—one symbol at a time? How much computation is needed to predict the next symbol? While local normalization is cheap, this also limits its power. We point out that some probability distributions over discrete sequences cannot be well-approximated byanyautoregressive model whose runtime and parameter size grow polynomially in the sequence length—even though theirunnormalizedsequence probabilities are efficient to compute exactly. Intuitively, the probability of the next symbol can be expensive to compute or approximate (even via randomized algorithms) when it marginalizes over exponentially many possible futures, which is in generalNP-hard. Our result is conditional on the widely believed hypothesis thatNPP/poly(without which the polynomial hierarchy would collapse at the second level). This theoretical observation serves as a caution to the viewpoint that pumping up parameter size is a straightforward way to improve autoregressive models (e.g., in language modeling). It also suggests that globally normalized (energy-based) models may sometimes outperform locally normalized (autoregressive) models, as we demonstrate experimentally for language modeling.

Keywords:hardness, language modeling, generative modeling, deep learning

**Task-oriented dialogue as dataflow synthesis**.

Semantic Machines, Jacob Andreas, John Bufe, David Burkett, Charles
Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner,
Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill,
Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo
Lanman, Percy Liang, Christopher H. Lin, Ilya Lintsbakh, Andy McGovern,
Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth,
Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon
Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela
Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov (2020).

*TACL*. [ paper | official link | arxiv | video | extended video | data | twitter | blog | bib ]

We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.

Keywords:dialog, selected papers

**Neural Datalog through time: Informed temporal modeling via logical
specification**.

Hongyuan Mei, Guanghui Qin, Minjie Xu, and Jason Eisner (2020).

In *ICML*. [ paper+supplement | arxiv | slides | PDF slides | video | code | blog | press | bib ]

Learning how to predict future events from patterns of past events is difficult when the set of possible event types is large. Training an unrestricted neural model might overfit to spurious patterns. To exploit domain-specific knowledge of how past events might affect an event's present probability, we propose using atemporal deductive databaseto track structured facts over time. Rules serve to prove facts from other facts and from past events. Each fact has a time-varying state—a vector computed by a neural net whose topology is determined by the fact'sprovenance, including its experience of past events. The possible event types at any time are given by special facts, whoseprobabilitiesare neurally modeled alongside their states. In both synthetic and real-world domains, we show that neural probabilistic models derived from concise Datalog programs improve prediction by encoding appropriate domain knowledge in their architecture.

Note:This is part of a series of papers on the neural Hawkes process for modeling irregular time series, but can also be used for discrete-time sequence modeling as in our work on neural finite-state methods. Is also related through Datalog to our work on Dyna.

Keywords:event streams, deep learning, generative modeling, Dyna, selected papers

**A corpus for large-scale phonetic typology**.

Elizabeth Salesky, Eleanor Chodroff, Tiago Pimental, Matthew Wiesner,
Ryan Cotterell, Alan W. Black, and Jason Eisner (2020).

In *ACL*. [ paper+supplement | slides | PDF slides | video | code | data | bib ]

A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present Vox Clamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io/.

Keywords:phonetics, typology, corpora

**Evaluation of logic programs with built-ins and aggregation: A calculus
for bag relations**.

Matthew Francis-Landau, Tim Vieira, and Jason Eisner (2020).

In *WRLA*. [ paper | arxiv | slides | PDF slides | video | code | bib ]

We present a scheme for translating logic programs with built-ins and aggregation into algebraic expressions that denote bag relations over ground terms of the Herbrand universe. To evaluate queries against these relations, we develop an operational semantics based on term rewriting of the algebraic expressions. This approach can exploit arithmetic identities and recovers a range of useful strategies, including lazy strategies that defer work until it becomes both possible and necessary.

Keywords:Dyna, rewriting

**Specializing word embeddings (for parsing) by information bottleneck**.

Xiang Lisa Li and Jason Eisner (2019).

In *EMNLP-IJCNLP (Best Paper Award)*. [ paper | slides | PDF slides | PPT slides | unofficial video | bib ]

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In thediscreteversion, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In thecontinuousversion, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.

Keywords:variational training, word embeddings, deep learning, language modeling, dependency parsing

**A functionalist account of vowel system typology**.

Ryan Cotterell and Jason Eisner (2019).

In *EMNLP-IJCNLP*. [ bib ]

The typology of sound systems in spoken human languages should be explained in part by functional pressures on communication. Two competing pressures are transmission rate and ease of communication. More information may be transmitted per phoneme token if more phonemes are available—but fitting more phonemes into the system would require the system to use outlier sounds that are hard to pronounce or perceive, or else to split existing phonemes, requiring more speaker and hearer effort to keep them distinct. In contrast, a system with few phonemes has limited information per phoneme, but speakers can articulate the phonemes more easily and sloppily (perhaps allowing more phonemes per second). We encode these two competing pressures into a proposed universal prior for a generative probability model. We find that a model of vowel token formants is more predictive of held-out data if it is trained with the help of this prior (that is, by MAP rather than ML).

Keywords:phonology, typology, linguistics, deep learning

**Spelling-aware construction of macaronic texts for teaching
foreign-language vocabulary**.

Adithya Renduchintala, Philipp Koehn, and Jason Eisner (2019).

In *EMNLP-IJCNLP*. [ paper+supplement | poster | PDF poster | bib ]

We present a machine foreign-language teacher that modifies text in a student's native language (L1) by replacing selected word tokens with glosses in a foreign language (L2), in such a way that L2 vocabulary can be learned simply by reading the resultingmacaronictext. The machine teacher uses no supervised data from human students. Instead, to guide the machine teacher's choices, we equip a cloze language model with a training procedure that can incrementally learn representations for novel words, and use this model as a proxy for the word guessing and learning ability of real human students. We use Mechanical Turk to evaluate two variants of the student model: (i) one that generates a representation for a novel word using only surrounding context and (ii) an extension that also uses the spelling of the novel word.

Note:This paper extends the workshop paper of Renduchintala et al. (2019a).

Keywords:edutech, language modeling

**Supervised Training on Synthetic Languages: A Novel Framework
for Unsupervised Parsing**.

Dingquan Wang (2019).

PhD thesis, Johns Hopkins University. [ bib ]

This thesis focuses on unsupervised dependency parsing—parsing sentences of a language into dependency trees without accessing the training data of that language. Different from most prior work that uses unsupervised learning to estimate the parsing parameters, we estimate the parameters bysupervisedtraining on synthetic languages. Our parsing framework has three major components:Synthetic language generationgives a rich set of training languages by mix-and-match over the real languages;surface-form feature extractionmaps an unparsed corpus of a language into a fixed-length vector as the syntactic signature of that language; and, finally,language-agnostic parsingincorporates the syntactic signature during parsing so that the decision on each word token is reliant upon the general syntax of the target language.The fundamental question we are trying to answer is whether some useful information about the syntax of a language could be inferred from its surface-form evidence (

unparsedcorpus). This is the same question that has been implicitly asked by previous papers on unsupervised parsing, which only assumes an unparsed corpus to be available for the target language. We show that, indeed, useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety ofotherlanguages, so it learns a feature extractor that works well.This thesis contains several large-scale experiments requiring hundreds of thousands of CPU-hours. To our knowledge, this is the largest study of unsupervised parsing yet attempted. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous works' interpretable typological features that require parsed corpora or expert categorization of the language.

Note:Dr. Wang's dissertation advisor was Jason Eisner.

Keywords:theses, synthetic data, grammar induction, unsupervised learning, deep learning, linguistics, typology, syntax, dependency parsing

**Simple construction of mixed-language texts for vocabulary learning**.

Adithya Renduchintala, Philipp Koehn, and Jason Eisner (2019).

In *BEA*. [ paper | bib ]

We present a machine foreign-language teacher that takes documents written in a student's native language and detects situations where it can replace words with their foreign glosses such that new foreign vocabulary can be learned simply through reading the resulting mixed-language text. We show that is it possible to design such a machine teacher without any supervised data from (human) students. We accomplish this by modifying a language model to incrementally learn new vocabulary items, and use this language model as a proxy for the word guessing ability of real students. Our machine foreign-language teacher consults this language model and creates mixed-language documents.We evaluate three variants of our student proxy language models through a study on Amazon Mechanical Turk. We find that Mechanical Turk “students” were able to guess the meanings of foreign words introduced by the machine teacher with high accuracy for both function words as well as content words in two out of the three word guessing models.

Note:The best of these models was further improved in our followup paper (Renduchintala et al., 2019b). The followup paper also evaluated on real languages instead of artificial ones.

Keywords:edutech, language modeling

**What kind of language is hard to language-model?**

Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, and
Jason Eisner (2019).

In *ACL*. [ paper+supplement | arxiv | slides | twitter | bib ]

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that “translationese” is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

Keywords:typology, morphology, language modeling

**Imputing missing events in continuous-time event streams**.

Hongyuan Mei, Guanghui Qin, and Jason Eisner (2019).

In *ICML*. [ paper+supplement | slides | PDF slides | poster | code | bib ]

Events in the world may be caused by other,unobservedevents. We consider sequences of events in continuous time. Given a probability model ofcompletesequences, we propose particle smoothing—a form of sequential importance sampling—to impute the missing events in anincompletesequence. We develop a trainable family of proposal distributions based on a type of bidirectional continuous-time LSTM. Bidirectionality lets the proposals condition on future observations, not just on the past as in particle filtering. Our method can sample an ensemble of possible complete sequences (particles), from which we form a single consensus prediction that has low Bayes risk under our chosen loss metric. We experiment in multiple synthetic and real domains, using different missingness mechanisms, and modeling the complete sequences in each domain with a neural Hawkes process (Mei & Eisner, 2017). On held-out incomplete sequences, our method is effective at inferring the ground-truth unobserved events, with particle smoothing consistently improving upon particle filtering.

Note:This is part of a series of papers on the neural Hawkes process.

Keywords:event streams, deep learning, generative modeling, unsupervised learning, particle methods

**Neural finite-state transducers: Beyond rational relations**.

Chu-Cheng Lin, Hao Zhu, Matthew Gormley, and Jason Eisner (2019).

In *NAACL*. [ bib ]

We introduce neural finite state transducers (NFSTs), a family of string transduction models defining joint and conditional probability distributions over pairs of strings. The probability of a string pair is obtained by marginalizing over the scores of all its accepting paths in a finite state transducer. In contrast to ordinary weighted FSTs, however, each path is scored using a recurrent neural network, which breaks the usual conditional independence assumption (Markov property). NFSTs are more powerful than previous finite-state models with neural features (Rastogi et al., 2016). We present training and inference algorithms for locally and globally normalized variants of NFSTs. In experiments on different transduction tasks, they compete favorably against seq2seq models while offering interpretable paths that correspond to hard monotonic alignments.

Keywords:finite-state methods, deep learning, particle methods

**Contextualization of morphological inflection**.

Ekaterina Vylomova, Ryan Cotterell, Tim Baldwin, Trevor Cohn, and
Jason Eisner (2019).

In *NAACL*. [ paper | arxiv | slides | bib ]

Critical to natural language generation is the production of correctly inflected text. In this paper, we isolate the task of predicting a fully inflected sentence from its partially lemmatized version. Unlike traditional morphological inflection or surface realization, our task input does not provide “gold” tags that specify what morphological features to realize on each lemmatized word; rather, such features must be inferred from sentential context. We develop a neural hybrid graphical model that explicitly reconstructs morphological features before predicting the inflected forms, and compare this to a system that directly predicts the inflected forms without relying on any morphological annotation. We experiment on several typologically diverse languages from the Universal Dependencies treebanks, showing the utility of incorporating linguistically-motivated latent variables into NLP models.

Keywords:morphology, deep learning

**A generative model for punctuation in dependency trees**.

Xiang Lisa Li, Dingquan Wang, and Jason Eisner (2019).

*TACL*. [ official link | paper+supplement | arxiv | slides | PDF slides | PPT slides | unofficial video | bib ]

Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree's “true” punctuation marks are not observed (Nunberg, 1990). These latent “underlying” marks serve to delimit or separate constituents in the syntax tree. When the tree's yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into “surface” marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to EM). When we use the trained model to reconstruct the tree's underlying punctuation, the results appear plausible across 5 languages, and in particular are consistent with Nunberg's analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our qreconstruction of a sentence's underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.

Keywords:punctuation, linguistics, dynamic programming, dependency parsing, finite-state methods, unsupervised learning, non-local syntax

**On the complexity and typology of inflectional morphological systems**.

Ryan Cotterell, Christo Kirov, Mans Hulden, and Jason Eisner
(2019).

*TACL*. [ paper | official link | arxiv | slides | bib ]

We quantify the linguistic complexity of different languages' morphological systems. We verify that there is a statistically significant empirical trade-off between paradigm size and irregularity: a language's inflectional paradigms may be either large in size or highly irregular, but never both. We define a new measure of paradigm irregularity based on the conditional entropy of the surface realization of a paradigm—how hard it is to jointly predict all the word forms in a paradigm from the lemma. We estimate irregularity by training a predictive model. Our measurements are taken on large morphological paradigms from 36 typologically diverse languages.

Keywords:linguistics, typology, deep learning, morphology

**Spell once, summon anywhere: A two-level open-vocabulary language
model**.

Sabrina J. Mielke and Jason Eisner (2019).

In *AAAI*. [ paper | arxiv | slides | poster | code | bib ]

We show how the spellings of known words can help us deal with unknown words in open-vocabulary NLP tasks. The method we propose can be used to extend any closed-vocabulary generative model, but in this paper we specifically consider the case of neural language modeling. Our Bayesian generative story combines a standard RNN language model (generating the wordtokensin each sentence) with an RNN-based spelling model (generating the letters in each wordtype). These two RNNs respectively capture sentence structure and word structure, and are kept separate as in linguistics. By invoking the second RNN to generate spellings for novel words in context, we obtain an open-vocabulary language model. For known words, embeddings are naturally inferred by combining evidence from type spelling and token context. Comparing to baselines (including a novel strong baseline), we beat previous work and establish state-of-the-art results on multiple datasets.

Keywords:language modeling, word embeddings, deep learning, morphology

**Are all languages equally hard to language-model?**

Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and Brian Roark
(2019).

*SCiL*. [ extended abstract | official link | bib ]

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with bothn-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

Note:This was a refereed abstract and presentation of previously published work (Cotterell et al., 2018).

Keywords:typology, morphology, language modeling

**Surface statistics of an unknown language indicate how to parse it**.

Dingquan Wang and Jason Eisner (2018).

*TACL*. [ paper | official link | poster | bib ]

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety ofotherlanguages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages (Wang and Eisner, 2016) in the training achieves further improvement. (3) Despite being computed fromunparsedcorpora, our learned task-specific features beat previous work's interpretable typological features that requireparsedcorpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.65 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).

Note:This paper builds on the synthetic Galactic Dependencies treebanks developed by Wang and Eisner (2016), and extends their use from typology prediction (Wang and Eisner, 2017) to parsing of new languages.

Keywords:synthetic data, grammar induction, unsupervised learning, deep learning, linguistics, typology, syntax, dependency parsing, selected papers

**Synthetic data made to order: The case of parsing**.

Dingquan Wang and Jason Eisner (2018).

In *EMNLP*. [ paper+supplement | slides | PDF slides | video | code | bib ]

To approximately parse an unfamiliar language, it helps to have a treebank of a similar language. But what if the only available treebanks have the wrong word order? We show how to (stochastically) permute the constituents of an existing dependency treebank so that its surface part-of-speech statistics approximately match those of the target language. The parameters of the permutation model can be evaluated for quality by dynamic programming and tuned by gradient descent. This optimization procedure yields trees for a new artificial language that resembles the target language. We show that delexicalized parsers for the target language can be successfully trained using such “made to order” artificial languages.

Note:This method of creating synthetic treebanks is analogous to biological mutation. It differs from the method of Wang and Eisner (2016), which is analogous to sexual reproduction.

Keywords:synthetic data, grammar induction, unsupervised learning, linguistics, typology, syntax, dependency parsing, recorded talks

**Discrete latent variables in NLP: Good, bad, and indifferent**.

Jason Eisner (2018).

Invited talk at ACL Workshop on Relevance of Linguistic Structure in
Neural Architectures for NLP. [ bib ]

Keywords:invited talks, linguistics, generative modeling

**Unsupervised disambiguation of syncretism in inflected lexicons**.

Ryan Cotterell, Christo Kirov, Sabrina J. Mielke, and Jason Eisner
(2018).

In *NAACL*. [ paper | arxiv | poster | bib ]

Lexical ambiguity makes it difficult to compute various useful statistics of a corpus. A given word form might represent any of several morphological feature bundles. One can, however, use unsupervised learning (as in EM) to fit a model that probabilistically disambiguates word forms. We present such an approach, which employs a neural network to smoothly model a prior distribution over feature bundles (even rare ones). Although this basic model does not consider a token's context, that very property allows it to operate on a simple list of unigram type counts, partitioning each count among different analyses of that unigram. We discuss evaluation metrics for this novel task and report results on 5 languages.

Keywords:morphology, unsupervised learning

**Are all languages equally hard to language-model?**

Ryan Cotterell, Sabrina J. Mielke, Jason Eisner, and Brian Roark
(2018).

In *NAACL*. [ paper | arxiv | poster | bib ]

For general modeling methods applied to diverse languages, a natural question is: how well should we expect our models to work on languages with differing typological profiles? In this work, we develop an evaluation framework for fair cross-linguistic comparison of language models, using translated text so that all models are asked to predict approximately the same information. We then conduct a study on 21 languages, demonstrating that in some languages, the textual expression of the information is harder to predict with bothn-gram and LSTM language models. We show complex inflectional morphology to be a cause of performance differences among languages.

Note:Our followup paper with a more sophisticated analysis and weaker conclusions is Mielke et al. (2019).

Keywords:typology, morphology, language modeling

**Neural particle smoothing for sampling from conditional sequence
models**.

Chu-Cheng Lin and Jason Eisner (2018).

In *NAACL*. [ paper | arxiv | poster | bib ]

We introduceneural particle smoothing, a sequential Monte Carlo method for sampling annotations of an input string from a given probability model. In contrast to conventional particle filtering algorithms, we train a proposal distribution that looks ahead to the end of the input string by means of a right-to-left LSTM. We demonstrate that this innovation can improve the quality of the sample. To motivate our formal choices, we explain how neural transduction models and our sampler can be viewed as low-dimensional but nonlinear approximations to working with HMMs over very large state spaces.

Keywords:finite-state methods, deep learning, particle methods

**A deep generative model of vowel formant typology**.

Ryan Cotterell and Jason Eisner (2018).

In *NAACL*. [ paper | arxiv | slides | video | bib ]

What makes some types of languages more probable than others? For instance, we know that almost all spoken languages contain the vowel phoneme /i/; why should that be? The field of linguistic typology seeks to answer these questions and, thereby, divine the mechanisms that underlie human language. In our work, we tackle the problem of vowel system typology, i.e., we propose a generative probability model of which vowels a language contains. In contrast to previous work, we work directly with the acoustic information—the first two formant values—rather than modeling discrete sets of phonemic symbols (IPA). We develop a novel generative probability model and report results based on a corpus of 233 languages.

Note:This paper extends Cotterell et al. (2017). A further followup is Cotterell et al. (2019).

Keywords:phonology, typology, linguistics, deep learning, recorded talks

**UniMorph 2.0: Universal morphology**.

Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine
Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sabrina J. Mielke,
Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans
Hulden (2018).

In *LREC*. [ paper | official paper | arxiv | bib ]

The Universal Morphology (UniMorph) project is a collaborative effort to improve how NLP handles complex morphology across the world’s languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema. Additional supporting data and tools are also released on a per-language basis when available. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the collection, annotation, and dissemination of project resources since the initial UniMorph release described at LREC 2016.

Keywords:linguistics, annotation, morphology

**On the diachronic stability of irregularity in inflectional morphology**.

Ryan Cotterell, Christo Kirov, Mans Hulden, and Jason Eisner
(2018).

*CoRR*. [ paper | arxiv | bib ]

Many languages' inflectional morphological systems are replete with irregulars, i.e., words that do not seem to follow standard inflectional rules. In this work, we quantitatively investigate the conditions under which irregulars can survive in a language over the course of time. Using recurrent neural networks to simulate language learners, we test the diachronic relation between frequency of words and their irregularity.

Note:Accepted to NAACL 2018, but withdrawn in order to add more thorough experiments before full publication.

Keywords:linguistics, typology, morphology

**Treating machine learning algorithms as declaratively specified
circuits**.

Jason Eisner and Nathaniel Wesley Filardo (2018).

In *SysML*. [ paper | official link | poster | PDF poster | bib ]

We overview the Dyna programming abstraction for specifying ML systems as computational circuits.

Note:This was an refereed abstract of previously published work.

Keywords:Dyna, circuits

**Recovering syntactic structure from surface features**.

Jason Eisner (2018).

Invited talk at Penn State University. [ slides | bib ]

We show how to predict the basic word-order facts of a novel language, and even obtain approximate syntactic parses, given only a corpus of part-of-speech (POS) sequences. We are motivated by the longstanding challenge of determining the structure of a language from its superficial features. While this is usually regarded as an unsupervised learning problem, there are good reasons that generic unsupervised learners are not up to the challenge. We do much better with a supervised approach where we train a system – a kind of language acquisition device – to predict how linguists will annotate a language. Our system uses a neural network to extract predictions from a large collection of numerical measurements. We train it on a mixture of real treebanks and synthetic treebanks obtained by systematically permuting the real trees, which we can motivate as sampling from an approximate prior over possible human languages.

Note:This work in this talk was mainly reported in Wang and Eisner (2016) and its followup papers.

Keywords:invited talks, grammar induction, unsupervised learning, linguistics, typology, syntax, dependency parsing

**Probabilistically modeling surface patterns using latent structure**.

Jason Eisner (2018).

Invited talk at the 1st Annual Meeting of the Society for Computation
in Linguistics (SCiL). [ slides | video with captions | bib ]

A language's lexicon of surface forms and constructions includes many systematic regularities, as well as semi-regular and irregular exceptions. Generative linguists often explain regularities using shared latent representations and regular derivational processes. A probabilistic model with those elements can naturally allow for deviations from regularity and model the fact that some deviations are improbable. The probability of a derivational change can be sensitive to subtle properties of the context. I will outline several probabilistic models of the morphophonological and syntactic lexicons, which can extrapolate predictions based on their reconstruction of latent structure: e.g., underlying forms, cyclic derivations, and input-output alignments.

Note:This work in this talk was mainly reported in Cotterell et al. (2015) and its followup papers.

Keywords:invited talks, recorded talks, linguistics, finite-state methods, morphology, phonology

**Predicting fine-grained syntactic typology from surface features**.

Dingquan Wang and Jason Eisner (2018). [ extended abstract | official link | Russian translation | poster | bib ]

We show how to predict the basic word-order facts of a novel language given only a corpus of its part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the directionalities of all dependency relations. Although recovering syntactic structure is usually regarded as unsupervised learning, we train our predictor on languages of known structure. It outperforms the state-of-the-art unsupervised learning by a large margin, especially when we augment the training data with many synthetic languages.

Note:This was a refereed abstract and presentation of previously published work (Wang and Eisner, 2017).

Keywords:synthetic data, grammar induction, unsupervised learning, linguistics, typology, syntax, dependency parsing

**Quantifying the trade-off between two types of morphological
complexity**.

Ryan Cotterell, Christo Kirov, Mans Hulden, and Jason Eisner (2018). [ extended abstract | official link | PDF slides | bib ]

Note:This was a refereed abstract and presentation of work that later appeared as (Cotterell et al., 2019).

Keywords:linguistics, typology, morphology

**The neural Hawkes process: A neurally self-modulating multivariate point
process**.

Hongyuan Mei and Jason Eisner (2017).

In *NeurIPS*. [ official link | paper+supplement | arxiv | poster | teaser video | code | bib ]

Many events occur in the world. Some event types are stochastically excited or inhibited—in the sense of having their probabilities elevated or decreased—by patterns in the sequence of previous events. Discovering such patterns can help us predictwhich typeof event will happen next andwhen. We model streams of discrete events in continuous time, by constructing aneurally self-modulating multivariate point processin which the intensities of multiple event types evolve according to a novelcontinuous-time LSTM. This generative model allows past events to influence the future in complex and realistic ways, by conditioning future event intensities on the hidden state of a recurrent neural network that has consumed the stream of past events. Our model has desirable qualitative properties. It achieves competitive likelihood and predictive accuracy on real and synthetic datasets, including under missing-data conditions.

Note:This is the first in a series of papers on the neural Hawkes process.

Keywords:event streams, deep learning, generative modeling, unsupervised learning, selected papers

** Dyna 2: Towards a General Weighted Logic Language**.

Nathaniel Wesley Filardo (2017).

PhD thesis, Johns Hopkins University. [ bib ]

We investigate the design of an expressive, purely-declarative, weighted logic programming language, Dyna. Dyna is a decade-plus effort in pushing the boundaries of declarative programming and “executable mathematics;” it instantiates an unusual point in the design space, as it is both Turing-complete (unlike Datalog) and devoid of a specified execution order (unlike Prolog). That is, it is designed to be, at once, both highly expressive and rich in opportunities for automated optimization. This thesis contains two major thrusts. We first consider both the denotational and operational aspects of Dyna. In particular, for operational semantics, we introduce and extend our EarthBound solver for finite circuits; the next chapter considers the generalization to logic programs proper. We then turn our attention to the static analysis of this language, considering mechanisms for reasoning both about abstract notions of well-formedness of programs as well as more mundane concerns of realizability of programs in actual computation. Along the way we endeavour to place our work in the context of the larger field of logic programming languages and present our current thoughts on future avenues of exploration.

Note:Dr. Filardo's dissertation advisor was Jason Eisner.

Keywords:theses, Dyna, circuits

**ACL policies and guidelines for submission, review and citation**.

Jason Eisner, Jennifer Foster, Iryna Guryvech, Marti Hearst, Heng Ji,
Lillian Lee, Christopher Manning, Paola Merlo, Yusuke Miyao, Joakim Nivre,
Amanda Stent, and Ming Zhou (2017).

Report available on the wiki of the Association for Computational
Linguistics. [ report | policies | bib ]

This is the report of a working group appointed by the ACL Executive Committee to review policies and guidelines for conference submissions. The group was chaired by ACL President Joakim Nivre. The policies were adopted by the Association for Computational Linguistics effective January 1, 2018.

Keywords:admin

**Knowledge tracing in sequential learning of inflected vocabulary**.

Adithya Renduchintala, Philipp Koehn, and Jason Eisner (2017).

In *CoNLL*. [ paper | slides | bib ]

We present a feature-rich knowledge tracing method that captures a student's acquisition and retention of knowledge during a foreign language phrase learning task. We model the student's behavior as making predictions under a log-linear model, and adopt a neural gating mechanism to model how the student updates their log-linear parameters in response to feedback. The gating mechanism allows the model to learn complex patterns of retention and acquisition for each feature, while the log-linear parameterization results in an interpretable knowledge state. We collect human data and evaluate several versions of the model.

Keywords:edutech, generative modeling

**Probabilistic typology: Deep generative models of vowel inventories**.

Ryan Cotterell and Jason Eisner (2017).

In *ACL (Best Long Paper Award)*. [ paper | arxiv | slides | video | podcast | bib ]

Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while most—but not all—languages have a /u/ sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology: What makes a natural vowel inventory? We introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages.

Note:Followup papers were Cotterell et al. (2018) and Cotterell et al. (2019).

Keywords:phonology, typology, linguistics, deep learning, recorded talks, selected papers

**Bayesian modeling of lexical resources for low-resource settings**.

Nicholas Andrews, Mark Dredze, Benjamin Van Durme, and Jason Eisner
(2017).

In *ACL*. [ paper | slides | video | code | bib ]

Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources asobservationsunder the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.

Keywords:names, nonparametric models, recorded talks

**Dyna: Toward a self-optimizing declarative language for machine learning
applications**.

Tim Vieira, Matthew Francis-Landau, Nathaniel Wesley Filardo, Farzad
Khorasani, and Jason Eisner (2017).

In *Workshop on Machine Learning and Programming Languages
(MAPL)*. [ paper | slides | PDF slides | video | bib ]

Declarative programming is a paradigm that allows programmers to specifywhatthey want to compute, leavinghowto compute it to a solver. Our declarative programming language, Dyna, is designed to compactly specify computations like those that are frequently encountered in machine learning. As a declarative language, Dyna's solver has a large space of (correct) strategies available to it. We describe a reinforcement learning framework foradaptivelychoosing among these strategies to maximize efficiency for a given workload. Adaptivity in execution is especially important for software that will run under a variety of workloads, where no fixed policy works well. We hope that reinforcement learning will identify good policies reasonably quickly—offloading the burden of writing efficient code from human programmers.

Keywords:Dyna, reinforcement learning, circuits, parallel/distributed computation, recorded talks, selected papers

**Explaining and generalizing skip-gram through exponential family principal
component analysis**.

Ryan Cotterell, Adam Poliak, Benjamin Van Durme, and Jason Eisner
(2017).

In *EACL*. [ paper | bib ]

The popular skip-gram model induces word embeddings by exploiting the signal from word-context coocurrence. We offer a new interpretation of skip-gram based on exponential family PCA—a form of matrix factorization to generalize the skip-gram model totensorfactorization. In turn, this lets us train embeddings through richer higher-order coocurrences, e.g., triples that include positional information (to incorporate syntax) or morphological information (to share parameters across related words). We experiment on 40 languages and show our model improves upon skip-gram.

Keywords:word embeddings, deep learning, morphology

**Fine-grained prediction of syntactic typology: Discovering latent
structure with supervised learning**.

Dingquan Wang and Jason Eisner (2017).

We show how to predict the basic word-order facts of a novel language given only a corpus of part-of-speech (POS) sequences. We predict how often direct objects follow their verbs, how often adjectives follow their nouns, and in general the directionalities of all dependency relations. Such typological properties could be helpful in grammar induction. While such a problem is usually regarded as unsupervised learning, our innovation is to treat it assupervisedlearning, using a large collection of realistic synthetic languages as training data. The supervised learner must identifysurfacefeatures of a language's POS sequence (hand-engineered or neural features) that correlate with the language'sdeeperstructure (latent trees). In the experiment, we show: 1) Given a small set of real languages, it helps to add many synthetic languages to the training data. 2) Our system is robust even when the POS sequences include noise. 3) Our system on this task outperforms a grammar induction baseline by a large margin.

Note:This paper builds on the synthetic Galactic Dependencies treebanks developed by Wang and Eisner (2016).Caveat lector:Our experimental design evaluated on held-out treebanks. Readers should be aware that two of the 17 evaluation treebanks were for languages that were also represented in the training (albeit with different text and different annotation efforts). Withholding those treebanks from the evaluation does not qualitatively change the published results. We are preparing an addendum that gives the results for this second experimental design and compares the two designs.

Keywords:synthetic data, grammar induction, unsupervised learning, linguistics, typology, syntax, dependency parsing, recorded talks, selected papers

**Learning to prune: Exploring the frontier of fast and accurate parsing**.

Tim Vieira and Jason Eisner (2017).

*TACL*. [ paper | official link | slides | video | code | bib ]

Pruning hypotheses during dynamic programming is commonly used to speed up inference in settings such as parsing. Unlike prior work, we train a pruning policy under an objective that measures end-to-end performance: we search for a fastandaccurate policy. This poses a difficult machine learning problem, which we tackle with the algorithm. We apply our approach to constituency parsing. Our experimental results show that accounting for performance of the end-to-end system leads to a better Pareto frontier—i.e., parsers which are more accurate for a given runtime.

Keywords:parsing approximations, reinforcement learning, recorded talks, selected papers

**Fine-grained parallelism in probabilistic parsing with Habanero
Java**.

Matthew Francis-Landau, Bing Xue, Jason Eisner, and Vivek Sarkar
(2016).

In *Proceedings of the Sixth Workshop on Irregular Applications:
Architectures and Algorithms (IA ^{3})*. [ paper | slides | bib ]

Structured prediction algorithms—used when applying machine learning to tasks like natural language parsing and image understanding—present some opportunities for fine-grained parallelism, but also have problem-specific serial dependencies. Most implementations exploit only simple opportunities such as parallel BLAS, or embarrassing parallelism over input examples. In this work we explore an orthogonal direction: using the fact that these algorithms can be described as specialized forward-chaining theorem provers (Pereira and Warren, 1983; Eisner et al., 2005), and implementing fine-grained parallelization of the forward-chaining mechanism. We study context-free parsing as a simple canonical example, but the approach is more general.

Keywords:Dyna, circuits, CFG parsing, parallel/distributed computation

**Inside-outside and forward-backward algorithms are just backprop**.

Jason Eisner (2016).

In *EMNLP Workshop on Structured Prediction for NLP*. [ paper | slides | bib ]

One commonly needs to obtain the expected counts of states, transitions, constituents, or rules under probabilistic or weighted grammars. This requires an algorithm such as inside-outside or forward-backward that is tailored to the grammar formalism. Conveniently, each such algorithm can be obtained by automatically differentiating an “inside” algorithm that merely computes the log-probability of the evidence. This mechanical procedure produces correct and efficient code. Just as for any instance of back-propagation, it can be carried out manually or by software. This pedagogical paper carefully spells out the construction and relates it to traditional and non-traditional views of these algorithms.

Keywords:teaching, automatic differentiation, dynamic programming, CFG parsing

**Speed-accuracy tradeoffs in tagging with variable-order CRFs and
structured sparsity**.

Tim Vieira, Ryan Cotterell, and Jason Eisner (2016).

In *EMNLP*. [ paper | bib ]

We propose a method for learning the structure of variable-order CRFs, a more flexible variant of higher-order linear-chain CRFs. Variable-order CRFs achieve faster inference by including features for only some of the tagn-grams. Our learning method discovers the useful higher-order features at the same time as it trains their weights, by maximizing an objective that combines log-likelihood with a structured-sparsity regularizer. An active-set outer loop allows the feature set to grow as far as needed. On part-of-speech tagging in 5 randomly chosen languages from the Universal Dependencies dataset, our method of shrinking the model achieved a 2–6x speedup over a baseline, with no significant drop in accuracy.

Keywords:dynamic programming, finite-state methods, tagging

**The Galactic Dependencies treebanks: Getting more data by synthesizing
new languages**.

Dingquan Wang and Jason Eisner (2016).

*TACL*. [ paper | official link | arxiv | slides | video | data | bib ]

We release Galactic Dependencies 1.0—a large set of synthetic languages not found on Earth, but annotated in Universal Dependencies format. This new resource aims to provide training and development data for NLP methods that aim to adapt to unfamiliar languages. Each synthetic treebank is produced from a real treebank by stochastically permuting the dependents of nouns and/or verbs to match the word order of other real languages. We discuss the usefulness, realism, parsability, perplexity, and diversity of the synthetic languages. As a simple demonstration of the use of Galactic Dependencies, we consider single-source transfer, which attempts to parse a real target language using a parser trained on a “nearby” source language. We find that including synthetic source languages somewhat increases the diversity of the source pool, which significantly improves results for most target languages.

Note:In later papers, we successfully used the Galactic Dependencies treebanks to extrapolate fine-grained typology prediction (Wang and Eisner, 2018) and multilingual parsing (Wang and Eisner, 2018) to unseen languages. Some of the material in these 3 papers is synthesized into this overview talk. We also developed an alternative method for creating synthetic languages on demand (Wang and Eisner, 2018).

Caveat lector:Our experimental design evaluated on held-out treebanks. Readers should be aware that two of the 17 evaluation treebanks were for languages that were also represented in the training (albeit with different text and different annotation efforts). Withholding those treebanks from the evaluation does not qualitatively change the published results. We are preparing an addendum that gives the results for this second experimental design and compares the two designs.

Keywords:synthetic data, corpora, linguistics, syntax, dependency parsing, MT, recorded talks

**Analyzing learner understanding of novel L2 vocabulary**.

Rebecca Knowles, Adithya Renduchintala, Philipp Koehn, and Jason
Eisner (2016).

In *CoNLL*. [ paper | video | bib ]

In this work, we explore how learners can infer second-language noun meanings in the context of their native language. Motivated by an interest in building interactive tools for language learning, we collect data on three word-guessing tasks, analyze their difficulty, and explore the types of errors that novice learners make. We train a log-linear model for predicting our subjects' guesses of word meanings in varying kinds of contexts. The model's predictions correlate well with subject performance, and we provide quantitative and qualitative analyses of both human and model performance.

Note:This study is closely related to (Renduchintala et al., 2016), but it uses a more controlled experimental design that permits a simpler model, and it examines more features. Our companion paper (Renduchintala et al., 2016) describes an interactive user interface for reading “macaronic” text. See also this talk.

Keywords:edutech, MT, recorded talks

**Creating interactive macaronic interfaces for language learning**.

Adithya Renduchintala, Rebecca Knowles, Philipp Koehn, and Jason
Eisner (2016).

In *ACL Demo Session*. [ paper+supplement | poster | bib ]

We present a prototype of a novel technology for second language instruction. Our learn-by-reading approach lets a human learner acquire new words and constructions by encountering them in context. To facilitate reading comprehension, our technology presents mixed native language (L1) and second language (L2) sentences to a learner and allows them to interact with the sentences to make the sentences easier (more L1-like) or harder (more L2-like) to read. Eventually, our system should continuously track a learner's knowledge and learning style by modeling their interactions, including performance on a pop quiz feature. This will allow our system to generate personalized mixed-language texts for learners.

Note:Our companion papers (Renduchintala et al., 2016; Knowles et al., 2016) study humans' comprehension in these mixed-language (“macaronic”) contexts. See also this talk.

Keywords:edutech, MT, selected papers

**User modeling in language learning with macaronic texts**.

Adithya Renduchintala, Rebecca Knowles, Philipp Koehn, and Jason
Eisner (2016).

In *ACL*. [ paper+supplement | slides | video | bib ]

Foreign language learners can acquire new vocabulary by using cognate and context clues when reading. To measure such incidental comprehension, we devise an experimental framework that involves reading mixed-language “macaronic” sentences. Using data collected via Amazon Mechanical Turk, we train a graphical model to simulate a human subject's comprehension of foreign words, based on cognate clues (edit distance to an English word), context clues (pointwise mutual information), and prior exposure. Our model does a reasonable job at predicting which words a user will be able to understand, which should facilitate the automatic construction of comprehensible text for personalized foreign language education.

Note:Our companion paper (Renduchintala et al., 2016) describes an interactive user interface for reading “macaronic” text. See also our followup study (Knowles et al., 2016) and this overview talk.

Keywords:edutech, MT, recorded talks

**Morphological smoothing and extrapolation of word embeddings**.

Ryan Cotterell, Hinrich Schütze, and Jason Eisner (2016).

In *ACL*. [ paper+supplement | slides | video | bib ]

Languages with rich inflectional morphology exhibit lexical data sparsity, since the word used to express a given concept will vary with the syntactic context. For instance, each count noun in Czech has 12 forms (where English uses only singular and plural). Even in large corpora, we are unlikely to observe all inflections of a given lemma. This reduces the vocabulary coverage of methods that induce continuous representations for words from distributional corpus information. We solve this problem by exploiting existing morphological resources that can enumerate a word's component morphemes. We present a latent-variable Gaussian graphical model that allows us to extrapolate continuous representations for words not observed in the training corpus, as well as smoothing the representations provided for the observed words. The latent variables represent embeddings of morphemes, which combine to create embeddings of words. Over several languages and training sizes, our model improves the embeddings for words, when evaluated on an analogy task, skip-gram predictive accuracy, and word similarity.

Keywords:word embeddings, deep learning, morphology, recorded talks

**Rigid tree automata with isolation**.

Nathaniel Wesley Filardo and Jason Eisner (2016).

In *TTATT*. [ paper | slides | bib ]

Rigid Tree Automata (RTAs) are a strict super-class of Regular Tree Automata (TAs), additionally capable of recognizing certain nonlinear patterns such as {f〈x,x〉:x∈X}. RTAs were developed for use in tree-automata-based model checking; we hope to use them as part of a static analysis system for a logic programming language. In developing that system, we noted that RTAs are not closed under Kleene-star or pre-concatenation with a regular language. We now introduce a strict super-class of RTA, called Isolating Rigid Tree Automata, which can accept rigid structures with arbitrarily manyisolatedrigid substructures, such as “lists of equal pairs,” by allowing rigidity to be confined within subtrees. This class is Kleene-star and concatenation closed and retains many features of RTAs, including linear-time emptiness testing and NP-complete membership testing. However, it gives up closure under intersection.

Keywords:finite-state methods, Dyna

**Weighting finite-state transductions with neural context**.

Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner (2016).

In *NAACL*. [ paper+supplement | slides | video | code | bib ]

How should one apply deep learning to tasks such as morphological reinflection, which stochastically edit one string to get another? A recent approach to such sequence-to-sequence tasks is to compress the input string into a vector that is then used to generate the output string, using recurrent neural networks. In contrast, we propose to keep the traditional architecture, which uses a finite-state transducer to scoreall possible output strings, but to augment the scoring function with the help of recurrent networks. A stack of bidirectional LSTMs reads the input string from left-to-right and right-to-left, in order to summarize theinput contextin which a transducer arc is applied. We combine these learned features with the transducer to define a probability distribution overalignedoutput strings, in the form of a weighted finite-state automaton. This reduces hand-engineering of features, allows learned features to examine unbounded context in the input string, and still permits exact inference through dynamic programming. We illustrate our method on the tasks of morphological reinflection and lemmatization.

Note:Jason's talk for this paper was based around the movieCowboys & Aliens. The method developed in this paper should be called a BiLSTM-FST, by analogy with related architectures such as the BiLSTM-CRF and the BiLSTM-CFG. Aharoni and Goldberg (2017) refer to latent alignments as "hard monotonic attention."

Keywords:finite-state methods, deep learning, dynamic programming, morphology, selected papers, recorded talks

** Generative Non-Markov Models for Information Extraction**.

Nicholas Oliver Andrews (2016).

PhD thesis, Johns Hopkins University. [ bib ]

Learning from unlabeled data is a long-standing challenge in machine learning. A principled solution involves modeling the full joint distribution over inputs and the latent structure of interest, and imputing the missing data via marginalization. Unfortunately, such marginalization is expensive for most non-trivial problems, which places practical limits on the expressiveness of generative models. As a result, joint models often encode strict assumptions about the underlying process such as fixed-order Markovian assumptions and employ simple count-based features of the inputs.In contrast, conditional models, which do not directly model the observed data, are free to incorporate rich overlapping features of the input in order to predict the latent structure of interest. It would be desirable to develop expressive generative models that retain tractable inference. This is the topic of this thesis. In particular, we explore joint models which relax fixed-order Markov assumptions, and investigate the use of recurrent neural networks for automatic feature induction in the generative process.

We focus on two structured prediction problems: (1) imputing labeled segmentations of input character sequences, and (2) imputing directed spanning trees relating strings in text corpora. These problems arise in many applications of practical interest, but we are primarily concerned with named-entity recognition and cross-document coreference resolution in this work.

For named-entity recognition, we propose a generative model in which the observed characters originate from a latent non-Markov process over words, and where the characters are themselves produced via a non-Markov process: a recurrent neural network (RNN). We propose a sampler for the proposed model in which sequential Monte Carlo is used as a transition kernel for a Gibbs sampler. The kernel is amenable to a fast parallel implementation, and results in fast mixing in practice.

For cross-document coreference resolution, we move beyond sequence modeling to consider string-to-string transduction. We stipulate a generative process for a corpus of documents in which entity names arise from copying—and optionally transforming—previous names of the same entity. Our proposed model is sensitive to both the context in which the names occur as well as their spelling. The string-to-string transformations correspond to systematic linguistic processes such as abbreviation, typos, and nicknaming, and by analogy to biology, we think of them as mutations along the edges of a phylogeny. We propose a novel block Gibbs sampler for this problem that alternates between sampling an ordering of the mentions and a spanning tree relating all mentions in the corpus.

Note:Dr. Andrews's dissertation advisors were Jason Eisner and Mark Dredze.

Keywords:theses, transformation models, finite-state methods, coreference resolution

**Gradually learning to read a foreign language: Adaptive partial machine
translation**.

Jason Eisner (2016).

Keynote talk at biennial Science of Learning Symposium at Johns
Hopkins University. [ slides | video | press | bib ]

We propose that one should learn a foreign language by reading interesting prose. But how can one get started? We are building an intelligent user interface thatpartiallytranslates text, leaning at first on the learner's native vocabulary but gradually introducing new foreign words and constructions in context. Faced with hybrid text of this sort, the learner can also use the mouse to translate or untranslate portions of a sentence; as a side benefit, this provides feedback about what the learner currently understands. We give an overview of the project, including pedagogical motivation, modeling of the learner, data collection, user interface design, linguistic issues, and our use of machine translation and reinforcement learning inside the system.

Keywords:edutech, invited talks, recorded talks

**Graphical models over string-valued random variables**.

Jason Eisner (2015).

Keynote talk at IEEE ASRU. [ slides | video | bib ]

Natural language processing must sometimes consider the internal structure of words, e.g., in order to understand or generate an unfamiliar word. Unfamiliar words are systematically related to familiar ones due to linguistic processes such as morphology, phonology, abbreviation, copying error, and historical change.We will show how to build joint probability models over many strings. These models are capable of predicting unobserved strings, or predicting the relationships among observed strings. However, computing the predictions of these models can be computationally hard. We outline approximate algorithms based on Markov chain Monte Carlo, expectation propagation, and dual decomposition. We give results on some NLP tasks.

Note:This work in this talk was mainly reported in Cotterell et al. (2015) and its followup papers.

Keywords:invited talks, recorded talks, finite-state methods, relaxation, variational inference, global optimization, morphology, phonology

** Graphical Models with Structured Factors, Neural Factors, and
Approximation-Aware Training**.

Matthew R. Gormley (2015).

PhD thesis, Johns Hopkins University. [ dissertation | bib ]

This thesis broadens the space of rich yet practical models for structured prediction. We introduce a general framework for modeling with four ingredients: (1) latent variables, (2) structural constraints, (3) learned (neural) feature representations of the inputs, and (4) training that takes the approximations made during inference into account. The thesis builds up to this framework through an empirical study of three NLP tasks: semantic role labeling, relation extraction, and dependency parsing—obtaining state-of-the-art results on the former two. We apply the resultinggraphical models with structured and neural factors, and approximation-aware learningto jointly model part-of-speech tags, a syntactic dependency parse, and semantic roles in a low-resource setting where the syntax is unobserved. We also present an alternative view of these models asneural networkswith a topology inspired by inference on graphical models that encode our intuitions about the data.

Note:Dr. Gormley's dissertation advisors were Jason Eisner and Mark Dredze.

Keywords:theses, dynamic programming, deep learning, dependency parsing, parsing approximations, variational inference, semantics

**Dual decomposition inference for graphical models over strings**.

Nanyun Peng, Ryan Cotterell, and Jason Eisner (2015).

In *EMNLP*. [ paper | slides | bib ]

We investigate dual decomposition for joint MAP inference of many strings. Given an arbitrary graphical model, we decompose it into small acyclic sub-models, whose MAP configurations can be found by finite-state composition and dynamic programming. We force the solutions of these subproblems to agree on overlapping variables, by tuning Lagrange multipliers for an adaptively expanding set of variable-lengthn-gram count features.This is the first inference method for arbitrary graphical models over strings that does not require approximations such as random sampling, message simplification, or a bound on string length.

Provided thatthe inference method terminates, it gives a certificate of global optimality (though MAP inference in our setting is undecidable in general). On our global phonological inference problems, it does indeed terminate, and achieves more accurate results than max-product and sum-product loopy belief propagation.

Note:This paper and Cotterell and Eisner (2015) provide improved inference methods for problems involving graphical models over strings, such as Dreyer et al. (2009, 2011) and Cotterell et al. (2015). See also this combined talk.

Keywords:finite-state methods, relaxation, global optimization, morphology, phonology

**Penalized expectation propagation for graphical models over strings**.

Ryan Cotterell and Jason Eisner (2015).

In *NAACL-HLT*. [ paper+supplement | slides | video | scholar | bib ]

We present penalized expectation propagation, a novel algorithm for approximate inference in graphical models. Expectation propagation is a well-known message-passing algorithm that prevents messages from growing in complexity by forcing them back into a given family of distributions. Our extension uses a structured-sparsity penalty to prefer simpler messages. In the case of string-valued random variables, this allows us to use a non-parametric message family, related to variable-ordern-gram models. The method automatically calibrates the complexity of each message to balance speed and accuracy. We test the algorithm on phonological inference problems, showing substantial speedup over previous algorithms with no significant loss in accuracy.

Note:This paper and Peng et al. (2015) provide improved inference methods for problems involving graphical models over strings, such as Dreyer et al. (2009, 2011) and Cotterell et al. (2015). See also this combined talk.

Keywords:finite-state methods, variational inference, morphology, phonology, recorded talks, selected papers

**Approximation-aware dependency parsing by belief propagation**.

Matthew R. Gormley, Mark Dredze, and Jason Eisner (2015).

*TACL*. [ paper | official link | arxiv | slides | PDF slides | code | bib ]

We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retainingO(n^{3}) runtime. It outputs the parse with maximum expected recall—but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by back-propagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.

Keywords:cost-aware learning, discriminative training, automatic differentiation, deep learning, variational inference, dynamic programming, dependency parsing, parsing approximations, deterministic annealing

**Modeling word forms using latent underlying morphs and phonology**.

Ryan Cotterell, Nanyun Peng, and Jason Eisner (2015).

*TACL*. [ paper | official link | slides | video | data | bib ]

The observed pronunciations or spellings of words are often explained as arising from the “underlying forms” of their morphemes. These forms are latent strings that linguists try to reconstruct by hand. We propose to reconstruct them automatically at scale, enabling generalization to new words. Given some surface word types of a concatenative language along with the abstract morpheme sequences that they express, we show how to recover consistent underlying forms for these morphemes, together with the (stochastic) phonology that maps each concatenation of underlying forms to a surface form. Our technique involves loopy belief propagation in a natural directed graphical model whose variables are unknown strings and whose conditional distributions are encoded as finite-state machines with trainable weights. We define training and evaluation paradigms for the task of surface word prediction, and report results on subsets of 7 languages.

Note:The PFSTs of Cotterell et al. (2014) appear as conditional distributions within this Bayesian network. The followup papers Cotterell and Eisner (2015) and Peng et al. (2015) develop improved inference methods and test them on this Bayesian network. See also this combined talk and this shorter one for linguists. Relatedly, Cotterell et al. (2016) use a graphical model with the same topology to reason about the embeddings of words and morphemes rather than their string form.

Keywords:finite-state methods, phonology, morphology, unsupervised learning, recorded talks, selected papers

**Learning to search in branch-and-bound algorithms**.

He He, Hal Daumé III, and Jason Eisner (2014).

In *NeurIPS*. [ paper | poster | scholar | bib ]

Branch-and-bound is a widely used method in combinatorial optimization, including mixed integer programming, structured prediction and MAP inference. While most work has been focused on developing problem-specific techniques, little is known about how to systematically design the node searching strategy on a branch-and-bound tree. We address the key challenge of learning anadaptivenode searching order for any class of problem solvable by branch-and-bound. Our strategies are learned by imitation learning. We apply our algorithm to linear programming based branch-and-bound for solving mixed integer programs (MIP). We compare our method with one of the fastest open-source solvers, SCIP; and a very efficient commercial solver, Gurobi. We demonstrate that our approach achieves better solutions faster on four MIP libraries.

Keywords:reinforcement learning, systematic search, global optimization

**Deriving multi-headed projective dependency parses from link grammar
parses**.

Juneki Hong and Jason Eisner (2014).

In *TLT*. [ paper | supplement | slides | poster | code | scholar | bib ]

Under multi-headed dependency grammar, a parse is a connected DAG rather than a tree. Such formalisms can be more syntactically and semantically expressive. However, it is hard to train, test, or improve multi-headed parsers because few multi-headed corpora exist, particularly for the projective case. To help fill this gap, we observe that link grammar already producesundirectedprojective graphs. We use Integer Linear Programming to assign consistent directions to the labeled links in a corpus of several thousand parses produced by the Link Grammar Parser, which has broad-coverage hand-written grammars of English as well as Russian and other languages. We find that such directions can indeed be consistently assigned in a way that yields valid multi-headed dependency parses. The resulting parses in English appear reasonably linguistically plausible, though differing in style from CoNLL-style parses of the same sentences; we discuss the differences.

Keywords:dependency parsing, corpora, annotation

**Robust entity clustering via phylogenetic inference**.

Nicholas Andrews, Jason Eisner, and Mark Dredze (2014).

In *ACL*. [ paper | slides | video | code | scholar | bib ]

Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evaluation on several datasets.

Keywords:names, transformation models, finite-state methods, coreference resolution, recorded talks, selected papers

**Stochastic contextual edit distance and probabilistic FSTs**.

Ryan Cotterell, Nanyun Peng, and Jason Eisner (2014).

In *ACL*. [ paper | poster | scholar | bib ]

String similarity is most often measured by weighted or unweighted edit distanced(x,y). Ristad and Yianilos (1998) defined stochastic edit distance—a probability distributionp(y|x) whose parameters can be trained from data. We generalize this so that the probability of choosing each edit operation can depend on contextual features. We show how to construct and train a probabilistic finite-state transducer that computes our stochastic contextual edit distance. To illustrate the improvement from conditioning on context, we model typos found in social media text.

Note:The PFSTs developed in this paper are used within the Bayesian network of Cotterell et al. (2015).

Keywords:finite-state methods

**Dynamic feature selection for dependency parsing**.

He He, Hal Daumé III, and Jason Eisner (2013).

In *EMNLP*. [ paper | slides | video | scholar | bib ]

Feature computation and exhaustive search have significantly restricted the speed of graph-based dependency parsing. We propose a faster framework ofdynamic feature selection, where features are added sequentially as needed, edges are pruned early, and decisions are made online for each sentence. We model this as a sequential decision-making problem and solve it by imitation learning techniques. Our dynamic parser can achieve accuracies comparable or even superior to parsers using a full set of features, while computing fewer than 30% of the feature templates.

Keywords:dependency parsing, parsing approximations, reinforcement learning, cost-aware learning, recorded talks

**A virtual manipulative for learning log-linear models**.

Francis Ferraro and Jason Eisner (2013).

In *ACL Workshop on Teaching NLP/CL*. [ paper | website | slides | code | scholar | bib ]

We present an open-source virtual manipulative for conditional log-linear models. This web-based interactive visualization lets the user tune the probabilities of various shapes—which grow and shrink accordingly—by dragging sliders that correspond to feature weights. The visualization displays a regularized training objective; it supports gradient ascent by optionally displaying gradients on the sliders and providing “Step” and “Solve” buttons. The user can sample parameters and datasets of different sizes and compare their own parameters to the truth. Our website, http://cs.jhu.edu/~jason/tutorials/loglin/, guides the user through a series of interactive lessons and provides auxiliary readings, explanations, practice problems and resources.

Keywords:teaching, machine learning

**Introducing computational concepts in a linguistics olympiad**.

Patrick Littell, Lori Levin, Jason Eisner, and Dragomir Radev
(2013).

In *ACL Workshop on Teaching NLP/CL*. [ paper | scholar | bib ]

Linguistics Olympiads, now offered in more than 20 countries, provide secondary-school students a compelling introduction to an unfamiliar field. The North American Computational Linguistics Olympiad (NACLO) includes computational puzzles in addition to purely linguistic ones. This paper explores the computational subject matter we want to convey via NACLO, as well as some of the challenges that arise when adapting problems in computational linguistics to an audience that may have no background in computer science, linguistics, or advanced mathematics. We present a small library of reusable design patterns that have proven useful when composing puzzles appropriate for secondary-school students.

Keywords:teaching

**Nonconvex global optimization for latent-variable models**.

Matthew Gormley and Jason Eisner (2013).

In *ACL*. [ paper | slides | PDF slides | scholar | bib ]

Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle forlocaloptimization methods such as EM or gradient ascent.We explore how one might instead search for a

globaloptimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ε) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum.As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.

Keywords:parameter search, grammar induction, relaxation, systematic search, global optimization

**Prioritized asynchronous belief propagation**.

Jiarong Jiang, Taesun Moon, Hal Daumé III, and Jason Eisner
(2013).

In *ICML Workshop on Inferning: Interactions between Inference
and Learning*. [ paper | video | scholar | bib ]

Message scheduling is shown to be very effective in belief propagation (BP) algorithms. However, most existing scheduling algorithms use ﬁxed heuristics regardless of the structure of the graphs or properties of the distribution. On the other hand, designing diﬀerent scheduling heuristics for all graph structures is not feasible. In this paper, we propose a reinforcement learning based message scheduling framework (RLBP) to learn the heuristics automatically which generalizes to any graph structures and distributions. In the experiments, we show that the learned problem-speciﬁc heuristics largely outperform other baselines in speed.

Keywords:dynamic prioritization, reinforcement learning, graphical model algorithms, recorded talks

**Deep learning of recursive structure: Grammar induction**.

Jason Eisner (2013).

Keynote talk at International Conference on Learning Representations. [ slides | PDF slides | video | photo | bib ]

Grammar induction aims to induce the parameters of a probabilistic context-free grammar (PCFG). Crucially, the same parameters should be used not only at different positions in the sentence (as in convolutional networks for vision) but also at different levels in the tree.We consider how several ideas from deep learning can help construct a PCFG from the bottom up while resisting bad local optima. Our proposed architecture learns to model a sentence as a sequence of phrases, each generated by a PCFG. The root nonterminal of each phrase is predicted from the context of that phrase. Formally this is a kind of autoencoder that predicts the sentence from itself, but structured as a semi-Markov CRF whose emission distribution is a PCFG, and which predicts each phrase only from context. During learning, we “anneal” the search bias from generating a long sequence of 1-word phrases (so the method finds word embeddings based on context) to a single phrase that covers the whole sentence (at which point we have an ordinary PCFG with no dependence on context).

Each run of the learner derives features of the context from the grammars found on previous, more naive runs. This stacking of multiple runs is what makes the method deep. We also mention extensions that involve supervised fine-tuning or richer, vector-valued representations of words and nonterminals.

Keywords:grammar induction, word embeddings, deep learning, invited talks, recorded talks

**Learned prioritization for trading off accuracy and speed**.

Jiarong Jiang, Adam Teichert, Hal Daumé III, and Jason Eisner
(2012).

In *NeurIPS*. [ paper | poster | scholar | bib ]

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs (for particular problems and datasets). We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is simply too large to explore naively. An attempt to counteract this by applying imitation learning algorithms also fails: the “teacher” is far too good to successfully imitate with our inexpensive features. Moreover, it is not specifically tuned for theknownreward function. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with only a few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.

Keywords:dynamic prioritization, parsing approximations, reinforcement learning

**Imitation learning by coaching**.

He He, Hal Daumé III, and Jason Eisner (2012).

In *NeurIPS*. [ paper | poster | scholar | bib ]

Imitation Learning has been shown to be successful in solving many challenging real-world problems. Some recent approaches give strong performance guarantees by training the policy iteratively. However, it is important to note that these guarantees depend on how well the policy we found can imitate the oracle on the training data. When there is a substantial difference between the oracle's ability and the learner's policy space, we may fail to find a policy that has low error on the training set. In such cases, we propose to use acoachthat demonstrates easy-to-learn actions for the learner and gradually approches the oracle. By a reduction of learning by demonstration to online learning, we prove that coaching can yield a lower regret bound than using the oracle. We apply our algorithm to a novel cost-sensitive dynamic feature selection, a hard decision problem that considers a user-specified accuracy-cost trade-off. Experimental results on UCI datasets show that our method outperforms state-of-the-art imitation learning methods in dynamic features selection and two static feature selection methods.

Keywords:dynamic prioritization

**Easy-first coreference resolution**.

Veselin Stoyanov and Jason Eisner (2012).

In *COLING*. [ paper | slides | scholar | bib ]

We describe an approach to coreference resolution that relies on the intuition that easy decisions should be made early, while harder decisions should be left for later when more information is available. We are inspired by the recent success of the rule-based system of Raghunathan et al. (2010), which relies on the same intuition. Our system, however, automatically learns from training data what constitutes an easy decision. Thus, we can utilize more features, learn more precise weights, and adapt to any dataset for which training data is available. Experiments show that our system outperforms recent state-of-the-art coreference systems including Raghunathan et al.’s system as well as a competitive baseline that uses a pairwise classifier.

Keywords:coreference resolution, discriminative training, clustering, greedy algorithms

**A flexible solver for finite arithmetic circuits**.

Nathaniel Wesley Filardo and Jason Eisner (2012).

In *ICLP*. [ paper | slides with movies | scholar | bib ]

Arithmetic circuits arise in the context of weighted logic programming languages, such as Datalog with aggregation, or Dyna. A weighted logic program defines a generalized arithmetic circuit—the weighted version of a proof forest, with nodes having arbitrary rather than boolean values. In this paper, we focus on finite circuits. We present a flexible algorithm for efficientlyqueryingnode values as they change underupdatesto the circuit's inputs. Unlike traditional algorithms, ours is agnostic about which nodes are tabled (materialized), and can vary smoothly between the traditional strategies of forward and backward chaining. Our algorithm is designed to admit future generalizations, including cyclic and infinite circuits and propagation of delta updates.

Keywords:Dyna, circuits

**Name phylogeny: A generative model of string variation**.

Nicholas Andrews, Jason Eisner, and Mark Dredze (2012).

In *EMNLP-CoNLL*. [ paper | slides | different slides | poster | scholar | bib ]

Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generatedab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

Note:The experiments in this paper are incomplete—we regrettably had to omit some other experimental results because of a bug in the code. You can find the proper results on the slides (and on the poster, from the October 2012 Mid-Atlantic Student Colloquium on Speech, Language and Learning).

Keywords:names, transformation models, finite-state methods

**Learning approximate inference policies for fast prediction**.

Jason Eisner (2012).

Keynote talk at ICML Workshop on Inferning: Interactions Between
Search and Learning. [ slides | video | bib ]

In many domains, our best models are computationally intractable. This problem will only get worse as we manage to build more richly detailed models of specific domains. Fortunately, the practical goal of artificial or natural intelligence is not to do perfect detailed inference, but rather to answer specific questions by reasoning from observed data. Thus, we should seek policies for fast approximate inference that will actually achieve low expected loss on our target task. The target task is a distribution not only over test-time data but also over which variables will be observed and queried. The loss function may explicitly penalize for runtime (or data acquisition).This story leaves open an engineering question: What space of policies should we search? I will review a range of options and point out past work for each. Among others, I will show our own recent successes using message-passing approximate inference policies for graphical models. The

formof these policies is determined by the structure of our intractable and surely mismatched domain model, but we tune theparametersto minimize loss (Stoyanov & Eisner, 2012).One may search a space of sophisticated policies or a space of crude hacks, but crucially, one should tune the policy parameters to optimize expected error and runtime. This expectation can be taken over training data (

empirical risk minimization), or over samples from a posterior belief about the target task (which I will callimputed risk minimization). The latter case requires some sort of prior model, but this is necessary when the empirical risk estimate suffers from sparse, non-independent, or out-of-domain training data.

Keywords:cost-aware learning, graphical model algorithms, invited talks, recorded talks

**Learned prioritization for trading off accuracy and speed**.

Jiarong Jiang, Adam Teichert, Hal Daumé III, and Jason Eisner
(2012).

In *ICML Workshop on Inferning: Interactions between Inference
and Learning*. [ paper | slides | poster | scholar | bib ]

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs for particular problems or datasets. We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is too large to explore naively. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.

Note:Consider instead the NeurIPS 2012 paper that is the "final version" of this workshop paper.

Keywords:dynamic prioritization, parsing approximations, reinforcement learning

**Cost-sensitive dynamic feature selection**.

He He, Hal Daumé III, and Jason Eisner (2012).

In *ICML Workshop on Inferning: Interactions between Inference
and Learning*. [ paper | slides | poster | scholar | bib ]

We present an instance-specific test-time dynamic feature selection algorithm. Our algorithm sequentially chooses features given previously selected features and their values. It stops the selection process to make a prediction according to a user-specified accuracy-cost trade-off. We cast the sequential decision-making problem as a Markov Decision Process and apply imitation learning techniques. We address the problem of learning and inference jointly in a simple multiclass classification setting. Experimental results on UCI datasets show that our approach achieves the same or higher accuracy using only a small fraction of features than static feature selection methods.

Keywords:dynamic prioritization, reinforcement learning

**Fast and accurate prediction via evidence-specific MRF structure**.

Veselin Stoyanov and Jason Eisner (2012).

In *ICML Workshop on Inferning: Interactions between Inference
and Learning*. [ paper | slides | poster | PDF poster | scholar | bib ]

We are interested in speeding up approximate inference in Markov Random Fields (MRFs). We present a new method that uses gates—binary random variables that determine which factors of the MRF to use. Which gates are open depends on the observed evidence; when many gates are closed, the MRF takes on a sparser and faster structure that omits “unnecessary” factors. We train parameters that control the gates, jointly with the ordinary MRF parameters, in order to locally minimize an objective that combines loss and runtime.

Keywords:cost-aware learning, graphical model algorithms, automatic differentiation

**Shared components topic models**.

Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, and Jason
Eisner (2012).

In *NAACL-HLT*. [ paper | slides | scholar | bib ]

With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.

Keywords:topic models, topics

**Implicitly intersecting weighted automata using dual decomposition**.

Michael Paul and Jason Eisner (2012).

In *NAACL-HLT*. [ paper | poster | scholar | bib ]

We propose an algorithm to find the best path through an intersection of arbitrarily many weighted automata, without actually performing the intersection. The algorithm is based on dual decomposition: the automata attempt to agree on a string by communicating about features of the string. We demonstrate the algorithm on the Steiner consensus string problem, both on synthetic data and on consensus decoding for speech recognition. This involves implicitly intersecting up to 100 automata.

Keywords:finite-state methods, relaxation, global optimization, selected papers

**Unsupervised learning on an approximate corpus**.

Jason Smith and Jason Eisner (2012).

In *NAACL-HLT*. [ paper | slides | PDF slides | scholar | bib ]

Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form ofn-gram counts (Brants and Franz, 2006). Whilen-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way thatn-gram language models are estimated. We treat this distribution over sentences as an approximate corpus and show how unsupervised learning can be performed on such a corpus using variational inference. We compare hidden Markov model (HMM) training on exact and approximate corpora of various sizes, measuring speed and accuracy on unsupervised part-of-speech tagging.

Keywords:finite-state methods, variational training, dynamic programming, tagging

**Minimum-risk training of approximate CRF-based NLP systems**.

Veselin Stoyanov and Jason Eisner (2012).

In *NAACL-HLT*. [ paper | slides | scholar | bib ]

Conditional Random Fields (CRFs) are a popular formalism for structured prediction in NLP. It is well known how to train CRFs with certain topologies that admit exact inference, such as linear-chain CRFs. Some NLP phenomena, however, suggest CRFs with more complex topologies. Should such models be used, considering that they make exact inference intractable? Stoyanov et al. (2011) recently argued for training parameters to minimize the task-specific loss of whatever approximate inference and decoding methods will be used at test time. We apply their method to three NLP problems, showing that (i) using more complex CRFs leads to improved performance, and that (ii) minimum-risk training learns more accurate models.

Keywords:information extraction, tagging, text categorization, cost-aware learning, deep learning, discriminative training, deterministic annealing, automatic differentiation

**Learning multivariate distributions by competitive assembly of
marginals**.

Francisco Sánchez-Vega, Jason Eisner, Laurent Younes, and Donald
Geman (2012).

*TPAMI*. [ official link | paper+supplement | video | scholar | bib ]

We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or “primitives,” which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.

Keywords:structure learning, graphical model algorithms, recorded talks

**Learning speed-accuracy tradeoffs in nondeterministic inference
algorithms**.

Jason Eisner and Hal Daumé III (2011).

In *COST: NeurIPS Workshop on Computational Trade-offs in
Statistical Learning*. [ paper | scholar | bib ]

Could we explicitly train test-time inference heuristics to trade off accuracy and efficiency? We focus our discussion on agenda-based natural language parsing under a weighted context-free grammar. We frame the problem as reinforcement learning, discuss its special properties, and propose new strategies.

Keywords:dynamic prioritization, reinforcement learning, parsing approximations

**Learning cost-aware, loss-aware approximate inference policies for
probabilistic graphical models**.

Veselin Stoyanov and Jason Eisner (2011).

In *COST: NeurIPS Workshop on Computational Trade-offs in
Statistical Learning*. [ paper | scholar | bib ]

Probabilistic graphical models are typically trained to maximize the likelihood of the training data and evaluated on some measure of accuracy on the test data. However, we are also interested in learning to produce predictions quickly. For example, one can speed up loopy belief propagation by choosing sparser models and by stopping at some point before convergence. We manage the speed-accuracy tradeoff by explicitly optimizing for a linear combination of speed and accuracy. Although this objective is not differentiable, we can compute the gradient of a smoothed version.

Note:Stoyanov and Eisner (2012) successfully applied these ideas to several NLP problems. We had several other followup papers as well.

Keywords:graphical model algorithms, cost-aware learning, deterministic annealing, automatic differentiation

**Transformation process priors**.

Nicholas Andrews and Jason Eisner (2011).

In *NeurIPS Workshop on Bayesian Nonparametrics: Hope or Hype?* [ paper | scholar | bib ]

Because of the neutrality property, a Dirichlet (process) prior on a discrete distribution cannot capture correlations among the probabilities of “similar” events. We propose obtaining the discrete distribution instead from a random walk model or transformation model, in which each observed event has evolved via a latent sequence of transformations. We are exploring transformation models in which the conditional distributions have infinite support and the prior over them is nonparametric.

Note:Extended abstract.

Keywords:transformation models, nonparametric models

**Shared components topic models with application to selectional
preference**.

Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, and Jason
Eisner (2011).

In *NeurIPS Workshop on Learning Semantics*. [ paper | scholar | bib ]

Latent Dirichlet Allocation (LDA) has been used to learn selectional preferences as softdisjunctionsover flat semantic classes. Our model, the SCTM, also learns the structure of each class as a softconjunctionof high-level semantic features.

Note:Extended abstract.

Keywords:selectional preferences, topic models

**Discovering morphological paradigms from plain text using a Dirichlet
process mixture model**.

Markus Dreyer and Jason Eisner (2011).

In *EMNLP*. [ paper+supplement | slides | PDF slides | data | scholar | bib ]

We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.

Note:Additional details are given in the dissertation of Dreyer (2011), and on the associated slides.

Keywords:nonparametric models, morphology, selected papers

**Minimum imputed risk: Unsupervised discriminative training for machine
translation**.

Zhifei Li, Jason Eisner, Ziyuan Wang, Sanjeev Khudanpur, and Brian
Roark (2011).

In *EMNLP*. [ paper | scholar | bib ]

Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a target-language sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing animputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.

Note:Additional details are given in the dissertation of Li (2010).

Keywords:MT, discriminative training

** A Non-Parametric Model for the Discovery of Inflectional Paradigms
from Plain Text Using Graphical Models over Strings**.

Markus Dreyer (2011).

PhD thesis, Johns Hopkins University. [ dissertation | slides | PDF slides | code | bib ]

The field of statistical natural language processing has been turning toward morphologically rich languages. These languages have vocabularies that are often orders of magnitude larger than that of English, since words may be inflected in various different ways. This leads to problems with data sparseness and calls for models that can deal with this abundance of related words—models that can learn, analyze, reduce and generate morphological inflections. But surprisingly, statistical approaches to morphology are still rare, which stands in contrast to the many recent advances of sophisticated models in parsing, grammar induction, translation and many other areas of natural language processing.This thesis presents a

novel, unified statistical approach to inflectional morphology, an approach that can decode and encode the inflectional system of a language. At the center of this approach stands the notion of inflectional paradigms. These paradigms cluster the large vocabulary of a language into structured chunks; inflections of the same word, likebreak, broke, breaks, breaking,..., all belong in the same paradigm. And moreover, each of these inflections has an exact place within a paradigm, since each paradigm has designated slots for each possible inflection; for verbs, there is a slot for thefirst person singular indicative present, one for thethird person plural subjunctive pastand slots for all other possible forms. The main goal of this thesis is tobuild probability models over inflectional paradigms, and therefore to sort the large vocabulary of a morphologically rich language into structured clusters. These models can be learned with minimal supervision for any language that has inflectional morphology. As training data, some sample paradigms and a raw, unannotated text corpus can be used.The models over morphological paradigms are developed in three main chapters that start with smaller components and build up to larger ones.

The first of these chapters (Chapter 2) presents novel probability models over strings and string pairs. These are applicable to lemmatization or to relate a past tense form to its associated present tense form, or for similar morphological tasks. It turns out they are general enough to tackle the popular task of transliteration very well, as well as other string-to-string tasks.

The second (Chapter 3) introduces the notion of a probability model over multiple strings, which is a novel variant of Markov Random Fields. These are used to relate the many inflections in an inflectional paradigm to one another, and they use the probability models from Chapter 2 as components. A novel version of belief propagation is presented, which propagates distributions over strings through a network of connected finite-state transducers, to perform inference in morphological paradigms (or other string fields).

Finally (Chapter 4), a non-parametric joint probability model over an unannotated text corpus and the morphological paradigms from Chapter 3 is presented. This model is based on a generative story for inflectional morphology that naturally incorporates common linguistic notions, such as lexemes, paradigms and inflections. Sampling algorithms are presented that perform inference over large text corpora and their implicit, hidden morphological paradigms. We show that they are able to discover the morphological paradigms that are implicit in the corpora. The model is based on finite-state operations and seamlessly handles concatenative and nonconcatenative morphology.

Note:Dr. Dreyer's dissertation advisor was Jason Eisner. Much of the work in this dissertation was originally reported in Dreyer et al. (2008), Dreyer and Eisner (2009), and Dreyer and Eisner (2011).

Keywords:theses, nonparametric models, morphology, finite-state methods, variational inference, graphical model algorithms, dynamic programming

**Empirical risk minimization of graphical model parameters given
approximate inference, decoding, and model structure**.

Veselin Stoyanov, Alexander Ropson, and Jason Eisner (2011).

In *AISTATS*. [ paper+supplement | poster | scholar | bib ]

Graphical models are often used “inappropriately,” with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic meta-descent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.

Note:Stoyanov and Eisner (2012a) subsequently applied this training method to three structured prediction problems in NLP, getting striking accuracy improvements. Gormley et al. (2015) applied it to dependency parsing by belief propagation. Stoyanov and Eisner (2011, 2012b) gave preliminary extensions to optimizespeedjointly with accuracy.

Keywords:deep learning, discriminative training, graphical model algorithms, automatic differentiation, deterministic annealing, selected papers

**Dyna: Extending Datalog for modern AI**.

Jason Eisner and Nathaniel W. Filardo (2011).

In *Datalog Reloaded*. [ full paper | shortened paper | official link | code examples | website | scholar | bib ]

Modern statistical AI systems are quite large and complex; this interferes with research, development, and education. We point out that most of the computation involves database-like queries and updates on complex views of the data. Specifically, recursivequerieslook up and aggregate relevant or potentially relevant values. If the results of these queries are memoized for reuse, the memos may need to beupdatedthrough change propagation. We propose a declarative language, which generalizes Datalog, to support this work in a generic way. Through examples, we show that a broad spectrum of AI algorithms can be concisely captured by writing down systems of equations in our notation. Many strategies could be used to actually solve those systems. Our examples motivate certain extensions to Datalog, which are connected to functional and object-oriented programming paradigms.

Note:The followup paper Filardo and Eisner (2012) explains execution mechanisms for handling queries and updates.

Keywords:Dyna, selected papers

**Confusion network decoding for MT system combination**.

Antti-Veikko Rosti, Eugene Matusov, Jason Smith, Necip Ayan, Jason
Eisner, Damianos Karakos, Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spyros
Matsoukas, Hermann Ney, Richard Schwartz, B. Zhang, and J. Zheng (2011).

In *Handbook of Natural Language Processing and Machine
Translation*. [ paper | book | bib ]

Note:This chapter includes an speedup of Karakos et al. (2008) using an A^{*}heuristic. See p. 355.

Keywords:MT, dynamic programming, synchronous grammar algorithms, alignment

** Efficient Inference for Trees and Alignments: Modeling Monolingual
and Bilingual Syntax with Hard and Soft Constraints and Latent Variables**.

David A. Smith (2010).

PhD thesis, Johns Hopkins University. [ dissertation | bib ]

Much recent work in natural language processing treats linguistic analysis as an inference problem over graphs. This development opens up useful connections between machine learning, graph theory, and linguistics.The first part of this dissertation formulates syntactic dependency parsing as a dynamic Markov random field with the novel ingredient of global constraints. Global constraints are enforced by calling combinatorial optimization algorithms as subroutines during message-passing inference in the graphical model, and these global constraints greatly improve on the accuracy of collections of local constraints. In particular, combinatorial subroutines enforce the constraint that the parser's output must form a tree. This is the first application that uses efficient computation of marginals for combinatorial problems to improve the speed and accuracy of belief propagation. If the dependency tree is projective, the tree constraint exploits the inside-outside algorithm; if non-projective, with discontiguous constituents, it exploits the directed matrix-tree theorem, here newly applied to NLP problems. Even with second-order features or latent variables, which would make exact parsing asymptotically slower or NP-hard, approximate inference with belief propagation is as efficient as a simple edge-factored parser times a constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features increases the runtime additively rather than multiplicatively.

The second part extends these models to capture correspondences among non-isomorphic structures. When bootstrapping a parser in a low-resource target language by exploiting a parser in a high-resource source language, models that score the alignment and the correspondence of divergent syntactic configurations in translational sentence pairs achieve higher accuracy in parsing the target language. These noisy (quasi-synchronous) mappings have further applications in adapting parsers across domains, in learning features of the syntax-semantics interface, and in question answering, paraphrasing, and information retrieval.

Note:Dr. Smith's dissertation advisor was Jason Eisner.

Keywords:theses, dependency parsing, variational inference, dynamic programming, non-local syntax, translation, domain adaptation

**Unsupervised discriminative language model training for machine
translation using simulated confusion sets**.

Zhifei Li, Ziyuan Wang, Sanjeev Khudanpur, and Jason Eisner (2010).

In *COLING*. [ paper | scholar | bib ]

Anunsuperviseddiscriminative training procedure is proposed for estimating a language model (LM) for machine translation (MT). An English-to-English synchronous context-free grammar is derived from a baseline MT system to capturetranslation alternatives: pairs of words, phrases or other sentence fragments that potentially compete to be the translation of the same source-language fragment. Using this grammar, a set of impostor sentences is then created for each English sentence tosimulateconfusions that would arise if the system were to process an (unavailable) input whose correct English translation is that sentence. An LM is then trained to discriminate between the original sentences and the impostors. The procedure is applied to the IWSLT Chinese-to-English translation task, and promising improvements on a state-of-the-art MT system are demonstrated.

Note:Additional details are given in the dissertation of Li (2010).

Keywords:MT, discriminative training, training objectives

**Favor short dependencies: Parsing with soft and hard constraints on
dependency length**.

Jason Eisner and Noah A. Smith (2010).

In *Trends in Parsing Technology: Dependency Parsing, Domain
Adaptation, and Deep Parsing*, chapter 8. [ paper | official link | slides | 3rd party slides | scholar | bib ]

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. This fact can be exploited by parsers. We first show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposinghardbounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. The approach with hard bounds, “vine grammar,” accepts only a regular language, even though it happily retains a context-free parameterization and defines meaningful parse trees. We show how to parse this language in O(n) time, using a novel chart parsing algorithm with a low grammar constant (rather than an impractically large finite-state recognizer with an exponential grammar constant). For a uniform hard bound ofkon dependencies of all types, our algorithm's runtime is O(nk^{2}). We also extend our algorithm to parse weighted-FSA inputs such as lattices.

Note:This book chapter extends Eisner & Smith (2005) with considerable new material (e.g., lattice parsing).

Keywords:dependency parsing, parsing approximations, dynamic programming, vine grammar

** Discriminative Training and Variational Decoding in Machine
Translation Via Novel Algorithms for Weighted Hypergraphs**.

Zhifei Li (2010).

PhD thesis, Johns Hopkins University. [ dissertation | slides | PDF slides | PPT slides | bib ]

A hypergraph or “packed forest” is a compact data structure that uses structure-sharing to represent exponentially many trees in polynomial space. A probabilistic/weighted hypergraph also defines a probability (or other weight) for each tree, and can be used to represent the hypothesis space considered (for a given input) by a monolingual parser or a tree-based translation system (e.g., tree to string, string to tree, tree to tree, or string to string with latent tree structures).Given a weighted/probabilistic hypergraph, we might ask three questions. What atomic operations can we perform on the weighted hypergraph? How do we set the weights in the hypergraph? Which particular translation (among the possible translations encoded in a hypergraph) should we present to an end user? These correspond to three fundamental problems: inference, training, and decoding, for which this dissertation will present novel techniques.

The atomic inference operations we may want to perform include finding one-best, k-best, or expectations over the hypergraph. To perform each operation, we may implement a dedicated dynamic programming algorithm. However, a more general framework to specify these algorithms is semiring-weighted logic programming. Within this framework, we first extend the expectation semiring, which is originally proposed for a finite state automaton, to a hypergraph. We then propose a novel second-order expectation semiring. These semirings can be used to compute a large number of expectations (e.g., entropy and its gradient) over the exponentially many trees presented in a hypergraph.

The weights used in a hypergraph are usually learnt by a discriminative training method. One common drawback of such method is that it relies on the existence of high-quality supervised data (i.e., bilingual data), which may be expensive to obtain. We present two unsupervised discriminative training methods: minimum imputed-risk training, and contrastive language model estimation, both can exploit monolingual English data to perform discriminative training. In minimum imputed-risk training, we first use a reverse translation model to impute the missing inputs, and then train a discriminative forward model by minimizing the expected loss of the forward translations of the missing inputs.

In contrast, the contrastive language model estimation does not use a reverse system. It first extracts a confusion grammar, then generates many alternative sentences (i.e., a contrastive set) for each English sentence using the confusion grammar, and finally trains a discriminative language model on the contrastive sets such that the model will prefer the original English sentences (over the sentences in the contrastive sets).

During decoding, we are interested in finding a translation that has a maximum posterior probability (i.e., MAP decoding). However, this is intractable due to spurious ambiguity, a situation where the probability of a translation string is split among many distinct derivations (e.g., trees or segmentations). Therefore, most systems use a simple Viterbi decoding that approximates the string probability with its most probable derivation's probability. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as n-gram models. We also analytically show that interpolating these n-gram models for different n is similar to lattice-based minimum-risk decoding for BLEU. Experiments show that our approach improves the state of the art.

All the above methods have been implemented in an open-source machine translation toolkit Joshua. In this dissertation, the methods have mainly been applied to a machine translation task, butwe expect that they will also find applications in other areas of natural language processing (e.g., parsing and speech recognition).

Note:Dr. Li's dissertation advisor and co-advisor were Sanjeev Khudanpur and Jason Eisner. Some of the work in this dissertation also appeared in Li and Eisner (2009), Li et al. (2009), Li et al. (2010), and Li et al. (2011).

Keywords:theses, MT, discriminative training, training objectives, finite-state methods, semirings, dynamic programming, automatic differentiation, variational inference

**First- and second-order expectation semirings with applications to
minimum-risk training on translation forests**.

Zhifei Li and Jason Eisner (2009).

In *EMNLP*. [ paper | slides | printable slides | scholar | bib ]

Many statistical translation models can be regarded as weighted logical deduction. Under this paradigm, we use weights from the expectation semiring (Eisner, 2002), to compute first-order statistics (e.g., the expected hypothesis length or feature counts) over packed forests of translations (lattices or hypergraphs). We then introduce a novel second-order expectation semiring, which computes second-order statistics (e.g., the variance of the hypothesis length or the gradient of entropy). This second-order semiring is essential for many interesting training paradigms such as minimum risk, deterministic annealing, active learning, and semi-supervised learning, where gradient descent optimization requires computing the gradient of entropy or risk. We use these semirings in an open-source machine translation toolkit, Joshua, enabling minimum-risk training for a benefit of up to 1.0 BLEU point.

Note:Additional introduction and details are given in the dissertation of Li (2010).

Keywords:MT, finite-state methods, semirings, dynamic programming, automatic differentiation, selected papers

**Graphical models over multiple strings**.

Markus Dreyer and Jason Eisner (2009).

In *EMNLP*. [ paper | slides | Quicktime slides | PDF slides | scholar | bib ]

We study graphical modeling in the case ofstring-valuedrandom variables. Whereas a weighted finite-state transducer can model the probabilistic relationship betweentwostrings, we are interested in building up joint models ofthree or morestrings. This is needed for inflectional paradigms in morphology, cognate modeling or language reconstruction, and multiple-string alignment. We propose a Markov Random Field in which each factor (potential function) is a weighted finite-state machine, typically a transducer that evaluates the relationship between just two of the strings. The full joint distribution is then a product of these factors. Though decoding is actually undecidable in general, we can still do efficient joint inference using approximate belief propagation; the necessary computations and messages are all finite-state. We demonstrate the methods by jointly predicting morphological forms.

Note:Additional details are given in the dissertation of Dreyer (2011), and on the associated slides. Cotterell and Eisner (2015) give an improved inference method for graphical models over strings. Cotterell et al. (2015) extended the morphological modeling approach to handle latent underlying morphs and derivational morphology.

Keywords:graphical model algorithms, morphology, variational inference, finite-state methods, dynamic programming

**Parser adaptation and projection with quasi-synchronous grammar
features**.

David A. Smith and Jason Eisner (2009).

In *EMNLP*. [ paper | slides | scholar | bib ]

We connect two scenarios in structured learning:adaptinga parser trained on one corpus to another annotation style, andprojectingsyntactic annotations from one language to another. We proposequasi-synchronous grammar(QG) features for these structured learning tasks. That is, we score a aligned pair of source and target trees based on local features of the trees and the alignment. Our quasi-synchronous model assigns positive probability to any alignment of any trees, in contrast to a synchronous grammar, which would insist on some form of structural parallelism.In monolingual dependency parser adaptation, we achieve high accuracy in translating among multiple annotation styles for the same sentence. On the more difficult problem of cross-lingual parser projection, we learn a dependency parser for a target language by using bilingual text, an English parser, and automatic word alignments. Our experiments show that unsupervised QG projection improves on parses trained using only high-precision projected annotations and far outperforms, by more than 35% absolute dependency accuracy, learning an unsupervised parser from raw target-language text alone. When a few target-language parse trees are available, projection gives a boost equivalent to doubling the number of target-language trees.

Note:Additional details are given in the dissertation of Smith (2010).

Keywords:dependency parsing, synchronous grammar algorithms, alignment, domain adaptation

**Learning linear ordering problems for better translation**.

Roy Tromble and Jason Eisner (2009).

In *EMNLP*. [ paper | slides | PDF slides | scholar | bib ]

We apply machine learning to the Linear Ordering Problem in order to learn sentence-specific reordering models for machine translation. We demonstrate that even when these models are used as a mere preprocessing step for German-English translation, they significantly outperform Moses' integrated lexicalized reordering model.Our models are trained on automatically aligned bitext. Their form is simple but novel. They assess, based on features of the input sentence, how strongly each pair of input word tokens wi;wj would like to reverse their relative order. Combining all these pairwise preferences to find the best global reordering is NP-hard. However, we present a non-trivial

O(n^{3}) algorithm, based on chart parsing, that at least finds the best reordering within a certain exponentially large neighborhood. We show how to iterate this reordering process within a local search algorithm, which we use in training.

Note:Additional details are given in the dissertation of Tromble (2009).

Keywords:MT, local search, dynamic programming, synchronous grammar algorithms, translation

**Variational decoding for statistical machine translation**.

Zhifei Li, Jason Eisner, and Sanjeev Khudanpur (2009).

In *ACL (nominated for Best Paper Award)*. [ paper | slides | printable slides | scholar | bib ]

Statistical models in machine translation exhibitspurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, finding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized asn-gram models. We also analytically show that interpolating thesen-gram models for differentnis similar to minimum-risk decoding for BLEU (Tromble et al., 2008). Experiments show that our approach improves the state of the art.

Note:Additional introduction and details are given in the dissertation of Li (2010).

Keywords:MT, variational inference, dynamic programming, selected papers

**Weighted deduction as an abstraction level for AI**.

Jason Eisner (2009).

Invited talk at ILP+MLG+SRL. [ slides | video | bib ]

The field of AI has become implementation-bound. We have plenty of ideas, but it is increasingly laborious to try them out, as our models become more ambitious and our datasets become larger, noisier, and more heterogeneous. The software engineering burden makes it hard to start new work; hard to reuse and combine existing ideas; and hard to educate our students.In this talk, I'll propose to hide many common implementation details behind a new level of abstraction that we are developing. Dyna is a declarative programming language that combines logic programming with functional programming. It also supports modularity. It may be regarded as a kind of deductive database, theorem prover, truth maintenance system, or equation solver.

I will illustrate how Dyna makes it easy to specify the combinatorial structure of typical computations needed in natural language processing, machine learning, and elsewhere in AI. Then I will sketch implementation strategies and program transformations that can help to make these computations fast and memory-efficient. Finally, I will suggest that machine learning should be used to search for the right strategies for a program on a particular workload.

Keywords:Dyna, invited talks, recorded talks

**Joint models with missing data for semi-supervised learning**.

Jason Eisner (2009).

Keynote talk at the NAACL HLT Workshop on Semi-supervised Learning
for Natural Language Processing. [ slides | bib ]

Note:The talk referred to Smith & Eisner (2008), Eisner & Karakos (2005), and Smith & Eisner (2009), among others.

Keywords:invited talks

** Search and Learning for the Linear Ordering Problem with an
Application to Machine Translation**.

Roy Tromble (2009).

PhD thesis, Johns Hopkins University. [ dissertation | bib ]

This dissertation is about ordering. The problem of arranging a set ofnitems in a desired order is quite common, as well as fundamental to computer science. Sorting is one instance, as is the Traveling Salesman Problem. Each problem instance can be thought of as optimization of a function that applies to the set of permutations.The dissertation treats word reordering for machine translation as another instance of a combinatorial optimization problem. The approach introduced is to combine three different functions of permutations. The first function is based on finite-state automata, the second is an instance of the Linear Ordering Problem, and the third is an entirely new permutation problem related to the LOP.

The Linear Ordering Problem has the most attractive computational properties of the three, all of which are NP-hard optimization problems. The dissertation expends significant effort developing neighborhoods for local search on the LOP, and uses grammars and other tools from natural language parsing to introduce several new results, including a state-of-the-art local search procedure.

Combinatorial optimization problems such as the TSP or the LOP are usually given the function over permutations. In the machine translation setting, the weights are not given, only words. The dissertation applies machine learning techniques to derive a LOP from each given sentence using a corpus of sentences and their translations for training. It proposes a set of features for such learning and argues their propriety for translation based on an analogy to dependency parsing. It adapts a number of parameter optimization procedures to the novel setting of the LOP.

The signature result of the thesis is the application of a machine learned set of linear ordering problems to machine translation. Using reorderings found by search as a preprocessing step significantly improves translation of German to English, and significantly more than the lexicalized reordering model that is the default of the translation system.

In addition, the dissertation provides a number of new theoretical results, and lays out an ambitious program for potential future research. Both the reordering model and the optimization techniques have broad applicability, and the availability of machine learning makes even new problems without obvious structure approachable.

Note:Dr. Tromble's dissertation advisor was Jason Eisner.

Keywords:theses, MT, local search, dynamic programming, synchronous grammar algorithms, translation

**Cross-document coreference resolution: A key technology for learning by
reading**.

James Mayfield, David Alexander, Bonnie Dorr, Jason Eisner, Tamer
Elsayed, Tim Finin, Clay Fink, Marjorie Freedman, Nikesh Garera, Paul
McNamee, Saif Mohammad, Douglas Oard, Christine Piatko, Asad Sayeed, Zareen
Syed, Ralph Weischedel, Tan Xu, and David Yarowsky (2009).

In *AAAI 2009 Spring Symposium on Learning by Reading and
Learning to Read*. [ paper | scholar | bib ]

Keywords:coreference resolution

**Dyna: A non-probabilistic programming language for probabilistic
AI**.

Jason Eisner (2008).

Extended abstract for talk at the NeurIPS*2008 Workshop on Probabilistic Programming. [ paper | extended slides | bib ]

The Dyna programming language is intended to provide an declarative abstraction layer for building systems in ML and AI. It extends logic programming with weights in a way that resembles functional programming. The weights are often probabilities. Yet Dyna does not enforce a probabilistic semantics, since many AI and ML methods work with inexact probabilities (e.g., bounds) and other numeric and non-numeric quantities. Instead Dyna aims to provide a flexible abstraction layer that is “one level lower,” and whose efficient implementation will be able to serve as infrastructure for building a variety of toolkits, languages, and specific systems.

Keywords:Dyna

**Machine learning with annotator rationales to reduce annotation cost**.

Omar F. Zaidan, Jason Eisner, and Christine Piatko (2008).

In *NeurIPS*2008 Workshop on Cost Sensitive Learning*. [ paper | slides | poster | PDF poster | scholar | bib ]

We review two novel methods for text categorization, based on a new framework that utilizes richer annotations that we callannotator rationales. A human annotator provides hints to a machine learner by highlighting contextual “rationales” in support of each of his or her annotations. We have collected such rationales, in the form of substrings, for an existing document sentiment classification dataset [1]. We have developed two methods, one discriminative [2] and one generative [3], that use these rationales during training to obtain significant accuracy improvements over two strong baselines. Our generative model in particular could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks. Based on a small study of annotation speed, we posit that for some tasks, providing rationales can be a more fruitful use of an annotator's time than annotating more examples.

Note:This paper is just a synthesis of Zaidan et al. (2007) and Zaidan & Eisner (2008). (I hate doing multiple papers on the same work, but I wanted the ML community outside of NLP to see these results, so I asked the workshop organizers for permission to republish them here.)

Keywords:annotation, sentiment, generative modeling, selected papers

**Dependency parsing by belief propagation**.

David A. Smith and Jason Eisner (2008).

In *EMNLP*. [ paper | extended slides | scholar | bib ]

We formulate dependency parsing as a graphical model with the novel ingredient of global constraints. We show how to apply loopy belief propagation (BP), a simple and effective tool forapproximatelearning and inference. As a parsing algorithm, BP is both asymptotically and empirically efficient. Even with second-order features or latent variables, which would make exact parsing considerably slower or NP-hard, BP needs onlyO(n^{3}) time with a small constant factor. Furthermore, such features significantly improve parse accuracy over exact first-order methods. Incorporating additional features would increase the runtime additively rather than multiplicatively.

Note:Additional details are given in the dissertation of Smith (2010).

Keywords:dependency parsing, parsing approximations, variational inference, non-local syntax, selected papers

**Modeling annotators: A generative approach to learning from annotator
rationales**.

Omar F. Zaidan and Jason Eisner (2008).

In *EMNLP*. [ paper | slides | scholar | bib ]

A human annotator can provide hints to a machine learner by highlighting contextual “rationales” for each of his or her annotations (Zaidan et al., 2007). How can one exploit this side information to better learn the desired parameters θ? We present a generative model of how a given annotator, knowing the true θ, stochastically chooses rationales. Thus, observing the rationales helps us infer the true θ. We collect substring rationales for a sentiment classification task (Pang and Lee, 2004) and use them to obtain significant accuracy improvements for each annotator. Our new generative approach exploits the rationales more effectively than our previous “masking SVM” approach. It is also more principled, and could be adapted to help learn other kinds of probabilistic classifiers for quite different tasks.

Keywords:annotation, sentiment, generative modeling

**Latent-variable modeling of string transductions with finite-state
methods**.

Markus Dreyer, Jason R. Smith, and Jason Eisner (2008).

In *EMNLP*. [ paper | extended slides | code | scholar | bib ]

String-to-string transduction is a central problem in computational linguistics and natural language processing. It occurs in tasks as diverse as name transliteration, spelling correction, pronunciation modeling and inflectional morphology. We present a conditional log-linear model for string-to-string transduction, which employs overlapping features over latent alignment sequences, and which learns latent classes and latent string pair regions from incomplete training data. We evaluate our approach on morphological tasks and demonstrate that latent variables can dramatically improve results, even when trained on small data sets. On the task of generating morphological forms, we outperform a baseline method reducing the error rate by up to 48%. On a lemmatization task, we reduce the error rates in Wicentowski (2002) by 38-92%.

Note:Additional details are given in the dissertation of Dreyer (2011).

Keywords:finite-state methods, morphology

** Proceedings of the Tenth Meeting of the ACL Special Interest Group on
Computational Morphology and Phonology**.

Jason Eisner and Jeffrey Heinz, editors. [ volume | bib ]

Keywords:edited volumes

**Competitive grammar writing**.

Jason Eisner and Noah A. Smith (2008).

In *ACL Workshop on Teaching CL*. [ paper | slides | code | original code | scholar | bib ]

Just as programming is the traditional introduction to computer science, writing grammars by hand is an excellent introduction to many topics in computational linguistics. We present and justify a well-tested introductory activity in which teams of mixed background compete to write probabilistic context-free grammars of English. The exercise brings together symbolic, probabilistic, algorithmic, and experimental issues in a way that is accessible to novices and enjoyable.

Keywords:teaching, syntax

**Machine translation system combination using ITG-based alignments**.

Damianos Karakos, Jason Eisner, Sanjeev Khudanpur, and Markus Dreyer
(2008).

In *ACL*. [ paper | scholar | bib ]

Note:See Rosti et al. (2011, p. 355) for a speedup using an A^{*}heuristic.

Keywords:MT, dynamic programming, synchronous grammar algorithms, alignment

** Proceedings of the 2007 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL)**.

Jason Eisner, editor. [ volume | bib ]

Keywords:edited volumes

**Bootstrapping feature-rich dependency parsers with entropic priors**.

David A. Smith and Jason Eisner (2007).

In *EMNLP-CoNLL*. [ paper | slides | scholar | bib ]

One may need to build a statistical parser for a new language, using only a very small labeled treebank together with raw text. We argue that bootstrapping a parser is most promising when the model uses a rich set of redundant features, as in recent models for scoring dependency parses (McDonald et al., 2005). Drawing on Abney's (2004) analysis of the Yarowsky algorithm, we perform bootstrapping by entropy regularization: we maximize a linear combination of conditional likelihood on labeled data and confidence (negative Rényi entropy) on unlabeled data. In initial experiments, this surpassed EM for training a simple feature-poor generative model, and also improved the performance of a feature-rich, conditionally estimated model where EM could not easily have been applied. For our models and training sets, more peaked measures of confidence, measured by Rényi entropy, outperformed smoother ones. We discuss how our feature set could be extended with cross-lingual or cross-domain features, to incorporate knowledge from parallel or comparable corpora during bootstrapping.

Keywords:bootstrapping, grammar induction, training objectives

**Using “annotator rationales” to improve machine learning for text
categorization**.

Omar Zaidan, Jason Eisner, and Christine Piatko (2007).

In *NAACL-HLT*. [ paper | slides | PDF slides | data | scholar | bib ]

We propose a new framework for supervised machine learning. Our goal is to learn from smaller amounts of supervised training data, by collecting a richer kind of training data: annotations with “rationales.” When annotating an example, the human teacher will also highlight evidence supporting this annotation—thereby teaching the machine learner why the example belongs to the category. We provide some rationale-annotated data and present a learning method that exploits the rationales during training to boost performance significantly on a sample task, namely sentiment classification of movie reviews. We hypothesize that in some situations, providing rationales is a more fruitful use of an annotator's time than annotating more examples.

Keywords:annotation, sentiment

**Cross-instance tuning of unsupervised document clustering algorithms**.

Damianos Karakos, Jason Eisner, Sanjeev Khudanpur, and Carey E.
Priebe (2007).

In *NAACL-HLT*. [ paper | slides | scholar | bib ]

In unsupervised learning, where no training takes place, one simply hopes that the unsupervised learner will work well onanyunlabeled test collection. However, when the variability in the data is large, such hope may be unrealistic; atuningof the unsupervised algorithm may then be necessary in order to perform well on new test collections. In this paper, we show how to perform such a tuning in the context of unsupervised document clustering, by (i) introducing a degree of freedom, α, into two leading information-theoretic clustering algorithms, through the use of generalized mutual information quantities; and (ii) selecting the value of α based on clusterings of similar, butsuperviseddocument collections (cross-instance tuning). One option is to perform a tuning that directly minimizes the error on the supervised data sets; another option is to use “strapping” (Eisner and Karakos, 2005), which builds a classifier that learns to distinguish good from bad clusterings, and then selects the α with the best predicted clustering on the test set. Experiments from the “20 Newsgroups” corpus show that, although both techniques improve the performance of the baseline algorithms, “strapping” is clearly a better choice for cross-instance tuning.

Keywords:strapping, document clustering

**Iterative denoising using Jensen-Renyí divergences with an
application to unsupervised document categorization**.

Damianos Karakos, Sanjeev Khudanpur, Jason Eisner, and Carey E.
Priebe (2007).

In *ICASSP*. [ paper | scholar | bib ]

Iterative denoising trees were used by Karakos et al. [1] for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic maximization objective. In this paper, we improve upon the work of [1] in two ways: (i) the amount of computation spent searching for a good projection at each node now adapts to the intrinsic dimensionality of the data observed at that node; (ii) the objective at each node is to find a split which maximizes a generalized form of mutual information, the Jensen-Renyi divergence; this is followed by an iterative Naive Bayes classification. The single parameter alpha of the Jensen-Renyi divergence is chosen based on the “strapping” methodology [2], which learns a meta-classifer on a related task. Compared with the sequential Information Bottleneck method [3], our procedure produces state-of-the-art results on an unsupervised categorization task of documents from the “20 Newsgroups” dataset.

Keywords:clustering, document clustering

**Program transformations for optimization of parsing algorithms and other
weighted logic programs**.

Jason Eisner and John Blatz (2007).

In *Conference on Formal Grammar*. [ paper | slides | bib ]

Dynamic programming algorithms in statistical natural language processing can be easily described as weighted logic programs. We give a notation and semantics for such programs. We then describe several source-to-source transformations that affect a program's efficiency, primarily by rearranging computations for better reuse or by changing the search strategy. We present practical examples of using these transformations, mainly to optimize context-free parsing algorithms, and we formalize them for use with new weighted logic programs.Specifically, we define weighted versions of the folding and unfolding transformations, whose unweighted versions are used in the logic programming and deductive database communities. We then present a novel transformation called speculation—a powerful generalization of folding that is motivated by gap-passing in categorial grammar. Finally, we give a simpler and more powerful formulation of the magic templates transformation.

Note:The speculation transform is really a form of lifted inference (a term that gained currency later).

Keywords:Dyna, CFG parsing, dynamic programming, selected papers

** Novel Estimation Methods for Unsupervised Discovery of Latent
Structure in Natural Language Text**.

Noah A. Smith (2006).

PhD thesis, Johns Hopkins University. [ dissertation | slides | scholar | bib ]

This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways.Contrastive estimationmaximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples.Skewed deterministic annealinglocally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood.Structural annealingis similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias.Our estimation methods do not make use of annotated examples. We consider their performance in both an

unsupervisedmodel selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, andsupervisedmodel selection, where the most accurate model on the development set (now with annotations) is selected. The latter is far superior, but surprisingly few annotated examples are required. The experimentation focuses on a single dependency grammar induction task, in depth. The aim is to give strong support for the usefulness of the new techniques in one scenario. It must be noted, however, that the task (as defined here and in prior work) is somewhat artificial, and improved performance on this particulartaskis not a direct contribution to the greater field of natural language processing. The real problem the task seeks to simulate—the induction of syntactic structure in natural language text—is certainly of interest to the community, but this thesis does not directly approach the problem of exploiting induced syntax in applications. We also do not attempt any realistic simulation of human language learning, as our newspaper text data do not resemble the data encountered by a child during language acquisition. Further, our iterative learning algorithms assume a fixed batch of data that can be repeatedly accessed, not a long stream of data observed over time in tandem with acquisition. (Of course, the cognitive criticisms apply to virtually all existing learning methods in natural language processing, not just the new ones presented here.) Nonetheless, the novel estimation methods presentedare, we will argue, better suited to adaptation for real engineering tasks than the maximum likelihood baseline.Our new methods are shown to achieve significant improvements over maximum likelihood estimation and maximum

a posterioriestimation, using the EM algorithm, for a state-of-the-art probabilistic model used in dependency grammar induction (Klein and Manning, 2004). The task is to induce dependency trees from part-of-speech tag sequences; we follow standard practice and train and test on sequences of ten tags or fewer. Our results are the best published to date for six languages, with supervised model selection: English (improvement from 41.6% directed attachment accuracy to 66.7%, a 43% relative error rate reduction), German (54.4 -> 71.8%, a 38% error reduction), Bulgarian (45.6% -> 58.3%, a 23% error reduction), Mandarin (50.0% -> 58.0%, a 16% error reduction), Turkish (48.0% -> 62.4%, a 28% error reduction, but only 2% error reduction from a left-branching baseline, which gives 61.8%), and Portuguese (42.5% -> 71.8%, a 51% error reduction). We also demonstrate the success of contrastive estimation at learning to disambiguate part-of-speech tags (from unannotated English text): 78.0% to 88.7% tagging accuracy on a known-dictionary task (a 49% relative error rate reduction), and 66.5% to 78.4% on a more difficult task with less dictionary knowledge (a 35% error rate reduction).The experiments presented in this thesis give one of the most thorough explorations to date of unsupervised parameter estimation for models of discrete structures. Two sides of the problem are considered in depth: the choice of objective function to be optimized during training, and the method of optimizing it. We find that both are important in unsupervised learning. Our best results on most of the six languages involve both improved objectives and improved search.

The methods presented in this thesis were originally presented in Smith and Eisner (2004, 2005a,b, 2006). The thesis gives a more thorough exposition, relating the methods to other work, presents more experimental results and error analysis, and directly compares the methods to each other.

Note:Dr. Smith's dissertation advisor was Jason Eisner.

Keywords:theses, grammar induction, deterministic annealing, contrastive estimation, structural annealing

**Visual navigation through large directed graphs and hypergraphs**.

Jason Eisner, Michael Kornbluh, Gordon Woodhull, Raymond Buse, Samuel
Huang, Constantinos Michael, and George Shafer (2006).

In *InfoVis*. [ paper | website | poster | PDF poster | scholar | bib ]

We describe Dynasty, a system for browsing large (possibly infinite) directed graphs and hypergraphs. Only a small subgraph is visible at any given time. We sketch how we lay out the visible subgraph, and how we update the layout smoothly and dynamically in an asynchronous environment. We also sketch our user interface for browsing and annotating such graphs—in particular, how we try to make keyboard navigation usable.

Keywords:Dyna, visualization

**A natural-language approach to automated cryptanalysis of two-time
pads**.

Joshua Mason, Kathryn Watkins, Jason Eisner, and Adam Stubblefield
(2006).

In *ACM CCS*. [ paper | slides | scholar | bib ]

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that “an attacker would never be able to do that.” In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002.

Keywords:cryptanalysis

**Better informed training of latent syntactic features**.

Markus Dreyer and Jason Eisner (2006).

In *EMNLP*. [ paper | poster | scholar | bib ]

We study unsupervised methods for learning refinements of the nonterminals in a treebank. Following Matsuzaki et al. (2005) and Prescher (2005), we may for example splitNPwithout supervision intoNP[0]andNP[1], which behave differently. We first propose to learn a PCFG that adds such features to nonterminals in such a way that they respect patterns of linguistic feature passing: each node's nonterminal features are either identical to, or independent of, those of its parent. This linguistic constraint reduces runtime and the number of parameters to be learned. However, it did not yield improvements when training on the Penn Treebank. An orthogonal strategy was more successful: to improve the performance of the EM learner by treebank preprocessing and by annealing methods that split nonterminals selectively. Using these methods, we can maintain high parsing accuracy while dramatically reducing the model size.

Keywords:non-local syntax, CFG parsing

**Minimum-risk annealing for training log-linear models**.

David A. Smith and Jason Eisner (2006).

In *COLING-ACL*. [ paper | scholar | bib ]

When training the parameters for a natural language system, one would prefer to minimize 1-best loss (error) on an evaluation set. Since the error surface for many natural language problems is piecewise constant and riddled with local minima, many systems instead optimize log-likelihood, which is conveniently differentiable and convex. We propose training instead to minimize the expected loss, or risk. We define this expectation using a probability distribution over hypotheses that we gradually sharpen (anneal) to focus on the 1-best hypothesis. Besides the linear loss functions used in previous work, we also describe techniques for optimizing nonlinear functions such as precision or the BLEU metric. We present experiments training log-linear combinations of models for dependency parsing and for machine translation. In machine translation, annealed minimum risk training achieves significant improvements in BLEU over standard minimum error training. We also show improvements in labeled dependency parsing.

Keywords:discriminative training, MT, deterministic annealing, training objectives

**Annealing structural bias in multilingual weighted grammar induction**.

Noah A. Smith and Jason Eisner (2006).

In *COLING-ACL*. [ paper | scholar | bib ]

We first show how a structurallocality biascan improve the accuracy of state-of-the-art dependency grammar induction models trained by EM from unannotated examples (Klein and Manning, 2004). Next, by annealing the free parameter that controls this bias, we achieve further improvements. We then describe an alternative kind of structural bias, toward “broken” hypotheses consisting of partial structures over segmented sentences, and show a similar pattern of improvement. We relate this approach to contrastive estimation (Smith and Eisner, 2005), apply the latter to grammar induction in six languages, and show that our new approach improves accuracy by 1–17% (absolute) over CE (and 8–30% over EM), achieving to our knowledge the best results on this task to date. Our method,structural annealing, is a general technique with broad applicability to hidden-structure discovery problems.

Note:Additional details are given in the dissertation of Smith (2006).

Keywords:deterministic annealing, structural annealing, vine grammar, grammar induction, training objectives, selected papers

**Local search with very large-scale neighborhoods for optimal permutations
in machine translation**.

Jason Eisner and Roy W. Tromble (2006).

In *HLT-NAACL Workshop on Computationally Hard Problems and Joint
Inference in Speech and Language Processing*. [ paper | extended slides | slides | scholar | bib ]

We introduce a novel decoding procedure for statistical machine translation and other ordering tasks based on a family of Very Large-Scale Neighborhoods, some of which have previously been applied to other NP-hard permutation problems. We significantly generalize these problems by simultaneously considering three distinct sets of ordering costs. We discuss how these costs might apply to MT, and some possibilities for training them. We show how to search and sample from exponentially large neighborhoods using efficient dynamic programming algorithms that resemble statistical parsing. We also incorporate techniques from statistical parsing to improve the runtime of our search. Finally, we report results of preliminary experiments indicating that the approach holds promise.

Note:Further theorems and experiments appear in Tromble & Eisner (2009) and especially in the dissertation of Tromble (2009).

Keywords:MT, local search, dynamic programming, synchronous grammar algorithms, translation, selected papers

**Quasi-synchronous grammars: Alignment by soft projection of syntactic
dependencies**.

David A. Smith and Jason Eisner (2006).

In *HLT-NAACL Workshop on Statistical MT (nominated for 5-year
retrospective Best Paper award)*. [ paper | extended slides | slides | scholar | bib ]

Many syntactic models in machine translation are channels that transform one tree into another, or synchronous grammars that generate trees in parallel. We present a new model of the translation process: quasi-synchronous grammar (QG). Given a source-language parse treeT_{1}, a QG defines a monolingual grammar that generates translations ofT_{1}. The treesT_{2}allowed by this monolingual grammar are inspired by pieces of substructure inT_{1}and aligned toT_{1}at those points. We describe experiments learning quasi-synchronous context-free grammars from bitext. As with other monolingual language models, we evaluate the crossentropy of QGs on unseen text and show that a better fit to bilingual data is achieved by allowing greater syntactic divergence. When evaluated on a word alignment task, QG matches standard baselines.

Keywords:syntax, translation, MT, alignment, synchronous grammar algorithms, dependency parsing

**A fast finite-state relaxation method for enforcing global constraints on
sequence decoding**.

Roy W. Tromble and Jason Eisner (2006).

In *HLT-NAACL*. [ paper | slides | scholar | bib ]

We describe finite-state constraint relaxation, a method for applying global constraints, expressed as automata, to sequence model decoding. We present algorithms for both hard constraints and binary soft constraints. On the CoNLL-2004 semantic role labeling task, we report a speedup of at least 16x over a previous method that used integer linear programming.

Keywords:finite-state methods, relaxation, global optimization

**Finite-state Dirichlet allocation: Learned priors on finite-state
models**.

Jia Cui and Jason Eisner (2006).

Technical report, Center for Language and Speech Processing, Johns
Hopkins University. [ paper | scholar | bib ]

To model a collection of documents, suppose that each document was generated by a different hidden Markov model or probabilistic finite-state automaton (PFSA). Further suppose that all these PFSAs are similar because they are drawn from a single (but unknown) prior distribution over PFSAs. We wish to infer the prior, obtain smoothed estimates of the individual PFSAs, and reconstruct the hidden paths by which the unknown PFSAs generated their documents.As an initial application, particularly hard for our model because of its sparse data, we derive an FSA topology from WordNet. For each verb, we construct the “document” of all nouns that have appeared as its object. Our method then learns a better estimate of p(object | verb), as well as which paths in WordNet, and hence which senses of ambiguous objects, tend to be favored. Our method improves 14.6% over Witten-Bell smoothing on the conditional perplexity of objects given the verb, and 27.5% over random on detecting the most common senses of nouns in the SemCor corpus.

Keywords:topic models, selectional preferences, finite-state methods, transformation models, variational inference

**Parsing with soft and hard constraints on dependency length**.

Jason Eisner and Noah A. Smith (2005).

In *IWPT*. [ paper | slides | 3rd party slides | scholar | bib ]

In lexicalized phrase-structure or dependency parses, a word's modifiers tend to fall near it in the string. We show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. This simple “vine grammar” formalism has only finite-state power, but a context-free parameterization with some extra parameters for stringing fragments together. We exhibit a linear-time chart parsing algorithm with a low grammar constant.

Note:Consider instead the 2010 book chapter that is an expanded version of this paper.

Keywords:dependency parsing, parsing approximations, dynamic programming, vine grammar

**Bootstrapping without the boot**.

Jason Eisner and Damianos Karakos (2005).

In *HLT-EMNLP*. [ paper | extended slides | audio for extended slides | slides | PDF slides | scholar | bib ]

“Bootstrapping” methods for learning require a small amount of supervision to seed the learning process. We show that it is sometimes possible to eliminate this last bit of supervision, by trying many candidate seeds and selecting the one with the most plausible outcome. We discuss such “strapping” methods in general, and exhibit a particular method for strapping word-sense classifiers for ambiguous words. Our experiments on the Canadian Hansards show that our unsupervised technique is significantly more effective than picking seeds by hand (Yarowsky, 1995), which in turn is known to rival supervised methods.

Keywords:strapping, WSD, synthetic data, recorded talks, selected papers

**Compiling comp ling: Weighted dynamic programming and the Dyna
language**.

Jason Eisner, Eric Goldlust, and Noah A. Smith (2005).

In *HLT-EMNLP*. [ paper | proof | extended slides | website | slides | PDF slides | audio for slides | scholar | bib ]

Weighted deduction with aggregation is a powerful theoretical formalism that encompasses many NLP algorithms. This paper proposes a declarative specification language, Dyna; gives general agenda-based algorithms for computing weights and gradients; briefly discusses Dyna-to-Dyna program transformations; and shows that a first implementation of a Dyna-to-C++ compiler produces code that is efficient enough for real NLP research, though still several times slower than hand-crafted code.

Keywords:Dyna, semirings, automatic differentiation, recorded talks

**Guiding unsupervised grammar induction using contrastive estimation**.

Noah A. Smith and Jason Eisner (2005).

In *IJCAI Workshop on Grammatical Inference Applications*. [ paper | scholar | bib ]

We describe a novel training criterion for probabilistic grammar induction models,contrastive estimation[Smith and Eisner, 2005], which can be interpreted as exploitingimplicit negative evidenceand includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the Expectation-Maximization algorithm [Dempsteret al., 1977]. CE is a natural fit forlog-linearmodels, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EM-trained generative models on the task of matching human linguistic annotations (the MatchLinguist task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood can target the objective of grammar induction to a specific application.

Note:This version of the paper has minor corrections. Additional details are given in the dissertation of Smith (2006).

Keywords:grammar induction, contrastive estimation

**Contrastive estimation: Training log-linear models on unlabeled data**.

Noah A. Smith and Jason Eisner (2005).

In *ACL (nominated for Best Paper Award)*. [ paper | slides | scholar | bib ]

Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs arelog-linear, allowing the incorporation of arbitrary features into the model. To train onunlabeleddata, we requireunsupervisedestimation methods for log-linear models; few exist. We describe a novel approach,contrastive estimation. We show that the new technique can be intuitively understood as exploitingimplicit negative evidenceand is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features.

Note:Additional details are given in the dissertation of Smith (2006). In particular, here is the mapping from 45 fine-grained to 17 coarse-grained tags.

Keywords:grammar induction, contrastive estimation, tagging, selected papers

**A class of rational n-WFSM auto-intersections**.

André Kempe, Jean-Marc Champarnaud, Jason Eisner, Franck Guingne, and Florent Nicart (2005).

In

Weighted finite-state machines withntapes describen-ary rational string relations. The joinn-ary relation is very important in applications. It is shown how to compute it via a more simple operation, the auto-intersection. Join and auto-intersection generally do not preserve rationality. We define a class of triples(A,i,j)such that the auto-intersection of the machineAon tapesiandjcan be computed by a delay-based algorithm. We point out how to extend this class and hope that it is sufficient for many practical applications.

Note:Dreyer & Eisner (2009) and Paul & Eisner (2012) make some progress on the uncomputable problem of joining or auto-intersecting FSMs. The first paper gives an approximate algorithm for the real or tropical semiring; the second gives an algorithm for the tropical semiring that is exact if it terminates.

Keywords:semirings, finite-state methods

**Unsupervised classification via decision trees: An information-theoretic
perspective**.

Damianos Karakos, Sanjeev Khudanpur, Jason Eisner, and Carey E.
Priebe (2005).

In *ICASSP*. [ paper | scholar | bib ]

Integrated Sensing and Processing Decision Trees (ISPDTs) were introduced in [1] as a tool for supervised classification of high-dimensional data. In this paper, we consider the problem ofunsupervisedclassification, through a recursive construction of ISPDTs, where at each internal node the data (i) are split into clusters, and (ii) are transformed independently of other clusters, guided by some optimization objective. We show that the maximization of information-theoretic quantities such as mutual information and alpha-divergences is theoretically justified for growing ISPDTs, assuming that each data point is generated by a finite-memory random process given the class label. Furthermore, we present heuristics that perform the maximization in a greedy manner, and we demonstrate their effectiveness with empirical results from multi-spectral imaging.

Keywords:clustering, document clustering

**A note on join and auto-intersection of n-ary rational relations**.

André Kempe, Jean-Marc Champarnaud, and Jason Eisner (2004).

In

A finite-state machine withntapes describes a rational (or regular) relation onnstrings. It is more expressive than a relational database table withncolumns, which can only describe a finite relation.We describe some basic operations on

n-ary rational relations and propose notation for them. (For generality we give the semiring-weighted case in which each tuple has a weight.) Unfortunately, the join operation is problematic: if two rational relations are joined on more than one tape, it can lead to non-rational relations with undecidable properties. We recast join in terms of “auto-intersection” and illustrate some cases in which difficulties arise. We close with the hope that partial or restricted algorithms may be found that are still powerful enough to have practical use.

Note:Dreyer & Eisner (2009) and Paul & Eisner (2012) make some progress on the uncomputable problem of joining or auto-intersecting FSMs. The first paper gives an approximate algorithm for the real or tropical semiring; the second gives an algorithm for the tropical semiring that is exact if it terminates.

Keywords:semirings, finite-state methods

**Dyna: A declarative language for implementing dynamic programs**.

Jason Eisner, Eric Goldlust, and Noah A. Smith (2004).

In *ACL*. [ paper | website | scholar | bib ]

We present the first version of a new declarative programming language. Dyna has many uses but was designed especially for rapid development of new statistical NLP systems. A Dyna program is a small set of equations, resembling Prolog inference rules, that specify the abstract structure of a dynamic programming algorithm. It compiles into efficient, portable, C++ classes that can be easily invoked from a larger application. By default, these classes run a generalization of agenda-based parsing, prioritizing the partial parses by some figure of merit. The classes can also perform an exact backward (outside) pass in the service of parameter training. The compiler already knows several implementation tricks, algorithmic transforms, and numerical optimization techniques. It will acquire more over time: we intend for it togeneralizeandencapsulatebest practices, and serve as a testbed for new practices. Dyna is now being used for parsing, machine translation, morphological analysis, grammar induction, and finite-state modeling.

Note:Consider the longer 2005 version of this paper instead.

Keywords:Dyna, semirings, automatic differentiation

**Annealing techniques for unsupervised statistical language learning**.

Noah A. Smith and Jason Eisner (2004).

In *ACL*. [ paper | scholar | bib ]

Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the Expectation-Maximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA begins by globally maximizing an easy concave function and maintains a local maximum as it gradually morphs the function into the desired non-concave likelihood function. Applying DA to parsing and tagging models is shown to be straightforward; significant improvements over EM are shown on a part-of-speech tagging task. We describe a variant, skewed DA, which can incorporate a good initializer when it is available, and show significant improvements over EM on a grammar induction task.

Note:Additional details are given in the dissertation of Smith (2006).

Keywords:grammar induction, deterministic annealing, training objectives

**Radiology report entry with automatic phrase completion driven by language
modeling**.

John Eng and Jason M. Eisner (2004).

*Radiographics*. [ paper | supplement | permalink | scholar | bib ]

Language modeling, a technology found in many computerized speech recognition systems, can also be used in a text editor to implement an automated phrase completion feature that significantly reduces the number of keystrokes required to generate a radiology report, therefore increasing typing speed.Radiology reports have especially low entropy, which allows prediction of multi-word phrases. Our system therefore chooses an optimal phrase length for each prediction, using Bellman-style dynamic programming to minimize the

expectedcost of typing therestof the document. This computation considers what the user is likely to type in the future, and how many keystrokes it will take, considering the future effect of phrase completion as well.

Keywords:text entry, dynamic programming

**Natural language generation in the context of machine translation**.

Jan Hajič, Martin Čmejrek, Bonnie Dorr, Yuan Ding, Jason
Eisner, Daniel Gildea, Terry Koo, Kristen Parton, Gerald Penn, Dragomir
Radev, and Owen Rambow (2004).

Technical report, Center for Language and Speech Processing, Johns
Hopkins University. [ paper | website | slides | scholar | bib ]

Final report from the team at the JHU CLSP 2002 summer workshop. See project description.

Keywords:MT

**Learning non-isomorphic tree mappings for machine translation**.

Jason Eisner (2003).

In *ACL*. [ paper | slides | PDF slides | erratum | scholar | bib ]

Often one may wish to learn a tree-to-tree mapping, training it on unaligned pairs of trees, or on a mixture of trees and strings. Unlike previous statistical formalisms (limited to isomorphic trees),synchronous tree substitution grammarallows local distortion of the tree topology. We reformulate it to permit dependency trees, and sketch EM/Viterbi algorithms for alignment, training, and decoding.

Note:At a reviewer's request, the paper describes TSG more formally than in the previous literature, which might be helpful for some readers and implementers.

Keywords:MT, synchronous grammar algorithms, translation, dynamic programming

**Simpler and more general minimization for weighted finite-state
automata**.

Jason Eisner (2003).

In *HLT-NAACL*. [ paper | slides | PDF slides | scholar | bib ]

Previous work on minimizing weighted finite-state automata (including transducers) is limited to particular types of weights. We present efficient new minimization algorithms that apply much more generally, while being simpler and about as fast.We also point out theoretical limits on minimization algorithms. We characterize the kind of “well-behaved” weight semirings where our methods work. Outside these semirings, minimization is not well-defined (in the sense of producing a unique minimal automaton), and even finding the minimum number of states is in general NP-complete and inapproximable.

Keywords:finite-state methods, semirings, selected papers

**Parameter estimation for probabilistic finite-state transducers**.

Jason Eisner (2002).

In *ACL*. [ paper | extended slides | slides | PDF slides | scholar | bib ]

Weighted finite-state transducers suffer from the lack of a training algorithm. Training is even harder for transducers that have been assembled via finite-state operations such as composition, minimization, union, concatenation, and closure, as this yields tricky parameter tying. We formulate a “parameterized FST” paradigm and give training algorithms for it, including a general bookkeeping trick (“expectation semirings”) that cleanly and efficiently computes expectations and gradients.

Note:Expectation semirings are included in the excellent OpenFST library. I believe that Zhifei Li, Ariya Rastrow, Markus Dreyer, and Roy Tromble have all written implementations that work with OpenFST or other packages. Some of these support the higher-order expectation semirings of Li & Eisner (2009).

Keywords:finite-state methods, semirings, dynamic programming, automatic differentiation

**Comprehension and compilation in Optimality Theory**.

Jason Eisner (2002).

In *ACL*. [ paper | slides | homework | homework code | scholar | bib ]

This paper ties up some loose ends in finite-state Optimality Theory. First, it discusses how to perform comprehension under Optimality Theory grammars consisting of finite-state constraints. Comprehension has not been much studied in OT; we show that unlike production, it does not always yield a regular set, making finite-state methods inapplicable. However, after giving a suitably flexible presentation of OT, we show carefully how to treat comprehension under recent variants of OT in which grammars can be compiled into finite-state transducers. We then unify these variants, showing that compilation is possible if all components of the grammar are regular relations,including the harmony ordering on scored candidates. A side benefit of our construction is a far simpler implementation of directional OT (Eisner, 2000).

Note:A related paper was published independently by Gerhard Jäger at the same time.

Keywords:phonology, finite-state methods, selected papers

**An interactive spreadsheet for teaching the forward-backward algorithm**.

Jason Eisner (2002).

In *ACL Workshop on Teaching NLP and CL*. [ paper | student reading | homework | video | spreadsheet | Viterbi spreadsheet | presentation tips | teaching slides | scholar | bib ]

This paper offers a detailed lesson plan on the forward-backward algorithm. The lesson is taught from a live, commented spreadsheet that implements the algorithm and graphs its behavior on a whimsical toy example. By experimenting with different inputs, one can help students develop intuitions about HMMs in particular and Expectation Maximization in general. The spreadsheet and a coordinated follow-up assignment are available.

Note:I also provide a clustering spreadsheet and animated Powerpoint examples of Earley's algorithm and fast bilexical parsing. If you like spreadsheets, I discovered some related work after publication. Spreadsheets in Education has a long bibliography, many links, and examples (including Fourier synthesis!), while Visualizations with Excel explains how to do algorithm animation with Excel macros (e.g., edit distance, Huffman coding). Here's a <A HREF="https://towardsdatascience.com/building-a-deep-neural-net-in-google-sheets-49cdaf466da0">deep neural net</A> in Google Sheets.

Keywords:teaching, dynamic programming, tagging, machine learning, recorded talks

**Transformational priors over grammars**.

Jason Eisner (2002).

In *EMNLP (nominated for Best Paper Award)*. [ paper | slides | PDF slides+notes | PDF slides | scholar | bib ]

This paper proposes a novel class of PCFG parameterizations that support linguistically reasonable priors over PCFGs. To estimate the parameters is to discover a notion of relatedness among context-free rules such that related rules tend to have related probabilities. The prior favors grammars in which the relationships are simple to describe and have few major exceptions. A basic version that bases relatedness on weighted edit distance yields superior smoothing of grammars learned from the Penn Treebank (20% reduction of rule perplexity over the best previous method).

Note:See also the linguistic perspective in Eisner (2002). Additional details are given in my dissertation.

Keywords:syntactic transformations, selected papers

**Discovering syntactic deep structure via Bayesian statistics**.

Jason Eisner (2002).

*Cognitive Science*. [ paper | scholar | bib ]

In the Bayesian framework, a language learner should seek a grammar that explains observed data well and is alsoa prioriprobable. This paper proposes such a measure of prior probability. Indeed it develops a full statistical framework for lexicalized syntax. The learner's job is to discover the system of probabilistic transformations (often called lexical redundancy rules) that underlies the patterns of regular and irregular syntactic constructions listed in the lexicon. Specifically, the learner discovers what transformations apply in the language, how often they apply, and in what contexts. It considers simpler systems of transformations to be more probablea priori. Experiments show that the learned transformations are more effective than previous statistical models at predicting the probabilities of lexical entries, especially those for which the learner had no direct evidence.

Note:See also the engineering perspective in Eisner (2002). Additional details are given in my dissertation.

Keywords:syntactic transformations

**Introduction to the special section on linguistically apt statistical
methods**.

Jason Eisner (2002).

*Cognitive Science*. [ paper | scholar | bib ]

This brief introduction, from the editor of the special section, reviews why and how statistical and linguistic approaches to language can help each other. It also asks how statistical modeling fits into the broader program of cognitive science.

Keywords:edited volumes

**Expectation semirings: Flexible EM for finite-state transducers**.

Jason Eisner (2001).

In *FSMNLP*. [ paper | scholar | bib ]

This paper gives the first EM algorithm for general probabilistic finite-state transducers (with epsilon). Furthermore, the approach is powerful enough to fit machines' parameters even after the machines are combined by operations of the finite-state calculus, such as composition and minimization. This allows an expert to build a parameterized transducer in any way that is appropriate to the domain, and then fit the parameters automatically from data. Many standard algorithms are special cases, and there are many further applications. Yet the algorithm remains surprisingly simple because all the difficult work is subcontracted to existing algorithms for semiring-weighted automata. The trick is to use a novel semiring.

Note:Extended abstract. Consider the longer 2002 version instead.

Keywords:finite-state methods, semirings, dynamic programming, automatic differentiation

** Smoothing a Probabilistic Lexicon via Syntactic Transformations**.

Jason Eisner (2001).

PhD thesis, University of Pennsylvania. [ dissertation | chapter 1 | slides | scripts | scholar | bib ]

Probabilistic parsing requires a lexicon that specifies each word's syntactic preferences in terms of probabilities. To estimate these probabilities for words that were poorly observed during training, this thesis assumes the existence of arbitrarily powerful transformations (also known to linguists as lexical redundancy rules or metarules) that can add, delete, retype or reorder the argument and adjunct positions specified by a lexical entry.In a given language, some transformations apply frequently and others rarely. We describe how to estimate the rates of the transformations from a sample of lexical entries. More deeply, we learn which properties of a transformation increase or decrease its rate in the language. As a result, we can smooth the probabilities of lexical entries. Given enough direct evidence about a lexical entry's probability, our Bayesian approach trusts the evidence; but when less evidence or no evidence is available, it relies more on the transformations' rates to guess how often the entry will be derived from related entries.

Abstractly, the proposed “transformation models” are probability distributions that arise from graph random walks with a log-linear parameterization. A domain expert constructs the parameterized graph, and a vertex is likely according to whether random walks tend to halt at it. Transformation models are suited to any domain where “related” events (as defined by the graph) may have positively covarying probabilities. Such models admit a natural prior that favors simple regular relationships over stipulative exceptions. The model parameters can be locally optimized by gradient-based methods or by Expectation-Maximization. Exact algorithms (matrix inversion) and approximate ones (relaxation) are provided, with optimizations. Variations on the idea are also discussed.

We compare the new technique empirically to previous techniques from the probabilistic parsing literature, using comparable features, and obtain a 20% perplexity reduction (similar to doubling the amount of training data). Some of this reduction is shown to stem from the transformation model's ability to match observed probabilities, and some from its ability to generalize. Model averaging yields a final 24% perplexity reduction.

Keywords:theses, syntactic transformations

** Finite-State Phonology: Proceedings of the 5th Workshop of the ACL
Special Interest Group in Computational Phonology (SIGPHON)**.

Jason Eisner, Lauri Karttunen, and Alain Thériault, editors. [ volume | bib ]

Keywords:edited volumes, finite-state methods, phonology

**Easy and hard constraint ranking in Optimality Theory: Algorithms and
complexity**.

Jason Eisner (2000).

In *SIGPHON*. [ paper | arxiv | slides | PDF slides | scholar | bib ]

We consider the problem of ranking a set of OT constraints in a manner consistent with data. (1) We speed up Tesar and Smolensky's RCD algorithm to be linear on the number of constraints. This finds a ranking so each attested formx_{i}beats or ties a particular competitory_{i}. (2) We also generalize RCD so eachx_{i}beats or tiesallpossible competitors.Alas, neither the more realistic version of ranking in (2), nor even generation, has any polynomial algorithm unless P=NP! That is, one cannot improve qualitatively upon brute force: (3) Merely checking that a

single(given) ranking is consistent with given forms is coNP-complete if the surface forms are fully observed and Δ_{2}^{p}-complete if not. Indeed, OT generation is OptP-complete. (4) As for ranking, determining whetheranyconsistent ranking exists is coNP-hard (but in Δ_{2}^{p}) if the forms are fully observed, and Σ_{2}^{p}-complete if not.Finally, we show (5) generation and ranking are easier in derivational theories: in P, and NP-complete.

Keywords:hardness, phonology, greedy algorithms, selected papers

**Directional constraint evaluation in Optimality Theory**.

Jason Eisner (2000).

In *COLING*. [ paper | slides | PDF slides+notes | PDF slides | scholar | bib ]

Weighted finite-state constraints that cancountunboundedly many violations make Optimality Theory more powerful than finite-state transduction (Frank and Satta, 1998). This result is empirically and computationally awkward. We propose replacing these unbounded constraints, as well as non-finite-state Generalized Alignment constraints, with a new class of finite-statedirectional constraints. We give linguistic applications, results on generative power, and algorithms to compile grammars into transducers.

Note:This paper makes thelinguisticcase for directional OT, and gives an interesting transducer construction. However, Eisner (2002) shows that themathematicalresult can actually be obtained as a special case of a more general theorem that is based on a simpleralgebraicconstruction.

Keywords:phonology, finite-state methods, linguistics

**The science of language: Computational linguistics**.

Jason Eisner (2000).

*Imagine Magazine*. [ paper | official link | bib ]

Keywords:teaching

**Review of Optimality Theory by René Kager**.

Jason Eisner (2000).

This book review also sketches why OT is interesting to computational linguists, and how it relates to other approaches for combining non-orthogonal surface features, such as maximum-entropy modeling.

Keywords:phonology, linguistics

**Bilexical grammars and their cubic-time parsing algorithms**.

Jason Eisner (2000).

In *Advances in Probabilistic and Other Parsing Technologies*. [ paper | slides | review | PDF slides | scholar | bib ]

This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such “bilexicalism” has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.The obvious parsing algorithm for bilexical grammars (used by most authors) takes time

O(n^{5}). A more efficientO(n^{3}) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Note:This book chapter is an improved and extended version of Eisner (1997).

Keywords:dependency parsing, dynamic programming, lexicalization

**A faster parsing algorithm for lexicalized tree-adjoining grammars**.

Jason Eisner and Giorgio Satta (2000).

In *TAG+5*. [ paper | bib ]

This paper points out some computational inefficiencies of standard TAG parsing algorithms when applied to LTAGs. We propose a novel algorithm with an asymptotic improvement, fromO(n^{8}g^{2}t) toO(n^{6}max(n,g)gt), wherenis the input length andg,tare grammar constants that are independent of vocabulary size.

Keywords:TAG parsing, dynamic programming, lexicalization

**Efficient parsing for bilexical context-free grammars and head-automaton
grammars**.

Jason Eisner and Giorgio Satta (1999).

In *ACL*. [ paper | slides | PDF slides+notes | PDF slides | scholar | bib ]

Several recent stochastic parsers usebilexicalgrammars, where each word type idiosyncratically prefers particular complements with particular head words. We presentO(n^{4}) parsing algorithms for two bilexical formalisms (see title), improving the previous upper bounds ofO(n^{5}). Also, for a common special case that was known to allowO(n^{3}) parsing (Eisner, 1997), we present anO(n^{3}) algorithm with an improved grammar constant.

Note:Note that the slides includeexperimental speed comparisonsthat werenotin the paper.

Keywords:dependency parsing, CFG parsing, dynamic programming, lexicalization, selected papers

**Doing OT in a straitjacket**.

Jason Eisner (1999).

Talk handout available online (27 pages), UCLA Linguistics Dept. [ paper | handout | scholar | bib ]

A universal theory of human phonology should be clearly specified and falsifiable. To treat Optimality Theory (OT) as a real proposal, one must put some cards on the table: What kinds of constraints may an OT grammar state? And how can anyone tell what data this grammar predicts, without constructing infinite tableaux?In this talk I'll motivate a restrictive formalization of OT that allows just two types of simple, local constraint. Gen freely proposes gestures and prosodic constituents; the constraints try to force these to coincide or not coincide temporally. An efficient algorithm exists to find the optimal candidate.

I will argue that despite its simplicity, primitive OT is expressive enough to describe and unify most of the work in OT phonology. However, it is provably more constrained: because it is unable to mimic deeply non-local mechanisms like Generalized Alignment, it forces a new and arguably better account of metrical stress typology.

Finally, I will sketch a more radical extension, directional evaluation, which changes how a constraint ranks candidates. This change brings back some of the descriptive convenience of Generalized Alignment, but it also constrains OT grammars to describe only regular relations, which is linguistically and computationally desirable.

Note:This is an extended version of Eisner (1997).

Keywords:phonology, linguistics, invited talks, selected papers

**FootForm decomposed: Using primitive constraints in OT**.

Jason Eisner (1998).

In *SCIL*. [ paper | scholar | bib ]

Hayes (1995) gives a typology of the world's metrical stress systems, which is marked by several striking asymmetries (parametric gaps). Most work on metrical stress within Optimality Theory (OT) has adopted this typology without explaining the gaps. Moreover, the OT versions use uncomfortably non-local constraints (Align, FootForm, FtBin).This paper presents a rather different and in some ways more explanatory typology of stress, couched in the framework of primitive Optimality Theory (OTP), which allows only primitive, radically local constraints. For example, Generalized Alignment is not allowed. The paper presents a single, coherent system of rerankable constraints that yields the basic facts about iambic and trochaic foot form, iambic lengthening, quantity sensitivity, unbounded feet, simple word-initial and word-final stress, directionality of footing, syllable (and foot) extrametricality, degenerate feet, and word-level stress.

The metrical part of the account rests on the following intuitions: <UL> <LI> (a) iambs are special because syllable structure allows them to lengthen their strong ends; <LI> (b) directionality of footing is really the result of local lapse avoidance; <LI> (c) any lapses are forced by a (localist) generalization of right extrametricality; <LI> (d) although degenerate feet are absolutely banned, primary stress does not require a foot in all languages. </UL> An interesting prediction of (b) and (c) is that left-to-right trochees should be incompatible with extrametricality. This prediction is robustly confirmed in Hayes.

Keywords:phonology, linguistics

**Bilexical grammars and a cubic-time probabilistic parser**.

Jason Eisner (1997).

In *IWPT*. [ paper | slides | PDF slides | scholar | bib ]

This paper introduces weighted bilexical grammars, a formalism in which individual lexical items, such as verbs and their arguments, can have idiosyncratic selectional influences on each other. Such “bilexicalism” has been a theme of much current work in parsing. The new formalism can be used to describe bilexical approaches to both dependency and phrase-structure grammars, and a slight modification yields link grammars. Its scoring approach is compatible with a wide variety of probability models.The obvious parsing algorithm for bilexical grammars (used by most authors) takes time

O(n^{5}). A more efficientO(n^{3}) method is exhibited. The new algorithm has been implemented and used in a large parsing experiment (Eisner 1996). We also give a useful extension to the case where the parser must undo a stochastic transduction that has altered the input.

Note:Consider instead the 2000 book chapter that is an expanded version of this paper.

Keywords:dependency parsing, dynamic programming, lexicalization

**Efficient generation in primitive Optimality Theory**.

Jason Eisner (1997).

In *ACL*. [ paper | proof details | slides | PDF slides | scholar | bib ]

This paper introduces computational linguists to primitive Optimality Theory (OTP), a clean and linguistically motivated formalization of OT. OTP specifies the class of autosegmental representations, the universal generator Gen, and the two simple families of permissible constraints. It is therefore possible to study its computational generation, comprehension, and learning properties.Some results on generation are presented. Unlike less restricted theories using Generalized Alignment, OTP grammars can derive optimal surface forms with finite-state methods adapted from Ellison (1994). Unfortunately these methods take time exponential on the size of the grammar. Indeed the generation problem is shown NP-complete in this sense. However, techniques are discussed for making Ellison's approach fast and practical in the typical case, including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size. One avenue for future improvements is a new finite-state notion, “factored automata,” where regular languages are represented compactly via formal intersections of FSAs.

Keywords:phonology, finite-state methods, hardness

**State-of-the-art algorithms for minimum spanning trees: A tutorial
discussion**.

Jason Eisner (1997).

Manuscript available online (78 pages), University of Pennsylvania. [ paper | scholar | bib ]

The classic “easy” optimization problem is to find the MST of a connected, undirected graph. Good polynomial-time algorithms have been known since 1930. Over the last 10 years, however, the standardO(mlogn) results of Kruskal and Prim have been improved to linear or near-linear time. The new methods use several tricks of general interest in order to reduce the number of edge weight comparisons and the amount of other work. This tutorial reviews those methods, building up strategies step by step so as to expose the insights behind the algorithms. Implementation details are clarified, and some generalizations are given.Specifically, the paper attempts to shed light on the classical algorithms of Kruskal, Prim, and Boruvka; the improved approach of Gabow, Galil, and Spencer, which takes time only

O(mlog (log^{*}n- log^{*}m/n)); and the randomizedO(m) algorithm of Karger, Klein, and Tarjan, which relies on anO(m) MST verification algorithm by King. It also considers Frederickson's method for maintaining an MST in timeO(sqrt((m)) per change to the graph. An appendix explains Fibonacci heaps.)

Keywords:greedy algorithms, teaching

**What constraints should OT allow?**

Jason Eisner (1997).

Talk handout available online (22 pages), Linguistic Society of
America (LSA). [ paper | handout | earlier version | scholar | bib ]

Note:A more recent, extended version of this talk is Eisner (1999).

Keywords:phonology, linguistics

**An empirical comparison of probability models for dependency grammar**.

Jason Eisner (1996).

Technical report, Institute for Research in Cognitive Science, Univ.
of Pennsylvania. [ paper | arxiv | scripts | scholar | bib ]

This technical report is an appendix to Eisner (1996): it gives superior experimental results that were reported only in the talk version of that paper. Eisner (1996) trained three probability models on a small set of about 4,000 conjunction-free, dependency-grammar parses derived from the Wall Street Journal section of the Penn Treebank, and then evaluated the models on a held-out test set, using a novelO(n^{3}) parsing algorithm.The present paper describes some details of the experiments and repeats them with a larger training set of 25,000 sentences. As reported at the talk, the more extensive training yields greatly improved performance. Nearly half the sentences are parsed with no misattachments; two-thirds are parsed with at most one misattachment.

Of the models described in the original written paper, the best score is still obtained with the generative (top-down) “model C.” However, slightly better models are also explored, in particular, two variants on the comprehension (bottom-up) “model B.” The better of these has an attachment accuracy of 90%, and (unlike model C) tags words more accurately than the comparable trigram tagger. Differences are statistically significant.

If tags are roughly known in advance, search error is all but eliminated and the new model attains an attachment accuracy of 93%. We find that the parser of Collins (1996), when combined with a highly-trained tagger, also achieves 93% when trained and tested on the same sentences. Similarities and differences are discussed.

Keywords:dependency parsing, lexicalization

**Three new probabilistic models for dependency parsing: An exploration**.

Jason Eisner (1996).

In *COLING*. [ paper | arxiv | more results | scholar | bib ]

After presenting a novelO(n^{3}) parsing algorithm for dependency grammar, we develop three contrasting ways to stochasticize it. We propose (a) a lexical affinity model where words struggle to modify each other, (b) a sense tagging model where words fluctuate randomly in their selectional preferences, and (c) a generative model where the speaker fleshes out each word's syntactic and conceptual structure without regard to the implications for the hearer. We also give preliminary empirical results from evaluating the three models' parsing performance on annotated Wall Street Journal training text (derived from the Penn Treebank). In these results, the generative (i.e., top-down) model performs significantly better than the others, and does about equally well at assigning part-of-speech tags.

Keywords:dependency parsing, lexicalization, dynamic programming

**Efficient normal-form parsing for combinatory categorial grammar**.

Jason Eisner (1996).

In *ACL*. [ paper | proof details | arxiv | slides | scholar | bib ]

Under categorial grammars that have powerful rules like composition, a simple n-word sentence can have exponentially many parses that are semantically equivalent. Generating all parses is inefficient and obscures whatever true semantic ambiguities are in the input. This paper addresses the problem for a fairly general form of Combinatory Categorial Grammar, by means of an efficient, correct, and easy to implement normal-form parsing technique. The parser is proved to find exactly one parse in each semantic equivalence class of allowable parses; that is, spurious ambiguity (as carefully defined) is shown to be both safely and completely eliminated.

Note:The example "intentionally knock twice," mentioned on page 4, should have been credited to Schabes & Shieber (1994).

Keywords:CCG parsing, dynamic programming, selected papers

**Description of the University of Pennsylvania entry in the MUC-6
competition**.

Breck Baldwin, Jeff Reynar, Mike Collins, Jason Eisner, Adwait
Ratnaparkhi, Joseph Rosenzweig, Anoop Sarkar, and Srinivas (1995).

In *MUC*. [ paper | scholar | bib ]

Note:A competitive system for coreference resolution, hacked together in our spare time.

Keywords:coreference resolution

**∀-less in Wonderland? Revisiting any**.

Jason Eisner (1995).

In

Englishanyis often treated as two unrelated or semi-related lexemes: a negative-polarity item,NPI any, and a universal quantifier,free-choice (FC) any. The latter is idiosyncratic in that it must appear in the scope of a licenser, but moves to take scope immediately over that licenser at LF. I give a semantic account of FCanyas an “irrealis” quantifier. This explains some curious (new and old) facts about FCany's semantics and licensing environments. Furthermore, it predicts that negation and other NPI-licensing environments should license FCany, which would then have just the same meaning as NPIany(paceLadusaw (1979), Carlson (1980)). Thus, we may unify the twoany's as a single universal quantifier, as originally proposed by Lasnik (1972) and others. Such an account implies that NPIanymoves over negation at LF—which is confirmed by scope tests. It also explains some well-known problems concerning NPIanyin non-downward-entailing environments and undersorryvs.glad.

Keywords:semantics, linguistics, selected papers

**A probabilistic parser applied to software testing documents**.

Mark A. Jones and Jason M. Eisner (1992).

In *AAAI*. [ paper | scholar | bib ]

We describe an approach to training a statistical parser from a bracketed corpus, and demonstrate its use in a software testing application that translates English specifications into an automated testing language. A grammar is not explicitly specified; the rules and contextual probabilities of occurrence are automatically generated from the corpus. The parser is extremely successful at producing and identifying the correct parse, and nearly deterministic in the number of parses that it produces. To compensate for undertraining, the parser also uses general, linguistic subtheories which aid in guessing some types of novel structures.

Keywords:CFG parsing, dynamic programming, lexicalization

**A probabilistic parser and its application**.

Mark A. Jones and Jason M. Eisner (1992).

In *Statistically-Based Natural Language Processing Techniques:
Papers from the 1992 Workshop*. [ paper | scholar | bib ]

We describe a general approach to the probabilistic parsing of context-free grammars. The method integrates context-sensitive statistical knowledge of various types (e.g., syntactic and semantic) and can be trained incrementally from a bracketed corpus. We introduce a variant of the GHR context-free recognition algorithm, and explain how to adapt it for efficient probabilistic parsing. On a real-world corpus of sentences from software testing documents, with 23 possible parses for a sentence of average length, the system accurately finds the correct parse in 99% of cases, while producing only 1.02 parses per sentence. Significantly, the success rate would be only 66% without the semantic statistics.

Keywords:CFG parsing, dynamic programming

**Indirect STV election: A voting system for South Africa**.

Jason Eisner (1991).

White paper, University of Cape Town. [ paper | bib ]

“Winner take all” electoral systems are not fully representative. Unfortunately, the ANC's proposed system of proportional representation is not much better. Because it ensconces party politics, it is only slightly more representative, and poses a serious threat to accountability.Many modern students of democracy favor proportional representation through the Single Transferable Vote (STV). In countries with high illiteracy, however, this system may be unworkable.

This paper proposes a practical modification of STV. In the modified system, each citizen votes for only one candidate. Voters need not specify their second, third, and fourth choices. Instead, each candidate specifies his or her second, third, and fourth choices. The modified system is no more difficult for voters than current proposals—and it provides virtually all the benefits of STV, together with some new ones.

Keywords:voting systems

**Cognitive science and the search for intelligence**.

Jason Eisner (1991).

Invited paper presented to the Socratic Society, University of Cape
Town, South Africa. [ paper | scholar | bib ]

This talk for a general audience gives a sketch of what the field of cognitive science is about. In its latter half, it turns to the philosophical question of defining intelligence, and proposes a non-operational alternative to the Turing Test.

Keywords:teaching

**Dynamical-systems behavior in recurrent and non-recurrent connectionist
nets**.

Jason Eisner (1990).

Undergraduate honors thesis, Harvard University. [ thesis | bib ]

A broad approach is developed for training dynamical behaviors in connectionist networks. General recurrent networks are powerful computational devices, necessary for difficult tasks like constraint satisfaction and temporal processing. These tasks are discussed here in some detail. From both theoretical and empirical considerations, it is concluded that such tasks are best addressed by recurrent networks that operate continuously in time—and further, that effective learning rules for these continuous-time networks must be able to prescribe theirdynamicalproperties. A general class of such learning rules is derived and tested on simple problems. Where existing learning algorithms for recurrent and non-recurrent networks only attempt to train a network'spositionin activation space, the models presented here can also explicitly and successfully prescribe the nature of itsmovement throughactivation space.

Keywords:theses, automatic differentiation, deep learning

**Stock market prediction using natural language processing**.

Frederick S.M. Herz, Lyle H. Ungar, Jason M. Eisner, and Walter Paul
Labys (2002).

U.S. Patent #8,285,619 issued 10/9/2012. [ patent | scholar | bib ]

We present a method of using natural language processing (NLP) techniques to extract information from online news feeds and then using the information so extracted to predict changes in stock prices or volatilities. These predictions can be used to make profitable trading strategies. More specifically, company names can be recognized and simple templates describing company actions can be automatically filled using parsing or pattern matching on words in or near the sentence containing the company name. These templates can be clustered into groups which are statistically correlated with changes in the stock prices.

Keywords:patents

**Method of combining shared buffers of continuous digital media data with
media delivery scheduling**.

Frederick S. M. Herz, Jonathan Smith, Paul Labys, and Jason Michael
Eisner (filed 2001).

Patent pending. [ patent | scholar | bib ]

A communications method utilizes memory areas to buffer portions of the media streams. These buffer areas are shared by user applications, with the desirable consequence of reducing workload for the server system distributing media to the user (client) applications. The preferred method allows optimal balancing of buffering delays and server loads, as well as optimal choice of buffer contents for the shared memory buffers.

Keywords:patents, content delivery

**Secure data interchange**.

Frederick S. M. Herz, Walter Paul Labys, David C. Parkes, Sampath
Kannan, and Jason M. Eisner (2000).

U.S. Patent 7,630,986, issued 2009. [ patent | scholar | bib ]

A secure data interchange system enables information about bilateral and multilateral interactions between multiple persistent parties to be exchanged and leveraged within an environment that uses a combination of techniques to control access to information, release of information, and matching of information back to parties. Access to data records can be controlled using an associated price rule. A data owner can specify a price for different types and amounts of information access.

Keywords:patents, online commerce

**System for the automatic determination of customized prices and
promotions**.

Frederick Herz, Jason Eisner, Lyle Ungar, Walter Paul Labys, Bernie
Roemmele, and Jon Hayward (filed 1998).

U.S. Patent Application 20,010/014,868. [ patent | scholar | bib ]

The system for the automatic determination of customized prices and promotions automatically constructs product offers tailored to individual shoppers, or types of shopper, in a way that attempts to maximize the vendor's profits. These offers are represented digitally. They are communicated either to the vendor, who may act on them as desired, or to an on-line computer shopping system that directly makes such offers to shoppers. Largely by tracking the behavior of shoppers, the system accumulates extensive profiles of the shoppers and the offers that they consider. The system can then select, present, price, and promote goods and services in ways that are tailored to an individual consumer. Likely shoppers can be identified, then enticed with the most effective visual and textual advertisements; deals can be offered to them, either on-line or off-line; detailed product information screens can be subtly rearranged from one type of shopper to the next. Furthermore, when a product can be tailored to a particular shopper, a general technique or expert system can offer each consumer an appropriately customized product.

Keywords:patents, online commerce

**A Lempel-Ziv data compression technique utilizing a dictionary
pre-filled with frequent letter combinations, words and/or phrases**.

Jeffrey C. Reynar, Fred Herz, Jason Eisner, and Lyle Ungar (1996).

U.S. Patent #5,951,623 issued 9/14/1999. [ patent | scholar | bib ]

An adaptive compression technique which is an improvement to Lempel-Ziv (LZ) compression techniques, both as applied for purposes of reducing required storage space and for reducing the transmission time associated with transferring data from point to point. Pre-filled compression dictionaries are utilized to address the problem with prior Lempel-Ziv techniques in which the compression software starts with an empty compression dictionary, whereby little compression is achieved until the dictionary has been filled with sequences common in the data being compressed. In accordance with the invention, the compression dictionary is pre-filled, prior to the beginning of the data compression, with letter sequences, words and/or phrases frequent in the domain from which the data being compressed is drawn. The letter sequences, words, and/or phrases used in the pre-filled compression dictionary may be determined by statistically sampling text data from the same genre of text. Multiple pre-filled dictionaries may be utilized by the compression software at the beginning of the compression process, where the most appropriate dictionary for maximum compression is identified and used to compress the current data. These modifications are made to any of the known Lempel-Ziv compression techniques based on the variants detailed in 1977 and 1978 articles by Ziv and Lempel.

Keywords:patents, text compression

**System for generation of object profiles for a system for customized
electronic identification of desirable objects**.

Frederick S. M. Herz, Jason M. Eisner, and Lyle H. Ungar (1995).

U.S. Patent #5,835,087 issued 11/10/1998. [ patent | scholar | bib ]

Keywords:patents, collaborative filtering

**System for generation of user profiles for a system for customized
electronic identification of desirable objects**.

Frederick S. M. Herz, Jason M. Eisner, Lyle H. Ungar, and Mitchell P.
Marcus (1995).

U.S. Patent #5,754,939 issued 5/19/1998. [ patent | scholar | bib ]

Keywords:patents, collaborative filtering

**Pseudonymous server for system for customized electronic identification of
desirable objects**.

Frederick S. M. Herz, Jason Eisner, and Marcos Salganicoff (1995).

U.S. Patent #5,754,938 issued 5/19/1998. [ patent | scholar | bib ]

Keywords:patents, collaborative filtering

**System for customized electronic identification of desirable objects**.

Frederick S. M. Herz, Jason M. Eisner, Jonathan M. Smith, and
Steven L. Salzberg (filed 1995).

Patent pending. [ patent | bib ]

This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment, and in particular to a system that automatically constructs both a “target profile” for each target object in the electronic media based, for example, on the frequency with which each word appears in an article relative to its overall frequency of use in all articles, as well as a “target profile interest summary” for each user, which target profile interest summary describes the user's interest level in various types of target objects. The system then evaluates the target profiles against the users' target profile interest summaries to generate a user-customized rank ordered listing of target objects most likely to be of interest to each user so that the user can select from among these potentially relevant target objects, which were automatically selected by this system from the plethora of target objects that are profiled on the electronic media. Users' target profile interest summaries can be used to efficiently organize the distribution of information in a large scale system consisting of many users interconnected by means of a communication network. Additionally, a cryptographically-based pseudonym proxy server is provided to ensure the privacy of a user's target profile interest summary, by giving the user control over the ability of third parties to access this summary and to identify or contact the user.

Keywords:patents, collaborative filtering

*This file was generated by
bibtex2html 1.99 plus some custom scripts.*