Further Resources on Log-Linear Models

by Jason Eisner and Frank Ferraro (2013)

This page points to some resources on log-linear modeling. They accompany the interactive visualization described in Ferraro & Eisner (2013), A virtual manipulative for learning log-linear models. Suggested additions are welcome.

Log-Linear Software

Here's some recommended open-source software you can use to build log-linear models for your own use.

NLTK is a Python toolkit for natural language processing, intended particularly for pedagogical use. Chapter 6 of the NLTK book (Bird, Klein & Loper 2009) walks you through using machine learning classifiers on natural language data. They refer to log-linear models as "maximum entropy models" (section 6.6). The use of NLTK's maxent module for classification tasks is illustrated in section 7.3 of the book.
MegaM by Hal Daumé III is an efficient program for training log-linear models. You give it a file of training data, and it prints out the learned weights. Once you have the weights in a file, you can run the program again in a different mode, as a filter, where it prints the probability distribution p(y | x) for each test input x that it reads, one by one. NLTK provides an interface to MegaM so you can call it from Python.
Vowpal Wabbit (VW) is a super-fast program that can learn linear models with billions of features, using thousands of computers in parallel if you've got 'em. Use the argument --loss_function logistic to use the log-linear training objective. Add ℓ₁ or ℓ₂ regularization via --l1 1.0 or --l2 1.0 (for C = 1.0). VW focuses on prediction, so I am not sure whether it will print p(y | x) or just tell you the best y for each x.
LIBLINEAR is a fast and popular C++ library for learning various types of linear models. It provides a command-line interface via the train and predict programs, and interfaces are available for calling LIBLINEAR from several other languages. Unfortunately, it handles only the binary case (where y=±1), under the name "logistic regression," with ℓ₁ or ℓ₂ regularization. (If you have a larger space of y values, it will build a multi-class classifier for you by combining several binary classifiers, but this is not quite the same as a single multi-class log-linear model.)

Pencil-and-Paper Exercises

[We will place some practice problems here from Jason's NLP class. We would also be happy to link to exercises from other NLP classes.]

Homework Projects

[We will link here to an assignment from Jason's NLP class. We would also be happy to link to projects from other NLP classes.]

Noah Smith's tutorial offers a more mathematical description of log-linear models, including how maximizing conditional log-likelihood (as in our visualization) arises as the dual problem of maximizing entropy. His book, Linguistic Structure Prediction, discusses log-linear models for structure prediction (see especially sections 3.4 and 3.5).

Charles Elkan has very readable notes on log-linear models and related concepts, with a bibliography. His CIKM 2008 video tutorial comes with notes. Computational and optimization aspects are covered, and grounded in logistic regression examples and conditional random field (CRF) tagging. Hanna Wallach also offers an introduction to CRFs and efficient computation for linear chain CRFs.

Jason Eisner has teaching slides (pdf) on using conditional log-linear models for structured prediction problems like sequence tagging and parsing, where the number of output categories y is very large. These slides also introduce the structured perceptron, a related technique. They assume familiarity with the simpler cases covered in our visualization, as well as with dynamic programming algorithms for tagging and parsing.

For links into the research literature, we quote from section 8 of our paper (Ferraro & Eisner, 2013):

At the time of writing, 3266 papers in the ACL Anthology mention log-linear models, with 137 using “log-linear,” “maximum entropy” or “maxent” in the paper title. These cover a wide range of applications that can be considered in lectures or homework projects.

Early papers may cover the most fundamental applications and the clearest motivation. Conditional log-linear models were first popularized in computational linguistics by a group of researchers associated with the IBM speech and language group, who called them “maximum entropy models,” after a principle that can be used to motivate their form (Jaynes, 1957). They applied the method to various binary or multiclass classification problems in NLP, such as prepositional phrase attachment (Ratnaparkhi et al., 1994), text categorization (Nigam et al., 1999), and boundary prediction (Beeferman et al., 1999).

Log-linear models can be also used for structured prediction problems in NLP such as tagging, parsing, chunking, segmentation, and language modeling. A simple strategy is to reduce structured prediction to a sequence of multiclass predictions, which can be individually made with a conditional log-linear model (Ratnaparkhi, 1998). A more fully probabilistic approach---used in the original “maximum entropy” papers---is to use (1) to define the conditional probabilities of the steps in a generative process that gradually produces the structure (Rosenfeld, 1994; Berger et al., 1996.). (Even predicting the single next word in a sentence can be broken down into a sequence of binary decisions in this way. This avoids normalizing over the large vocabulary (Mnih & Hinton, 2008).) This idea remains popular today and can be used to embed rich distributions into a variety of generative models (Berg-Kirkpatrick et al. 2010). For example, a PCFG that uses richly annotated nonterminals involves a large number of context-free rules. Rather than estimating their probabilities separately, or with traditional backoff smoothing, a better approach is to use (1) to model the probability of all rules given their left-hand sides, based on features that consider attributes of the nonterminals. (E.g., case, number, gender, tense, aspect, mood, lexical head. In the case of a terminal rule, the spelling or morphology of the terminal symbol can be considered.)

The most direct approach to structured prediction is to simply predict the structured output all at once, so that y is a large structured object with many features. This is conceptually natural but means that the normalizer Z(x) involves summing over a large space 𝒴(x). One can restrict 𝒴(x) before training (Johnson et al., 1999). More common is to sum efficiently by dynamic programming or sampling, as is typical in linear-chain conditional random fields (Lafferty et al., 2001), whole-sentence language modeling (Rosenfeld et al., 2001), and CRF CFGs (Finkel et al, 2008).

This page online: http://cs.jhu.edu/~jason/tutorials/loglin/further

Jason Eisner - jason@cs.jhu.edu (suggestions welcome)

Further Resources on Log-Linear Models

by Jason Eisner and Frank Ferraro (2013)

Log-Linear Software

Pencil-and-Paper Exercises

Homework Projects

Further Reading