This page points to some resources on log-linear modeling. They accompany the interactive visualization described in Ferraro & Eisner (2013), A virtual manipulative for learning log-linear models. Suggested additions are welcome.

NLTK is a Python toolkit for natural language processing, intended particularly for pedagogical use. Chapter 6 of the NLTK book (Bird, Klein & Loper 2009) walks you through using machine learning classifiers on natural language data. They refer to log-linear models as "maximum entropy models" (section 6.6). The use of NLTK's maxent module for classification tasks is illustrated in section 7.3 of the book.

MegaM by Hal Daumé III is an efficient program for training log-linear models. You give it a file of training data, and it prints out the learned weights. Once you have the weights in a file, you can run the program again in a different mode, as a filter, where it prints the probability distribution

*p*(*y*|*x*) for each test input*x*that it reads, one by one. NLTK provides an interface to MegaM so you can call it from Python.Vowpal Wabbit (VW) is a super-fast program that can learn linear models with billions of features, using thousands of computers in parallel if you've got 'em. Use the argument

`--loss_function logistic`

to use the log-linear training objective. Add ℓ_{1}or ℓ_{2}regularization via`--l1 1.0`

or`--l2 1.0`

(for*C*= 1.0). VW focuses on prediction, so I am not sure whether it will print*p*(*y*|*x*) or just tell you the*best**y*for each*x*.LIBLINEAR is a fast and popular C++ library for learning various types of linear models. It provides a command-line interface via the

`train`

and`predict`

programs, and interfaces are available for calling LIBLINEAR from several other languages. Unfortunately, it handles only the binary case (where*y*=±1), under the name "logistic regression," with ℓ_{1}or ℓ_{2}regularization. (If you have a larger space of*y*values, it will build a multi-class classifier for you by combining several binary classifiers, but this is not quite the same as a single multi-class log-linear model.)

One good introduction is the handout that goes along with our visualization.

Noah Smith's tutorial offers a more mathematical description of log-linear models, including how maximizing conditional log-likelihood (as in our visualization) arises as the dual problem of maximizing entropy. His book, Linguistic Structure Prediction, discusses log-linear models for structure prediction (see especially sections 3.4 and 3.5).

Charles Elkan has very readable notes on log-linear models and related concepts, with a bibliography. His CIKM 2008 video tutorial comes with notes. Computational and optimization aspects are covered, and grounded in logistic regression examples and conditional random field (CRF) tagging. Hanna Wallach also offers an introduction to CRFs and efficient computation for linear chain CRFs.

Jason Eisner
has teaching
slides
(pdf)
on using conditional log-linear models for structured prediction problems
like sequence tagging and parsing, where the number of output
categories *y* is very large.
These slides also introduce the structured perceptron, a related
technique. They assume familiarity with the simpler cases covered in
our visualization, as well as with dynamic programming algorithms
for tagging and parsing.

For links into the research literature, we quote from section 8 of our paper (Ferraro & Eisner, 2013):

At the time of writing, 3266 papers in the ACL Anthology mention log-linear models, with 137 using “log-linear,” “maximum entropy” or “maxent” in the paper title. These cover a wide range of applications that can be considered in lectures or homework projects.

Early papers may cover the most fundamental applications and the clearest motivation. Conditional log-linear models were first popularized in computational linguistics by a group of researchers associated with the IBM speech and language group, who called them “maximum entropy models,” after a principle that can be used to motivate their form (Jaynes, 1957). They applied the method to various binary or multiclass classification problems in NLP, such as prepositional phrase attachment (Ratnaparkhi et al., 1994), text categorization (Nigam et al., 1999), and boundary prediction (Beeferman et al., 1999).

Log-linear models can be also used for structured prediction problems in NLP such as tagging, parsing, chunking, segmentation, and language modeling. A simple strategy is to reduce structured prediction to a sequence of multiclass predictions, which can be individually made with a conditional log-linear model (Ratnaparkhi, 1998). A more fully probabilistic approach---used in the original “maximum entropy” papers---is to use (1) to define the conditional probabilities of the steps in a generative process that gradually produces the structure (Rosenfeld, 1994; Berger et al., 1996.). (Even predicting the single next word in a sentence can be broken down into a sequence of binary decisions in this way. This avoids normalizing over the large vocabulary (Mnih & Hinton, 2008).) This idea remains popular today and can be used to embed rich distributions into a variety of generative models (Berg-Kirkpatrick et al. 2010). For example, a PCFG that uses richly annotated nonterminals involves a large number of context-free rules. Rather than estimating their probabilities separately, or with traditional backoff smoothing, a better approach is to use (1) to model the probability of all rules given their left-hand sides, based on features that consider attributes of the nonterminals. (E.g., case, number, gender, tense, aspect, mood, lexical head. In the case of a terminal rule, the spelling or morphology of the terminal symbol can be considered.)

The most direct approach to structured prediction is to simply predict the structured output all at once, so that

yis a large structured object with many features. This is conceptually natural but means that the normalizerZ(x) involves summing over a large space 𝒴(x). One can restrict 𝒴(x) before training (Johnson et al., 1999). More common is to sumefficientlyby dynamic programming or sampling, as is typical in linear-chain conditional random fields (Lafferty et al., 2001), whole-sentence language modeling (Rosenfeld et al., 2001), and CRF CFGs (Finkel et al, 2008).

This page online:

`http://cs.jhu.edu/~jason/tutorials/loglin/further`

Jason Eisner - jason@cs.jhu.edu (suggestions welcome) |