Language modeling

All of you have seen a language model at work. And by knowing a language, you have developed your own language model.

You have probably seen a LM at work in predictive text:

  • a search engine predicts what you will type next
  • your phone predicts the next word
  • recently, Gmail also added a prediction feature

Language models also help filter the output of systems for tasks like:

  • speech recognition

    You speak a phrase into your phone, which has to convert it to text. How does it know if you said "recognize speech" or "wreck a nice beach"? (say them really fast, they sound quite similar)

  • machine translation

    You are translating the Chinese sentence "我在开车" into English. Your translation system gives you several choices:

    • I at open car
    • me at open car
    • I at drive
    • me at drive
    • I am driving
    • me am driving

    A language model tells you which translation sounds the most natural.

Activity: Wheel of Fortune Cookies

This puzzle is about language models and bigrams (groups of 2 words). The bigrams "ice cream" and "cream cheese" are very common, but "ice cream cheese" is not.

1-gram = unigram, 2-gram = bigram, 3-gram = trigram, 4-gram, 5-gram, etc. are called just that.

Building Intuition

We have two sentences:

  • the cat chased the mouse
  • the tiger chased the mouse

Which sounds more natural? Which is more common?

  • The first one, obviously. Cats are more common than tigers, and you usually see "cat" and "mouse" in the same sentence.

But it's not obvious to a computer. How do we mathematically answer this question?

  • Count how many times the sentence appears in a corpus (a collection of documents). Wikipedia is a commonly used corpus. Web crawled data is also widely used.

What if the second sentence never appears in the corpus?

  • Break up the sentence into smaller parts, like words. So our sentences are now [the, cat, chased, the, mouse] and [the, tiger, chased, the mouse]. If we count up how many times each of these words appear, we can see that the counts for all the words in both sentences are the same, except for the counts for "cat" and "tiger".

But sentences are not just a collection of words.

  • Right! If we just look at the words (unigrams), then "the cat chased the mouse" is the same as "the the cat chased mouse". This is called bag of words. They are clearly not the same sentences, but in practice, many NLP systems use this approach, and it is effective and fast.

Then use bigrams.

  • Bigrams of "the cat chased the mouse": the cat, cat chased, chased the, the mouse. This is better. In practice, 3 to 5 grams are common. If the 5-gram doesn't ever appear, you can backoff to 4-grams. If the 4-gram doesn't appear, backoff to 3-grams.

What if a word never appears, say "tiger" never occurs in Wikiedia?

  • One simple approach is add one smoothing, where you assume every unknown word as a count of 1. You have to do a little extra math to make the counts add up.

What about longer or shorter sentences?

  • We actually use probabilities, not just counts.


The chain rule of probability says

p(X_1 X_2 \cdots X_n) = p(X_1) p(X_2 \mid X_1) p(X_3 \mid X_1 X_2) p(X_4 \mid X_1 X_2 X_3) \cdots p(X_n | X_{1:n-1})

So the probability of "the cat chased the mouse" is

p(\text{the}) p(\text{cat} \mid \text{the}) p(\text{chased} \mid \text{the cat}) p(\text{the} \mid \text{the cat chased}) p(\text{mouse} \mid \text{the cat chased the})

How do we calculate p(\text{chased} \mid \text{the cat})? By counting:

p(\text{mouse} \mid \text{the cat chased the}) = \frac{ c(\text{the cat chased the mouse}) }{ c(\text{the cat chased the}) }

But these phrases are quite long, and the longer the phrase, the more likely it is to have a count of zero. We talked above about breaking it down into n-grams. In statistics, this is called the Markov assumption. For trigrams, we only look at the two words before:

p(\text{mouse} \mid \text{the cat chased the}) \approx p(\text{mouse} \mid \text{chased the})


p(\text{the cat chased the mouse}) = p(\text{the}) p(\text{cat} \mid \text{the}) p(\text{chased} \mid \text{the cat}) p(\text{the} \mid \text{cat chased}) p(\text{mouse} \mid \text{chased the})

which is more manageable.

Text Generation

Let's get a trigram LM to generate some text. If we start with two words A and B, how do we generate the next one (C)? Pick the one that has the highest probability (or count) for p(C \mid A B). Then use B and C as the starting words, and repeat!


Let's quickly write a (simple) language model to generate text.

  1. We're going to need a corpus. Let's download one from Project Gutenberg. In class, I used Pride and Prejudice. Make sure you download the "Plain Text" version. Save it to your computer.

  2. Write some code! The code I wrote in class can be found here along with Pride and Prejudice. Download and unzip it into the same folder.

  3. Some parts of the code you might want to change:

    • Line 4 contains the file for the book ("pp.txt"). Change it as appropriate.
    • Line 18 specifies trigrams (the number 3). Try other values.
  4. Open a terminal in the same folder. Run it with python

Some things to think about:

  • How can we improve this?
  • Run it a couple times. Why does it produce different output?


  1. Download another book from Project Gutenberg that is not in English (preferably in a language you understand) and run the code on this book. This code is very simple, and it expects words to be separated by spaces, so languages like Chinese are not going to work as expected. We will deal with this issue next week!

    Examine the output. Do you notice anything interesting or unusual? Does it generate any funny sentences?

  2. Read this blog post about GPT-2, which is currently the state of the art in language modeling.