Homework 3

In this assignment you will first carry out exploratory data analysis on some data, and then you will build a classifier. This assignment is due at 11:59 pm on Monday, December 12. Everything is to be turned in electronically. Please tar and gzip your submission and use pdf, postscript, or plaintext for the writeup.

Part 1

The dataset is here. You will need to run
  gzip -cd lingspam.tar.gz | tar -xf - 
This will unpack a directory called lingspam that includes two kinds of email messages (one per file): SPAM (starting with sp) and MAIL (starting with numbers). The mail messages come from the Linguist email list.

Let a feature f be a function from a message to a discrete or continous number. One example of a feature is the length (in bytes) of a message. Another feature is the number of words. Or the number of occurrences of the letter Q. You can come up with features very easily.

Your task, however, is to come up with good features that predict whether a message is spam or mail. To do this, you need to roll up your sleeves and do some EDA. (See Lecture 5.) For instance, plot the value of your feature for SPAM in one color and MAIL in another color, and see how well they are separated. Plot one feature on the x-axis and another on the y-axis. Try to come up with a convincing argument that helps you pick good features.

The deliverable for part 1 is a written description, supported by graphics, of what you conclude. Normally EDA is not "written up" as a paper; it's a prelude to carrying out actual experiments or building empirical systems. So your prose can be casual; but your thought process should be clear.

Part 2

You will build a Naive-Bayes classifier from the data. Your program must run on the command line in the following way:

cat filename | ./spamornot > outputfile

The input file could contain anything and must be read on standard input. The program spamornot will write to standard output either the word "SPAM" (no quotes) or the word "MAIL" (no quotes) followed by a newline character. Five bytes. Got it?

Your program may read other files in the same directory.

What is a Naive-Bayes classifier? See Lectures 6 and 7. Here's a re-cap. Any model will define Pr(SPAM | message) and Pr(MAIL | message). Your model will do this in a Bayesian manner. First we break these down:

 Pr(SPAM | message) = Pr(SPAM) x Pr(message | SPAM) / Pr(message)
 Pr(MAIL | message) = Pr(MAIL) x Pr(message | MAIL) / Pr(message) 

We ignore the Pr(message) term, because it's always the same for a given message, so all you need are:

 Pr(SPAM | message) ~ Pr(SPAM) x Pr(message | SPAM) 
 Pr(MAIL | message) ~ Pr(MAIL) x Pr(message | MAIL) 

Assuming that you have those four quantities, your classifier will write out "SPAM" if Pr(SPAM | message) > Pr(MAIL | message) and "MAIL" otherwise. (This is the Bayesian decision criterion we covered before Thanksgiving.)

How to compute those quantities? Pr(SPAM) and Pr(MAIL) should be easy given the training data. The other two terms, Pr(message | SPAM) and Pr(message | MAIL), are where the "naive" part of Naive-Bayes comes in.

Suppose you have picked n features, f1, f2, ..., fn. Rather than predict the whole message, the Naive-Bayes model predicts just those n feature values - conditioning on SPAM and conditioning on MAIL. So we have:

 Pr(message | SPAM) = Pr(f1(message) | SPAM) x Pr(f2(message) | SPAM) x ... x Pr(fn(message) | SPAM)
 Pr(message | MAIL) = Pr(f1(message) | MAIL) x Pr(f2(message) | MAIL) x ... x Pr(fn(message) | MAIL) 

These quantities should be estimated, of course, from the training data. Note that if any Pr(fk(message) | SPAM) is zero, then Pr(message | SPAM) will be zero - so you may need to smooth your probability estimates (i.e., assign some small, nonzero probability even to unseen feature/class combinations).

You will turn in your source code and (if needed) a Makefile (if your program requires compilation, we should be able to build your program by running "make" on the ugrad network). You should also describe in writing what your features are and all of the computed probabilities. (The probabilities will be built into your program, but we don't want to have to go looking for them in your code.) Feel free to use many more features than the ones you explored in part 1.

How will we evaluate the classifiers? Not on the data we gave you, of course; we will use a small portion of unseen test data. Your classifier's performance will have a small effect on your grade.

As always, please make sure your assignment is clear and understandable - label quantities clearly and write your responses in understandable English.

The dataset we are using is due to:

I. Androutsopoulos et al., "An Evaluation of Naive Bayesian Anti-Spam Filtering". In Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pp. 9-17, 2000.


Noah A. Smith and David A. Smith | Empirical Research Methods in Computer Science