Image Spam Dataset



This image spam/ham dataset was used in our paper:

Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach. Learning Fast Classifiers for Image Spam. In proceedings of the Conference on Email and Anti-Spam (CEAS), 2007. [PDF]

If you use this dataset in any publication, please cite this paper as the reference for the data.

Please see the paper for details on how the dataset was collected and prepared. The primary motivation for collecting and distributing this data is because there is currently no image spam/ham dataset freely available. While there are several sources of image spam, there is no image ham. As a result, ham is typically simulated from images on the web. In contrast, our data was collected from the mailboxes of two real users. Below are a few additional notes concerning the dataset (in no particular order).

1) Distributing ham data is not easy since it often contains personal information from someone's inbox. Our ham data contained pictures of friends, copies of documents, and other personal information. To make this data available, we went through by hand and removed all images considered private. The result is a slightly smaller ham dataset mostly missing photographs. However, there should be plenty of good images left for testing.

2) The Spam Archive images were taken from the Spam Archive data provided by Giorgio Fumera's group and used in this paper:

Giorgio Fumera, Ignazio Pillai, Fabio Roli. Spam Filtering Based On The Analysis Of Text Information Embedded Into Images. Journal of Machine Learning Research, 7(Dec):2699--2720, 2006.

It was difficult converting the data into image files because of problems in the encoding scheme, but most of it was successfully converted.

3) You will see a mismatch between the total number of images reported in our paper and the number included in this data. There are several reasons for this. First, we removed some personal mail from the ham data. Second, we removed duplicates for the paper using a heuristic (see the paper). The dataset contains all images, including those duplicates. Finally, not all images are well formed. Many are, because of the wonders of spam and email attachments, corrupted in all sorts of interesting ways. Different image readers will have different amounts of success in reading each file. Our image readers (two Java image readers) were unable to read all the images, so our numbers don't count all images in the data (we ignored images we could not process.)

4) Since our paper was focused on image classification, we did not save the emails associated with the images. The SpamArchive emails are still available (contact me) if you really need emails themselves.

There are always small details and I am sure that I omitted many of them. If you have a question after reading the paper and this page, please let me know.

Links to download the data:

SpamArchive Image Spam (137 MB) [spam_archive.tar.gz]

Personal Image Spam (29 MB) [personal_image_spam.tar.gz]

Personal Image Ham (107 MB) [personal_image_ham.tar.gz]

Other image spam datasets: (Email me if you'd like yours listed here)
Princeton Spam Image Benchmark
TREC 2005-2007 (Many have extracted images from the TREC corpora)

Last updated: July 25, 2007