Multi-Domain Sentiment Dataset (version 2.0)

This sentiment dataset supersedes the previous data (still available here).

Link to download the data:
[unprocessed.tar.gz] (1.5 G)
[processed_acl.tar.gz] (19 M)
[processed_stars.tar.gz] (33 M)

This sentiment dataset has been used in several papers:

John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007. [PDF]

John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jenn Wortman. Learning Bounds for Domain Adaptation. Neural Information Processing Systems (NIPS), 2008. [PDF]

Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-Weighted Linear Classification. International Conference on Machine Learning (ICML), 2008. [PDF]

Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain Adaptation with Multiple Sources. Neural Information Processing Systems (NIPS), 2009.

If you use this data for your research or a publication, please cite the first (ACL 2007) paper as the reference for the data. Also, please drop me a line so I know that you found the data useful.

The Multi-Domain Sentiment Dataset contains product reviews taken from from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed. This page contains some descriptions about the data. If you have questions, please email Mark Dredze or John Blitzer.

A few notes regarding the data sets.

1) unprocessed.tar.gz contains the original data.
2) processed.acl.tar.gz contains the data pre-processed and balanced. That is, the format of Blitzer et al. (ACL 2007)
3) processed.realvalued.tar.gz contains the data pre-processed and balanced, but with the number of stars, rather than just positive or negative. That is, the format of Mansour et al. (NIPS 2009)

The preprocessed data is one line per document, with each line in the format:

feature:<count> .... feature:<count> #label#:<label>

The label is always at the end of each line.

4) Each directory corresponds to a single domain. Each directory contains several files, which we briefly describe: -- All reviews for this domain, in their original format -- Positive reviews -- Negative reviews -- Unlabeled reviews -- Preprocessed reviews (see below) -- Preprocessed reviews, equally balanced between positive and negative.

5) While the positive and negative files contain positive and negative reviews, these aren't necessarily the splits used in any of the cited papers. They are simply there as possible initial splits.

6) Each (unprocessed) file contains a pseudo XML scheme for encoding the reviews. Most of the fields are self explanatory. The reviews have a unique ID field that isn't very unique. If it has two unique id fields, ignore the one containing only a number.

There are always small details that we might have omitted. If you have a question after reading the paper and this page, please let Mark Dredze or John Blitzer know.

Last updated: March 23, 2009