Movie Review Polarity Dataset Enriched with "Annotator Rationales" ------------------------------------ README file Written/Maintained by: Omar F. Zaidan (( ozaidan@cs.jhu.edu )) -- Last Updated: May 9 2007 -- ---------------------------------------------- NOTE: this README will be updated with more information in the very near future! ---------------------------------------------- Hey, what is this? ------------------ This README describes the release of the Movie Review Polarity Dataset Enriched with "Annotator Rationales." Availability ------------ This dataset is available from my (Omar's) home page: http://cs.jhu.edu/~ozaidan/rationales Citation -------- This dataset was introduced and first used in: Zaidan, Omar, Jason Eisner, and Christine Piatko (2007). Using "Annotator Rationales" to Improve Machine Learning for Text Categorization. NAACL HLT 2007; Proceedings of the Main Conference, pp. 260-267. @InProceedings{zaidan-eisner-piatko:2007:MainConf, author = {Zaidan, Omar and Eisner, Jason and Piatko, Christine}, title = {Using ``Annotator Rationales'' to Improve Machine Learning for Text Categorization}, booktitle = {NAACL HLT 2007; Proceedings of the Main Conference}, month = {April}, year = {2007}, pages = {260--267}, } Please cite the above paper if you use the dataset in your work. If you are interested, a slideshow about our work is available on the URL mentioned above (in addition to the paper itself). What are "rationales"? ---------------------- This dataset is based on the movie review polarity dataset (v2.0) collected and maintained by Bo Pang and Lillian Lee. Their dataset (we'll call it PL2.0) consists of 1000 positive and 1000 negative movie reviews obtained from the Internet Movie Database (IMDb) review archive. The main contribution of this release is the enrichment of the documents with "annotator rationales," a concept we describe in our NAACL HLT 2007 paper. Basically, "rationales" are segments of the text that support an annotator's classification. Let's say we have a movie review that is labeled as positive (i.e. the writer has a favorable opinion of the movie). Then the rationales would be segments of the text that support the claim (by an annotator) that the review is, indeed, positive. Here are some examples of positive rationales (the segments enclosed by double square brackets): • [[you will enjoy the hell out of]] American Pie. • fortunately, they [[managed to do it in an interesting and funny way]]. • he is [[one of the most exciting martial artists on the big screen]], continuing to perform his own stunts and [[dazzling audiences]] with his flashy kicks and punches. • the romance was [[enchanting]]. And here are some examples of negative rationales: • A woman in peril. A confrontation. An explosion. The end. [[Yawn. Yawn. Yawn.]] • when a film makes watching Eddie Murphy [[a tedious experience, you know something is terribly wrong]]. • the movie is [[so badly put together]] that even the most casual viewer may notice the [[miserable pacing and stray plot threads]]. • [[don't go see]] this movie Datasets -------- This release includes two datasets, DS1 and DS2: (DS1) A dataset of documents enriched with annotator rationales In their experiments, Pang and Lee divided the 2000 documents into 10 folds of 200 documents each. In our experiments, we used the last fold as an evaluation set, and so did not annotate its documents with rationales. Therefore, we provide here annotator rationales for 1800 documents (900 positive and 900 negative). Positive rationales in positive reviews are tagged with and , and negative rationales in negative reviews are tagged with and . No positive rationales were tagged in negative reviews, and no negative rationales were tagged in positive reviews. (DS2) Documents without any rationale annotation Although PL2.0 can be obtained directly from Pang's home page, the dataset needed a bit of cleaning up (in my opinion). We provide here the dataset that we worked with. DS2 is simply DS1 but with the rationale tags removed. Many of the differences between our DS2 and PL2.0 are quite minor, but some documents were shortened significantly in the process. You can choose to just use PL2.0 in your work and accept the shortcomings in some documents as annotator-caused noise, but feel free to use the cleaned up version we provide here (DS2). PS: Even if you do use DS2, I think you really should still cite Pang & Lee's work because it _is_ their dataset after all. A URL to this dataset is appropriate. Clearly, if you use DS1, you should cite our work :) PPS: If you want to get a feel for how/why DS2 is "cleaner" than PL2.0, you can compare the following document pairs: (cv392_12238.txt/neg_392.txt) (cv456_20370.txt/neg_456.txt) (cv557_12237.txt/neg_557.txt) (cv440_15243.txt/pos_440.txt) (cv054_4101.txt/neg_054.txt) (cv320_9693.txt/neg_320.txt) (cv025_3108.txt/pos_025.txt) PPPS: If you *really* want to know more, ask me (I actually did keep documentation on how DS2 is cleaner than PL2.0, but I'm not including it in this README) The End (for now)