Movie Review Polarity Dataset
Enriched with "Annotator Rationales"
------------------------------------
README file
Written/Maintained by: Omar F. Zaidan
(( ozaidan@cs.jhu.edu ))
-- Last Updated: May 9 2007 --
----------------------------------------------
NOTE: this README will be updated with more
information in the very near future!
----------------------------------------------
Hey, what is this?
------------------
This README describes the release of the Movie Review Polarity Dataset Enriched
with "Annotator Rationales."
Availability
------------
This dataset is available from my (Omar's) home page:
http://cs.jhu.edu/~ozaidan/rationales
Citation
--------
This dataset was introduced and first used in:
Zaidan, Omar, Jason Eisner, and Christine Piatko (2007). Using
"Annotator Rationales" to Improve Machine Learning for Text
Categorization. NAACL HLT 2007; Proceedings of the Main
Conference, pp. 260-267.
@InProceedings{zaidan-eisner-piatko:2007:MainConf,
author = {Zaidan, Omar and Eisner, Jason and Piatko, Christine},
title = {Using ``Annotator Rationales'' to Improve Machine Learning for
Text Categorization},
booktitle = {NAACL HLT 2007; Proceedings of the Main Conference},
month = {April},
year = {2007},
pages = {260--267},
}
Please cite the above paper if you use the dataset in your work. If you are
interested, a slideshow about our work is available on the URL mentioned above
(in addition to the paper itself).
What are "rationales"?
----------------------
This dataset is based on the movie review polarity dataset (v2.0) collected and
maintained by Bo Pang and Lillian Lee. Their dataset (we'll call it PL2.0)
consists of 1000 positive and 1000 negative movie reviews obtained from the
Internet Movie Database (IMDb) review archive.
The main contribution of this release is the enrichment of the documents with
"annotator rationales," a concept we describe in our NAACL HLT 2007 paper.
Basically, "rationales" are segments of the text that support an annotator's
classification. Let's say we have a movie review that is labeled as positive
(i.e. the writer has a favorable opinion of the movie). Then the rationales
would be segments of the text that support the claim (by an annotator) that the
review is, indeed, positive.
Here are some examples of positive rationales (the segments enclosed by double
square brackets):
• [[you will enjoy the hell out of]] American Pie.
• fortunately, they [[managed to do it in an interesting and funny way]].
• he is [[one of the most exciting martial artists on the big screen]],
continuing to perform his own stunts and [[dazzling audiences]] with his
flashy kicks and punches.
• the romance was [[enchanting]].
And here are some examples of negative rationales:
• A woman in peril. A confrontation. An explosion. The end. [[Yawn. Yawn.
Yawn.]]
• when a film makes watching Eddie Murphy [[a tedious experience, you know
something is terribly wrong]].
• the movie is [[so badly put together]] that even the most casual viewer may
notice the [[miserable pacing and stray plot threads]].
• [[don't go see]] this movie
Datasets
--------
This release includes two datasets, DS1 and DS2:
(DS1) A dataset of documents enriched with annotator rationales
In their experiments, Pang and Lee divided the 2000 documents into 10 folds
of 200 documents each. In our experiments, we used the last fold as an
evaluation set, and so did not annotate its documents with rationales.
Therefore, we provide here annotator rationales for 1800 documents (900
positive and 900 negative).
Positive rationales in positive reviews are tagged with and , and
negative rationales in negative reviews are tagged with and .
No positive rationales were tagged in negative reviews, and no negative
rationales were tagged in positive reviews.
(DS2) Documents without any rationale annotation
Although PL2.0 can be obtained directly from Pang's home page, the dataset
needed a bit of cleaning up (in my opinion). We provide here the dataset
that we worked with. DS2 is simply DS1 but with the rationale tags removed.
Many of the differences between our DS2 and PL2.0 are quite minor, but some
documents were shortened significantly in the process. You can choose to
just use PL2.0 in your work and accept the shortcomings in some documents as
annotator-caused noise, but feel free to use the cleaned up version we
provide here (DS2).
PS: Even if you do use DS2, I think you really should still cite Pang
& Lee's work because it _is_ their dataset after all. A URL to this
dataset is appropriate.
Clearly, if you use DS1, you should cite our work :)
PPS: If you want to get a feel for how/why DS2 is "cleaner" than PL2.0,
you can compare the following document pairs:
(cv392_12238.txt/neg_392.txt)
(cv456_20370.txt/neg_456.txt)
(cv557_12237.txt/neg_557.txt)
(cv440_15243.txt/pos_440.txt)
(cv054_4101.txt/neg_054.txt)
(cv320_9693.txt/neg_320.txt)
(cv025_3108.txt/pos_025.txt)
PPPS: If you *really* want to know more, ask me (I actually did keep
documentation on how DS2 is cleaner than PL2.0, but I'm not
including it in this README)
The End (for now)