Publications from 2016
-
Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation
Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking. Our corpus is available: https://bitbucket.org/mdredze/tgx
Mark Dredze , Nicholas Andrews , Jay DeYoung
Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, 2016