Mark Dredze

     Johns Hopkins University
Research Scientist Human Language Technology Center of Excellence (HLTCOE)
Assistant Research Professor Department of Computer Science
Center for Language and Speech Processing (CLSP)
Machine Learning Group
Center for Population Health Information Technology (CPHIT), Bloomberg
Health Sciences Informatics, School of Medicine
Social Media and Health Research Group
Social Media for Public Health
Institute for Global Tobacco Control
Contact:   |  www.cs.jhu.edu/~mdredze   www.dredze.com  |  @mdredze
Office: Stieff 181    (410) 516-6786        ORCID: 0000-0002-0422-2474        Google Scholar Profile

I get a lot of emails asking me for data or code from one of my papers. If you are wondering, the answer is yes! I try to provide both data and code so that others can reproduce or compare against my results. Sadly, I don't always post data or code due to the lack of time, but I usually make them available if you email me.

Datasets

TAC 2009 Entity Linking [Link]
A collection of manually linked training examples to supplement those provided in the TAC 2009 KBP task. These are described in my Coling 2010 paper on entity linking. If you use this data, please cite:
Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, Tim Finin. Entity Disambiguation for Knowledge Base Population. Conference on Computational Linguistics (Coling), 2010.

Image Spam Dataset [Link]
A collection of ham and spam images taken from real user email.

Multi-Domain Sentiment Dataset [Link]
Product reviews from several different product types taken from Amazon.com.

Attachment Prediction Email (Email for data)
Enron emails annotated with attachment information and cleaned of numerous artifacts inserted by email programs.

Twitter First Name, Last Name, and Location Clusters [Link]
A set of clusters extracted from Twitter that contains firstnames, lastnames, and locations. We used this in our NAACL 2013 paper.

Twitter Hurricane Sandy Dataset [Link]
A collection of tweets from areas hit by hurricane Sandy (2012) in the United States. This dataset is meant for research in social media disaster response.

Twitter Named Entity Recognition Dataset [Link]
A collection of tweets tagged for named entities. These were created as described in this paper. Thanks to Dirk Hovy for preparing the data as part of his LREC 2014 paper, which contains a larger collection of Twitter NER data.

Influenza Twitter Annotations [Link]
These annotations were created for the paper:
Alex Lamb, Michael J. Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
The annotations label tweets are related to influenza, awareness vs. infection and if the tweet is about the author or someone else. The files include tweet ids which you can use to download the data.

Health Twitter Annotations [Link]
These annotations were created for the paper:
Michael J. Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011.
The annotations label tweets as they relate to health. The annotations are described on page 2 of the paper. The file includes tweet ids which you can use to download the data.

Twitter Health Keywords [Link]
These files contain the keywords we use to collect and identify health related tweets.

Twitter Grammy XDoc Corpus: Entity Linking and Disambiguation [Link]
This corpus contains tweets about the Grammy Award ceromony annotated for entity linking and cross document coreference resolution (entity disambigutation). The corpus is described in our 2016 EMNLP workshop paper at SocialNLP.

Named Entity Recognition and Entity Linking for Speech [Link]
This corpus contains broadcast news transcripts annotated for named entities and entity linking against the TAC KBP 2009 corpus. This was used in our NAACL 2015 paper "Entity Linking for Spoken Language" and in our 2011 Interspeech paper "OOV Sensitive Named-Entity Recognition in Speech".

Flu Vaccination Tweets [Link]
This dataset contains annotations for whether a tweet is relevant to the topic of flu vaccination, and if the author intends to receive a flu vaccine. Analysis of this dataset was published in:
Xiaolei Huang, Michael C. Smith, Michael Paul, Dmytro Ryzhkov, Sandra Quinn, David Broniatowski, Mark Dredze. Examining Patterns of Influenza Vaccination in Social Media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), 2017.

Vaccination Sentiment and Relevance Tweets [Link]
This dataset contains annotations for whether a tweet is relevant to the topic of vaccinations, and if the author is expressing a positive or negative view about vaccines. Analysis of this dataset was published in:
Michael Smith, David A. Broniatowski, Mark Dredze. Using Twitter to Examine Social Rationales for Vaccine Refusal. International Engineering Systems Symposium (CESUN), 2016.
Mark Dredze, David A. Broniatowski, Michael Smith, Karen M. Hilyard. Understanding Vaccine Refusal: Why We Need Social Media Now. American Journal of Preventive Medicine, 2015.

Zika Conspiracy Tweets [Link]
This dataset contains annotations for whether a tweet about Zika contains pseudo-scientific information. Analysis of this dataset was published in:
Mark Dredze, David A Broniatowski, Karen M Hilyard. Zika Vaccine Misconceptions: A social media analysis. Vaccine, 2016.

Code

Automated Reviewer Assignment (used in multiple ACL affiliated conferences) [Link] [Code]
I authored a process for automatically assigning reviewers to an area for ACL affiliated conferences. The code is freely available for others to use. Let me know if you plan on using this system; I'm happy to answer questions.

SPRITE [Link]
A general purpose topic modeling package that implements SPRITE from our TACL 2015 paper. The package supports multi-threaded training and makes implementing new models easier.

PARMA [Link]
PARMA is our predicate argument aligner, published at ACL 2013.

Structured Learning at Penn [Link]
This is a collection of software developed by me and others in Fernando Pereria's research group at UPenn. It is designed for a range of machine learning tasks, such as dependency parsing, structured learning, gene prediction and gene mention finding.

Confidence Weighted Learning Library (Email for code)
We have collected most of the core algorithms in the confidence weighted learning framework for release as a software library. Please email me for the code.

Carmen [Java, Python]
Carmen is a library for geolocating tweets. Given a tweet, Carmen will return Location objects that represent a physical location. Carmen uses both coordinates and other information in a tweet to make geolocation decisions. It's not perfect, but this greatly increases the number of geolocated tweets over what Twitter provides.

The Python and Java versions don't give exactly the same results due to differences in the dependencies. If you use Carmen, please cite:
Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013.

Twitter Stream Downloader [Link]
Code for downloading data using the Twitter streaming API.

Mingpipe [Link]
Code for Chinese name matching. Given two Chinese person names, assigns a score based on how likely the two names refer to the same person.

csLDA [Link]
Cross language topic models based on code-switched documents. Documents can be in different languages and some "glue" documents contain multiple languages. csLDA learns topics for each language and aligns topics across languages.

Golden Horse [Link]
Code for named entity recognition using embeddings, focused on Chinese social media (Weibo). This code implements the methods in our 2015 EMNLP paper.

Multiview Representations of Twitter Users [Link]
Code and data. This is for our 2016 ACL paper.

Demographer: Gender Identification for Social Media [Link]
Demographer is a Python package that identifies demographic characteristics based on a name. It's designed for Twitter, where it takes the name of the user and returns information about his or her likely demographics.