Mark Dredze

     Johns Hopkins University
Research Scientist Human Language Technology Center of Excellence (HLTCOE)
Assistant Research Professor Department of Computer Science
Center for Language and Speech Processing (CLSP)
Machine Learning Group
Center for Population Health Information Technology (CPHIT), Bloomberg
Health Sciences Informatics, School of Medicine
Social Media and Health Research Group
Social Media for Public Health
Institute for Global Tobacco Control
Contact:   |  |  @mdredze
Office: Stieff 181    (410) 516-6786        ORCID: 0000-0002-0422-2474        Google Scholar Profile

Click to show abstract.
Conference (abstract)   Conference (proceedings)  Journal  Workshop  Patent 

     2017 (11 Publications)
     Jon-Patrick Allem, Eric C. Leas, Theodore L. Caputi, Mark Dredze, Benjamin M. Althouse, Seth M. Noar, John W. Ayers. The Charlie Sheen Effect on Rapid In-home Human Immunodeficiency Virus Test Sales. Prevention Science, 2017. [PDF] [Bibtex]
     Ning Gao, Douglas Oard, Mark Dredze. Support for Interactive Identification of Mentioned Entities in Conversational Speech. International Conference on Research and Development in Information Retrieval (SIGIR) (short paper), 2017. [Bibtex]
     Nicholas Andrews, Mark Dredze, Benjamin Van Durme, Jason Eisner. Bayesian Modeling of Lexical Resources for Low-Resource Settings. Association for Computational Linguistics (ACL), 2017. [Bibtex]
     Travis Wolfe, Mark Dredze, Benjamin Van Durme. Pocket Knowledge Base Population. Association for Computational Linguistics (ACL) (short paper), 2017. [Bibtex]
     Ning Gao, Mark Dredze, Douglas Oard. Person Entity Linking in Email with NIL Detection. Journal of the Association for Information Science and Technology (JAIST), 2017. [Bibtex]
Expand Me     Ann Irvine, Mark Dredze. Harmonic Grammar, Optimality Theory, and Syntax Learnability: An Empirical Exploration of Czech Word Order. Unpublished Manuscript, 2017. [PDF] [Bibtex]
This work presents a systematic theoretical and empirical comparison of the major algorithms that have been proposed for learning Harmonic and Optimality Theory grammars (HG and OT, respectively). By comparing learning algorithms, we are also able to compare the closely related OT and HG frameworks themselves. Experimental results show that the additional expressivity of the HG framework over OT affords performance gains in the task of predicting the surface word order of Czech sentences. We compare the perceptron with the classic Gradual Learning Algorithm (GLA), which learns OT grammars, as well as the popular Maximum Entropy model. In addition to showing that the perceptron is theoretically appealing, our work shows that the performance of the HG model it learns approaches that of the upper bound in prediction accuracy on a held out test set and that it is capable of accurately modeling observed variation.
     Adrian Benton, Glen Coppersmith, Mark Dredze. Ethical Research Protocols for Social Media Health Research. EACL Workshop on Ethics in Natural Language Processing, 2017. [PDF] [Bibtex]
     John Ayers, Eric C Leas, Jon-Patrick Allem, Adrian Benton, Mark Dredze, Benjamin M Althouse, Tess B Cruz, Jennifer B Unger. Why Do People Use Electronic Nicotine Delivery Systems (Electronic Cigarettes)? A Content Analysis of Twitter, 2012-2015. PLoS One, 2017. [PDF] [Bibtex]
Expand Me   Anthony Nastasi, Tyler Bryant, Joseph K. Canner, Mark Dredze, Melissa S. Camp, Neeraja Nagarajan. Breast Cancer Screening and Social Media: a Content Analysis of Evidence Use and Guideline Opinions on Twitter. Journal of Cancer Education, 2017. [PDF] [Bibtex]
There is ongoing debate regarding the best mammography screening practices. Twitter has become a powerful tool for disseminating medical news and fostering healthcare conversations; however, little work has been done examining these conversations in the context of how users are sharing evidence and discussing current guidelines for breast cancer screening. To characterize the Twitter conversation on mammography and assess the quality of evidence used as well as opinions regarding current screening guidelines, individual tweets using mammography-related hashtags were prospectively pulled from Twitter from 5 November 2015 to 11 December 2015. Content analysis was performed on the tweets by abstracting data related to user demographics, content, evidence use, and guideline opinions. Standard descriptive statistics were used to summarize the results. Comparisons were made by demographics, tweet type (testable claim, advice, personal experience, etc.), and user type (non-healthcare, physician, cancer specialist, etc.). The primary outcomes were how users are tweeting about breast cancer screening, the quality of evidence they are using, and their opinions regarding guidelines. The most frequent user type of the 1345 tweets was ``non-healthcare'' with 323 tweets (32.5%). Physicians had 1.87 times higher odds (95% CI, 0.69--5.07) of providing explicit support with a reference and 11.70 times higher odds (95% CI, 3.41--40.13) of posting a tweet likely to be supported by the scientific community compared to non-healthcare users. Only 2.9% of guideline tweets approved of the guidelines while 14.6% claimed to be confused by them. Non-healthcare users comprise a significant proportion of participants in mammography conversations, with tweets often containing claims that are false, not explicitly backed by scientific evidence, and in favor of alternative ``natural'' breast cancer prevention and treatment. Furthermore, users appear to have low approval and confusion regarding screening guidelines. These findings suggest that more efforts are needed to educate and disseminate accurate information to the general public regarding breast cancer prevention modalities, emphasizing the safety of mammography and the harms of replacing conventional prevention and treatment modalities with unsubstantiated alternatives.
Expand Me   Xiaolei Huang, Michael C. Smith, Michael Paul, Dmytro Ryzhkov, Sandra Quinn, David Broniatowski, Mark Dredze. Examining Patterns of Influenza Vaccination in Social Media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), 2017. [PDF] [Bibtex] [Data]
Traditional data on influenza vaccination has several limitations: high cost, limited coverage of underrepresented groups, and low sensitivity to emerging public health issues. Social media, such as Twitter, provide an alternative way to understand a population's vaccination-related opinions and behaviors. In this study, we build and employ several natural language classifiers to examine and analyze behavioral patterns regarding influenza vaccination in Twitter across three dimensions: temporality (by week and month), geography (by US region), and demography (by gender). Our best results are highly correlated official government data, with a correlation over 0.90, providing validation of our approach. We then suggest a number of directions for future work.
Expand Me   Neeraja Nagarajan, Husain Alshaikh, Anthony Nastasi, Blair Smart, Zackary Berger, Eric B. Schneider, Mark Dredze, Joseph K. Canner, Nita Ahuja. The Utility of Twitter in Generating High-Quality Conversations about Surgical Care. Academic Surgical Congress, 2017. [Bibtex]
Introduction: There is growing interest among various stakeholders in using social media sites to discuss healthcare issues. However, little is known about how social media sites are used to discuss surgical care. There is also a lack of understanding of the types of content generated and the quality of the information shared in social media platforms about surgical care issues. We therefore sought to identify and summarize conversations on surgical care in Twitter, a popular microblogging website. Methods: A comprehensive list of surgery-related hashtags was used to pull individual tweets from 3/27-4/27/2015. Four independent reviewers blindly analyzed 25 tweets to develop themes for extraction from a larger sample. The themes were broadly divided further to obtain data at the levels of the user, the tweet, the content of the tweet and personal information shared (Figure I). Standard descriptive statistical analysis and simple logistic regression analysis was used. Results: In total, 17,783 tweets were pulled and 1000 from 615 unique users were randomly selected for analysis. Most users were from North America (62.4%) and non-healthcare related individuals (31.8%). Healthcare organizations generated 12.4%, and surgeons 9.5%, of tweets. Overall, 67.4% were original tweets and 79.0% contained a hyperlink (11% to healthcare and 8.7% to peer-reviewed sources). The common areas of surgery discussed were global surgery/health systems (18.4%), followed by general surgery (15.6%). Among personal tweets (n=236), 31.1% concerned surgery on family/friends and 24.4% on the user; 61.1% discussed procedures already performed and 58.0% used positive language about their personal experience with surgical care. Surgical news/opinion was present in 45% of tweets and 13.7% contained evidence-based information. Non-healthcare professionals were 53.5% (95% CI: 3.8%-77.5%, p=0.039) and 72.8% (95% CI: 21.1%-91.7%, p=0.017) less likely to generate a tweet that contained evidence-based information and to quote from a peer-reviewed journal, respectively, when compared to other users. Conclusion: Our study demonstrates that while healthcare professionals and organizations tend to share higher quality data on surgical care on social media, non-health care related individuals largely drive the conversation. Fewer than half of all surgery-related tweets included surgical news/opinion; only 14% included evidence-based information and just 9% linked to peer-reviewed sources. As social media outlets become important sources of actionable information, leaders in the surgical community should develop professional guidelines to maximize this versatile platform to disseminate accurate and high-quality content on surgical issues to a wide range of audiences.

     2016 (33 Publications)
Expand Me   Anietie Andy, Satoshi Sekine, Mugizi Rwebangira, Mark Dredze. Name Variation in Community Question Answering Systems. COLING Workshop on Noisy User-generated Text, 2016. [PDF] [Bibtex] (Best Paper Award)
Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, Who is the best player for the Reds? and Who is currently the biggest star at Manchester United? have a shared need but are worded differently; also, Reds and Manchester United are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.
Expand Me   Travis Wolfe, Mark Dredze, Benjamin Van Durme. A Study of Imitation Learning Methods for Semantic Role Labeling. EMNLP Workshop on Structured Prediction for NLP, 2016. [PDF] [Bibtex]
Global features have proven effective in a wide range of structured prediction problems but come with high inference costs. Imitation learning is a common method for training models when exact inference isn't feasible. We study imitation learning for Semantic Role Labeling (SRL) and analyze the effectiveness of the Violation Fixing Perceptron (VFP) (Huang et al., 2012) and Locally Optimal Learning to Search (LOLS) (Chang et al.,2015) frameworks with respect to SRL global features. We describe problems in applying each framework to SRL and evaluate the effectiveness of some solutions. We also show that action ordering, including easy first inference, has a large impact on the quality of greedy global models.
Expand Me   Rebecca Knowles, Josh Carroll, Mark Dredze. Demographer: Extremely Simple Name Demographics. EMNLP Workshop on Natural Language Processing and Computational Social Science, 2016. [PDF] [Bibtex] [Code]
The lack of demographic information available when conducting passive analysis of social media content can make it difficult to compare results to traditional survey results. We present DEMOGRAPHER, a tool that predicts gender from names, using name lists and a classifier with simple character-level features. By relying only on a name, our tool can make predictions even without extensive user-authored content. We compare DEMOGRAPHER to other available tools and discuss differences in performance. In particular, we show that DEMOGRAPHER performs well on Twitter data, making it useful for simple and rapid social media demographic inference.
Expand Me   John W Ayers, Eric C Leas, Mark Dredze, Jon Allem, Jurek G Grabowski, Linda Hill. POk\'emon go---a new distraction for drivers and pedestrians. JAMA Internal Medicine, 2016. [PDF] [Bibtex] (Ranked in the top .02% of 6.5m research outputs by Altmetric)
Pok\'emon GO, an augmented reality game, has swept the nation. As players move, their avatar moves within the game, and players are then rewarded for collecting Pok\'emon placed in real-world locations. By rewarding movement, the game incentivizes physical activity. However, if players use their cars to search for Pok\'emon they negate any health benefit and incur serious risk. Motor vehicle crashes are the leading cause of death among 16- to 24-year-olds, whom the game targets. Moreover, according to the American Automobile Association, 59% of all crashes among young drivers involve distractions within 6 seconds of the accident. We report on an assessment of drivers and pedestrians distracted by Pok\'emon GO and crashes potentially caused by Pok\'emon GO by mining social and news media reports.
Expand Me   Mark Dredze, Nicholas Andrews, Jay DeYoung. Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation. EMNLP Workshop on Natural Language Processing for Social Media, 2016. [PDF] [Bibtex] [Code], [Data]
Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking.
Expand Me   John W Ayers, Benjamin M. Althouse, Eric C Leas, Ted Alcorn, Mark Dredze. Big Media Data Can Inform Gun Violence Prevention. Bloomberg Data for Good Exchange, 2016. [PDF] [Bibtex]
The scientific method drives improvements in public health, but a strategy of obstructionism has impeded scientists from gathering even a minimal amount of information to address America's gun violence epidemic. We argue that in spite of a lack of federal investment, large amounts of publicly available data offer scientists an opportunity to measure a range of firearm-related behaviors. Given the diversity of available data -- including news coverage, social media, web forums, online advertisements, and Internet searches (to name a few) -- there are ample opportunities for scientists to study everything from trends in particular types of gun violence to gun related behaviors (such as purchases and safety practices) to public understanding of and sentiment towards various gun violence reduction measures. Science has been sidelined in the gun violence debate for too long. Scientists must tap the big media datastream and help resolve this crisis.
Expand Me   Adrian Benton, Braden Hancock, Glen Coppersmith, John W Ayers, Mark Dredze. After Sandy Hook Elementary: A Year in the Gun Control Debate on Twitter. Bloomberg Data for Good Exchange, 2016. [PDF] [Bibtex]
The mass shooting at Sandy Hook elementary school on December 14, 2012 catalyzed a year of active debate and legislation on gun control in the United States. Social media hosted an active public discussion where people expressed their support and opposition to a variety of issues surrounding gun legislation. In this paper, we show how a content based analysis of Twitter data can provide insights and understanding into this debate. We estimate the relative support and opposition to gun control measures, along with a topic analysis of each camp by analyzing over 70 million gun-related tweets from 2013. We focus on spikes in conversation surrounding major events related to guns throughout the year. Our general approach can be applied to other important public health and political issues to analyze the prevalence and nature of public opinion.
Expand Me   Eric C Leas, Benjamin M Althouse, Mark Dredze, Nick Obradovich, James H Fowler, Seth M Noar, JonPatrick Allem, John W Ayers. Big data sensors of organic advocacy: The case of Leonardo DiCaprio and Climate Change. PLoS One, 2016. [PDF] [Bibtex]
The strategies that experts have used to share information about social causes have historically been top-down, meaning the most influential messages are believed to come from planned events and campaigns. However, more people are independently engaging with social causes today than ever before, in part because online platforms allow them to instantaneously seek, create, and share information. In some cases this ``organic advocacy'' may rival or even eclipse top-down strategies. Big data analytics make it possible to rapidly detect public engagement with social causes by analyzing the same platforms from which organic advocacy spreads. To demonstrate this claim we evaluated how Leonardo DiCaprio's 2016 Oscar acceptance speech citing climate change motivated global English language news (Bloomberg Terminal news archives), social media (Twitter postings) and information seeking (Google searches) about climate change. Despite an insignificant increase in traditional news coverage (54%; 95%CI: -144 to 247), tweets including the terms ``climate change'' or ``global warming'' reached record highs, increasing 636% (95%CI: 573--699) with more than 250,000 tweets the day DiCaprio spoke. In practical terms the ``DiCaprio effect'' surpassed the daily average effect of the 2015 Conference of the Parties (COP) and the Earth Day effect by a factor of 3.2 and 5.3, respectively. At the same time, Google searches for ``climate change'' or ``global warming'' increased 261% (95%CI, 186--335) and 210% (95%CI 149--272) the day DiCaprio spoke and remained higher for 4 more days, representing 104,190 and 216,490 searches. This increase was 3.8 and 4.3 times larger than the increases observed during COP's daily average or on Earth Day. Searches were closely linked to content from Dicaprio's speech (e.g., ``hottest year''), as unmentioned content did not have search increases (e.g., ``electric car''). Because these data are freely available in real time our analytical strategy provides substantial lead time for experts to detect and participate in organic advocacy while an issue is salient. Our study demonstrates new opportunities to detect and aid agents of change and advances our understanding of communication in the 21st century media landscape.
Expand Me   Michael J. Paul, Margaret S. Chisolm, Matthew W. Johnson, Ryan G. Vandrey, Mark Dredze. Assessing the validity of online drug forums as a source for estimating demographic and temporal trends in drug use. Journal of Addiction Medicine, 2016. [PDF] [Bibtex]
Objectives: Addiction researchers have begun monitoring online forums to uncover self-reported details about use and effects of emerging drugs. The use of such online data sources has not been validated against data from large epidemiological surveys. This study aimed to characterize and compare the demographic and temporal trends associated with drug use as reported in online forums and in a large epidemiological survey. Methods: Data were collected from the website,, from January 2007 through August 2012 (143,416 messages posted by 8,087 members) and from the United States National Survey on Drug Use and Health (NSDUH) from 2007-2012. Measures of forum participation levels were compared with and validated against two measures from the NSDUH survey data: percentage of people using the drug in last 30 days and percentage using the drug more than 100 times in the past year. Results: For established drugs (e.g., cannabis), significant correlations were found across demographic groups between and the NSDUH survey data, while weaker, non-significant correlations were found with temporal trends. Emerging drugs (e.g., Salvia divinorum) were strongly associated with male users in the forum, in agreement with survey-derived data, and had temporal patterns that increased in synchrony with poison control reports. Conclusions: These results offer the first assessment of online drug forums as a valid source for estimating demographic and temporal trends in drug use. The analyses suggest that online forums are a reliable source for estimation of demographic associations and early identification of emerging drugs, but a less reliable source for measurement of long-term temporal trends.
     David A Broniatowski, Mark Dredze, Karen M Hilyard, Maeghan Dessecker, Sandra Crouse Quinn, Amelia Jamison, Michael J. Paul, Michael C. Smith. Both Mirror and Complement: A Comparison of Social Media Data and Survey Data about Flu Vaccination. American Public Health Association, 2016. [PDF] [Bibtex]
Expand Me   Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S. Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, Paola Velardi, Alessandro Vespignani, Lyn Finelli. Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge. BMC Infectious Diseases, 2016. [PDF] [Bibtex]
Background: Early insights into the timing of the start, peak, and intensity of the influenza season could be useful in planning influenza prevention and control activities. To encourage development and innovation in influenza forecasting, the Centers for Disease Control and Prevention (CDC) organized a challenge to predict the 2013--14 Unites States influenza season. Methods: Challenge contestants were asked to forecast the start, peak, and intensity of the 2013-2014 influenza season at the national level and at any or all Health and Human Services (HHS) region level(s). The challenge ran from December 1, 2013--March 27, 2014; contestants were required to submit 9 biweekly forecasts at the national level to be eligible. The selection of the winner was based on expert evaluation of the methodology used to make the prediction and the accuracy of the prediction as judged against the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Results: Nine teams submitted 13 forecasts for all required milestones. The first forecast was due on December 2, 2013; 3/13 forecasts received correctly predicted the start of the influenza season within one week, 1/13 predicted the peak within 1 week, 3/13 predicted the peak ILINet percentage within 1%, and 4/13 predicted the season duration within 1 week. For the prediction due on December 19, 2013, the number of forecasts that correctly forecasted the peak week increased to 2/13, the peak percentage to 6/13, and the duration of the season to 6/13. As the season progressed, the forecasts became more stable and were closer to the season milestones. Conclusion: Forecasting has become technically feasible, but further efforts are needed to improve forecast accuracy so that policy makers can reliably use these predictions. CDC and challenge contestants plan to build upon the methods developed during this contest to improve the accuracy of influenza forecasts.
Expand Me   Mark Dredze, Manuel García-Herranz, Alex Rutherford, Gideon Mann. Twitter as a Source of Global Mobility Patterns for Social Good. ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, 2016. [PDF] [Bibtex]
Data on human spatial distribution and movement is essential for understanding and analyzing social systems. However existing sources for this data are lacking in various ways; difficult to access, biased, have poor geographical or temporal resolution, or are significantly delayed. In this paper, we describe how geolocation data from Twitter can be used to estimate global mobility patterns and address these shortcomings. These findings will inform how this novel data source can be harnessed to address humanitarian and development efforts.
     Mark Dredze, David A Broniatowski, Karen M Hilyard. Zika Vaccine Misconceptions: A social media analysis. Vaccine, 2016. [PDF] [Bibtex] [Data]
Expand Me   Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, Miles Osborne. How Twitter is Changing the Nature of Financial News Discovery. SIGMOD Workshop on Data Science for Macro-Modeling with Financial and Economic Datasets, 2016. [PDF] [Bibtex]
Access to the most relevant and current information is critical to financial analysis and decision making.Historically, financial news has been discovered through company press releases, required disclosures and news articles. More recently, social media has reshaped the financial news landscape, radically changing the dynamics of news dissemination. In this paper we discuss the ways in which Twitter, a leading social media platform, has contributed to changes in this landscape. We explain why today Twitter is a valuable source of material financial information and describe opportunities and challenges in using this novel news source for financial information discovery.
Expand Me   Nanyun Peng, Mark Dredze. Learning Word Segmentation Representations to Improve Named Entity Recognition for Chinese Social Media. Association for Computational Linguistics (ACL) (short paper), 2016. [PDF] [Bibtex]
Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results.
Expand Me   Adrian Benton, Raman Arora, Mark Dredze. Learning Multiview Embeddings of Twitter Users. Association for Computational Linguistics (ACL) (short paper), 2016. [PDF] [Bibtex] [Code]
Low-dimensional vector representations are widely used as stand-ins for the text of words, sentences, and entire documents. These embeddings are used to identify similar words or make predictions about documents. In this work, we consider embeddings for social media users and demonstrate that these can be used to identify users who behave similarly or to predict attributes of users. In order to capture information from all aspects of a user's online life, we take a multiview approach, applying a weighted variant of Generalized Canonical Correlation Analysis (GCCA) to a collection of over 100,000 Twitter users. We demonstrate the utility of these multiview embeddings on three downstream tasks: user engagement, friend selection, and demographic attribute prediction.
Expand Me   David Andre Broniatowski, Mark Dredze, Karen M Hilyard. Effective Vaccine Communication during the Disneyland Measles Outbreak. Vaccine, 2016. [PDF] [Bibtex]
Vaccine refusal rates have increased in recent years, highlighting the need for effective risk communication, especially over social media. Fuzzy-trace theory predicts that individuals encode bottom-line meaning (''gist'') and statistical information (''verbatim'') in parallel and those articles expressing a clear gist will be most compelling. We coded news articles (n = 4581) collected during the 2014−2015 Disneyland measles for content including statistics, stories, or bottom-line gists regarding vaccines and vaccine-preventable illnesses. We measured the extent to which articles were compelling by how frequently they were shared on Facebook. The most widely shared articles expressed bottom-line gists, although articles containing statistics were also more likely to be shared than articles lacking statistics. Stories had limited impact on Facebook shares. Results support Fuzzy Trace Theory's predictions regarding the distinct yet parallel impact of categorical gist and statistical verbatim information on public health communication.
Expand Me   Ning Gao, Mark Dredze, Douglas Oard. Knowledge Base Population for Organization in Emails. NAACL Workshop on Automated Knowledge Base Construction (AKBC), 2016. [PDF] [Bibtex]
A prior study found that on average there are 6.3 named mentions of organizations found in email messages from the Enron collection, only about half of which could be linked to known entities in Wikipedia. That suggests a need for collection-specific approaches to entity linking, similar to those have proven successful for person mentions. This paper describes a process for automatically constructing such a collection-specific knowledge base of organization entities for named mentions in Enron. A new public test collection for linking 130 mentions of organizations found in Enron email to either Wikipedia or to this new collection-specific knowledge base is also described. Together, Wikipedia entities plus the new collection-specific knowledge base cover 83% of the 130 organization mentions, a 14% (absolute) improvement over the 69% that could be linked to Wikipedia alone.
     Michael Smith, David A. Broniatowski, Mark Dredze. Using Twitter to Examine Social Rationales for Vaccine Refusal. International Engineering Systems Symposium (CESUN), 2016. [Bibtex] [Data]
Expand Me   Mo Yu, Mark Dredze, Raman Arora, Matthew R. Gormley. Embedding Lexical Features via Low-rank Tensors. North American Chapter of the Association for Computational Linguistics (NAACL), 2016. [PDF] [Bibtex]
Modern NLP models rely heavily on engineered features, which often combine word and contextual information into complex lexical features. Such combination results in large numbers of features, which can lead to over-fitting. We present a new model that represents complex lexical features---comprised of parts for words, contextual information and labels---in a tensor that captures conjunction information among these parts. We apply low-rank tensor approximations to the corresponding parameter tensors to reduce the parameter space and improve prediction speed. Furthermore, we investigate two methods for handling features that include n-grams of mixed lengths. Our model achieves state-of-the-art results on tasks in relation extraction, PP-attachment, and preposition disambiguation.
Expand Me   Mark Dredze, Miles Osborne, Prabhanjan Kambadur. Geolocation for Twitter: Timing Matters. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2016. [PDF] [Bibtex]
Automated geolocation of social media messages can benefit a variety of downstream applications. However, these geolocation systems are typically evaluated without attention to how changes in time impact geolocation. Since different people, in different locations write messages at different times, these factors can significantly vary the performance of a geolocation system over time. We demonstrate cyclical temporal effects on geolocation accuracy in Twitter, as well as rapid drops as test data moves beyond the time period of training data. We show that temporal drift can effectively be countered with even modest online model updates.
Expand Me   John W. Ayers, Benjamin M. Althouse, Mark Dredze, Eric C. Leas, Seth M. Noar. News and Internet Searches About Human Immunodeficiency Virus After Charlie Sheen's Disclosure. JAMA Internal Medicine, 2016. [PDF] [Bibtex] (Ranked in the top .03% of 4.8m research outputs by Altmetric)
Celebrity Charlie Sheen publicly disclosed his human immunodeficiency virus (HIV)--positive status on November 17, 2015. Could Sheen's disclosure, like similar announcements from celebrities, generate renewed attention to HIV? We provide an early answer by examining news trends to reveal discussion of HIV in the mass media and Internet searches to reveal engagement with HIV-related topics around the time of Sheen's disclosure.
     Neeraja Nagarajan, Blair Smart, Anthony Nastasi, Zoya J. Effendi, Sruthi Murali, Zackary Berger, Eric Schneider, Mark Dredze, Joseph Canner. An Analysis of Twitter Conversations on Global Surgical Care. Annual CUGH Global Health Conference, 2016. [Bibtex] (poster)
Expand Me   John W. Ayers, J. Lee Westmaas, Eric C. Leas, Adrian Benton, Yunqi Chen, Mark Dredze, Benjamin Althouse. Leveraging Big Data to Improve Health Awareness Campaigns: A Novel Evaluation of the Great American Smokeout. JMIR Public Health and Surveillance, 2016. [PDF] [Bibtex]
Awareness campaigns are ubiquitous, but little is known about their potential effectiveness because traditional evaluations are often unfeasible. For 40 years, the ``Great American Smokeout'' (GASO) has encouraged media coverage and popular engagement with smoking cessation on the third Thursday of November as the nation's longest running awareness campaign. We proposed a novel evaluation framework for assessing awareness campaigns using the GASO as a case study by observing cessation-related news reports and Twitter postings, and cessation-related help seeking via Google, Wikipedia, and government-sponsored quitlines.
Expand Me   Munmun De Choudhury, Emre Kiciman, Mark Dredze, Glen Coppersmith, Mrinal Kumar. Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media. Conference on Human Factors in Computing Systems (CHI), 2016. [PDF] [Bibtex] (Honorable Mention Award)
History of mental illness is a major factor behind suicide risk and ideation. However research efforts toward characterizing and forecasting this risk is limited due to the paucity of information regarding suicide ideation, exacerbated by the stigma of mental illness. This paper fills gaps in the literature by developing a statistical methodology to infer which individuals could undergo transitions from mental health discourse to suicidal ideation. We utilize semi-anonymous support communities on Reddit as unobtrusive data sources to infer the likelihood of these shifts. We develop language and interactional measures for this purpose, as well as a propensity score matching based statistical approach. Our approach allows us to derive distinct markers of shifts to suicidal ideation. These markers can be modeled in a prediction framework to identify individuals likely to engage in suicidal ideation in the future. We discuss societal and ethical implications of this research.
Expand Me   Animesh Koratana, Mark Dredze, Margaret Chisolm, Matthew Johnson, Michael J. Paul. Studying Anonymous Health Issues and Substance Use on College Campuses with Yik Yak. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2016. [PDF] [Bibtex]
This study investigates the public health intelligence utility of Yik Yak, a social media platform that allows users to anonymously post and view messages within precise geographic locations. Our dataset contains 122,179 "yaks" collected from 120 college campuses across the United States during 2015. We first present an exploratory analysis of the topics commonly discussed in Yik Yak, clarifying the health issues for which this may serve as a source of information. We then present an in-depth content analysis of data describing substance use, an important public health issue that is not often discussed in public social media, but commonly discussed on Yik Yak under the cloak of anonymity.
Expand Me   Adrian Benton, Michael J. Paul, Braden Hancock, Mark Dredze. Collective Supervision of Topic Models for Predicting Surveys with Social Media. Association for the Advancement of Artificial Intelligence (AAAI), 2016. [PDF] [Bibtex]
This paper considers survey prediction from social media. We use topic models to correlate social media messages with survey outcomes and to provide an interpretable representation of the data. Rather than rely on fully unsupervised topic models, we use existing aggregated survey data to inform the inferred topics, a class of topic model supervision referred to as collective supervision. We introduce and explore a variety of topic model variants and provide an empirical analysis, with conclusions of the most effective models for this task.
Expand Me   Michael Smith, David A. Broniatowski, Michael J. Paul, Mark Dredze. Towards Real-Time Measurement of Public Epidemic Awareness: Monitoring Influenza Awareness through Twitter. AAAI Spring Symposium on Observational Studies through Social Media and Other Human-Generated Content, 2016. [PDF] [Bibtex]
This study analyzes temporal trends in Twitter data pertaining to both influenza awareness and influenza infection during the 2012--13 influenza season in the US. We make use of classifiers to distinguish tweets that express a personal infection (``sick with the flu'') versus a more general awareness (``worried about the flu''). While previous research has focused on estimating prevalence of influenza infection, little is known about trends in public awareness of the disease. Our analysis shows that infection and awareness have very different trends. In contrast to infection trends, awareness trends have little regional variation, and our experiments suggest that public awareness is primarily driven by news media.
     Blair. J. Smart, Neeraja Nagarajan, Joe K. Canner, Mark Dredze, Eric B. Schneider, Minh Luu, Zack D. Berger, Jonathan A. Myers. The Use of Social Media in Surgical Education: An Analysis of Twitter. Annual Academic Surgical Congress, 2016. [Bibtex]
     Neeraja Nagarajan, Blair J. Smart, Mark Dredze, Joy L. Lee, James Taylor, Jonathan A. Myers, Eric B. Schneider, Zack D. Berger, Joe K. Canner. How do Surgical Providers use Social Media? A Mixed-Methods Analysis using Twitter. Annual Academic Surgical Congress, 2016. [Bibtex]
Expand Me   John W. Ayers, Benjamin M. Althouse, Jon-Patrick Allem, Eric C. Leas, Mark Dredze, Rebecca Williams. Revisiting the Rise of Electronic Nicotine Delivery Systems Using Search Query Surveillance. American Journal of Preventive Medicine (AJPM), 2016. [PDF] [Bibtex]
Introduction: Public perceptions of electronic nicotine delivery systems (ENDS) remain poorly understood because surveys are too costly to regularly implement and, when implemented, there are long delays between data collection and dissemination. Search query surveillance has bridged some of these gaps. Herein, ENDS' popularity in the U.S. is reassessed using Google searches. Methods: ENDS searches originating in the U.S. from January 2009 through January 2015 were disaggregated by terms focused on e-cigarette (e.g., e-cig) versus vaping (e.g., vapers); their geolocation (e.g., state); the aggregate tobacco control measures corresponding to their geolocation (e.g., clean indoor air laws); and by terms that indicated the searcher's potential interest (e.g., buy e-cigs likely indicates shopping)---all analyzed in 2015. Results: ENDS searches are rapidly increasing in the U.S., with 8,498,000 searches during 2014 alone. Increasingly, searches are shifting from e-cigarette- to vaping-focused terms, especially in coastal states and states where anti-smoking norms are stronger. For example, nationally, e-cigarette searches declined 9% (95% CI=1%, 16%) during 2014 compared with 2013, whereas vaping searches increased 136% (95% CI=97%, 186%), even surpassing e-cigarette searches. Additionally, the percentage of ENDS searches related to shopping (e.g., vape shop) nearly doubled in 2014, whereas searches related to health concerns (e.g., vaping risks) or cessation (e.g., quit smoking with e-cigs) were rare and declined in 2014. Conclusions: ENDS popularity is rapidly growing and evolving. These findings could inform survey questionnaire development for follow-up investigation and immediately guide policy debates about how the public perceives the health risks or cessation benefits of ENDS.
Expand Me   Atul Nakhasi, Sarah G Bell, Ralph J Passarella, Michael J Paul, Mark Dredze, Peter J Pronovost. The Potential of Twitter as a Data Source for Patient Safety. Journal of Patient Safety, 2016. [PDF] [Bibtex]
Background: Error-reporting systems are widely regarded as critical components to improving patient safety, yet current systems do not effectively engage patients. We sought to assess Twitter as a source to gather patient perspective on errors in this feasibility study. Methods: We included publicly accessible tweets in English from any geography. To collect patient safety tweets, we consulted a patient safety expert and constructed a set of highly relevant phrases, such as "doctor screwed up." We used Twitter's search application program interface from January to August 2012 to identify tweets that matched the set of phrases. Two researchers used criteria to independently review tweets and choose those relevant to patient safety; a third reviewer resolved discrepancies. Variables included source and sex of tweeter, source and type of error, emotional response, and mention of litigation. Results: Of 1006 tweets analyzed, 839 (83%) identified the type of error: 26% of which were procedural errors, 23% were medication errors, 23% were diagnostic errors, and 14% were surgical errors. A total of 850 (84%) identified a tweet source, 90% of which were by the patient and 9% by a family member. A total of 519 (52%) identified an emotional response, 47% of which expressed anger or frustration, 21% expressed humor or sarcasm, and 14% expressed sadness or grief. Of the tweets, 6.3% mentioned an intent to pursue malpractice litigation. Conclusions: Twitter is a relevant data source to obtain the patient perspective on medical errors. Twitter may provide an opportunity for health systems and providers to identify and communicate with patients who have experienced a medical error. Further research is needed to assess the reliability of the data.
Expand Me   Brad J. Bushman, Katherine Newman, Sandra L. Calvert, Geraldine Downey, Mark Dredze, Michael Gottfredson, Nina G. Jablonski, Ann S. Masten, Calvin Morrill, Daniel B. Neill, Daniel Romer, Daniel W. Webster. Youth Violence: What We Know and What We Need to Know. American Psychologist, 2016;71(1):17-39. [PDF] [Bibtex]
School shootings tear the fabric of society. In the wake of a school shooting, parents, pediatricians, policymakers, politicians, and the public search for ``the'' cause of the shooting. But there is no single cause. The causes of school shootings are extremely complex. After the Sandy Hook Elementary School rampage shooting in Newtown, Connecticut, we wrote a report for the National Science Foundation on what is known and not known about youth violence. This article summarizes and updates that report. After distinguishing violent behavior from aggressive behavior, we describe the prevalence of gun violence in the United States and age-related risks for violence. We delineate important differences between violence in the context of rare rampage school shootings, and much more common urban street violence. Acts of violence are influenced by multiple factors, often acting together. We summarize evidence on some major risk factors and protective factors for youth violence, highlighting individual and contextual factors, which often interact. We consider new quantitative ``data mining'' procedures that can be used to predict youth violence perpetrated by groups and individuals, recognizing critical issues of privacy and ethical concerns that arise in the prediction of violence. We also discuss implications of the current evidence for reducing youth violence, and we offer suggestions for future research. We conclude by arguing that the prevention of youth violence should be a national priority.

     2015 (28 Publications)
     Yu Wang, Eugene Agichtein, Tom Clark, Mark Dredze, Jeffrey Staton. Inferring latent user characteristics for analyzing political discussions in social media. Atlanta Computational Social Science Workshop, 2015. [Bibtex]
     Mark Dredze, David A. Broniatowski, Michael Smith, Karen M. Hilyard. Understanding Vaccine Refusal: Why We Need Social Media Now. American Journal of Preventive Medicine (AJPM), 2015. [PDF] [Bibtex] [Data]
Expand Me   Mauricio Santillana, Andre T. Nguyen, Mark Dredze, Michael J. Paul, Elaine Nsoesie, John S. Brownstein. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology, 2015. [PDF] [Bibtex]
We present a machine learning-based methodology capable of providing real-time (``nowcast'') and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013--2014 (retrospective) and 2014--2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons.
Expand Me   Matthew Gormley, Mark Dredze, Jason Eisner. Approximation-Aware Dependency Parsing by Belief Propagation. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex]
We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n^3) runtime. It outputs the parse with maximum expected recall---but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by backpropagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.
     David Broniatowski, Mark Dredze, Karen Hilyard. News Articles are More Likely to be Shared if they Combine Statistics with Explanation. Conference of the Society for Medical Decision Making, 2015. [Bibtex]
Expand Me   Matthew R. Gormley, Mo Yu, Mark Dredze. Improved Relation Extraction with Feature-Rich Compositional Embedding Models. Empirical Methods in Natural Language Processing (EMNLP), 2015. [PDF] [Bibtex]
Compositional embedding models build a representation (or embedding) for a linguistic structure based on its component word embeddings. We propose a Feature-rich Compositional Embedding Model (FCM) for relation extraction that is expressive, generalizes to new domains, and is easy-to-implement. The key idea is to combine both (unlexicalized) handcrafted features with learned word embeddings. The model is able to directly tackle the difficulties met by traditional compositional embeddings models, such as handling arbitrary types of sentence annotations and utilizing global information for composition. We test the proposed model on two relation extraction tasks, and demonstrate that our model outperforms both previous compositional models and traditional feature rich models on the ACE 2005 relation extraction task, and the SemEval 2010 relation classification task. The combination of our model and a loglinear classifier with hand-crafted features gives state-of-the-art results. We made our implementation available for general use.
Expand Me   Nanyun Peng, Mark Dredze. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. Empirical Methods in Natural Language Processing (EMNLP) (short paper), 2015. [PDF] [Bibtex] [Code]
We consider the task of named entity recognition for Chinese social media. The long line of work in Chinese NER has focused on formal domains, and NER for social media has been largely restricted to English. We present a new corpus of Weibo messages annotated for both name and nominal mentions. Additionally, we evaluate three types of neural embeddings for representing Chinese text. Finally, we propose a joint training objective for the embeddings that makes use of both (NER) labeled and unlabeled raw text. Our methods yield a 9% improvement over a state-of-the-art baseline.
     Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S. Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, Paola Velardi, Alessandro Vespignani, Lyn Finelli. Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge. International Conference of Emerging Infectious Diseases Conference, 2015. [PDF] [Bibtex]
Expand Me   Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi, Chris Callison-Burch, Mark Dredze, Benjamin Van Durme. FrameNet+: Fast Paraphrastic Tripling of FrameNet. Association for Computational Linguistics (ACL) (short paper), 2015. [PDF] [Bibtex] [Data]
We increase the lexical coverage of FrameNet through automatic paraphrasing. We use crowdsourcing to manually filter out bad paraphrases in order to ensure a high-precision resource. Our expanded FrameNet contains an additional 22K lexical units, a 3-fold increase over the current FrameNet, and achieves 40% better coverage when evaluated in a practical setting on New York Times data.
Expand Me   Nanyun Peng, Mo Yu, Mark Dredze. An Empirical Study of Chinese Name Matching and Applications. Association for Computational Linguistics (ACL) (short paper), 2015. [PDF] [Bibtex] [Code]
Methods for name matching, an important component to support downstream tasks such as entity linking and entity clustering, have focused on alphabetic languages, primarily English. In contrast, logogram languages such as Chinese remain untested. We evaluate methods for name matching in Chinese, including both string matching and learning approaches. Our approach, based on new representations for Chinese, improves both name matching and a downstream entity clustering task.
Expand Me     Travis Wolfe, Mark Dredze, James Mayfield, Paul McNamee, Craig Harman, Tim Finin, Benjamin Van Durme. Interactive Knowledge Base Population. Unpublished Manuscript, 2015. [PDF] [Bibtex]
Most work on building knowledge bases has focused on collecting entities and facts from as large a collection of documents as possible. We argue for and describe a new paradigm where the focus is on a high-recall extraction over a small collection of documents under the supervision of a human expert, that we call Interactive Knowledge Base Population (IKBP).
Expand Me   Mrinal Kumar, Mark Dredze, Glen Coppersmith, Munmun De Choudhury. Shifts in Suicidal Ideation Manifested in Social Media Following Celebrity Suicides. Conference on Hypertext and Social Media, 2015. [PDF] [Bibtex]
The Werther effect describes the increased rate of completed or attempted suicides following the depiction of an individual's suicide in the media, typically a celebrity. We present findings on the prevalence of this effect in an online platform: r/SuicideWatch on Reddit. We examine both the posting activity and post content after the death of ten high-profile suicides. Posting activity increases following reports of celebrity suicides, and post content exhibits considerable changes that indicate increased suicidal ideation. Specifically, we observe that post-celebrity suicide content is more likely to be inward focused, manifest decreased social concerns, and laden with greater anxiety, anger, and negative emotion. Topic model analysis further reveals content in this period to switch to a more derogatory tone that bears evidence of self-harm and suicidal tendencies. We discuss the implications of our findings in enabling better community support to psychologically vulnerable populations, and the potential of building suicide prevention interventions following high-profile suicides.
     Michael Smith, David Broniatowski, Michael Paul, Mark Dredze. Tracking Public Awareness of Influenza through Twitter. 3rd International Conference on Digital Disease Detection (DDD), 2015. [Bibtex] (rapid fire talk)
     Joanna E. Cohen, Rebecca Shillenn, Mark Dredze, John W. Ayers. Tobacco Watcher: Real-Time Global Tobacco Surveillance Using Online News Media. Annual Meeting of the Society for Research on Nicotine and Tobacco, 2015. [Bibtex]
Expand Me   David Andre Broniatowski, Mark Dredze, Michael J Paul, Andrea Dugas. Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital. JMIR Public Health and Surveillance, 2015. [PDF] [Bibtex]
Background: Public health officials and policy makers in the United States expend significant resources at the national, state, county, and city levels to measure the rate of influenza infection. These individuals rely on influenza infection rate information to make important decisions during the course of an influenza season driving vaccination campaigns, clinical guidelines, and medical staffing. Web and social media data sources have emerged as attractive alternatives to supplement existing practices. While traditional surveillance methods take 1-2 weeks, and significant labor, to produce an infection estimate in each locale, web and social media data are available in near real-time for a broad range of locations. Objective: The objective of this study was to analyze the efficacy of flu surveillance from combining data from the websites Google Flu Trends and HealthTweets at the local level. We considered both emergency department influenza-like illness cases and laboratory-confirmed influenza cases for a single hospital in the City of Baltimore. Methods: This was a retrospective observational study comparing estimates of influenza activity of Google Flu Trends and Twitter to actual counts of individuals with laboratory-confirmed influenza, and counts of individuals presenting to the emergency department with influenza-like illness cases. Data were collected from November 20, 2011 through March 16, 2014. Each parameter was evaluated on the municipal, regional, and national scale. We examined the utility of social media data for tracking actual influenza infection at the municipal, state, and national levels. Specifically, we compared the efficacy of Twitter and Google Flu Trends data. Results: We found that municipal-level Twitter data was more effective than regional and national data when tracking actual influenza infection rates in a Baltimore inner-city hospital. When combined, national-level Twitter and Google Flu Trends data outperformed each data source individually. In addition, influenza-like illness data at all levels of geographic granularity were best predicted by national Google Flu Trends data. Conclusions: In order to overcome sensitivity to transient events, such as the news cycle, the best-fitting Google Flu Trends model relies on a 4-week moving average, suggesting that it may also be sacrificing sensitivity to transient fluctuations in influenza infection to achieve predictive power. Implications for influenza forecasting are discussed in this report.
     J. Lee Westmaas, John W. Ayers, Mark Dredze, Benjamin M. Althouse. Evaluation of the Great American Smokeout by Digital Surveillance. Society of Behavioral Medicine, 2015. [Bibtex] (Citation Award, given to the best submissions)
Expand Me   Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, Margaret Mitchell. CLPsych 2015 Shared Task: Depression and PTSD on Twitter. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015. [PDF] [Bibtex]
This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks. These tasks aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The data used for these tasks is from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) and demographically-matched community controls. The unshared task was a hackathon held at Johns Hopkins University in November 2014 to explore the data, and the shared task was conducted remotely, with each participating team submitted scores for a held-back test set of users. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD. Classifiers were compared primarily via their average precision, though a number of other metrics are used along with this to allow a more nuanced interpretation of the performance measures.
Expand Me   Mo Yu, Mark Dredze. Learning Composition Models for Phrase Embeddings. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] [Code]
Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.
Expand Me   Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead. From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015. [PDF] [Bibtex]
Many significant challenges exist for the mental health field, but one in particular is a lack of data available to guide research. Language provides a natural lens for studying mental health -- much existing work and therapy have strong linguistic components, so the creation of a large, varied, language-centric dataset could provide significant grist for the field of mental health research. We examine a broad range of mental health conditions in Twitter data by identifying self-reported statements of diagnosis. We systematically explore language differences between ten conditions with respect to the general population, and to each other. Our aim is to provide guidance and a roadmap for where deeper exploration is likely to be fruitful.
Expand Me   Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas Andrews, Jay DeYoung, Max Thomas, Matthew R. Gormley, Travis Wolfe, Craig Harman, Benjamin Van Durme, Mark Dredze. A Chinese Concrete NLP Pipeline. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2015. [PDF] [Bibtex]
Natural language processing research increasingly relies on the output of a variety of syntactic and semantic analytics. Yet integrating output from multiple analytics into a single framework can be time consuming and slow research progress. We present a CONCRETE Chinese NLP Pipeline: an NLP stack built using a series of open source systems integrated based on the CONCRETE data schema. Our pipeline includes data ingest, word segmentation, part of speech tagging, parsing, named entity recognition, relation extraction and cross document coreference resolution. Additionally, we integrate a tool for visualizing these annotations as well as allowing for the manual annotation of new data. We release our pipeline to the research community to facilitate work on Chinese language tasks that require rich linguistic annotations.
Expand Me   Adrian Benton, Mark Dredze. Entity Linking for Spoken Language. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2015. [PDF] [Bibtex] [Data]
Research on entity linking has considered a broad range of text, including newswire, blogs and web documents in multiple languages. However, the problem of entity linking for spoken language remains unexplored. Spoken language obtained from automatic speech recognition systems poses different types of challenges for entity linking; transcription errors can distort the context, and named entities tend to have high error rates. We propose features to mitigate these errors and evaluate the impact of ASR errors on entity linking using a new corpus of entity linked broadcast news transcripts.
Expand Me   Travis Wolfe, Mark Dredze, Benjamin Van Durme. Predicate Argument Alignment using a Global Coherence Model. North American Chapter of the Association for Computational Linguistics (NAACL), 2015. [PDF] [Bibtex]
We present a joint model for predicate argument alignment. We leverage multiple sources of semantic information, including temporal ordering constraints between events. These are combined in a max-margin framework to find a globally consistent view of entities and events across multiple documents, which leads to improvements over a very strong local baseline.
Expand Me   Mo Yu, Matthew R. Gormley, Mark Dredze. Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2015. [PDF] [Bibtex]
Compositional embedding models build a representation for a linguistic structure based on its component word embeddings. While recent work has combined these word embeddings with hand crafted features for improved performance, it was restricted to a small number of features due to model complexity, thus limiting its applicability. We propose a new model that conjoins features and word embeddings while maintaining a small number of parameters by learning feature embeddings jointly with the parameters of a compositional model. The result is a method that can scale to more features and more labels, while avoiding overfitting. We demonstrate that our model attains state-of-the-art results on ACE and ERE fine-grained relation extraction.
Expand Me   Michael J Paul, Mark Dredze. SPRITE: Generalizing Topic Models with Structured Priors. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] [Code]
We introduce SPRITE, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing SPRITE to be tailored to particular settings. We demonstrate this flexibility by constructing a SPRITE-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.
Expand Me   Shiliang Wang, Michael J Paul, Mark Dredze. Social Media as a Sensor of Air Quality and Public Response in China. Journal of Medical Internet Research (JMIR), 2015. [PDF] [Bibtex]
Background: Recent studies have demonstrated the utility of social media data sources for a wide range of public health goals including disease surveillance, mental health trends, and health perceptions and sentiment. Most such research has focused on English-language social media for the task of disease surveillance. Objective: We investigated the value of Chinese social media for monitoring air quality trends and related public perceptions and response. The goal was to determine if this data is suitable for learning actionable information about pollution levels and public response. Methods: We mined a collection of 93 million messages from Sina Weibo, China's largest microblogging service. We experimented with different filters to identify messages relevant to air quality, based on keyword matching and topic modeling. We evaluated the reliability of the data filters by comparing message volume per city to air particle pollution rates obtained from the Chinese government for 74 cities. Additionally, we performed a qualitative study of the content of pollution-related messages by coding a sample of 170 messages for relevance to air quality, and whether the message included details such as a reactive behavior or a health concern. Results: The volume of pollution-related messages is highly correlated with particle pollution levels, with Pearson correlation values up to .718 (74
Expand Me   Haoyu Wang, Eduard Hovy, Mark Dredze. The Hurricane Sandy Twitter Corpus. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015. [PDF] [Bibtex] [Data]
The growing use of social media has made it a critical component of disaster response and recovery efforts. Both in terms of preparedness and response, public health officials and first responders have turned to automated tools to assist with organizing and visualizing large streams of social media. In turn, this has spurred new research into algorithms for information extraction, event detection and organization, and information visualization. One challenge of these efforts has been the lack of a common corpus for disaster response on which researchers can compare and contrast their work. This paper describes the Hurricane Sandy Twitter Corpus: 6.5 million geotagged Twitter posts from the geographic area and time period of the 2012 Hurricane Sandy.
     Michael Paul, Mark Dredze, David Broniatowski, Nicholas Generous. Worldwide Influenza Surveillance through Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015. [Bibtex]
     Joanna E Cohen, John W Ayers, Mark Dredze. Tobacco Watcher: Real-time Global Surveillance for Tobacco Control. World Conference on Tobacco or Health (WCTOH), 2015. [Bibtex]

     2014 (23 Publications)
Expand Me   Ning Gao, Douglas Oard, Mark Dredze. A Test Collection for Email Entity Linking. NIPS Workshop on Automated Knowledge Base Construction, 2014. [PDF] [Bibtex]
Most prior work on entity linking has focused on linking name mentions found in third-person communication (e.g., news) to broad-coverage knowledge bases (e.g., Wikipedia). A restricted form of domain-specific entity linking has, however, been tried with email, linking mentions of people to specific email addresses. This paper introduces a new test collection for the task of linking mentions of people, organizations, and locations to Wikipedia. Annotation of 200 randomly selected entities of each type from the Enron email collection indicates that domain specific knowledge bases are indeed required to get good coverage of people and organizations, but that Wikipedia provides good (93%) coverage for the named mentions of locations in the Enron collection. Furthermore, experiments with an existing entity linking system indicate that the absence of a suitable referent in Wikipedia can easily be recognized by automated systems, with NIL precision (i.e., correct detection of the absence of a suitable referent) above 90% for all three entity types.
Expand Me   Adrian Benton, Jay Deyoung, Adam Teichert, Mark Dredze, Benjamin Van Durme, Stephen Mayhew, Max Thomas. Faster (and Better) Entity Linking with Cascades. NIPS Workshop on Automated Knowledge Base Construction, 2014. [PDF] [Bibtex]
Entity linking requires ranking thousands of candidates for each query, a time consuming process and a challenge for large scale linking. Many systems rely on prediction cascades to efficiently rank candidates. However, the design of these cascades often requires manual decisions about pruning and feature use, limiting the effectiveness of cascades. We present Slinky, a modular, flexible, fast and accurate entity linker based on prediction cascades. We adapt the web-ranking prediction cascade learning algorithm, Cronus, in order to learn cascades that are both accurate and fast. We show that by balancing between accurate and fast linking, this algorithm can produce Slinky configurations that are significantly faster and more accurate than a baseline configuration and an alternate cascade learning method with a fixed introduction of features.
     Mo Yu, Matthew Gormley, Mark Dredze. Factor-based Compositional Embedding Models. NIPS Workshop on Learning Semantics, 2014. [PDF] [Bibtex] [Code]
     Rebecca Knowles, Mark Dredze, Kathleen Evans, Elyse Lasser, Tom Richards, Jonathan Weiner, Hadi Kharrazi. High Risk Pregnancy Prediction from Clinical Text. NIPS Workshop on Machine Learning for Clinical Data Analysis, 2014. [PDF] [Bibtex]
Expand Me   Michael J Paul, Mark Dredze, David Broniatowski. Twitter Improves Influenza Forecasting. PLOS Currents Outbreaks, 2014. [PDF] [Bibtex]
Accurate disease forecasts are imperative when preparing for influenza epidemic outbreaks; nevertheless, these forecasts are often limited by the time required to collect new, accurate data. In this paper, we show that data from the microblogging community Twitter significantly improves influenza forecasting. Most prior influenza forecast models are tested against historical influenza-like illness (ILI) data from the U.S. Centers for Disease Control and Prevention (CDC). These data are released with a one-week lag and are often initially inaccurate until the CDC revises them weeks later. Since previous studies utilize the final, revised data in evaluation, their evaluations do not properly determine the effectiveness of forecasting. Our experiments using ILI data available at the time of the forecast show that models incorporating data derived from Twitter can reduce forecasting error by 17-30% over a baseline that only uses historical data. For a given level of accuracy, using Twitter data produces forecasts that are two to four weeks ahead of baseline models. Additionally, we find that models using Twitter data are, on average, better predictors of influenza prevalence than are models using data from Google Flu Trends, the leading web data source.
     Joy Lee, Matthew DeCamp, Mark Dredze, Margaret S. Chisolm, Zackary D Berger. What Are Health-related Users Tweeting? A Qualitative Content Analysis of Health-related Users and their Messages on Twitter. Journal of Medical Internet Research (JMIR), 2014. [PDF] [Bibtex]
Expand Me   Michael J Paul, Mark Dredze. Discovering Health Topics in Social Media Using Topic Models. PLoS ONE, 2014. [PDF] [Bibtex] [Data]
By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = −.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media.
     David Broniatowski, Michael J. Paul, Mark Dredze. Twitter: Big Data Opportunities (Letter). Science, 2014;345(6193):148. [PDF] [Bibtex]
     Ahmed Abbasi, Donald Adjeroh, Mark Dredze, Michael J. Paul, Fatemeh Mariam Zahedi, Huimin Zhao, Nitin Walia, Hemant Jain, Patrick Sanvanson, Reza Shaker, Marco D. Huesch, Richard Beal, Wanhong Zheng, Marie Abate, Arun Ross. Social Media Analytics for Smart Health. IEEE Intelligent Systems, 2014;29(2):60--80. [PDF] [Bibtex]
     Byron C. Wallace, Michael J. Paul, Urmimala Sarkar, Thomas A. Trikalinos, Mark Dredze. A Large-Scale Quantitative Analysis of Latent Factors and Sentiment in Online Doctor Reviews. Journal of the American Medical Informatics Association (JAMIA), 2014. [PDF] [Bibtex]
Expand Me   Mark Dredze, Renyuan Cheng, Michael Paul, David Broniatowski. A Platform for Public Health Surveillance using Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [PDF] [Bibtex] [Website]
We present, a new platform for sharing the latest research results on Twitter data with researchers and public officials. In this demo paper, we describe data collection, processing, and features of the site. The goal of this service is to transition results from research to practice.
     Michael Paul, Mark Dredze, David Broniatowski. Challenges in Influenza Forecasting and Opportunities for Social Media. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [Bibtex]
Expand Me   Shiliang Wang, Michael Paul, Mark Dredze. Exploring Health Topics in Chinese Social Media: An Analysis of Sina Weibo. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [PDF] [Bibtex]
This paper seeks to identify and characterize health-related topics discussed on the Chinese microblogging website, Sina Weibo. We identified nearly 1 million messages containing health-related keywords, filtered from a dataset of 93 million messages spanning five years. We applied probabilistic topic models to this dataset and identified the prominent health topics. We show that a variety of health topics are discussed in Sina Weibo, and that four flu-related topics are correlated with monthly influenza case rates in China.
Expand Me   Mo Yu, Mark Dredze. Improving Lexical Embeddings with Semantic Knowledge. Association for Computational Linguistics (ACL) (short paper), 2014. [PDF] [Bibtex] [Code]
Word embeddings learned on unlabeled data are a popular tool in semantics, but may not capture the desired semantics. We propose a new learning objective that incorporates both a neural language model objective and prior knowledge from semantic resources to learn improved lexical semantic embeddings. We demonstrate that our embeddings improve over those learned solely on raw text in three settings: language modeling, measuring semantic similarity, and predicting human judgements.
Expand Me   Nanyun Peng, Yiming Wang, Mark Dredze. Learning Polylingual Topic Models from Code-Switched Social Media Documents. Association for Computational Linguistics (ACL) (short paper), 2014. [PDF] [Bibtex] [Code]
Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which infers language specific topic distributions based on code-switched documents to facilitate multi-lingual corpus analysis. We experiment on two code-switching corpora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human annotators.
Expand Me   Glen Coppersmith, Mark Dredze, Craig Harman. Quantifying Mental Health Signals in Twitter. ACL Workshop on Computational Linguistics and Clinical Psychology, 2014. [PDF] [Bibtex]
The ubiquity of social media provides a rich opportunity to enhance the data available to mental health clinicians and researchers, enabling a better-informed and better-equipped mental health field. We present analysis of mental health phenomena in publicly available Twitter data, demonstrating how rigorous application of simple natural language processing methods can yield insight into specific disorders as well as mental health writ large, along with evidence that as-of-yet undiscovered linguistic signals relevant to mental health exist in social media. We present a novel method for gathering data for a range of mental illnesses quickly and cheaply, then focus on analysis of four in particular: post-traumatic stress disorder (PTSD), major depressive disorder, bipolar disorder, and seasonal affective disorder. We intend for these proof-of-concept results to inform the necessary ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information.
Expand Me   Glen Coppersmith, Craig Harman, Mark Dredze. Measuring Post Traumatic Stress Disorder in Twitter. International Conference on Weblogs and Social Media (ICWSM), 2014. [PDF] [Bibtex]
Traditional mental health studies rely on information primarily collected and analyzed through personal contact with a health care professional. Recent work has shown the utility of social media data for studying depression, but there have been limited evaluations of other mental health conditions. We consider post traumatic stress disorder (PTSD), a serious condition that affects millions worldwide, with especially high rates in military veterans. We show how to obtain a PTSD classifier for social media using simple searches of available Twitter data, a significant reduction in training data cost compared to previous work on mental health. We demonstrate its utility by an examination of language use from PTSD individuals, and by detecting elevated rates of PTSD at and around US military bases using our classifiers.
Expand Me   Miles Osborne, Mark Dredze. Facebook, Twitter and Google Plus for Breaking News: Is there a winner?. International Conference on Weblogs and Social Media (ICWSM), 2014. [PDF] [Bibtex] [Supplement]
Twitter is widely seen as being the go to place for breaking news. Recently however, competing Social Media have begun to carry news. Here we examine how Facebook, Google Plus and Twitter report on breaking news. We consider coverage (whether news events are reported) and latency (the time when they are reported). Using data drawn from three weeks in December 2013, we identify 29 major news events, ranging from celebrity deaths, plague outbreaks to sports events. We find that all media carry the same major events, but Twitter continues to be the preferred medium for breaking news, almost consistently leading Facebook or Google Plus. Facebook and Google Plus largely repost newswire stories and their main research value is that they conveniently package multitple sources of information together.
Expand Me   Matthew R. Gormley, Margaret Mitchell, Benjamin Van Durme, Mark Dredze. Low-Resource Semantic Role Labeling. Association for Computational Linguistics (ACL), 2014. [PDF] [Bibtex] [Code]
We explore the extent to which high-resource manual annotations such as treebanks are necessary for the task of semantic role labeling (SRL). We examine how performance changes without syntactic supervision, comparing both joint and pipelined methods to induce latent syntax. This work highlights a new application of unsupervised grammar induction and demonstrates several approaches to SRL in the absence of supervised syntax. Our best models obtain competitive results in the high-resource setting and state-of-the-art results in the low resource setting, reaching 72.48% F1 averaged across languages. We release our code for this work along with a larger toolkit for specifying arbitrary graphical structure.
Expand Me   Nicholas Andrews, Jason Eisner, Mark Dredze. Robust Entity Clustering via Phylogenetic Inference. Association for Computational Linguistics (ACL), 2014. [PDF] [Bibtex] [Code]
Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evalution on several datasets. On a challenging Twitter corpus, our method outperforms the best baseline by 12.6 points of F1 score.
     John W. Ayers, Benjamin M. Althouse, Mark Dredze. Could Behavioral Medicine Lead the Web Data Revolution?. Journal of the American Medical Association (JAMA), 2014. [PDF] [Bibtex]
     John W. Ayers, Benjamin M. Althouse, Morgan Johnson, Mark Dredze, Joanna E. Cohen. What's the Healthiest Day? Circaseptan (Weekly) Rhythms in Healthy Considerations. American Journal of Preventive Medicine (AJPM), 2014. [PDF] [Bibtex]
     Ben Althouse, Jon-Patrick Allem, Matt Childers, Mark Dredze, John W Ayers. Population Health Concerns During the United States' Great Recession. American Journal of Preventive Medicine (AJPM), 2014;46(2):166-170. [PDF] [Bibtex]

     2013 (13 Publications)
     Mark H Dredze, William N Schilit. Facet suggestion for search query augmentation. US Patent 8,433,705, 2013. [Bibtex]
Expand Me   David Broniatowski, Michael J. Paul, Mark Dredze. National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic. PLOS ONE, 2013. [PDF] [Bibtex]
Social media have been proposed as a data source for influenza surveillance because they have the potential to offer real-time access to millions of short, geographically localized messages containing information regarding personal well-being. However, accuracy of social media surveillance systems declines with media attention because media attention increases ``chatter'' -- messages that are about influenza but that do not pertain to an actual infection -- masking signs of true influenza prevalence. This paper summarizes our recently developed influenza infection detection algorithm that automatically distinguishes relevant tweets from other chatter, and we describe our current influenza surveillance system which was actively deployed during the full 2012-2013 influenza season. Our objective was to analyze the performance of this system during the most recent 2012--2013 influenza season and to analyze the performance at multiple levels of geographic granularity, unlike past studies that focused on national or regional surveillance. Our system's influenza prevalence estimates were strongly correlated with surveillance data from the Centers for Disease Control and Prevention for the United States (r = 0.93, p < 0.001) as well as surveillance data from the Department of Health and Mental Hygiene of New York City (r = 0.88, p < 0.001). Our system detected the weekly change in direction (increasing or decreasing) of influenza prevalence with 85% accuracy, a nearly twofold increase over a simpler model, demonstrating the utility of explicitly distinguishing infection tweets from other chatter.
Expand Me   Travis Wolfe, Benjamin Van Durme, Mark Dredze, Nicholas Andrews, Charley Beller, Chris Callison-Burch, Jay DeYoung, Justin Snyder, Jonathan Weese, Tan Xu, Xuchen Yao. PARMA: A Predicate Argument Aligner. Association for Computational Linguistics (ACL) (short paper), 2013. [PDF] [Bibtex]
We introduce PARMA, a system for cross-document, semantic predicate and argument alignment. Our system integrates popular lexical semantic resources into a simple discriminative model. PARMA achieves state of the art results. We suggest that existing efforts have focussed on data that is too easy, and we provide a more difficult dataset based on MT translation references which has a lower baseline which we beat by 17% absolute F1.
Expand Me     Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition. Technical Report 10, Human Language Technology Center of Excellence, Johns Hopkins University, 2013. [PDF] [Bibtex]
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. We present a novel probabilistic model to learn the sub-word lexicon optimized for a given task. We consider the task of Out Of vocabulary (OOV) word detection, which relies on output from a hybrid system. We combine the proposed hybrid system with confidence based metrics to improve OOV detection performance. Previous work address OOV detection as a binary classification task, where each region is independently classified using local information. We propose to treat OOV detection as a sequence labeling problem, and we show that 1) jointly predicting out-of-vocabulary regions, 2) including contextual information from each region, and 3) learning sub-lexical units optimized for this task, leads to substantial improvements with respect to state-of-the-art on an English Broadcast News and MIT Lectures task.
Expand Me   Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Bibtex] [Code]
Public health applications using social media often require accurate, broad-coverage location information. However, the standard information provided by social media APIs, such as Twitter, cover a limited number of messages. This paper presents Carmen, a geolocation system that can determine structured location information for messages provided by the Twitter API. Our system utilizes geocoding tools and a combination of automatic and manual alias resolution methods to infer location structures from GPS positions and user-provided profile data. We show that our system is accurate and covers many locations, and we demonstrate its utility for improving influenza surveillance.
Expand Me   Michael J Paul, Byron Wallace, Mark Dredze. What Affects Patient (Dis)satisfaction? Analyzing Online Doctor Ratings with a Joint Topic-Sentiment Model. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Bibtex]
We analyze patient reviews of doctors using a novel probabilistic joint model of aspect and sentiment based on factorial LDA. We leverage this model to exploit a small set of previously annotated reviews to automatically analyze the topics and sentiment latent in over 50,000 online reviews of physicians (and we make this dataset publicly available). The proposed model outperforms baseline models for this task with respect to model perplexity and sentiment classification. We report the most representative words with respect to positive and negative sentiment along three clinical aspects, thus complementing existing qualitative work exploring patient reviews of physicians.
Expand Me   Justin Snyder, Rebecca Knowles, Mark Dredze, Matthew R. Gormley, Travis Wolfe. Topic Models and Metadata for Visualizing Text Corpora. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2013. [PDF] [Bibtex]
Effectively exploring and analyzing large text corpora requires visualizations that provide a high level summary. Past work has relied on faceted browsing of document metadata or on natural language processing of document text. In this paper, we present a new web-based tool that integrates topics learned from an unsupervised topic model in a faceted browsing experience. The user can manage topics, filter documents by topic and summarize views with metadata and topic graphs. We report a user study of the usefulness of topics in our tool.
Expand Me     Damianos Karakos, Mark Dredze, Sanjeev Khudanpur. Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation. Technical Report 8, Johns Hopkins University, 2013. [PDF] [Bibtex]
Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is self-training, in which a language model's parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide self-training, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1-best, we improve self-training efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topic-based language modeling adaptation results over both 1-best and lattice self-training using our ASR channel confusion estimates on telephone conversations.
Expand Me   Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] [Bibtex] [Data]
Hidden properties of social media users, such as their ethnicity, gender, and location, are often reflected in their observed attributes, such as their first and last names. Furthermore, users who communicate with each other often have similar hidden properties. We propose an algorithm that exploits these insights to cluster the observed attributes of hundreds of millions of Twitter users. Attributes such as user names are grouped together if users with those names communicate with other similar users. We separately cluster millions of unique first names, last names, and userprovided locations. The efficacy of these clusters is then evaluated on a diverse set of classification tasks that predict hidden users properties such as ethnicity, geographic location, gender, language, and race, using only profile names and locations when appropriate. Our readily-replicable approach and publicly released clusters are shown to be remarkably effective and versatile, substantially outperforming state-of-the-art approaches and human accuracy on each of the tasks studied.
Expand Me   Mahesh Joshi, Mark Dredze, William W. Cohen, Carolyn P. Rose. What's in a Domain? Multi-Domain Learning for Multi-Attribute Data. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2013. [PDF] [Bibtex]
Multi-Domain learning assumes that a single metadata attribute is used in order to divide the data into so-called domains. However, real-world datasets often have multiple metadata attributes that can divide the data into domains. It is not always apparent which single attribute will lead to the best domains, and more than one attribute might impact classification. We propose extensions to two multi-domain learning techniques for our multi-attribute setting, enabling them to simultaneously learn from several metadata attributes. Experimentally, they outperform the multi-domain learning baseline, even when it selects the single "best" attribute.
Expand Me   Alex Lamb, Michael J. Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2013. [PDF] [Bibtex] [Data]
Twitter has been shown to be a fast and reliable method for disease surveillance of common illnesses like influenza. However, previous work has relied on simple content analysis, which conflates flu tweets that report infection with those that express concerned awareness of the flu. By discriminating these categories, as well as tweets about the authors versus about others, we demonstrate significant improvements on influenza surveillance using Twitter.
Expand Me   Michael Paul, Mark Dredze. Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] [Bibtex]
Multi-dimensional latent text models, such as factorial LDA (f-LDA), capture multiple factors of corpora, creating structured output for researchers to better understand the contents of a corpus. We consider such models for clinical research of new recreational drugs and trends, an important application for mining current information for healthcare workers. We use a "three-dimensional" f-LDA variant to jointly model combinations of drug (marijuana, salvia, etc.), aspect (effects, chemistry, etc.) and route of administration (smoking, oral, etc.) Since a purely unsupervised topic model is unlikely to discover these specific factors of interest, we develop a novel method of incorporating prior knowledge by leveraging user generated tags as priors in our model. We demonstrate that this model can be used as an exploratory tool for learning about these drugs from the Web by applying it to the task of extractive summarization. In addition to providing useful output for this important public health task, our prior-enriched model provides a framework for the application of f-LDA to other tasks
Expand Me   Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Machine Learning, 2013;91(2):155-187. [PDF] [Bibtex]
We present AROW, an online learning algorithm for binary and multiclass problems that combines large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive mistake bounds for the binary and multiclass settings that are similar in form to the second order perceptron bound. Our bounds do not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques. Empirical evaluations show that AROW achieves state-of-the-art performance on a wide range of binary and multiclass tasks, as well as robustness in the face of non-separable data.

     2012 (17 Publications)
     Kristian, Hammond, Jerome Budzik, Lawrence Birnbaum, Kevin Livingston, Mark Dredze. Request initiated collateral content offering. US Patent 8,260,874, 2012. [Bibtex]
Expand Me   Mark Dredze. How Social Media Will Change Public Health. IEEE Intelligent Systems, 2012;27(4):81-84. [Bibtex] [Link]
Recent work in machine learning and natural language processing has studied the health content of tweets and demonstrated the potential for extracting useful public health information from their aggregation. This article examines the types of health topics discussed on Twitter, and how tweets can both augment existing public health capabilities and enable new ones. The author also discusses key challenges that researchers must address to deliver high-quality tools to the public health community.
Expand Me   Michael J. Paul, Mark Dredze. Factorial LDA: Sparse Multi-Dimensional Text Models. Neural Information Processing Systems (NIPS), 2012. [PDF] [Bibtex]
Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors.
Expand Me   Alex Lamb, Michael J. Paul, Mark Dredze. Investigating Twitter as a Source for Studying Behavioral Responses to Epidemics. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex]
We present preliminary results for mining concerned awareness of influenza tweets. We describe our data set construction and experiments with binary classification of data into influenza versus general messages and classification into concerned awareness and existing infection.
Expand Me   Atul Nakhasi, Ralph J Passarella, Sarah G Bell, Michael J Paul, Mark Dredze, Peter J Pronovost. Malpractice and Malcontent: Analyzing Medical Complaints in Twitter. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex]
In this paper we report preliminary results from a study of Twitter to identify patient safety reports, which offer an immediate, untainted, and expansive patient perspective un- like any other mechanism to date for this topic. We identify patient safety related tweets and characterize them by which medical populations caused errors, who reported these er- rors, what types of errors occurred, and what emotional states were expressed in response. Our long term goal is to improve the handling and reduction of errors by incorpo- rating this patient input into the patient safety process.
Expand Me   Michael J. Paul, Mark Dredze. Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex]
Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of a multi-dimensional latent text model -- factorial LDA -- that captures orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interest to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.
     Ralph Passarella, Atul Nakhasi, Sarah Bell, Michael J. Paul, Peter Pronovost, Mark Dredze. Twitter as a Source for Learning about Patient Safety Events. Annual Symposium of the American Medical Informatics Association (AMIA), 2012. [Bibtex]
Expand Me   Damianos Karakos, Brian Roark, Izhak Shafran, Kenji Sagae, Maider Lehr, Emily Prud'hommeaux, Puyang Xu, Nathan Glenn, Sanjeev Khudanpur, Murat Saraclar, Dan Bikel, Mark Dredze, Chris Callison-Burch, Yuan Cao, Keith Hall, Eva Hasler, Philip Koehn, Adam Lopez, Matt Post, Darcey Riley. Deriving conversation-based features from unlabeled speech for discriminative language modeling. International Speech Communication Association (INTERSPEECH), 2012. [PDF] [Bibtex]
The perceptron algorithm was used in [1] to estimate discriminative language models which correct errors in the output of ASR systems. In its simplest version, the algorithm simply increases the weight of n-gram features which appear in the correct (oracle) hypothesis and decreases the weight of n-gram features which appear in the 1-best hypothesis. In this paper, we show that the perceptron algorithm can be successfully used in a semi-supervised learning (SSL) framework, where limited amounts of labeled data are available. Our framework has some similarities to graph-based label propagation [2] in the sense that a graph is built based on proximity of unlabeled conversations, and then it is used to propagate confidences (in the form of features) to the labeled data, based on which perceptron trains a discriminative model. The novelty of our approach lies in the fact that the confidence "flows" from the unlabeled data to the labeled data, and not vice-versa, as is done traditionally in SSL. Experiments conducted at the 2011 CLSP Summer Workshop on the conversational telephone speech corpora Dev04f and Eval04f demonstrate the effectiveness of the proposed approach.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Structured Language Modeling for Speech Recognition. International Speech Communication Association (INTERSPEECH), 2012. [PDF] [Bibtex]
The structured language model (SLM) of [1] was one of the first to successfully integrate syntactic structure into language models. We extend the SLM framework in two new directions. First, we propose a new syntactic hierarchical interpolation that improves over previous approaches. Second, we develop a general information-theoretic algorithm for pruning the underlying Jelinek-Mercer interpolated LM used in [1], which substantially reduces the size of the LM, enabling us to train on large data. When combined with hill-climbing [2] the SLM is an accurate model, space-efficient and fast for rescoring large speech lattices. Experimental results on broadcast news demonstrate that the SLM outperforms a large 4-gram LM.
Expand Me   Nicholas Andrews, Jason Eisner, Mark Dredze. Name Phylogeny: A Generative Model of String Variation. Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF] [Bibtex]
Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
Expand Me   Mahesh Joshi, Mark Dredze, William Cohen, Carolyn Rose. Multi-Domain Learning: When Do Domains Matter?. Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF] [Bibtex]
We present a systematic analysis of existing multi-domain learning approaches with respect to two questions. First, many multi-domain learning algorithms resemble ensemble learning algorithms. (1) Are multi-domain learning improvements the result of ensemble learning effects? Second, these algorithms are traditionally evaluated in a balanced label setting, although in practice many multi-domain settings have domain-specific label biases. When multi-domain learning is applied to these settings, (2) are multi-domain methods improving because they capture domain-specific class biases? An understanding of these two issues presents a clearer idea about where the field has had success in multi-domain learning, and it suggests some important open questions for improving beyond the current state of the art.
Expand Me   Ariya Rastrow, Sanjeev Khudanpur, Mark Dredze. Revisiting the Case for Explicit Syntactic Information in Language Models. NAACL Workshop on the Future of Language Modeling for HLT, 2012. [PDF] [Bibtex]
Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naive, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactically well-motivated models. This unusual resilience of n-grams, as well as their weaknesses, are examined here. It is demonstrated that n-grams are good word-predictors, even linguistically speaking, in a large majority of word-positions, and it is suggested that to improve over n-grams, one must explore syntax-aware (or other) language models that focus on positions where n-grams are weak.
Expand Me   Spence Green, Nicholas Andrews, Matthew Gormley, Mark Dredze, Christopher D. Manning. Entity Clustering Across Languages. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF] [Bibtex]
Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline.
Expand Me   Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF] [Bibtex]
With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining. Association for Computational Linguistics (ACL), 2012. [PDF] [Bibtex]
Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation. However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set. In this work, we propose substructure sharing, which saves duplicate work in processing hypothesis sets with redundant hypothesis structures. We apply substructure sharing to a dependency parser and part of speech tagger to obtain significant speedups, and further improve the accuracy of these tools through up-training. When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N-best and hill climbing rescoring, and show that up-training leads to WER reduction.
Expand Me   Koby Crammer, Alex Kulesza, Mark Dredze. New H-∞ Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012. [PDF] [Bibtex]
The well known recursive least squares (RLS) algorithm has been widely used for many years. Most analyses of RLS have assumed statistical properties of the data or noise process, but recent robust H∞ analyses have been used to bound the ratio of the performance of the algorithm to the total noise. In this paper, we provide the first additive analysis bounding the difference between performance and noise. Our analysis provides additional convergence guarantees in general, and particularly with structured input data. We illustrate the analysis using human speech and white noise.
Expand Me   Koby Crammer, Mark Dredze, Fernando Pereira. Confidence-Weighted Linear Classification for Text Categorization. Journal of Machine Learning Research (JMLR), 2012;13(Jun):1891-1926. [PDF] [Bibtex]
Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing.

     2011 (13 Publications)
     Joshua T. Vogelstein, William R. Gray, Jason G. Martin, Glen A. Coppersmith, Mark Dredze, J. Bogovic, J. L. Prince, S. M. Resnick, Carey E. Priebe, R. Jacob Vogelstein. Connectome Classification using Statistical Graph Theory and Machine Learning. Society for Neuroscience (Poster), 2011. [Bibtex]
Expand Me     Spence Green, Nicholas Andrews, Matthew R. Gormley, Mark Dredze, Christopher Manning. Cross-lingual Coreference Resolution: A New Task for Multilingual Comparable Corpora. Technical Report 6, Human Language Technology Center of Excellence, Johns Hopkins University, 2011. [Bibtex]
We introduce cross-lingual coreference resolution, the task of grouping entity mentions with a common referent in a multilingual corpus. Information, especially on the web, is increasingly multilingual. We would like to track entity references across languages without machine translation, which is expensive and unavailable for many language pairs. Therefore, we develop a set of models that rely on decreasing levels of parallel resources: a bitext, a bilingual lexicon, and a parallel name list. We propose baselines, provide experimental results, and analyze sources of error. Across a range of metrics, we find that even our lowest resource model gives a 2.5% F1 absolute improvement over the strongest baseline. Our results present a positive outlook for crosslingual coreference resolution even in low resource languages. We are releasing our crosslingual annotations for the ACE2008 ArabicEnglish evaluation corpus.
     Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models with Application to Selectional Preference. NIPS Workshop on Learning Semantics, 2011. [PDF] [Bibtex]
Expand Me   Damianos Karakos, Mark Dredze, Kenneth Church, Aren Jansen, Sanjeev Khudanpur. Estimating Document Frequencies in a Speech Corpus. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex]
Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df(w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex]
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of- domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Discrimnative Training of Long-Span Language Models. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex]
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long- span LMs based on a hill climbing rescoring algorithm. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.
Expand Me     Delip Rao, Paul McNamee, Mark Dredze. Entity Linking: Finding Extracted Entities in a Knowledge Base. Multi-source, Multi-lingual Information Extraction and Summarization, 2011. [Bibtex]
In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.
     Ann Irvine, Mark Dredze, Geraldine Legendre, Paul Smolensky. Optimality Theory Syntax Learnability: An Empirical Exploration of the Perceptron and GLA. CogSci Workshop on OT as a General Cognitive Architecture, 2011. [Bibtex]
Expand Me   Carolina Parada, Mark Dredze, Frederick Jelinek. OOV Sensitive Named-Entity Recognition in Speech. International Speech Communication Association (INTERSPEECH), 2011. [PDF] [Bibtex] [Data]
Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named entities and always produce transcription errors. In this work, we improve speech NER by including features indicative of OOVs based on a OOV detector, allowing for the identification of regions of speech containing named entities, even if they are incorrectly transcribed. We construct a new speech NER data set and demonstrate significant improvements for this task.
Expand Me     Michael J. Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011. [PDF] [Bibtex] [Data]
We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ailments, such as influenza, infections, obesity, as compared to standard topic models. Furthermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data.
Expand Me   Michael J. Paul, Mark Dredze. You Are What You Tweet: Analyzing Twitter for Public Health. International Conference on Weblogs and Social Media (ICWSM), 2011. [PDF] [Bibtex]
Analyzing user messages in social media can mea- sure different population haracteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and in- somnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.
Expand Me   Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Learning Sub-Word Units for Open Vocabulary Speech Recognition. Association for Computational Linguistics (ACL), 2011. [PDF] [Bibtex]
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to \em learn the sub-word lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. %We present results on a Broadcast News and MIT Lectures data sets. A hybrid model with our learned sub-word lexicon reduces error by 6.3\% and 7.6\% (absolute) at a 5\% false alarm rate on an English Broadcast News and MIT Lectures task respectively.
Expand Me   Ariya Rastrow, Markus Dreyer, Abhinav Sethy, Sanjeev Khudanpur, Bhuvana Ramabhadran, Mark Dredze. Hill Climbing on Speech Lattices: A New Rescoring Framework. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [PDF] [Bibtex]
We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.

     2010 (12 Publications)
Expand Me   Mark Dredze, Aren Jansen, Glen Coppersmith, Kenneth Church. NLP on Spoken Documents without ASR. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF] [Bibtex]
There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-of-vocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (&#160;1 sec) repetitions in speech, and clusters them into pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudo-terms; performance on a Switchboard task approaches a baseline using gold standard manual transcriptions.
Expand Me   Mark Dredze, Tim Oates, Christine Piatko. We're Not in Kansas Anymore: Detecting Domain Changes in Streams. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF] [Bibtex]
Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention -- detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses A-distance, a metric for detecting shifts in data streams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.
Expand Me   Carolina Parada, Abhinav Sethy, Mark Dredze, Fred Jelinek. A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web. International Speech Communication Association (INTERSPEECH), 2010. [PDF] [Bibtex]
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.
Expand Me   Delip Rao, Paul McNamee, Mark Dredze. Streaming Cross Document Entity Coreference Resolution. Conference on Computational Linguistics (Coling), 2010. [PDF] [Bibtex]
Previous research in cross-document entity coreference has generally been restricted to the offline scenario where the set of documents is provided in advance. As a consequence, the dominant approach is based on greedy agglomerative clustering techniques that utilize pairwise vector comparisons and thus require O(n^2) space and time. In this paper we explore identifying coreferent entity mentions across documents in high-volume streaming text, including methods for utilizing orthographic and contextual information. We test our methods using several corpora to quantitatively measure both the efficacy and scalability of our streaming approach. We show that our approach scales to at least an order of magnitude larger data than previous reported methods.
Expand Me   Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, Tim Finin. Entity Disambiguation for Knowledge Base Population. Conference on Computational Linguistics (Coling), 2010. [PDF] [Bibtex]
The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources. Further, our approach achieves performance of up to 95% on entities mentioned from newswire and 80% on a public test set that was designed to include challenging queries.
Expand Me   Chris Callison-Burch, Mark Dredze. Creating Speech and Language Data With Amazon's Mechanical Turk. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex]
In this paper we give an introduction to using Amazon\'s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop\'s shared task to create data for speech and language applications with $100.
Expand Me   Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, Mark Dredze. Annotating named entities in Twitter data with crowdsourcing. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex] [Data]
We describe our experience using both Amazon Mechanical Turk (MTurk) and CrowdFlower to collect simple named entity annotations for Twitter status updates. Unlike most genres that have traditionally been the focus of named entity experiments Twitter is far more informal and abbreviated. The collected annotations and annotation techniques will provide a first step towards the full study of named entity recognition in domains like Facebook and Twitter. We also briefly describe how to use MTurk to collect judgements on the quality of "word clouds."
Expand Me   Matthew R. Gormley, Adam Gerber, Mary Harper, Mark Dredze. Non-Expert Correction of Automatically Generated Relation Annotations. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex]
We explore a new way to collect human annotated relations in text using Amazon Mechanical Turk. Given a knowledge base of relations and a corpus, we identify sentences which mention both an entity and an attribute that have some relation in the knowledge base. Each noisy sentence/relation pair is presented to multiple turkers, who are asked whether the sentence expresses the relation. We describe a design which encourages user efficiency and aids discovery of cheating. We also present results on inter-annotator agreement.
Expand Me   Courtney Napoles, Mark Dredze. Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language. NAACL-HLT Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids, 2010. [PDF] [Bibtex]
Text simplification is the process of changing vocabulary and grammatical structure to create a more accessible version of the text while maintaining the underlying information and content. Automated tools for text simplification are a practical way to make large corpora of text accessible to a wider audience lacking high levels of fluency in the corpus language. In this work, we investigate the potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English. Most text simplification systems are based on hand-written rules (e.g., PEST and its module SYSTAR), and therefore face limitations scaling and transferring across domains. The potential for using Simple Wikipedia for text simplification is significant; it contains nearly 60,000 articles with revision histories and aligned articles to ordinary English Wikipedia. Using articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains. These findings help further understanding of what makes text simple and can be applied as a tool to help writers craft simple text.
Expand Me   Justin Ma, Alex Kulesza, Koby Crammer, Mark Dredze, Lawrence Saul, Fernando Pereira. Exploiting Feature Covariance in High-Dimensional Online Learning. AIStats, 2010. [PDF] [Bibtex]
Some online algorithms for linear classification model the uncertainty in their weights over the course of learning. Modeling the full covariance structure of the weights can provide a significant advantage for classification. However, for high-dimensional, large-scale data, even though there may be many second-order feature interactions, it is computationally infeasible to maintain this covariance structure. To extend second-order methods to high-dimensional data, we develop low-rank approximations of the covariance structure. We evaluate our approach on both synthetic and real-world data sets using the confidence-weighted online learning framework. We show improvements over diagonal covariance matrices for both low and high-dimensional data.
Expand Me   Carolina Parada, Mark Dredze, Denis Filimonov, Fred Jelinek. Contextual Information Improves OOV Detection in Speech. North American Chapter of the Association for Computational Linguistics (NAACL), 2010. [PDF] [Bibtex]
Out-of-vocabulary (OOV) words represent an important source of error in large vocabulary continuous speech recognition (LVCSR) systems. These words cause recognition failures, which propagate through pipeline systems impacting the performance of downstream applications. The detection of OOV regions in the output of a LVCSR system is typically addressed as a binary classification task, where each region is independently classified using local information. In this paper, we show that jointly predicting OOV regions, and including contextual information from each region, leads to substantial improvement in OOV detection. Compared to the state-of-the-art, we reduce the missed OOV rate from 42.6% to 28.4% at 10% false alarm rate.
Expand Me   Mark Dredze, Alex Kulesza, Koby Crammer. Multi-Domain Learning by Confidence-Weighted Parameter Combination. Machine Learning, 2010;79(1-2):123-149. [PDF] [Bibtex] [Tech Report]
State-of-the-art statistical NLP systems for a variety of tasks learn from labeled training data that is often domain specific. However, there may be multiple domains or sources of interest on which the system must perform. For example, a spam filtering system must give high quality predictions for many users, each of whom receives emails from different sources and may make slightly different decisions about what is or is not spam. Rather than learning separate models for each domain, we explore systems that learn across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of disparate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.

     2009 (6 Publications)
       Mark Dredze. Intelligent Email: Aiding Users with AI. PhD Thesis, Computer and Information Science, University of Pennsylvania, 2009. [PDF] [Bibtex]
Expand Me   Paul McNamee, Mark Dredze, Adam Gerber, Nikesh Garera, Tim Finin, James Mayfield, Christine Piatko, Delip Rao, David Yarowsky, Markus Dreyer. HLTCOE Approaches to Knowledge Base Population at TAC 2009. Text Analysis Conference (TAC), 2009. [Bibtex]
The HLTCOE participated in the entity linking and slot filling tasks at TAC 2009. A machine learning-based approach to entity linking, operating over a wide range of feature types, yielded good performance on the entity linking task. Slot-filling based on sentence selection, application of weak patterns and exploitation of redundancy was ineffective in the slot filling task.
Expand Me   Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Advances in Neural Information Processing Systems (NIPS), 2009. [PDF] [Bibtex]
We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data.
Expand Me   Mark Dredze, Partha Pratim Talukdar, Koby Crammer. Sequence Learning from Data with Multiple Labels. ECML/PKDD Workshop on Learning from Multi-Label Data, 2009. [PDF] [Bibtex]
We present novel algorithms for learning structured predictors from instances with multiple labels in the presence of noise. The proposed algorithms improve performance on two standard NLP tasks when we have a small amount of training data (low quantity) and when the labels are noisy (low quality). In these settings, the methods improve performance over using a single label, in some cases exceeding performance using gold labels. Our methods could be used in a semi-supervised setting, where a limited amount of labeled data could be combined with a rule based automatic labeling of unlabeled data with multiple possible labels.
Expand Me   Koby Crammer, Mark Dredze, Alex Kulesza. Multi-Class Confidence Weighted Algorithms. Empirical Methods in Natural Language Processing (EMNLP), 2009. [PDF] [Bibtex] [Data (Amazon 7)]
The recently introduced online confidence-weighted (CW) learning algorithm for binary classification performs well on many binary NLP tasks. However, for multi-class problems CW learning updates and inference cannot be computed analytically or solved as convex optimization problems as they are in the binary case. We derive learning algorithms for the multi-class CW setting and provide extensive evaluation using nine NLP datasets, including three derived from the recently released New York Times corpus. Our best algorithm outperforms state-of-the-art online and batch methods on eight of the nine tasks. We also show that the confidence information maintained during learning yields useful probabilistic information at test time.
Expand Me   Mark Dredze, Bill Schilit, Peter Norvig. Suggesting Email View Filters for Triage and Search. International Joint Conference on Artificial Intelligence (IJCAI), 2009. [PDF] [Bibtex]
Growing email volumes cause flooded inboxes and swelled email archives, making search and new email processing difficult. While emails have rich metadata, such as recipients and folders, suitable for creating filtered views, it is often difficult to choose appropriate filters for new inbox messages without first examining messages. In this work, we consider a system that automatically suggests relevant view filters to the user for the currently viewed messages. We propose several ranking algorithms for suggesting useful filters. Our work suggests that such systems quickly filter groups of inbox messages and find messages more easily during search.

     2008 (12 Publications)
Expand Me   Kevin Lerman, Ari Gilder, Mark Dredze, Fernando Pereira. Reading the Markets: Forecasting Public Opinion of Political Candidates by News Analysis. Conference on Computational Linguistics (Coling), 2008. [PDF] [Bibtex]
Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of political candidates. Our system uses daily newspaper articles to predict shifts in public opinion as reflected in prediction markets. We discuss various types of features designed for this problem. The news system improves market prediction over baseline market systems.
Expand Me     Mark Dredze, Joel Wallenberg. Further Results and Analysis of Icelandic Part of Speech Tagging. Technical Report MS-CIS-08-13, University of Pennsylvania, Department of Computer and Information Science, 2008. [PDF] [Bibtex]
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. This paper presents further results and analysis to the original work.
Expand Me   Mark Dredze, Joel Wallenberg. Icelandic Data-Driven Part of Speech Tagging. Association for Computational Linguistics (ACL) (short paper), 2008. [PDF] [Bibtex]
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions.
Expand Me   Kuzman Ganchev, Mark Dredze. Small Statistical Models by Random Feature Mixing. ACL Workshop on Mobile NLP, 2008. [PDF] [Bibtex]
The application of statistical NLP systems to resource constrained devices is limited by the need to maintain parameters for a large number of features and an alphabet mapping features to parameters. We introduce random feature mixing to eliminate alphabet storage and reduce the number of parameters without severely impacting model performance.
Expand Me   Mark Dredze, Hanna Wallach, Danny Puller, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Intelligent User Interfaces (IUI), 2008. [PDF] [Bibtex]
Email summary keywords, used to concisely represent the gist of an email, can help users manage and prioritize large numbers of messages. We develop an unsupervised learning framework for selecting summary keywords from emails using latent representations of the underlying topics in a user's mailbox. This approach selects words that describe each message in the context of existing topics rather than simply selecting keywords based on a single message in isolation. We present and compare four methods for selecting summary keywords based on two well-known models for inferring latent topics: latent semantic analysis and latent Dirichlet allocation. The quality of the summary keywords is assessed by generating summaries for emails from twelve users in the Enron corpus. The summary keywords are then used in place of entire messages in two proxy tasks: automated foldering and recipient prediction. We also evaluate the extent to which summary keywords enhance the information already available in a typical email user interface by repeating the same tasks using email subject lines.
Expand Me   Mark Dredze, Hanna Wallach, Danny Puller, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Aiding Users with AI. American National Conference on Artificial Intelligence (AAAI) (Nectar), 2008. [PDF] [Bibtex]
Email occupies a central role in the modern workplace. This has led to a vast increase in the number of email messages that users are expected to handle daily. Furthermore, email is no longer simply a tool for asynchronous online communication - email is now used for task management, personal archiving, as well both synchronous and asynchronous online communication. This explosion can lead to "email overload" - many users are overwhelmed by the large quantity of information in their mailboxes. In the human--computer interaction community, there has been much research on tackling email overload. Recently, similar efforts have emerged in the artificial intelligence (AI) and machine learning communities to form an area of research known as intelligent email.\nIn this paper, we take a user-oriented approach to applying AI to email. We identify enhancements to email user interfaces and employ machine learning techniques to support these changes. We focus on three tasks - summary keyword generation, reply prediction and attachment prediction - and summarize recent work in these areas.
Expand Me   Mark Dredze, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Reply and Attachment Prediction. Intelligent User Interfaces (IUI), 2008. [PDF] [Bibtex]
We present two prediction problems under the rubric of Intelligent Email that are designed to support enhanced email interfaces that relieve the stress of email overload. Reply prediction alerts users when an email requires a response and facilitates email response management. Attachment prediction alerts users when they are about to send an email missing an attachment or triggers a document recommendation system, which can catch missing attachment emails before they are sent. Both problems use the same underlying email classification system and task specific features. Each task is evaluated for both single-user and cross-user settings.
Expand Me   Mark Dredze, Hanna Wallach. User Models for Email Activity Management. IUI Workshop on Ubiquitous User Modeling, 2008. [PDF] [Bibtex]
A single user activity, such as planning a conference trip, typically involves multiple actions. Although these actions may involve several applications, the central point of co-ordination for any particular activity is usually email. Previous work on email activity management has focused on clustering emails by activity. Dredze et al. accomplished this by combining supervised classifiers based on document similarity, authors and recipients, and thread information. In this paper, we take a different approach and present an unsupervised framework for email activity clustering. We use the same information sources as Dredze et al.- namely, document similarity, message recipients and authors, and thread information - but combine them to form an unsupervised, non-parametric Bayesian user model. This approach enables email activities to be inferred without any user input. Inferring activities from a user's mailbox adapts the model to that user. We next describe the statistical machinery that forms the basis of our user model, and explain how several email properties may be incorporated into the model. We evaluate this approach using the same data as Dredze et al., showing that our model does well at clustering emails by activity.
Expand Me   Koby Crammer, Mark Dredze, Fernando Pereira. Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems (NIPS), 2008. [PDF] [Bibtex]
Confidence-weighted (CW) learning, an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods.
Expand Me   Mark Dredze, Koby Crammer, Fernando Pereira. Confidence-Weighted Linear Classification. International Conference on Machine Learning (ICML), 2008. [PDF] [Bibtex]
We introduce confidence-weighted linear classifiers, which add parameter confidence information to linear classifiers. Online learners in this setting update both classifier parameters and the estimate of their confidence. The particular online algorithms we study here maintain a Gaussian distribution over parameter vectors and update the mean and covariance of the distribution with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training.
Expand Me   Mark Dredze, Koby Crammer. Active Learning with Confidence. Association for Computational Linguistics (ACL) (short paper), 2008. [PDF] [Bibtex]
Active learning is a machine learning approach to achieving high-accuracy with a small amount of labels by letting the learning algorithm choose instances to be labeled. Most of previous approaches based on discriminative learning use the margin for choosing instances. We present a method for incorporating confidence into the margin by using a newly introduced online learning algorithm and show empirically that confidence improves active learning.
Expand Me   Mark Dredze, Koby Crammer. Online Methods for Multi-Domain Learning and Adaptation. Empirical Methods in Natural Language Processing (EMNLP), 2008. [PDF] [Bibtex]
NLP tasks are often domain specific, yet systems can learn behaviors across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of dispirate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.

     2007 (11 Publications)
     John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. North East Student Colloquium on Artificial Intelligence (NESCAI), 2007. [Bibtex]
     Danny Puller, Hanna Wallach, Mark Dredze, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Women in Machine Learning Workshop (WiML) at Grace Hopper, 2007. [Bibtex]
Expand Me   Koby Crammer, Mark Dredze, John Blitzer, Fernando Pereira. Batch Performance for an Online Price. NIPS Workshop on Efficient Machine Learning, 2007. [PDF] [Bibtex]
Batch learning techniques achieve good performance, but at the cost of many (sometimes even hundreds) of passes over the data. For many tasks, such as web-scale ranking of machine translation hypotheses, making many passes over the data is prohibitively expensive, even in parallel over thousands of machines. Online algorithms, which treat data as a stream of examples, are conceptually appealing for these large scale problems. In practice, however, online algorithms tend to underperform batch methods, unless they are themselves run in multiple passes over the data. <br>In this work we explore a new type of online learning algorithm that incorporates a measure of confidence to the algorithm. The model maintains a confidence for each parameter, reflecting previously observed properties of the data. While this requires an additional parameter for each feature of the data, this is a minimal cost when compared to running the algorithm multiple times over the data. The resulting algorithm learns faster, requiring both fewer training instances and fewer passes over the training data, often approaching batch performance with only a single pass through the data.
Expand Me   Mark Dredze, Krzysztof Czuba. Learning to Admit You're Wrong: Statistical Tools for Evaluating Web QA. NIPS Workshop on Machine Learning for Web Search, 2007. [PDF] [Bibtex]
Web search engines provide specialized results to specific queries, often relying on the output of a QA system. However, targeted answers, while helpful, are embarrassing when wrong. Automated techniques are required to avoid wrong answers and improve system performance. We present the Expected Answer System, a statistical data-driven framework that analyzes the performance of a QA system with the goal of improving system accuracy. Our system is used for wrong answer prediction, missing answer discovery, and question class analysis. An empirical study of a production QA system, one of the first such evaluations presented in the literature, motivates our approach.
Expand Me   Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum, Mark Dredze. Lightly-Supervised Attribute Extraction for Web Search. NIPS Workshop on Machine Learning for Web Search, 2007. [PDF] [Bibtex]
Web search engines can greatly benefit from knowledge about attributes of entities present in search queries. In this paper, we introduce lightly-supervised methods for extracting entity attributes from natural language text. Using these methods, we are able to extract large numbers of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a previously proposed pattern-based relation extractor, showing that the new methods give considerable improvements over that baseline. We also demonstrate that query expansion using extracted attributes improves retrieval performance on underspecified information-seeking queries.
Expand Me     Neal Parikh, Mark Dredze. Graphical Models for Primarily Unsupervised Sequence Labeling. Technical Report MS-CIS-07-18, University of Pennsylvania, Department of Computer and Information Science, 2007. [PDF] [Bibtex]
Most models used in natural language processing must be trained on large corpora of labeled text. This tutorial explores a 'primarily unsupervised' approach (based on graphical models) that augments a corpus of unlabeled text with some form of prior domain knowledge, but does not require any fully labeled examples. We survey probabilistic graphical models for (supervised) classification and sequence labeling and then present the prototype-driven approach of Haghighi and Klein (2006) to sequence labeling in detail, including a discussion of the theory and implementation of both conditional random fields and prototype learning. We show experimental results for English part of speech tagging.
Expand Me   Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach. Learning Fast Classifiers for Image Spam. Conference on Email and Anti-Spam (CEAS), 2007. [PDF] [Bibtex] [Data]
Recently, spammers have proliferated image spam, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes imagespam classification practical by providing both high accuracy features and a method to learn fast classifiers.
Expand Me   Koby Crammer, Mark Dredze, Kuzman Ganchev, Partha Pratim Talukdar, Steven Carroll. Automatic Code Assignment to Medical Text. BioNLP Workshop at ACL, 2007. [PDF] [Bibtex]
Code assignment is important for handling large amounts of electronic medical data in the modern hospital. However, only expert annotators with extensive training can assign codes. We present a system for the assignment of ICD-9-CM clinical codes to free text radiology reports. Our system assigns a code configuration, predicting one or more codes for each document. We combine three coding systems into a single learning system for higher accuracy. We compare our system on a real world medical dataset with both human annotators and other automated systems, achieving nearly the maximum score on the Computational Medicine Center's challenge.
     Mark Dredze, Hanna M. Wallach. Email Keyword Summarization and Visualization with Topic Models. North East Student Colloquium on Artificial Intelligence (NESCAI), 2007. [Bibtex]
Expand Me   John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association for Computational Linguistics (ACL), 2007. [PDF] [Bibtex] (Over 1000 citations)
Automatic sentiment classification has been extensively studied and applied in recent years. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is impractical. We investigate domain adaptation for sentiment classifiers, focusing on online reviews for different types of products. First, we extend to sentiment classification the recently-proposed structural correspondence learning (SCL) algorithm, reducing the relative error due to adaptation between domains by an average of 30% over the original SCL algorithm and 46% over a supervised baseline. Second, we identify a measure of domain similarity that correlates well with the potential for adaptation of a classifier from one domain to another. This measure could for instance be used to select a small set of domains to annotate whose trained classifiers would transfer well to many other domains.
Expand Me   Mark Dredze, John Blitzer, Partha Pratim Talukdar, Kuzman Ganchev, Joao Graca, Fernando Pereira. Frustratingly Hard Domain Adaptation for Parsing. Shared Task - Conference on Natural Language Learning - CoNLL 2007 shared task, 2007. [PDF] [Bibtex]
We describe some challenges of adaptation in the 2007 CoNLL Shared Task on Domain Adaptation. Our error analysis for this task suggests that a primary source of error is differences in annotation guidelines between treebanks. Our suspicions are supported by the observation that no team was able to improve target domain performance substantially over a state of the art baseline.

     2006 (4 Publications)
     Mark Dredze, John Blitzer, Koby Crammer, Fernando Pereira. Feature Design for Transfer Learning. North East Student Colloquium on Artificial Intelligence (NESCAI), 2006. [PDF] [Bibtex]
Expand Me   Mark Dredze, John Blitzer, Fernando Pereira. ``Sorry, I Forgot the Attachment:'' Email Attachment Prediction. Conference on Email and Anti-Spam (CEAS), 2006. [PDF] [Bibtex]
The missing attachment problem: a missing attachment generates a wave of emails from the recipients notifying the sender of the error. We present an attachment prediction system to reduce the volume of missing attachment mail. Our classifier could prompt an alert when an outgoing email is missing an attachment. Additionally, the system could activate an attachment recommendation system, whereby suggested documents are offered once the system determines the user is likely to include an attachment, effectively reminding the user to include the attachment. We present promising initial results and discuss implications of our work.
Expand Me   Mark Dredze, Tessa Lau, Nicholas Kushmerick. Automatically classifying emails into activities. Intelligent User Interfaces (IUI), 2006. [PDF] [Bibtex]
Email-based activity management systems promise to give users better tools for managing increasing volumes of email, by organizing email according to a user\'s activities. Current activity management systems do not automatically classify incoming messages by the activity to which they belong, instead relying on simple heuristics (such as message threads), or asking the user to manually classify incoming messages as belonging to an activity. This paper presents several algorithms for automatically recognizing emails as part of an ongoing activity. Our baseline methods are the use of message reply-to threads to determine activity membership and a naive Bayes classifier. Our SimSubset and SimOverlap algorithms compare the people involved in an activity against the recipients of each incoming message. Our SimContent algorithm uses IRR (a variant of latent semantic indexing) to classify emails into activities using similarity based on message contents. An empirical evaluation shows that each of these methods provide a significant improvement to the baseline methods. In addition, we show that a combined approach that votes the predictions of the individual methods performs better than each individual method alone.
     Nicholas Kushmerick, Tessa Lau, Mark Dredze, Rinat Khoussainov. Activity-Centric Email: A Machine Learning Approach. American National Conference on Artificial Intelligence (AAAI) (Nectar), 2006. [PDF] [Bibtex]

     2005 (3 Publications)
     Rie Kuboto Ando, Mark Dredze, Tong Zhang. Trec 2005 Genomics Track Experiments at IBM Watson. Text REtrieval Conference (TREC), 2005. [PDF] [Bibtex] (Group invited talk at TREC 2005, ranked 3rd and 4th out of 53 entries)
Expand Me   Mark Dredze, John Blitzer, Fernando Pereira. Reply Expectation Prediction for Email Management. Conference on Email and Anti-Spam (CEAS), 2005. [PDF] [Bibtex]
We reduce email overload by addressing the problem of waiting for a reply to one's email. We predict whether sent and received emails necessitate a reply, enabling the user to both better manage his inbox and to track mail sent to others. We discuss the features used to discriminate emails, show promising initial results with a logistic regression model, and outline future directions for this work.
Expand Me   Catalina Danis, Wendy Kellogg, Tessa Lau, Mark Dredze, Jeffrey Stylos, Nicholas Kushmerick. Managers Email: Beyond Tasks and To-Dos. Conference on Human Factors in Computing Systems (CHI) (Extended Abstracts), 2005. [PDF] [Bibtex]
In this paper, we describe preliminary findings that indicate that managers and non-mangers think about their email differently. We asked three research managers and three research non-managers to sort about 250 of their own email messages into categories that "would help them to manage their work." Our analyses indicate that managers create more categories and a more differentiated category structure than non-managers. Our data also suggest that managers create "relationship-oriented" categories more often than non-managers. These results are relevant to research on "email overload" that has highlighted the use of email for activities beyond communication. In particular, our findings suggest that too strong a focus on task management may be incomplete, and that a user's organizational role has an impact on their conceptualization and likely use of email.

     2004 (1 Publications)
       Mark Dredze, Jeffrey Stylos, Tessa Lau, Wendy Kellogg, Catalina Danis, Nicholas Kushmerick. Taxie: Automatically identifying tasks in email. Unpublished Manuscript, 2004. [Bibtex]

     2003 (1 Publications)
     Kevin Livingston, Mark Dredze, Kristian Hammond, Larry Birnbaum. Beyond Broadcast. International Conference on Intelligent User Interfaces (IUI), 2003. [Bibtex]

     Masters Thesis
      For my masters thesis in Jewish Studies at Yeshiva University, I completed a thesis titled: The Values of Traditional Judaism in Chicago. Please email me if you'd like a copy of this thesis.