# Publications

Click to show abstract.

Conference (abstract)   Conference (proceedings)  Journal  Workshop  Patent  Book

 2019 (8 Publications) Silvio Amir, Mark Dredze, John W Ayers. Population Level Mental Health Surveillance over Social Media with Digital Cohorts. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2019. [Bibtex] Alicia L Nobles, Mark Dredze, John W Ayers. Repeal and replace": increased demand for intrauterine devices following the 2016 presidential election. Contraception, 2019. [PDF] [Bibtex] Tao Chen, Mark Dredze, Jonathan P Weiner, Leilani Hernandez, Joe Kimura, Hadi Kharrazi. Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods. JMIR Medical Informatics, 2019. [PDF] [Bibtex] Elliot Schumacher, Mark Dredze. Discriminative Candidate Generation for Medical Concept Linking. Knowledge Base Construction (AKBC), 2019. [Bibtex] Ran Zhao, Yuntian Deng, Mark Dredze, Arun Verma, David Rosenberg, Amanda Stent. Visual Attention Model for Cross-sectional Stock Return Prediction and End-to-End Multimodal Market Representation Learning. The Florida Artificial Intelligence Research Society (FLAIRS), 2019. [PDF] [Bibtex] Technical and fundamental analysis are traditional tools used to analyze stocks; however, the finance literature has shown that the price movement of each individual stock is highly correlated with that of other stocks, especially those within the same sector. In this paper we propose a general- purpose market representation that incorporates fundamental and technical indicators and relationships between individual stocks. We treat the daily stock market as a market image' where rows (grouped by market sector) represent individual stocks and columns represent indicators. We apply a convo- lutional neural network over this market image to build market features in a hierarchical way. We use a recurrent neural network, with an attention mechanism over the market fea- ture maps, to model temporal dynamics in the market. Our model outperforms strong baselines in both short-term and long-term stock return prediction tasks. We also show another use for our market image: to construct concise and dense mar- ket embeddings suitable for downstream prediction tasks. Joshua Dredze, Lisi Dredze, Mark Dredze. Measuring Online Information Seeking for Stimulants from Google Search Queries. American Psychological Association (APA), 2019. [Bibtex] John W Ayers, Alicia L Nobles, Mark Dredze. Media Trends for the Substance Abuse and Mental Health Services Administration 800-662-HELP Addiction Treatment Referral Services After a Celebrity Overdose. JAMA Internal Medicine, 2019. [PDF] [Bibtex] (Ranked in the top 1% of 12.5m research outputs by Altmetric) Xiaolei Huang, Michael C Smith, Amelia M Jamison, David A Broniatowski, Mark Dredze, Sandra Crouse Quinn, Justin Cai, Michael J Paul. Can online self-reports assist in real-time identification of influenza vaccination uptake? A cross-sectional study of influenza vaccine-related tweets in the USA, 2013--2017. BMJ Open, 2019. [PDF] [Bibtex]

 2018 (18 Publications) John W Ayers, Mark Dredze, Eric C Leas, Theodore L Caputi, Jon-Patrick Allem, Joanna E Cohen. Next generation media monitoring: Global coverage of electronic nicotine delivery systems (electronic cigarettes) on Bing, Google and Twitter, 2013-2018. PloS one, 2018;13(11):e0205822. [PDF] [Bibtex] News media monitoring is an important scientific tool. By treating news reporters as data collectors and their reports as qualitative accounts of a fast changing public health landscape, researchers can glean many valuable insights. Yet, there have been surprisingly few innovations in public health media monitoring, with nearly all studies relying on labor-intensive content analyses limited to a small number of media reports. We propose to advance this subfield by using scalable machine learning. In potentially the largest contemporary public health media monitoring study to date, we systematically characterize global news reports surrounding electronic cigarettes or electronic nicotine delivery systems (ENDS) using natural language processing techniques. News reports including ENDS terms (e.g., electronic cigarettes'') from over 100,000 sources (all sources archived on Google News or Bing News, as well as all news articles shared on Twitter) were monitored for 1 January 2013 through 31 July 2018. The geographic and subject (e.g., prevalence, bans, quitting, warnings, marketing, prices, age, flavor and industry) foci of news articles, their popularity among readers who share news on social media, and the sentiment behind news articles were assessed algorithmically. Globally there were 86,872 ENDS news reports with coverage increasing from 8 (standard deviation [SD] = 8) stories per day in 2013 to 75 (SD = 56) stories per day during 2018. The focus of ENDS news spanned 148 nations, with the plurality focusing on the United States (34% of all news). Potentially overlooked hotspots of ENDS media activity included China, Egypt, Russia, Ukraine, and Paraguay. The most common subject was warnings about ENDS (18%), followed by bans on using ENDS (13%) and ENDS prices (9%). Flavor and age restrictions were the least covered news subjects ( 1% each). Among different subject foci, reports on quitting cigarettes using ENDS had the highest probability of scoring in the top three deciles of popularity rankings. Moreover, ENDS news on quitting and prices had a more positive sentiment on average than news with other subject foci. Public health leaders can use these trends to stay abreast of how ENDS are portrayed in the media, and potentially how the public perceives ENDS. Because our analytical strategies are updated in near real time, we aim to make media monitoring part of standard practice to support evidence-based tobacco control in the future. Masoud Rouhizadeh, Elham Hatef, Mark Dredze, Christopher Chute, Hadi Kharrazi. Identifying Social Determinants of Health from Clinical Notes: A Rule-Based Approach. AMIA Natural Language Processing Working Group Pre-Symposium, 2018. [Bibtex] David A Broniatowski, Amelia M Jamison, SiHua Qi, Lulwah AlKulaib, Tao Chen, Adrian Benton, Sandra C Quinn, Mark Dredze. Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate. American Journal of Public Health (AJPH), 2018;108(10):1378-1384. [PDF] [Bibtex] (Ranked #35 of 12.8 million research outputs by Altmetric and in the top-20 for 2018.) [Data] Objectives. To understand how Twitter bots and trolls (bots'') promote online health content. Methods. We compared bots' to average users' rates of vaccine-relevant messages, which we collected online from July 2014 through September 2017. We estimated the likelihood that users were bots, comparing proportions of polarized and antivaccine tweets across user types. We conducted a content analysis of a Twitter hashtag associated with Russian troll activity. Results. Compared with average users, Russian trolls (χ2(1) = 102.0; P < .001), sophisticated bots (χ2(1) = 28.6; P < .001), and content polluters'' (χ2(1) = 7.0; P < .001) tweeted about vaccination at higher rates. Whereas content polluters posted more antivaccine content (χ2(1) = 11.18; P < .001), Russian trolls amplified both sides. Unidentifiable accounts were more polarized (χ2(1) = 12.1; P < .001) and antivaccine (χ2(1) = 35.9; P < .001). Analysis of the Russian troll hashtag showed that its messages were more political and divisive. Conclusions. Whereas bots that spread malware and unsolicited content disseminated antivaccine messages, Russian trolls promoted discord. Accounts masquerading as legitimate users create false equivalency, eroding public consensus on vaccination. Public Health Implications. Directly confronting vaccine skeptics enables bots to legitimize the vaccine debate. More research is needed to determine how best to combat bot-driven content. (Am J Public Health. Published online ahead of print August 23, 2018: e1--e7. doi:10.2105/AJPH.2018.304567) Adrian Benton, Mark Dredze. Using Author Embeddings to Improve Tweet Stance Classification. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2018. [PDF] [Bibtex] Many social media classification tasks analyze the content of a message, but do not consider the context of the message. For example, in tweet stance classification -- where a tweet is categorized according to a view-point it espouses -- the expressed viewpoint depends on latent beliefs held by the user. In this paper we investigate whether incorporating knowledge about the author can improve tweet stance classification. Furthermore, since author information and embeddings are often unavailable for labeled training examples, we propose a semi-supervised pre-training method to predict user embeddings. Although the neural stance classifiers we learn are often outperformed by a baseline SVM, author embedding pre-training yields improvements over a non-pre-trained neural network on four out of five domains in the SemEval 2016 6A tweet stance classification task. In a tweet gun control stance classification dataset, improvements from pre-training are only apparent when training data is limited. Zachary Wood-Doughty, Nicholas Andrews, Mark Dredze. Convolutions Are All You Need (For Classifying Character Sequences) EMNLP Workshop on Noisy User-generated Text (W-NUT), 2018. [PDF] [Bibtex] While recurrent neural networks (RNNs) are widely used for text classification, they demonstrate poor performance and slow convergence when trained on long sequences. When text is modeled as characters instead of words, the longer sequences make RNNs a poor choice. Convolutional neural networks (CNNs), although somewhat less ubiquitous than RNNs, have an internal structure more appropriate for long-distance character dependencies. To better understand how CNNs and RNNs differ in handling long sequences, we use them for text classification tasks in several character-level social media datasets. The CNN models vastly outperform the RNN models in our experiments, suggesting that CNNs are superior to RNNs at learning to classify character-level data. Vedran Sekara, Alex Rutherford, Gideon Mann, Mark Dredze, Natalia Adler, Manuel García-Herranz. Trends in the Adoption of Corporate Child Labor Policies: An Analysis with Bloomberg Terminal ESG Data. Bloomberg Data for Good Exchange, 2018. [PDF] [Bibtex] Over 150 million children worldwide are estimated to be engaged in some form of child labor, with nearly one in every four children between the ages of 5 and 14 engaged in potentially harmful work in the world's poorest countries. Child labor compromises children's physical, mental, social and educational development. It also reinforces cycles of poverty, negatively affecting the ecosystem necessary for business to thrive in a sustainable manner. Against a backdrop of multiple international and national laws against child labor, corporations also adopt policies on child labor. However, new methods of globally dispersed production have made this commitment to sustainability issues across supply chains more challenging. In this work we examine, through the lens of Bloomberg's environmental, social and governance (ESG) and financial data, trends in corporate child labor policies and their relationship to classic economic variables as a first step in understanding sustainability issues across global supply networks. Zachary Wood-Doughty, Ilya Shpitser, Mark Dredze. Challenges of Using Text Classifiers for Causal Inference. Empirical Methods in Natural Language Processing (EMNLP), 2018. [PDF] [Bibtex] Causal understanding is essential for many kinds of decision-making, but causal inference from observational data has typically only been applied to structured, low-dimensional datasets. While text classifiers produce low-dimensional outputs, their use in causal inference has not previously been studied. To facilitate causal analyses based on language data, we consider the role that text classifiers can play in causal inference through established modeling mechanisms from the causality literature on missing data and measurement error. We demonstrate how to conduct causal analyses using text classifiers on simulated and Yelp data, and discuss the opportunities and challenges of future work that uses text data in causal inference. John W Ayers, Theodore L Caputi, Camille Nebeker, Mark Dredze. Don't quote me: reverse identification of research participants in social media studies. Nature Digital Medicine, 2018. [PDF] [Bibtex] (Ranked in the top 0.6% of 11.6m million research outputs by Altmetric) We investigated if participants in social media surveillance studies could be reverse identified by reviewing all articles published on PubMed in 2015 or 2016 with the words Twitter'' and either read,'' coded,'' or content'' in the title or abstract. Seventy-two percent (95% CI: 63--80) of articles quoted at least one participant's tweet and searching for the quoted content led to the participant 84% (95% CI: 74--91) of the time. Twenty-one percent (95% CI: 13--29) of articles disclosed a participant's Twitter username thereby making the participant immediately identifiable. Only one article reported obtaining consent to disclose identifying information and institutional review board (IRB) involvement was mentioned in only 40% (95% CI: 31--50) of articles, of which 17% (95% CI: 10--25) received IRB-approval and 23% (95% CI:16--32) were deemed exempt. Biomedical publications are routinely including identifiable information by quoting tweets or revealing usernames which, in turn, violates ICMJE ethical standards governing scientific ethics, even though said content is scientifically unnecessary. We propose that authors convey aggregate findings without revealing participants' identities, editors refuse to publish reports that reveal a participant's identity, and IRBs attend to these privacy issues when reviewing studies involving social media data. These strategies together will ensure participants are protected going forward. Yuki Lama, Tao Chen, Mark Dredze, Amelia M Jamison, Sandra C Quinn, David A Broniatowski. Discordance Between Human Papillomavirus Twitter Images and Disparities in Human Papillomavirus Risk and Disease in the United States: Mixed-Methods Analysis. Journal of Medical Internet Research (JMIR), 2018;20(9):e10244. [PDF] [Bibtex] Background: Racial and ethnic minorities are disproportionately affected by human papillomavirus (HPV)-related cancer, many of which could have been prevented with vaccination. Yet, the initiation and completion rates of HPV vaccination remain low among these populations. Given the importance of social media platforms for health communication, we examined US-based HPV images on Twitter. We explored inconsistencies between the demographics represented in HPV images and the populations that experience the greatest burden of HPV-related disease. Objective: The objective of our study was to observe whether HPV images on Twitter reflect the actual burden of disease by select demographics and determine to what extent Twitter accounts utilized images that reflect the burden of disease in their health communication messages. Methods: We identified 456 image tweets about HPV that contained faces posted by US users between November 11, 2014 and August 8, 2016. We identified images containing at least one human face and utilized Face++ software to automatically extract the gender, age, and race of each face. We manually annotated the source accounts of these tweets into 3 types as follows: government (38/298, 12.8%), organizations (161/298, 54.0%), and individual (99/298, 33.2%) and topics (news, health, and other) to examine how images varied by message source. Results: Findings reflected the racial demographics of the US population but not the disease burden (795/1219, 65.22% white faces; 140/1219, 11.48% black faces; 71/1219, 5.82% Asian faces; and 213/1219, 17.47% racially ambiguous faces). Gender disparities were evident in the image faces; 71.70% (874/1219) represented female faces, whereas only 27.89% (340/1219) represented male faces. Among the 11-26 years age group recommended to receive HPV vaccine, HPV images contained more female-only faces (214/616, 34.3%) than males (37/616, 6.0%); the remainder of images included both male and female faces (365/616, 59.3%). Gender and racial disparities were present across different image sources. Faces from government sources were more likely to depict females (n=44) compared with males (n=16). Of male faces, 80% (12/15) of youth and 100% (1/1) of adults were white. News organization sources depicted high proportions of white faces (28/38, 97% of female youth and 12/12, 100% of adult males). Face++ identified fewer faces compared with manual annotation because of limitations with detecting multiple, small, or blurry faces. Nonetheless, Face++ achieved a high degree of accuracy with respect to gender, race, and age compared with manual annotation. Conclusions: This study reveals critical differences between the demographics reflected in HPV images and the actual burden of disease. Racial minorities are less likely to appear in HPV images despite higher rates of HPV incidence. Health communication efforts need to represent populations at risk better if we seek to reduce disparities in HPV infection. Katherine Smith, Caitlin Weiger, Errol Fields, Joanna E Cohen, Meghan Moran, Mark Dredze. Conducting public health surveillance research on consumer product websites. American Public Health Association (APHA), 2018. [PDF] [Bibtex] Yuchen Zhou, Mark Dredze, David A Broniatowski, William Adler. Gab: The Alt-Right Social Media Platform. International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2018. [PDF] [Bibtex] This study proposes the use of Gab as a vehicle for political science research regarding modern American politics and the Alt-Right population. We collect several million Gab messages posted on Gab web- site from August 2016 to February 2018. We conduct a preliminary analysis of Gab platform related to site use, growth and topics, which shows that Gab is a reasonable resource for Alt-Right study. Travis Wolfe, Annabelle Carrell, Mark Dredze, Benjamin Van Durme. Summarizing Entities using Distantly Supervised Information Extractors. SIGIR Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR), 2018. [Bibtex] Alexis S Hammond, Michael J Paul, J Gregory Hobelmann, Animesh R Koratana, Mark Dredze, Margaret S Chisolm. Perceived Attitudes About Substance Use in Anonymous Social Media Posts Near College Campuses. Journal of Medical Internet Research Mental Health (JMIR MH), 2018;5(3):e52. [PDF] [Bibtex] Zachary Wood-Doughty, Praateek Mahajan, Mark Dredze. Johns Hopkins or johnny-hopkins: Classifying Individuals versus Organizations on Twitter. NAACL Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, 2018. [PDF] [Bibtex] Zachary Wood-Doughty, Nicholas Andrews, Rebecca Marvin, Mark Dredze. Predicting Twitter User Demographics from Names Alone. NAACL Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, 2018. [PDF] [Bibtex] Theodore L Caputi, Eric C Leas, Mark Dredze, John W Ayers. Online Sales of Marijuana: An Unrecognized Public Health Dilemma. American Journal of Preventive Medicine (AJPM), 2018;54(5):719-721. [PDF] [Bibtex] Adrian Benton, Mark Dredze. Deep Dirichlet Multinomial Regression. North American Chapter of the Association for Computational Linguistics (NAACL), 2018. [PDF] [Bibtex] Tao Chen, Mark Dredze. Vaccine Images on Twitter: What is Shared and Why. Journal of Medical Internet Research (JMIR), 2018;20(4):2018. [PDF] [Bibtex]

 2017 (23 Publications) Seth M Noar, Eric C Leas, Benjamin M Althouse, Mark Dredze, Dannielle Kelley, John W Ayers. Can a selfie promote public engagement with skin cancer? Preventive Medicine, 2017. [PDF] [Bibtex] Social media may provide new opportunities to promote skin cancer prevention, but research to understand this potential is needed. In April of 2015, Kentucky native Tawny Willoughby (TW) shared a graphic skin cancer selfie on Facebook that subsequently went viral. We examined the volume of comments and shares of her original Facebook post; news volume of skin cancer from Google News; and search volume for skin cancer Google queries. We compared these latter metrics after TWs announcement against expected volumes based on forecasts of historical trends. TW's skin cancer selfie went viral on May 11, 2015 after the social media post had been shared approximately 50,000 times. All search queries for skin cancer increased 162% (95% CI 102 to 320) and 155% (95% CI 107 to 353) on May 13th and 14th, when news about TW's skin cancer selfie was at its peak, and remained higher through May 17th. Google searches about skin cancer prevention and tanning were also significantly higher than expected volumes. In practical terms, searches reached near-record levels - i.e., May 13th, 14th and 15th were respectively the 6th, 8th, and 40th most searched days for skin cancer since January 1, 2004 when Google began tracking searches. We conclude that an ordinary person's social media post caught the public's imagination and led to significant increases in public engagement with skin cancer prevention. Digital surveillance methods can rapidly detect these events in near real time, allowing public health practitioners to engage and potentially elevate positive effects. Theodore L Caputi, Eric C Leas, Mark Dredze, Joanna E Cohen, John W Ayers. They're heating up: Internet search query trends reveal significant public interest in heat-not-burn tobacco products. PLoS ONE, 2017. [PDF] [Bibtex] Benjamin Van Durme, Tom Lippincott, Kevin Duh, Deana Burchfield, Adam Poliak, Cash Costello, Tim Finin, Scott Miller, James Mayfield, Philipp Koehn, Craig Harman, Dawn Lawrie, Chandler May, Max Thomas, Julianne Chaloux, Annabelle Carrell, Tongfei Chen, Alex Comerford, Mark Dredze, Benjamin Glass, Shudong Hao, Patrick Martin, Rashmi Sankepally, Pushpendre Rastogi, Travis Wolfe, Ying-Ying Tran, Ted Zhang. CADET: Computer Assisted Discovery Extraction and Translation. International Joint Conference on Natural Language Processing (IJCNLP) (Demonstration Track), 2017. [PDF] [Bibtex] Michael C Smith, Mark Dredze, Sandra C Quinn, David A Broniatowski. Monitoring Real-time Spatial Public Health Discussions in the Context of Vaccine Hesitancy. AMIA Workshop on Social Media Mining for Health Applications, 2017. [PDF] [Bibtex] Ning Gao, Mark Dredze, Douglas Oard. Enhancing Scientific Collaboration Through Knowledge Base Population and Linking for Meetings. Hawaii International Conference on System Sciences (HICSS), 2017. [PDF] [Bibtex] Michael J Paul, Mark Dredze. Social Monitoring for Public Health. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2017;9(5):1-183. [PDF] [Bibtex] [Preprint (free)] Ning Gao, Gregory Sell, Douglas Oard, Mark Dredze. Leveraging Side Information for Speaker Identification with the Enron Conversational Telephone Speech Collection. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017. [PDF] [Bibtex] John W Ayers, Benjamin M Althouse, Eric C Leas, Mark Dredze, Jon-Patrick Allem. Internet searches for suicide following the release of 13 Reasons Why. JAMA Internal Medicine, 2017;57(4):238-240. [PDF] [Bibtex] (Ranked in the top .02% of 8.2m research outputs by Altmetric, Read the JAMA IM Editorial, Read Netflix cast's response to criticism) Mark Dredze, Zachary Wood-Doughty, Sandra C Quinn, David A Broniatowski. Vaccine opponents' use of Twitter during the 2016 US presidential election: Implications for practice and policy. Vaccine, 2017;35(36):4670-4672. [PDF] [Bibtex] The recent inauguration of President Trump carries with it many public health policy implications. During the election, President Trump, like all political candidates, made policy commitments to various interest groups including vaccine skeptics. These groups celebrated the announcement that Robert Kennedy Jr., a noted proponent of a causal link between vaccines and autism, may chair a commission on vaccines. Furthermore, during the GOP primaries, Mr. Trump endorsed messages associated with vaccine refusal on Twitter, and met with prominent vaccine refusal advocates including Andrew Wakefield, who published the retracted and discredited 1998 Lancet article claiming to link autism to MMR vaccination. In this paper, we show that the new administration has mobilized vaccine refusal advocates, potentially enabling them to influence the national agenda in a manner that could lead to changes in existing vaccination policy. Anietie Andy, Mark Dredze, Mugizi Rwebangira, Chris Callison-Burch. Constructing an Alias List for Named Entities during an Event. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2017. [PDF] [Bibtex] Nanyun Peng, Mark Dredze. Multi-task Domain Adaptation for Sequence Tagging. ACL Workshop on Representation Learning for NLP (RepL4NLP), 2017. [PDF] [Bibtex] Many domain adaptation approaches rely on learning cross domain shared representations to transfer the knowledge learned in one domain to other domains. Traditional domain adaptation only considers adapting for one task. In this paper, we explore multi-task representation learning under the domain adaptation scenario. We propose a neural network framework that supports domain adaptation for multiple tasks simultaneously, and learns shared representations that better generalize for domain adaptation. We apply the proposed framework to domain adaptation for sequence tagging problems considering two tasks: Chinese word segmentation and named entity recognition. Experiments show that multi-task domain adaptation works better than disjoint domain adaptation for each task, and achieves the state-of-the-art results for both tasks in the social media domain. Zachary Wood-Doughty, Michael C Smith, David A Broniatowski, Mark Dredze. How Does Twitter User Behavior Vary Across Demographic Groups? ACL Workshop on Natural Language Processing and Computational Social Science, 2017. [PDF] [Bibtex] Demographically-tagged social media messages are a common source of data for computational social science. While these messages can indicate differences in beliefs and behaviors between demographic groups, we do not have a clear understanding of how different demographic groups use platforms such as Twitter. This paper presents a preliminary analysis of how groups' differing behaviors may confound analyses of the groups themselves. We analyzed one million Twitter users by first inferring demographic attributes, and then measuring several indicators of Twitter behavior. We find differences in these indicators across demographic groups, suggesting that there may be underlying differences in how different demographic groups use Twitter. Jon-Patrick Allem, Eric C Leas, Theodore L Caputi, Mark Dredze, Benjamin M Althouse, Seth M Noar, John W Ayers. The Charlie Sheen Effect on Rapid In-home Human Immunodeficiency Virus Test Sales. Prevention Science, 2017;18(5):541--544. [PDF] [Bibtex] One in eight of the 1.2 million Americans living with human immunodeficiency virus (HIV) are unaware of their positive status, and untested individuals are responsible for most new infections. As a result, testing is the most cost-effective HIV prevention strategy and must be accelerated when opportunities are presented. Web searches for HIV spiked around actor Charlie Sheen's HIV-positive disclosure. However, it is unknown whether Sheen's disclosure impacted offline behaviors like HIV testing. The goal of this study was to determine if Sheen's HIV disclosure was a record-setting HIV prevention event and determine if Web searches presage increases in testing allowing for rapid detection and reaction in the future. Sales of OraQuick rapid in-home HIV test kits in the USA were monitored weekly from April 12, 2014, to April 16, 2016, alongside Web searches including the terms test,'' tests,'' or testing'' and HIV'' as accessed from Google Trends. Changes in OraQuick sales around Sheen's disclosure and prediction models using Web searches were assessed. OraQuick sales rose 95% (95% CI, 75--117; p < 0.001) of the week of Sheen's disclosure and remained elevated for 4 more weeks (p < 0.05). In total, there were 8225 more sales than expected around Sheen's disclosure, surpassing World AIDS Day by a factor of about 7. Moreover, Web searches mirrored OraQuick sales trends (r = 0.79), demonstrating their ability to presage increases in testing. The Charlie Sheen effect'' represents an important opportunity for a public health response, and in the future, Web searches can be used to detect and act on more opportunities to foster prevention behaviors. Ning Gao, Douglas Oard, Mark Dredze. Support for Interactive Identification of Mentioned Entities in Conversational Speech. International Conference on Research and Development in Information Retrieval (SIGIR) (short paper), 2017. [PDF] [Bibtex] Searching conversational speech poses several new challenges, among which is how the searcher will make sense of what they find. This paper describes our initial experiments with a freely available collection of Enron telephone conversations. Our goal is to help the user make sense of search results by finding information about mentioned people, places and organizations. Because automated entity recognition is not yet sufficiently accurate on conversational telephone speech, we ask the user to transcribe just the name, and to indicate where in the recording it was heard. We then seek to link that mention to other mentions of the same entity in a variety of sources (in our experiments, in email and in Wikipedia). We cast this as an entity linking problem, and achieve promising results by utilizing social network features to help compensate for the limited accuracy of automatic transcription for this challenging content. Nicholas Andrews, Mark Dredze, Benjamin Van Durme, Jason Eisner. Bayesian Modeling of Lexical Resources for Low-Resource Settings. Association for Computational Linguistics (ACL), 2017. [PDF] [Bibtex] Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition. Travis Wolfe, Mark Dredze, Benjamin Van Durme. Pocket Knowledge Base Population. Association for Computational Linguistics (ACL) (short paper), 2017. [PDF] [Bibtex] [Code] Existing Knowledge Base Population methods extract relations from a closed relational schema with limited coverage, leading to sparse KBs. We propose Pocket Knowledge Base Population (PKBP), the task of dynamically constructing a KB of entities related to a query and finding the best characterization of relationships between entities. We describe novel Open Information Extraction methods which leverage the PKB to find informative trigger words. We evaluate using existing KBP shared-task data as well as new annotations collected for this work. Our methods produce high quality KBs from just text with many more entities and relationships than existing KBP systems. Ning Gao, Mark Dredze, Douglas Oard. Person Entity Linking in Email with NIL Detection. Journal of the Association for Information Science and Technology (JAIST), 2017. [PDF] [Bibtex] For each specific mention of an entity found in a text, the goal of entity linking is to determine whether the referenced entity is present in an existing knowledge base, and if so to determine which KB entity is the correct referent. Entity linking has been well explored for dissemination-oriented sources such as news stories, blogs, and microblog posts, but the limited work to date on conversational'' sources such as email or text chat has not yet attempted to determine when the referent entity is not in the knowledge base (a task known as NIL detection''). This article presents a supervised machine learning system for linking named mentions of people in email messages to a collection-specific knowledge base, and that is also capable of NIL detection. This system learns from manually annotated training examples to leverage a rich set of features. The entity linking accuracy for entities present in the knowledge base is substantially and significantly better than the best previously reported results on the Enron email collection, comparable accuracy is reported for the challenging NIL detection task, and these results are for the first time replicated on a second email collection from a different source with comparable results. Ann Irvine, Mark Dredze. Harmonic Grammar, Optimality Theory, and Syntax Learnability: An Empirical Exploration of Czech Word Order. Unpublished Manuscript, 2017. [PDF] [Bibtex] This work presents a systematic theoretical and empirical comparison of the major algorithms that have been proposed for learning Harmonic and Optimality Theory grammars (HG and OT, respectively). By comparing learning algorithms, we are also able to compare the closely related OT and HG frameworks themselves. Experimental results show that the additional expressivity of the HG framework over OT affords performance gains in the task of predicting the surface word order of Czech sentences. We compare the perceptron with the classic Gradual Learning Algorithm (GLA), which learns OT grammars, as well as the popular Maximum Entropy model. In addition to showing that the perceptron is theoretically appealing, our work shows that the performance of the HG model it learns approaches that of the upper bound in prediction accuracy on a held out test set and that it is capable of accurately modeling observed variation. Adrian Benton, Glen A Coppersmith, Mark Dredze. Ethical Research Protocols for Social Media Health Research. EACL Workshop on Ethics in Natural Language Processing, 2017. [PDF] [Bibtex] Social media have transformed data driven research in political science, the social sciences, health, and medicine. Since health research often touches on sensitive topics that relate to ethics of treatment and patient privacy, similar ethical considerations should be acknowledged when using social media data in health research. While much has been said regarding the ethical considerations of social media research, health research leads to an additional set of concerns. We provide practical suggestions in the form of guidelines for researchers working with social media data in health research. These guidelines can inform an IRB proposal for researchers new to social media health research. John W Ayers, Eric C Leas, Jon-Patrick Allem, Adrian Benton, Mark Dredze, Benjamin M Althouse, Tess B Cruz, Jennifer B Unger. Why Do People Use Electronic Nicotine Delivery Systems (Electronic Cigarettes)? A Content Analysis of Twitter, 2012-2015. PLoS One, 2017. [PDF] [Bibtex] The reasons for using electronic nicotine delivery systems (ENDS) are poorly understood and are primarily documented by expensive cross-sectional surveys that use preconceived close-ended response options rather than allowing respondents to use their own words. We passively identify the reasons for using ENDS longitudinally from a content analysis of public postings on Twitter. All English language public tweets including several ENDS terms (e.g., e-cigarette'' or vape'') were captured from the Twitter data stream during 2012 and 2015. After excluding spam, advertisements, and retweets, posts indicating a rationale for vaping were retained. The specific reasons for vaping were then inferred based on a supervised content analysis using annotators from Amazon's Mechanical Turk. During 2012 quitting combustibles was the most cited reason for using ENDS with 43% (95%CI 39--48) of all reason-related tweets cited quitting combustibles, e.g., I couldn't quit till I tried ecigs,'' eclipsing the second most cited reason by more than double. Other frequently cited reasons in 2012 included ENDS's social image (21%; 95%CI 18--25), use indoors (14%; 95%CI 11--17), flavors (14%; 95%CI 11--17), safety relative to combustibles (9%; 95%CI 7--11), cost (3%; 95%CI 2--5) and favorable odor (2%; 95%CI 1--3). By 2015 the reasons for using ENDS cited on Twitter had shifted. Both quitting combustibles and use indoors significantly declined in mentions to 29% (95%CI 24--33) and 12% (95%CI 9--16), respectively. At the same time, social image increased to 37% (95%CI 32--43) and lack of odor increased to 5% (95%CI 2--5), the former leading all cited reasons in 2015. Our data suggest the reasons people vape are shifting away from cessation and toward social image. The data also show how the ENDS market is responsive to a changing policy landscape. For instance, smoking indoors was less frequently cited in 2015 as indoor smoking restrictions became more common. Because the data and analytic approach are scalable, adoption of our strategies in the field can inform follow-up survey-based surveillance (so the right questions are asked), interventions, and policies for ENDS. Anthony Nastasi, Tyler Bryant, Joseph K Canner, Mark Dredze, Melissa S Camp, Neeraja Nagarajan. Breast Cancer Screening and Social Media: a Content Analysis of Evidence Use and Guideline Opinions on Twitter. Journal of Cancer Education, 2017. [PDF] [Bibtex] There is ongoing debate regarding the best mammography screening practices. Twitter has become a powerful tool for disseminating medical news and fostering healthcare conversations; however, little work has been done examining these conversations in the context of how users are sharing evidence and discussing current guidelines for breast cancer screening. To characterize the Twitter conversation on mammography and assess the quality of evidence used as well as opinions regarding current screening guidelines, individual tweets using mammography-related hashtags were prospectively pulled from Twitter from 5 November 2015 to 11 December 2015. Content analysis was performed on the tweets by abstracting data related to user demographics, content, evidence use, and guideline opinions. Standard descriptive statistics were used to summarize the results. Comparisons were made by demographics, tweet type (testable claim, advice, personal experience, etc.), and user type (non-healthcare, physician, cancer specialist, etc.). The primary outcomes were how users are tweeting about breast cancer screening, the quality of evidence they are using, and their opinions regarding guidelines. The most frequent user type of the 1345 tweets was non-healthcare'' with 323 tweets (32.5%). Physicians had 1.87 times higher odds (95% CI, 0.69--5.07) of providing explicit support with a reference and 11.70 times higher odds (95% CI, 3.41--40.13) of posting a tweet likely to be supported by the scientific community compared to non-healthcare users. Only 2.9% of guideline tweets approved of the guidelines while 14.6% claimed to be confused by them. Non-healthcare users comprise a significant proportion of participants in mammography conversations, with tweets often containing claims that are false, not explicitly backed by scientific evidence, and in favor of alternative natural'' breast cancer prevention and treatment. Furthermore, users appear to have low approval and confusion regarding screening guidelines. These findings suggest that more efforts are needed to educate and disseminate accurate information to the general public regarding breast cancer prevention modalities, emphasizing the safety of mammography and the harms of replacing conventional prevention and treatment modalities with unsubstantiated alternatives. Xiaolei Huang, Michael C Smith, Michael J Paul, Dmytro Ryzhkov, Sandra C Quinn, David A Broniatowski, Mark Dredze. Examining Patterns of Influenza Vaccination in Social Media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), 2017. [PDF] [Bibtex] [Data] Traditional data on influenza vaccination has several limitations: high cost, limited coverage of underrepresented groups, and low sensitivity to emerging public health issues. Social media, such as Twitter, provide an alternative way to understand a population's vaccination-related opinions and behaviors. In this study, we build and employ several natural language classifiers to examine and analyze behavioral patterns regarding influenza vaccination in Twitter across three dimensions: temporality (by week and month), geography (by US region), and demography (by gender). Our best results are highly correlated official government data, with a correlation over 0.90, providing validation of our approach. We then suggest a number of directions for future work. Neeraja Nagarajan, Husain Alshaikh, Anthony Nastasi, Blair J Smart, Zackary D Berger, Eric B Schneider, Mark Dredze, Joseph K Canner, Nita Ahuja. The Utility of Twitter in Generating High-Quality Conversations about Surgical Care. Academic Surgical Congress, 2017. [PDF] [Bibtex] Introduction: There is growing interest among various stakeholders in using social media sites to discuss healthcare issues. However, little is known about how social media sites are used to discuss surgical care. There is also a lack of understanding of the types of content generated and the quality of the information shared in social media platforms about surgical care issues. We therefore sought to identify and summarize conversations on surgical care in Twitter, a popular microblogging website. Methods: A comprehensive list of surgery-related hashtags was used to pull individual tweets from 3/27-4/27/2015. Four independent reviewers blindly analyzed 25 tweets to develop themes for extraction from a larger sample. The themes were broadly divided further to obtain data at the levels of the user, the tweet, the content of the tweet and personal information shared (Figure I). Standard descriptive statistical analysis and simple logistic regression analysis was used. Results: In total, 17,783 tweets were pulled and 1000 from 615 unique users were randomly selected for analysis. Most users were from North America (62.4%) and non-healthcare related individuals (31.8%). Healthcare organizations generated 12.4%, and surgeons 9.5%, of tweets. Overall, 67.4% were original tweets and 79.0% contained a hyperlink (11% to healthcare and 8.7% to peer-reviewed sources). The common areas of surgery discussed were global surgery/health systems (18.4%), followed by general surgery (15.6%). Among personal tweets (n=236), 31.1% concerned surgery on family/friends and 24.4% on the user; 61.1% discussed procedures already performed and 58.0% used positive language about their personal experience with surgical care. Surgical news/opinion was present in 45% of tweets and 13.7% contained evidence-based information. Non-healthcare professionals were 53.5% (95% CI: 3.8%-77.5%, p=0.039) and 72.8% (95% CI: 21.1%-91.7%, p=0.017) less likely to generate a tweet that contained evidence-based information and to quote from a peer-reviewed journal, respectively, when compared to other users. Conclusion: Our study demonstrates that while healthcare professionals and organizations tend to share higher quality data on surgical care on social media, non-health care related individuals largely drive the conversation. Fewer than half of all surgery-related tweets included surgical news/opinion; only 14% included evidence-based information and just 9% linked to peer-reviewed sources. As social media outlets become important sources of actionable information, leaders in the surgical community should develop professional guidelines to maximize this versatile platform to disseminate accurate and high-quality content on surgical issues to a wide range of audiences.

 2016 (34 Publications) Travis Wolfe, Mark Dredze, Benjamin Van Durme. Feature Generation for Robust Semantic Role Labeling. Unpublished Manuscript, 2016. [PDF] [Bibtex] Hand-engineered feature sets are a well understood method for creating robust NLP models, but they require a lot of expertise and effort to create. In this work we describe how to automatically generate rich feature sets from simple units called featlets, requiring less engineering. Using information gain to guide the generation process, we train models which rival the state of the art on two standard Semantic Role Labeling datasets with almost no task or linguistic insight. Anietie Andy, Satoshi Sekine, Mugizi Rwebangira, Mark Dredze. Name Variation in Community Question Answering Systems. COLING Workshop on Noisy User-generated Text, 2016. [PDF] [Bibtex] (Best Paper Award) Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, Who is the best player for the Reds? and Who is currently the biggest star at Manchester United? have a shared need but are worded differently; also, Reds and Manchester United are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories. Travis Wolfe, Mark Dredze, Benjamin Van Durme. A Study of Imitation Learning Methods for Semantic Role Labeling. EMNLP Workshop on Structured Prediction for NLP, 2016. [PDF] [Bibtex] Global features have proven effective in a wide range of structured prediction problems but come with high inference costs. Imitation learning is a common method for training models when exact inference isn't feasible. We study imitation learning for Semantic Role Labeling (SRL) and analyze the effectiveness of the Violation Fixing Perceptron (VFP) (Huang et al., 2012) and Locally Optimal Learning to Search (LOLS) (Chang et al.,2015) frameworks with respect to SRL global features. We describe problems in applying each framework to SRL and evaluate the effectiveness of some solutions. We also show that action ordering, including easy first inference, has a large impact on the quality of greedy global models. Rebecca Knowles, Josh Carroll, Mark Dredze. Demographer: Extremely Simple Name Demographics. EMNLP Workshop on Natural Language Processing and Computational Social Science, 2016. [PDF] [Bibtex] [Code] The lack of demographic information available when conducting passive analysis of social media content can make it difficult to compare results to traditional survey results. We present DEMOGRAPHER, a tool that predicts gender from names, using name lists and a classifier with simple character-level features. By relying only on a name, our tool can make predictions even without extensive user-authored content. We compare DEMOGRAPHER to other available tools and discuss differences in performance. In particular, we show that DEMOGRAPHER performs well on Twitter data, making it useful for simple and rapid social media demographic inference. John W Ayers, Eric C Leas, Mark Dredze, Jon-Patrick Allem, Jurek G Grabowski, Linda Hill. Pok\'emon go---a new distraction for drivers and pedestrians. JAMA Internal Medicine, 2016;176(12):1865-1866. [PDF] [Bibtex] (Ranked in the top .02% of 6.5m research outputs by Altmetric) Pok\'emon GO, an augmented reality game, has swept the nation. As players move, their avatar moves within the game, and players are then rewarded for collecting Pok\'emon placed in real-world locations. By rewarding movement, the game incentivizes physical activity. However, if players use their cars to search for Pok\'emon they negate any health benefit and incur serious risk. Motor vehicle crashes are the leading cause of death among 16- to 24-year-olds, whom the game targets. Moreover, according to the American Automobile Association, 59% of all crashes among young drivers involve distractions within 6 seconds of the accident. We report on an assessment of drivers and pedestrians distracted by Pok\'emon GO and crashes potentially caused by Pok\'emon GO by mining social and news media reports. Mark Dredze, Nicholas Andrews, Jay DeYoung. Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation. EMNLP Workshop on Natural Language Processing for Social Media, 2016. [PDF] [Bibtex] [Code], [Data] Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking. John W Ayers, Benjamin M Althouse, Eric C Leas, Ted Alcorn, Mark Dredze. Big Media Data Can Inform Gun Violence Prevention. Bloomberg Data for Good Exchange, 2016. [PDF] [Bibtex] The scientific method drives improvements in public health, but a strategy of obstructionism has impeded scientists from gathering even a minimal amount of information to address America's gun violence epidemic. We argue that in spite of a lack of federal investment, large amounts of publicly available data offer scientists an opportunity to measure a range of firearm-related behaviors. Given the diversity of available data -- including news coverage, social media, web forums, online advertisements, and Internet searches (to name a few) -- there are ample opportunities for scientists to study everything from trends in particular types of gun violence to gun related behaviors (such as purchases and safety practices) to public understanding of and sentiment towards various gun violence reduction measures. Science has been sidelined in the gun violence debate for too long. Scientists must tap the big media datastream and help resolve this crisis. Adrian Benton, Braden Hancock, Glen A Coppersmith, John W Ayers, Mark Dredze. After Sandy Hook Elementary: A Year in the Gun Control Debate on Twitter. Bloomberg Data for Good Exchange, 2016. [PDF] [Bibtex] The mass shooting at Sandy Hook elementary school on December 14, 2012 catalyzed a year of active debate and legislation on gun control in the United States. Social media hosted an active public discussion where people expressed their support and opposition to a variety of issues surrounding gun legislation. In this paper, we show how a content based analysis of Twitter data can provide insights and understanding into this debate. We estimate the relative support and opposition to gun control measures, along with a topic analysis of each camp by analyzing over 70 million gun-related tweets from 2013. We focus on spikes in conversation surrounding major events related to guns throughout the year. Our general approach can be applied to other important public health and political issues to analyze the prevalence and nature of public opinion. Eric C Leas, Benjamin M Althouse, Mark Dredze, Nick Obradovich, James H Fowler, Seth M Noar, Jon-Patrick Allem, John W Ayers. Big data sensors of organic advocacy: The case of Leonardo DiCaprio and Climate Change. PLoS One, 2016;11(8):e0159885. [PDF] [Bibtex] The strategies that experts have used to share information about social causes have historically been top-down, meaning the most influential messages are believed to come from planned events and campaigns. However, more people are independently engaging with social causes today than ever before, in part because online platforms allow them to instantaneously seek, create, and share information. In some cases this organic advocacy'' may rival or even eclipse top-down strategies. Big data analytics make it possible to rapidly detect public engagement with social causes by analyzing the same platforms from which organic advocacy spreads. To demonstrate this claim we evaluated how Leonardo DiCaprio's 2016 Oscar acceptance speech citing climate change motivated global English language news (Bloomberg Terminal news archives), social media (Twitter postings) and information seeking (Google searches) about climate change. Despite an insignificant increase in traditional news coverage (54%; 95%CI: -144 to 247), tweets including the terms climate change'' or global warming'' reached record highs, increasing 636% (95%CI: 573--699) with more than 250,000 tweets the day DiCaprio spoke. In practical terms the DiCaprio effect'' surpassed the daily average effect of the 2015 Conference of the Parties (COP) and the Earth Day effect by a factor of 3.2 and 5.3, respectively. At the same time, Google searches for climate change'' or global warming'' increased 261% (95%CI, 186--335) and 210% (95%CI 149--272) the day DiCaprio spoke and remained higher for 4 more days, representing 104,190 and 216,490 searches. This increase was 3.8 and 4.3 times larger than the increases observed during COP's daily average or on Earth Day. Searches were closely linked to content from Dicaprio's speech (e.g., hottest year''), as unmentioned content did not have search increases (e.g., electric car''). Because these data are freely available in real time our analytical strategy provides substantial lead time for experts to detect and participate in organic advocacy while an issue is salient. Our study demonstrates new opportunities to detect and aid agents of change and advances our understanding of communication in the 21st century media landscape. Michael J Paul, Margaret S Chisolm, Matthew W Johnson, Ryan G Vandrey, Mark Dredze. Assessing the validity of online drug forums as a source for estimating demographic and temporal trends in drug use. Journal of Addiction Medicine, 2016;10(5):324--330. [PDF] [Bibtex] Objectives: Addiction researchers have begun monitoring online forums to uncover self-reported details about use and effects of emerging drugs. The use of such online data sources has not been validated against data from large epidemiological surveys. This study aimed to characterize and compare the demographic and temporal trends associated with drug use as reported in online forums and in a large epidemiological survey. Methods: Data were collected from the website, drugs-forum.com, from January 2007 through August 2012 (143,416 messages posted by 8,087 members) and from the United States National Survey on Drug Use and Health (NSDUH) from 2007-2012. Measures of forum participation levels were compared with and validated against two measures from the NSDUH survey data: percentage of people using the drug in last 30 days and percentage using the drug more than 100 times in the past year. Results: For established drugs (e.g., cannabis), significant correlations were found across demographic groups between drugs-forum.com and the NSDUH survey data, while weaker, non-significant correlations were found with temporal trends. Emerging drugs (e.g., Salvia divinorum) were strongly associated with male users in the forum, in agreement with survey-derived data, and had temporal patterns that increased in synchrony with poison control reports. Conclusions: These results offer the first assessment of online drug forums as a valid source for estimating demographic and temporal trends in drug use. The analyses suggest that online forums are a reliable source for estimation of demographic associations and early identification of emerging drugs, but a less reliable source for measurement of long-term temporal trends. David A Broniatowski, Mark Dredze, Karen M Hilyard, Maeghan Dessecker, Sandra C Quinn, Amelia M Jamison, Michael J Paul, Michael C Smith. Both Mirror and Complement: A Comparison of Social Media Data and Survey Data about Flu Vaccination. American Public Health Association, 2016. [PDF] [Bibtex] Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, Paola Velardi, Alessandro Vespignani, Lyn Finelli. Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge. BMC Infectious Diseases, 2016;16(357):10.1186/s12879-016-1669-x. [PDF] [Bibtex] Background: Early insights into the timing of the start, peak, and intensity of the influenza season could be useful in planning influenza prevention and control activities. To encourage development and innovation in influenza forecasting, the Centers for Disease Control and Prevention (CDC) organized a challenge to predict the 2013--14 Unites States influenza season. Methods: Challenge contestants were asked to forecast the start, peak, and intensity of the 2013-2014 influenza season at the national level and at any or all Health and Human Services (HHS) region level(s). The challenge ran from December 1, 2013--March 27, 2014; contestants were required to submit 9 biweekly forecasts at the national level to be eligible. The selection of the winner was based on expert evaluation of the methodology used to make the prediction and the accuracy of the prediction as judged against the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Results: Nine teams submitted 13 forecasts for all required milestones. The first forecast was due on December 2, 2013; 3/13 forecasts received correctly predicted the start of the influenza season within one week, 1/13 predicted the peak within 1 week, 3/13 predicted the peak ILINet percentage within 1%, and 4/13 predicted the season duration within 1 week. For the prediction due on December 19, 2013, the number of forecasts that correctly forecasted the peak week increased to 2/13, the peak percentage to 6/13, and the duration of the season to 6/13. As the season progressed, the forecasts became more stable and were closer to the season milestones. Conclusion: Forecasting has become technically feasible, but further efforts are needed to improve forecast accuracy so that policy makers can reliably use these predictions. CDC and challenge contestants plan to build upon the methods developed during this contest to improve the accuracy of influenza forecasts. Mark Dredze, Manuel García-Herranz, Alex Rutherford, Gideon Mann. Twitter as a Source of Global Mobility Patterns for Social Good. ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, 2016. [PDF] [Bibtex] Data on human spatial distribution and movement is essential for understanding and analyzing social systems. However existing sources for this data are lacking in various ways; difficult to access, biased, have poor geographical or temporal resolution, or are significantly delayed. In this paper, we describe how geolocation data from Twitter can be used to estimate global mobility patterns and address these shortcomings. These findings will inform how this novel data source can be harnessed to address humanitarian and development efforts. Mark Dredze, David A Broniatowski, Karen M Hilyard. Zika Vaccine Misconceptions: A social media analysis. Vaccine, 2016;34(30):3441-3442. [PDF] [Bibtex] [Data] Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, Miles Osborne. How Twitter is Changing the Nature of Financial News Discovery. SIGMOD Workshop on Data Science for Macro-Modeling with Financial and Economic Datasets, 2016. [PDF] [Bibtex] Access to the most relevant and current information is critical to financial analysis and decision making.Historically, financial news has been discovered through company press releases, required disclosures and news articles. More recently, social media has reshaped the financial news landscape, radically changing the dynamics of news dissemination. In this paper we discuss the ways in which Twitter, a leading social media platform, has contributed to changes in this landscape. We explain why today Twitter is a valuable source of material financial information and describe opportunities and challenges in using this novel news source for financial information discovery. Nanyun Peng, Mark Dredze. Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. Association for Computational Linguistics (ACL) (short paper), 2016. [PDF] [Bibtex] Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results. Adrian Benton, Raman Arora, Mark Dredze. Learning Multiview Embeddings of Twitter Users. Association for Computational Linguistics (ACL) (short paper), 2016. [PDF] [Bibtex] [Code] Low-dimensional vector representations are widely used as stand-ins for the text of words, sentences, and entire documents. These embeddings are used to identify similar words or make predictions about documents. In this work, we consider embeddings for social media users and demonstrate that these can be used to identify users who behave similarly or to predict attributes of users. In order to capture information from all aspects of a user's online life, we take a multiview approach, applying a weighted variant of Generalized Canonical Correlation Analysis (GCCA) to a collection of over 100,000 Twitter users. We demonstrate the utility of these multiview embeddings on three downstream tasks: user engagement, friend selection, and demographic attribute prediction. David A Broniatowski, Mark Dredze, Karen M Hilyard. Effective Vaccine Communication during the Disneyland Measles Outbreak. Vaccine, 2016;34(28):3225-3228. [PDF] [Bibtex] Vaccine refusal rates have increased in recent years, highlighting the need for effective risk communication, especially over social media. Fuzzy-trace theory predicts that individuals encode bottom-line meaning (''gist'') and statistical information (''verbatim'') in parallel and those articles expressing a clear gist will be most compelling. We coded news articles (n = 4581) collected during the 2014−2015 Disneyland measles for content including statistics, stories, or bottom-line gists regarding vaccines and vaccine-preventable illnesses. We measured the extent to which articles were compelling by how frequently they were shared on Facebook. The most widely shared articles expressed bottom-line gists, although articles containing statistics were also more likely to be shared than articles lacking statistics. Stories had limited impact on Facebook shares. Results support Fuzzy Trace Theory's predictions regarding the distinct yet parallel impact of categorical gist and statistical verbatim information on public health communication. Ning Gao, Mark Dredze, Douglas Oard. Knowledge Base Population for Organization Mentions in Email. NAACL Workshop on Automated Knowledge Base Construction (AKBC), 2016. [PDF] [Bibtex] A prior study found that on average there are 6.3 named mentions of organizations found in email messages from the Enron collection, only about half of which could be linked to known entities in Wikipedia. That suggests a need for collection-specific approaches to entity linking, similar to those have proven successful for person mentions. This paper describes a process for automatically constructing such a collection-specific knowledge base of organization entities for named mentions in Enron. A new public test collection for linking 130 mentions of organizations found in Enron email to either Wikipedia or to this new collection-specific knowledge base is also described. Together, Wikipedia entities plus the new collection-specific knowledge base cover 83% of the 130 organization mentions, a 14% (absolute) improvement over the 69% that could be linked to Wikipedia alone. Michael C Smith, David A Broniatowski, Mark Dredze. Using Twitter to Examine Social Rationales for Vaccine Refusal. International Engineering Systems Symposium (CESUN), 2016. [PDF] [Bibtex] [Data] [Poster] Mo Yu, Mark Dredze, Raman Arora, Matthew R Gormley. Embedding Lexical Features via Low-rank Tensors. North American Chapter of the Association for Computational Linguistics (NAACL), 2016. [PDF] [Bibtex] Modern NLP models rely heavily on engineered features, which often combine word and contextual information into complex lexical features. Such combination results in large numbers of features, which can lead to over-fitting. We present a new model that represents complex lexical features---comprised of parts for words, contextual information and labels---in a tensor that captures conjunction information among these parts. We apply low-rank tensor approximations to the corresponding parameter tensors to reduce the parameter space and improve prediction speed. Furthermore, we investigate two methods for handling features that include n-grams of mixed lengths. Our model achieves state-of-the-art results on tasks in relation extraction, PP-attachment, and preposition disambiguation. Mark Dredze, Miles Osborne, Prabhanjan Kambadur. Geolocation for Twitter: Timing Matters. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2016. [PDF] [Bibtex] Automated geolocation of social media messages can benefit a variety of downstream applications. However, these geolocation systems are typically evaluated without attention to how changes in time impact geolocation. Since different people, in different locations write messages at different times, these factors can significantly vary the performance of a geolocation system over time. We demonstrate cyclical temporal effects on geolocation accuracy in Twitter, as well as rapid drops as test data moves beyond the time period of training data. We show that temporal drift can effectively be countered with even modest online model updates. John W Ayers, Benjamin M Althouse, Mark Dredze, Eric C Leas, Seth M Noar. News and Internet Searches About Human Immunodeficiency Virus After Charlie Sheen's Disclosure. JAMA Internal Medicine, 2016;176(4):552-554. [PDF] [Bibtex] (Ranked in the top .03% of 4.8m research outputs by Altmetric) Celebrity Charlie Sheen publicly disclosed his human immunodeficiency virus (HIV)--positive status on November 17, 2015. Could Sheen's disclosure, like similar announcements from celebrities, generate renewed attention to HIV? We provide an early answer by examining news trends to reveal discussion of HIV in the mass media and Internet searches to reveal engagement with HIV-related topics around the time of Sheen's disclosure. Neeraja Nagarajan, Blair J Smart, Anthony Nastasi, Zoya J Effendi, Sruthi Murali, Zackary D Berger, Eric B Schneider, Mark Dredze, Joseph K Canner. An Analysis of Twitter Conversations on Global Surgical Care. Annual CUGH Global Health Conference, 2016. [Bibtex] (poster) John W Ayers, J Lee Westmaas, Eric C Leas, Adrian Benton, Yunqi Chen, Mark Dredze, Benjamin M Althouse. Leveraging Big Data to Improve Health Awareness Campaigns: A Novel Evaluation of the Great American Smokeout. JMIR Public Health and Surveillance, 2016;2(1):e16. [PDF] [Bibtex] Awareness campaigns are ubiquitous, but little is known about their potential effectiveness because traditional evaluations are often unfeasible. For 40 years, the Great American Smokeout'' (GASO) has encouraged media coverage and popular engagement with smoking cessation on the third Thursday of November as the nation's longest running awareness campaign. We proposed a novel evaluation framework for assessing awareness campaigns using the GASO as a case study by observing cessation-related news reports and Twitter postings, and cessation-related help seeking via Google, Wikipedia, and government-sponsored quitlines. Munmun De Choudhury, Emre Kiciman, Mark Dredze, Glen A Coppersmith, Mrinal Kumar. Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media. Conference on Human Factors in Computing Systems (CHI), 2016. [PDF] [Bibtex] (Honorable Mention Award) History of mental illness is a major factor behind suicide risk and ideation. However research efforts toward characterizing and forecasting this risk is limited due to the paucity of information regarding suicide ideation, exacerbated by the stigma of mental illness. This paper fills gaps in the literature by developing a statistical methodology to infer which individuals could undergo transitions from mental health discourse to suicidal ideation. We utilize semi-anonymous support communities on Reddit as unobtrusive data sources to infer the likelihood of these shifts. We develop language and interactional measures for this purpose, as well as a propensity score matching based statistical approach. Our approach allows us to derive distinct markers of shifts to suicidal ideation. These markers can be modeled in a prediction framework to identify individuals likely to engage in suicidal ideation in the future. We discuss societal and ethical implications of this research. Animesh R Koratana, Mark Dredze, Margaret S Chisolm, Matthew W Johnson, Michael J Paul. Studying Anonymous Health Issues and Substance Use on College Campuses with Yik Yak. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2016. [PDF] [Bibtex] This study investigates the public health intelligence utility of Yik Yak, a social media platform that allows users to anonymously post and view messages within precise geographic locations. Our dataset contains 122,179 "yaks" collected from 120 college campuses across the United States during 2015. We first present an exploratory analysis of the topics commonly discussed in Yik Yak, clarifying the health issues for which this may serve as a source of information. We then present an in-depth content analysis of data describing substance use, an important public health issue that is not often discussed in public social media, but commonly discussed on Yik Yak under the cloak of anonymity. Adrian Benton, Michael J Paul, Braden Hancock, Mark Dredze. Collective Supervision of Topic Models for Predicting Surveys with Social Media. Association for the Advancement of Artificial Intelligence (AAAI), 2016. [PDF] [Bibtex] This paper considers survey prediction from social media. We use topic models to correlate social media messages with survey outcomes and to provide an interpretable representation of the data. Rather than rely on fully unsupervised topic models, we use existing aggregated survey data to inform the inferred topics, a class of topic model supervision referred to as collective supervision. We introduce and explore a variety of topic model variants and provide an empirical analysis, with conclusions of the most effective models for this task. Michael C Smith, David A Broniatowski, Michael J Paul, Mark Dredze. Towards Real-Time Measurement of Public Epidemic Awareness: Monitoring Influenza Awareness through Twitter. AAAI Spring Symposium on Observational Studies through Social Media and Other Human-Generated Content, 2016. [PDF] [Bibtex] This study analyzes temporal trends in Twitter data pertaining to both influenza awareness and influenza infection during the 2012--13 influenza season in the US. We make use of classifiers to distinguish tweets that express a personal infection (sick with the flu'') versus a more general awareness (worried about the flu''). While previous research has focused on estimating prevalence of influenza infection, little is known about trends in public awareness of the disease. Our analysis shows that infection and awareness have very different trends. In contrast to infection trends, awareness trends have little regional variation, and our experiments suggest that public awareness is primarily driven by news media. Blair J Smart, Neeraja Nagarajan, Joseph K Canner, Mark Dredze, Eric B Schneider, Minh Luu, Zackary D Berger, Jonathan A Myers. The Use of Social Media in Surgical Education: An Analysis of Twitter. Annual Academic Surgical Congress, 2016. [Bibtex] Neeraja Nagarajan, Blair J Smart, Mark Dredze, Joy L Lee, James Taylor, Jonathan A Myers, Eric B Schneider, Zackary D Berger, Joseph K Canner. How do Surgical Providers use Social Media? A Mixed-Methods Analysis using Twitter. Annual Academic Surgical Congress, 2016. [Bibtex] John W Ayers, Benjamin M Althouse, Jon-Patrick Allem, Eric C Leas, Mark Dredze, Rebecca Williams. Revisiting the Rise of Electronic Nicotine Delivery Systems Using Search Query Surveillance. American Journal of Preventive Medicine (AJPM), 2016;50(6):e173-e181. [PDF] [Bibtex] Introduction: Public perceptions of electronic nicotine delivery systems (ENDS) remain poorly understood because surveys are too costly to regularly implement and, when implemented, there are long delays between data collection and dissemination. Search query surveillance has bridged some of these gaps. Herein, ENDS' popularity in the U.S. is reassessed using Google searches. Methods: ENDS searches originating in the U.S. from January 2009 through January 2015 were disaggregated by terms focused on e-cigarette (e.g., e-cig) versus vaping (e.g., vapers); their geolocation (e.g., state); the aggregate tobacco control measures corresponding to their geolocation (e.g., clean indoor air laws); and by terms that indicated the searcher's potential interest (e.g., buy e-cigs likely indicates shopping)---all analyzed in 2015. Results: ENDS searches are rapidly increasing in the U.S., with 8,498,000 searches during 2014 alone. Increasingly, searches are shifting from e-cigarette- to vaping-focused terms, especially in coastal states and states where anti-smoking norms are stronger. For example, nationally, e-cigarette searches declined 9% (95% CI=1%, 16%) during 2014 compared with 2013, whereas vaping searches increased 136% (95% CI=97%, 186%), even surpassing e-cigarette searches. Additionally, the percentage of ENDS searches related to shopping (e.g., vape shop) nearly doubled in 2014, whereas searches related to health concerns (e.g., vaping risks) or cessation (e.g., quit smoking with e-cigs) were rare and declined in 2014. Conclusions: ENDS popularity is rapidly growing and evolving. These findings could inform survey questionnaire development for follow-up investigation and immediately guide policy debates about how the public perceives the health risks or cessation benefits of ENDS. Atul Nakhasi, Sarah G Bell, Ralph J Passarella, Michael J Paul, Mark Dredze, Peter J Pronovost. The Potential of Twitter as a Data Source for Patient Safety. Journal of Patient Safety, 2016. [PDF] [Bibtex] Background: Error-reporting systems are widely regarded as critical components to improving patient safety, yet current systems do not effectively engage patients. We sought to assess Twitter as a source to gather patient perspective on errors in this feasibility study. Methods: We included publicly accessible tweets in English from any geography. To collect patient safety tweets, we consulted a patient safety expert and constructed a set of highly relevant phrases, such as "doctor screwed up." We used Twitter's search application program interface from January to August 2012 to identify tweets that matched the set of phrases. Two researchers used criteria to independently review tweets and choose those relevant to patient safety; a third reviewer resolved discrepancies. Variables included source and sex of tweeter, source and type of error, emotional response, and mention of litigation. Results: Of 1006 tweets analyzed, 839 (83%) identified the type of error: 26% of which were procedural errors, 23% were medication errors, 23% were diagnostic errors, and 14% were surgical errors. A total of 850 (84%) identified a tweet source, 90% of which were by the patient and 9% by a family member. A total of 519 (52%) identified an emotional response, 47% of which expressed anger or frustration, 21% expressed humor or sarcasm, and 14% expressed sadness or grief. Of the tweets, 6.3% mentioned an intent to pursue malpractice litigation. Conclusions: Twitter is a relevant data source to obtain the patient perspective on medical errors. Twitter may provide an opportunity for health systems and providers to identify and communicate with patients who have experienced a medical error. Further research is needed to assess the reliability of the data. Brad J Bushman, Katherine Newman, Sandra L Calvert, Geraldine Downey, Mark Dredze, Michael Gottfredson, Nina G Jablonski, Ann S Masten, Calvin Morrill, Daniel B Neill, Daniel Romer, Daniel W Webster. Youth Violence: What We Know and What We Need to Know. American Psychologist, 2016;71(1):17-39. [PDF] [Bibtex] School shootings tear the fabric of society. In the wake of a school shooting, parents, pediatricians, policymakers, politicians, and the public search for the'' cause of the shooting. But there is no single cause. The causes of school shootings are extremely complex. After the Sandy Hook Elementary School rampage shooting in Newtown, Connecticut, we wrote a report for the National Science Foundation on what is known and not known about youth violence. This article summarizes and updates that report. After distinguishing violent behavior from aggressive behavior, we describe the prevalence of gun violence in the United States and age-related risks for violence. We delineate important differences between violence in the context of rare rampage school shootings, and much more common urban street violence. Acts of violence are influenced by multiple factors, often acting together. We summarize evidence on some major risk factors and protective factors for youth violence, highlighting individual and contextual factors, which often interact. We consider new quantitative data mining'' procedures that can be used to predict youth violence perpetrated by groups and individuals, recognizing critical issues of privacy and ethical concerns that arise in the prediction of violence. We also discuss implications of the current evidence for reducing youth violence, and we offer suggestions for future research. We conclude by arguing that the prevention of youth violence should be a national priority.

 2015 (28 Publications) Yu Wang, Eugene Agichtein, Tom Clark, Mark Dredze, Jeffrey Staton. Inferring latent user characteristics for analyzing political discussions in social media. Atlanta Computational Social Science Workshop, 2015. [Bibtex] Mark Dredze, David A Broniatowski, Michael C Smith, Karen M Hilyard. Understanding Vaccine Refusal: Why We Need Social Media Now. American Journal of Preventive Medicine (AJPM), 2015;50(4):550-552. [PDF] [Bibtex] [Data] Mauricio Santillana, Andre T Nguyen, Mark Dredze, Michael J Paul, Elaine Nsoesie, John S Brownstein. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology, 2015. [PDF] [Bibtex] We present a machine learning-based methodology capable of providing real-time (nowcast'') and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013--2014 (retrospective) and 2014--2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons. Matthew R Gormley, Mark Dredze, Jason Eisner. Approximation-Aware Dependency Parsing by Belief Propagation. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n^3) runtime. It outputs the parse with maximum expected recall---but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by backpropagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood. David A Broniatowski, Mark Dredze, Karen M Hilyard. News Articles are More Likely to be Shared if they Combine Statistics with Explanation. Conference of the Society for Medical Decision Making, 2015. [Bibtex] Matthew R Gormley, Mo Yu, Mark Dredze. Improved Relation Extraction with Feature-Rich Compositional Embedding Models. Empirical Methods in Natural Language Processing (EMNLP), 2015. [PDF] [Bibtex] Compositional embedding models build a representation (or embedding) for a linguistic structure based on its component word embeddings. We propose a Feature-rich Compositional Embedding Model (FCM) for relation extraction that is expressive, generalizes to new domains, and is easy-to-implement. The key idea is to combine both (unlexicalized) handcrafted features with learned word embeddings. The model is able to directly tackle the difficulties met by traditional compositional embeddings models, such as handling arbitrary types of sentence annotations and utilizing global information for composition. We test the proposed model on two relation extraction tasks, and demonstrate that our model outperforms both previous compositional models and traditional feature rich models on the ACE 2005 relation extraction task, and the SemEval 2010 relation classification task. The combination of our model and a loglinear classifier with hand-crafted features gives state-of-the-art results. We made our implementation available for general use. Nanyun Peng, Mark Dredze. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. Empirical Methods in Natural Language Processing (EMNLP) (short paper), 2015. [PDF] [Bibtex] [Code] We consider the task of named entity recognition for Chinese social media. The long line of work in Chinese NER has focused on formal domains, and NER for social media has been largely restricted to English. We present a new corpus of Weibo messages annotated for both name and nominal mentions. Additionally, we evaluate three types of neural embeddings for representing Chinese text. Finally, we propose a joint training objective for the embeddings that makes use of both (NER) labeled and unlabeled raw text. Our methods yield a 9% improvement over a state-of-the-art baseline. Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, Paola Velardi, Alessandro Vespignani, Lyn Finelli. Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge. International Conference of Emerging Infectious Diseases Conference, 2015. [PDF] [Bibtex] Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi, Chris Callison-Burch, Mark Dredze, Benjamin Van Durme. FrameNet+: Fast Paraphrastic Tripling of FrameNet. Association for Computational Linguistics (ACL) (short paper), 2015. [PDF] [Bibtex] [Data] We increase the lexical coverage of FrameNet through automatic paraphrasing. We use crowdsourcing to manually filter out bad paraphrases in order to ensure a high-precision resource. Our expanded FrameNet contains an additional 22K lexical units, a 3-fold increase over the current FrameNet, and achieves 40% better coverage when evaluated in a practical setting on New York Times data. Nanyun Peng, Mo Yu, Mark Dredze. An Empirical Study of Chinese Name Matching and Applications. Association for Computational Linguistics (ACL) (short paper), 2015. [PDF] [Bibtex] [Code] Methods for name matching, an important component to support downstream tasks such as entity linking and entity clustering, have focused on alphabetic languages, primarily English. In contrast, logogram languages such as Chinese remain untested. We evaluate methods for name matching in Chinese, including both string matching and learning approaches. Our approach, based on new representations for Chinese, improves both name matching and a downstream entity clustering task. Travis Wolfe, Mark Dredze, James Mayfield, Paul McNamee, Craig Harman, Tim Finin, Benjamin Van Durme. Interactive Knowledge Base Population. Unpublished Manuscript, 2015. [PDF] [Bibtex] Most work on building knowledge bases has focused on collecting entities and facts from as large a collection of documents as possible. We argue for and describe a new paradigm where the focus is on a high-recall extraction over a small collection of documents under the supervision of a human expert, that we call Interactive Knowledge Base Population (IKBP). Mrinal Kumar, Mark Dredze, Glen A Coppersmith, Munmun De Choudhury. Detecting Changes in Suicide Content Manifested in Social Media Following Celebrity Suicides. Conference on Hypertext and Social Media, 2015. [PDF] [Bibtex] The Werther effect describes the increased rate of completed or attempted suicides following the depiction of an individual's suicide in the media, typically a celebrity. We present findings on the prevalence of this effect in an online platform: r/SuicideWatch on Reddit. We examine both the posting activity and post content after the death of ten high-profile suicides. Posting activity increases following reports of celebrity suicides, and post content exhibits considerable changes that indicate increased suicidal ideation. Specifically, we observe that post-celebrity suicide content is more likely to be inward focused, manifest decreased social concerns, and laden with greater anxiety, anger, and negative emotion. Topic model analysis further reveals content in this period to switch to a more derogatory tone that bears evidence of self-harm and suicidal tendencies. We discuss the implications of our findings in enabling better community support to psychologically vulnerable populations, and the potential of building suicide prevention interventions following high-profile suicides. Michael C Smith, David A Broniatowski, Michael J Paul, Mark Dredze. Tracking Public Awareness of Influenza through Twitter. 3rd International Conference on Digital Disease Detection (DDD), 2015. [Bibtex] (rapid fire talk) Joanna E Cohen, Rebecca Shillenn, Mark Dredze, John W Ayers. Tobacco Watcher: Real-Time Global Tobacco Surveillance Using Online News Media. Annual Meeting of the Society for Research on Nicotine and Tobacco, 2015. [Bibtex] David A Broniatowski, Mark Dredze, Michael J Paul, Andrea Dugas. Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital. JMIR Public Health and Surveillance, 2015. [PDF] [Bibtex] Background: Public health officials and policy makers in the United States expend significant resources at the national, state, county, and city levels to measure the rate of influenza infection. These individuals rely on influenza infection rate information to make important decisions during the course of an influenza season driving vaccination campaigns, clinical guidelines, and medical staffing. Web and social media data sources have emerged as attractive alternatives to supplement existing practices. While traditional surveillance methods take 1-2 weeks, and significant labor, to produce an infection estimate in each locale, web and social media data are available in near real-time for a broad range of locations. Objective: The objective of this study was to analyze the efficacy of flu surveillance from combining data from the websites Google Flu Trends and HealthTweets at the local level. We considered both emergency department influenza-like illness cases and laboratory-confirmed influenza cases for a single hospital in the City of Baltimore. Methods: This was a retrospective observational study comparing estimates of influenza activity of Google Flu Trends and Twitter to actual counts of individuals with laboratory-confirmed influenza, and counts of individuals presenting to the emergency department with influenza-like illness cases. Data were collected from November 20, 2011 through March 16, 2014. Each parameter was evaluated on the municipal, regional, and national scale. We examined the utility of social media data for tracking actual influenza infection at the municipal, state, and national levels. Specifically, we compared the efficacy of Twitter and Google Flu Trends data. Results: We found that municipal-level Twitter data was more effective than regional and national data when tracking actual influenza infection rates in a Baltimore inner-city hospital. When combined, national-level Twitter and Google Flu Trends data outperformed each data source individually. In addition, influenza-like illness data at all levels of geographic granularity were best predicted by national Google Flu Trends data. Conclusions: In order to overcome sensitivity to transient events, such as the news cycle, the best-fitting Google Flu Trends model relies on a 4-week moving average, suggesting that it may also be sacrificing sensitivity to transient fluctuations in influenza infection to achieve predictive power. Implications for influenza forecasting are discussed in this report. J Lee Westmaas, John W Ayers, Mark Dredze, Benjamin M Althouse. Evaluation of the Great American Smokeout by Digital Surveillance. Society of Behavioral Medicine, 2015. [Bibtex] (Citation Award, given to the best submissions) Glen A Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, Margaret Mitchell. CLPsych 2015 Shared Task: Depression and PTSD on Twitter. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015. [PDF] [Bibtex] This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks. These tasks aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The data used for these tasks is from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) and demographically-matched community controls. The unshared task was a hackathon held at Johns Hopkins University in November 2014 to explore the data, and the shared task was conducted remotely, with each participating team submitted scores for a held-back test set of users. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD. Classifiers were compared primarily via their average precision, though a number of other metrics are used along with this to allow a more nuanced interpretation of the performance measures. Mo Yu, Mark Dredze. Learning Composition Models for Phrase Embeddings. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] [Code] Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use. Glen A Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead. From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015. [PDF] [Bibtex] Many significant challenges exist for the mental health field, but one in particular is a lack of data available to guide research. Language provides a natural lens for studying mental health -- much existing work and therapy have strong linguistic components, so the creation of a large, varied, language-centric dataset could provide significant grist for the field of mental health research. We examine a broad range of mental health conditions in Twitter data by identifying self-reported statements of diagnosis. We systematically explore language differences between ten conditions with respect to the general population, and to each other. Our aim is to provide guidance and a roadmap for where deeper exploration is likely to be fruitful. Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas Andrews, Jay DeYoung, Max Thomas, Matthew R Gormley, Travis Wolfe, Craig Harman, Benjamin Van Durme, Mark Dredze. A Chinese Concrete NLP Pipeline. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2015. [PDF] [Bibtex] Natural language processing research increasingly relies on the output of a variety of syntactic and semantic analytics. Yet integrating output from multiple analytics into a single framework can be time consuming and slow research progress. We present a CONCRETE Chinese NLP Pipeline: an NLP stack built using a series of open source systems integrated based on the CONCRETE data schema. Our pipeline includes data ingest, word segmentation, part of speech tagging, parsing, named entity recognition, relation extraction and cross document coreference resolution. Additionally, we integrate a tool for visualizing these annotations as well as allowing for the manual annotation of new data. We release our pipeline to the research community to facilitate work on Chinese language tasks that require rich linguistic annotations. Adrian Benton, Mark Dredze. Entity Linking for Spoken Language. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2015. [PDF] [Bibtex] [Data] Research on entity linking has considered a broad range of text, including newswire, blogs and web documents in multiple languages. However, the problem of entity linking for spoken language remains unexplored. Spoken language obtained from automatic speech recognition systems poses different types of challenges for entity linking; transcription errors can distort the context, and named entities tend to have high error rates. We propose features to mitigate these errors and evaluate the impact of ASR errors on entity linking using a new corpus of entity linked broadcast news transcripts. Travis Wolfe, Mark Dredze, Benjamin Van Durme. Predicate Argument Alignment using a Global Coherence Model. North American Chapter of the Association for Computational Linguistics (NAACL), 2015. [PDF] [Bibtex] We present a joint model for predicate argument alignment. We leverage multiple sources of semantic information, including temporal ordering constraints between events. These are combined in a max-margin framework to find a globally consistent view of entities and events across multiple documents, which leads to improvements over a very strong local baseline. Mo Yu, Matthew R Gormley, Mark Dredze. Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2015. [PDF] [Bibtex] Compositional embedding models build a representation for a linguistic structure based on its component word embeddings. While recent work has combined these word embeddings with hand crafted features for improved performance, it was restricted to a small number of features due to model complexity, thus limiting its applicability. We propose a new model that conjoins features and word embeddings while maintaining a small number of parameters by learning feature embeddings jointly with the parameters of a compositional model. The result is a method that can scale to more features and more labels, while avoiding overfitting. We demonstrate that our model attains state-of-the-art results on ACE and ERE fine-grained relation extraction. Michael J Paul, Mark Dredze. SPRITE: Generalizing Topic Models with Structured Priors. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] [Code] We introduce SPRITE, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing SPRITE to be tailored to particular settings. We demonstrate this flexibility by constructing a SPRITE-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks. Shiliang Wang, Michael J Paul, Mark Dredze. Social Media as a Sensor of Air Quality and Public Response in China. Journal of Medical Internet Research (JMIR), 2015. [PDF] [Bibtex] Background: Recent studies have demonstrated the utility of social media data sources for a wide range of public health goals including disease surveillance, mental health trends, and health perceptions and sentiment. Most such research has focused on English-language social media for the task of disease surveillance. Objective: We investigated the value of Chinese social media for monitoring air quality trends and related public perceptions and response. The goal was to determine if this data is suitable for learning actionable information about pollution levels and public response. Methods: We mined a collection of 93 million messages from Sina Weibo, China's largest microblogging service. We experimented with different filters to identify messages relevant to air quality, based on keyword matching and topic modeling. We evaluated the reliability of the data filters by comparing message volume per city to air particle pollution rates obtained from the Chinese government for 74 cities. Additionally, we performed a qualitative study of the content of pollution-related messages by coding a sample of 170 messages for relevance to air quality, and whether the message included details such as a reactive behavior or a health concern. Results: The volume of pollution-related messages is highly correlated with particle pollution levels, with Pearson correlation values up to .718 (74 Haoyu Wang, Eduard Hovy, Mark Dredze. The Hurricane Sandy Twitter Corpus. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015. [PDF] [Bibtex] [Data] The growing use of social media has made it a critical component of disaster response and recovery efforts. Both in terms of preparedness and response, public health officials and first responders have turned to automated tools to assist with organizing and visualizing large streams of social media. In turn, this has spurred new research into algorithms for information extraction, event detection and organization, and information visualization. One challenge of these efforts has been the lack of a common corpus for disaster response on which researchers can compare and contrast their work. This paper describes the Hurricane Sandy Twitter Corpus: 6.5 million geotagged Twitter posts from the geographic area and time period of the 2012 Hurricane Sandy. Michael J Paul, Mark Dredze, David A Broniatowski, Nicholas Generous. Worldwide Influenza Surveillance through Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015. [Bibtex] Joanna E Cohen, John W Ayers, Mark Dredze. Tobacco Watcher: Real-time Global Surveillance for Tobacco Control. World Conference on Tobacco or Health (WCTOH), 2015. [Bibtex]

 2014 (23 Publications) Ning Gao, Douglas Oard, Mark Dredze. A Test Collection for Email Entity Linking. NIPS Workshop on Automated Knowledge Base Construction, 2014. [PDF] [Bibtex] Most prior work on entity linking has focused on linking name mentions found in third-person communication (e.g., news) to broad-coverage knowledge bases (e.g., Wikipedia). A restricted form of domain-specific entity linking has, however, been tried with email, linking mentions of people to specific email addresses. This paper introduces a new test collection for the task of linking mentions of people, organizations, and locations to Wikipedia. Annotation of 200 randomly selected entities of each type from the Enron email collection indicates that domain specific knowledge bases are indeed required to get good coverage of people and organizations, but that Wikipedia provides good (93%) coverage for the named mentions of locations in the Enron collection. Furthermore, experiments with an existing entity linking system indicate that the absence of a suitable referent in Wikipedia can easily be recognized by automated systems, with NIL precision (i.e., correct detection of the absence of a suitable referent) above 90% for all three entity types. Adrian Benton, Jay DeYoung, Adam Teichert, Mark Dredze, Benjamin Van Durme, Stephen Mayhew, Max Thomas. Faster (and Better) Entity Linking with Cascades. NIPS Workshop on Automated Knowledge Base Construction, 2014. [PDF] [Bibtex] Entity linking requires ranking thousands of candidates for each query, a time consuming process and a challenge for large scale linking. Many systems rely on prediction cascades to efficiently rank candidates. However, the design of these cascades often requires manual decisions about pruning and feature use, limiting the effectiveness of cascades. We present Slinky, a modular, flexible, fast and accurate entity linker based on prediction cascades. We adapt the web-ranking prediction cascade learning algorithm, Cronus, in order to learn cascades that are both accurate and fast. We show that by balancing between accurate and fast linking, this algorithm can produce Slinky configurations that are significantly faster and more accurate than a baseline configuration and an alternate cascade learning method with a fixed introduction of features. Mo Yu, Matthew R Gormley, Mark Dredze. Factor-based Compositional Embedding Models. NIPS Workshop on Learning Semantics, 2014. [PDF] [Bibtex] [Code] Rebecca Knowles, Mark Dredze, Kathleen Evans, Elyse Lasser, Tom Richards, Jonathan Weiner, Hadi Kharrazi. High Risk Pregnancy Prediction from Clinical Text. NIPS Workshop on Machine Learning for Clinical Data Analysis, 2014. [PDF] [Bibtex] Michael J Paul, Mark Dredze, David A Broniatowski. Twitter Improves Influenza Forecasting. PLOS Currents Outbreaks, 2014. [PDF] [Bibtex] Accurate disease forecasts are imperative when preparing for influenza epidemic outbreaks; nevertheless, these forecasts are often limited by the time required to collect new, accurate data. In this paper, we show that data from the microblogging community Twitter significantly improves influenza forecasting. Most prior influenza forecast models are tested against historical influenza-like illness (ILI) data from the U.S. Centers for Disease Control and Prevention (CDC). These data are released with a one-week lag and are often initially inaccurate until the CDC revises them weeks later. Since previous studies utilize the final, revised data in evaluation, their evaluations do not properly determine the effectiveness of forecasting. Our experiments using ILI data available at the time of the forecast show that models incorporating data derived from Twitter can reduce forecasting error by 17-30% over a baseline that only uses historical data. For a given level of accuracy, using Twitter data produces forecasts that are two to four weeks ahead of baseline models. Additionally, we find that models using Twitter data are, on average, better predictors of influenza prevalence than are models using data from Google Flu Trends, the leading web data source. Joy L Lee, Matthew DeCamp, Mark Dredze, Margaret S Chisolm, Zackary D Berger. What Are Health-related Users Tweeting? A Qualitative Content Analysis of Health-related Users and their Messages on Twitter. Journal of Medical Internet Research (JMIR), 2014. [PDF] [Bibtex] Michael J Paul, Mark Dredze. Discovering Health Topics in Social Media Using Topic Models. PLoS ONE, 2014. [PDF] [Bibtex] [Data] By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = −.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media. David A Broniatowski, Michael J Paul, Mark Dredze. Twitter: Big Data Opportunities (Letter) Science, 2014;345(6193):148. [PDF] [Bibtex] Ahmed Abbasi, Donald Adjeroh, Mark Dredze, Michael J Paul, Fatemeh Mariam Zahedi, Huimin Zhao, Nitin Walia, Hemant Jain, Patrick Sanvanson, Reza Shaker, Marco D Huesch, Richard Beal, Wanhong Zheng, Marie Abate, Arun Ross. Social Media Analytics for Smart Health. IEEE Intelligent Systems, 2014;29(2):60--80. [PDF] [Bibtex] Byron C Wallace, Michael J Paul, Urmimala Sarkar, Thomas A Trikalinos, Mark Dredze. A Large-Scale Quantitative Analysis of Latent Factors and Sentiment in Online Doctor Reviews. Journal of the American Medical Informatics Association (JAMIA), 2014;21(6):1098--1103. [PDF] [Bibtex] Mark Dredze, Renyuan Cheng, Michael J Paul, David A Broniatowski. HealthTweets.org: A Platform for Public Health Surveillance using Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [PDF] [Bibtex] [Website] We present HealthTweets.org, a new platform for sharing the latest research results on Twitter data with researchers and public officials. In this demo paper, we describe data collection, processing, and features of the site. The goal of this service is to transition results from research to practice. Michael J Paul, Mark Dredze, David A Broniatowski. Challenges in Influenza Forecasting and Opportunities for Social Media. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [Bibtex] Shiliang Wang, Michael J Paul, Mark Dredze. Exploring Health Topics in Chinese Social Media: An Analysis of Sina Weibo. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [PDF] [Bibtex] This paper seeks to identify and characterize health-related topics discussed on the Chinese microblogging website, Sina Weibo. We identified nearly 1 million messages containing health-related keywords, filtered from a dataset of 93 million messages spanning five years. We applied probabilistic topic models to this dataset and identified the prominent health topics. We show that a variety of health topics are discussed in Sina Weibo, and that four flu-related topics are correlated with monthly influenza case rates in China. Mo Yu, Mark Dredze. Improving Lexical Embeddings with Semantic Knowledge. Association for Computational Linguistics (ACL) (short paper), 2014. [PDF] [Bibtex] [Code] Word embeddings learned on unlabeled data are a popular tool in semantics, but may not capture the desired semantics. We propose a new learning objective that incorporates both a neural language model objective and prior knowledge from semantic resources to learn improved lexical semantic embeddings. We demonstrate that our embeddings improve over those learned solely on raw text in three settings: language modeling, measuring semantic similarity, and predicting human judgements. Nanyun Peng, Yiming Wang, Mark Dredze. Learning Polylingual Topic Models from Code-Switched Social Media Documents. Association for Computational Linguistics (ACL) (short paper), 2014. [PDF] [Bibtex] [Code] Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which infers language specific topic distributions based on code-switched documents to facilitate multi-lingual corpus analysis. We experiment on two code-switching corpora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human annotators. Glen A Coppersmith, Mark Dredze, Craig Harman. Quantifying Mental Health Signals in Twitter. ACL Workshop on Computational Linguistics and Clinical Psychology, 2014. [PDF] [Bibtex] The ubiquity of social media provides a rich opportunity to enhance the data available to mental health clinicians and researchers, enabling a better-informed and better-equipped mental health field. We present analysis of mental health phenomena in publicly available Twitter data, demonstrating how rigorous application of simple natural language processing methods can yield insight into specific disorders as well as mental health writ large, along with evidence that as-of-yet undiscovered linguistic signals relevant to mental health exist in social media. We present a novel method for gathering data for a range of mental illnesses quickly and cheaply, then focus on analysis of four in particular: post-traumatic stress disorder (PTSD), major depressive disorder, bipolar disorder, and seasonal affective disorder. We intend for these proof-of-concept results to inform the necessary ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information. Glen A Coppersmith, Craig Harman, Mark Dredze. Measuring Post Traumatic Stress Disorder in Twitter. International Conference on Weblogs and Social Media (ICWSM), 2014. [PDF] [Bibtex] Traditional mental health studies rely on information primarily collected and analyzed through personal contact with a health care professional. Recent work has shown the utility of social media data for studying depression, but there have been limited evaluations of other mental health conditions. We consider post traumatic stress disorder (PTSD), a serious condition that affects millions worldwide, with especially high rates in military veterans. We show how to obtain a PTSD classifier for social media using simple searches of available Twitter data, a significant reduction in training data cost compared to previous work on mental health. We demonstrate its utility by an examination of language use from PTSD individuals, and by detecting elevated rates of PTSD at and around US military bases using our classifiers. Miles Osborne, Mark Dredze. Facebook, Twitter and Google Plus for Breaking News: Is there a winner? International Conference on Weblogs and Social Media (ICWSM), 2014. [PDF] [Bibtex] [Supplement] Twitter is widely seen as being the go to place for breaking news. Recently however, competing Social Media have begun to carry news. Here we examine how Facebook, Google Plus and Twitter report on breaking news. We consider coverage (whether news events are reported) and latency (the time when they are reported). Using data drawn from three weeks in December 2013, we identify 29 major news events, ranging from celebrity deaths, plague outbreaks to sports events. We find that all media carry the same major events, but Twitter continues to be the preferred medium for breaking news, almost consistently leading Facebook or Google Plus. Facebook and Google Plus largely repost newswire stories and their main research value is that they conveniently package multitple sources of information together. Matthew R Gormley, Margaret Mitchell, Benjamin Van Durme, Mark Dredze. Low-Resource Semantic Role Labeling. Association for Computational Linguistics (ACL), 2014. [PDF] [Bibtex] [Code] We explore the extent to which high-resource manual annotations such as treebanks are necessary for the task of semantic role labeling (SRL). We examine how performance changes without syntactic supervision, comparing both joint and pipelined methods to induce latent syntax. This work highlights a new application of unsupervised grammar induction and demonstrates several approaches to SRL in the absence of supervised syntax. Our best models obtain competitive results in the high-resource setting and state-of-the-art results in the low resource setting, reaching 72.48% F1 averaged across languages. We release our code for this work along with a larger toolkit for specifying arbitrary graphical structure. Nicholas Andrews, Jason Eisner, Mark Dredze. Robust Entity Clustering via Phylogenetic Inference. Association for Computational Linguistics (ACL), 2014. [PDF] [Bibtex] [Code] Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evalution on several datasets. On a challenging Twitter corpus, our method outperforms the best baseline by 12.6 points of F1 score. John W Ayers, Benjamin M Althouse, Mark Dredze. Could Behavioral Medicine Lead the Web Data Revolution? Journal of the American Medical Association (JAMA), 2014;311(14):1399--1400. [PDF] [Bibtex] John W Ayers, Benjamin M Althouse, Morgan Johnson, Mark Dredze, Joanna E Cohen. What's the Healthiest Day? Circaseptan (Weekly) Rhythms in Healthy Considerations. American Journal of Preventive Medicine (AJPM), 2014;47(1):73-76. [PDF] [Bibtex] Benjamin M Althouse, Jon-Patrick Allem, Matt Childers, Mark Dredze, John W Ayers. Population Health Concerns During the United States' Great Recession. American Journal of Preventive Medicine (AJPM), 2014;46(2):166-170. [PDF] [Bibtex]

 2013 (14 Publications) Mark Dredze, Bill Schilit. Facet suggestion for search query augmentation. US Patent 8,433,705, 2013. [Bibtex] David A Broniatowski, Michael J Paul, Mark Dredze. National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic. PLOS ONE, 2013. [PDF] [Bibtex] Social media have been proposed as a data source for influenza surveillance because they have the potential to offer real-time access to millions of short, geographically localized messages containing information regarding personal well-being. However, accuracy of social media surveillance systems declines with media attention because media attention increases chatter'' -- messages that are about influenza but that do not pertain to an actual infection -- masking signs of true influenza prevalence. This paper summarizes our recently developed influenza infection detection algorithm that automatically distinguishes relevant tweets from other chatter, and we describe our current influenza surveillance system which was actively deployed during the full 2012-2013 influenza season. Our objective was to analyze the performance of this system during the most recent 2012--2013 influenza season and to analyze the performance at multiple levels of geographic granularity, unlike past studies that focused on national or regional surveillance. Our system's influenza prevalence estimates were strongly correlated with surveillance data from the Centers for Disease Control and Prevention for the United States (r = 0.93, p < 0.001) as well as surveillance data from the Department of Health and Mental Hygiene of New York City (r = 0.88, p < 0.001). Our system detected the weekly change in direction (increasing or decreasing) of influenza prevalence with 85% accuracy, a nearly twofold increase over a simpler model, demonstrating the utility of explicitly distinguishing infection tweets from other chatter. Travis Wolfe, Benjamin Van Durme, Mark Dredze, Nicholas Andrews, Charley Beller, Chris Callison-Burch, Jay DeYoung, Justin Snyder, Jonathan Weese, Tan Xu, Xuchen Yao. PARMA: A Predicate Argument Aligner. Association for Computational Linguistics (ACL) (short paper), 2013. [PDF] [Bibtex] We introduce PARMA, a system for cross-document, semantic predicate and argument alignment. Our system integrates popular lexical semantic resources into a simple discriminative model. PARMA achieves state of the art results. We suggest that existing efforts have focussed on data that is too easy, and we provide a more difficult dataset based on MT translation references which has a lower baseline which we beat by 17% absolute F1. Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition. Technical Report 10, Human Language Technology Center of Excellence, Johns Hopkins University, 2013. [PDF] [Bibtex] Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. We present a novel probabilistic model to learn the sub-word lexicon optimized for a given task. We consider the task of Out Of vocabulary (OOV) word detection, which relies on output from a hybrid system. We combine the proposed hybrid system with confidence based metrics to improve OOV detection performance. Previous work address OOV detection as a binary classification task, where each region is independently classified using local information. We propose to treat OOV detection as a sequence labeling problem, and we show that 1) jointly predicting out-of-vocabulary regions, 2) including contextual information from each region, and 3) learning sub-lexical units optimized for this task, leads to substantial improvements with respect to state-of-the-art on an English Broadcast News and MIT Lectures task. Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Bibtex] [Code] Public health applications using social media often require accurate, broad-coverage location information. However, the standard information provided by social media APIs, such as Twitter, cover a limited number of messages. This paper presents Carmen, a geolocation system that can determine structured location information for messages provided by the Twitter API. Our system utilizes geocoding tools and a combination of automatic and manual alias resolution methods to infer location structures from GPS positions and user-provided profile data. We show that our system is accurate and covers many locations, and we demonstrate its utility for improving influenza surveillance. Michael J Paul, Byron C Wallace, Mark Dredze. What Affects Patient (Dis)satisfaction? Analyzing Online Doctor Ratings with a Joint Topic-Sentiment Model. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Bibtex] We analyze patient reviews of doctors using a novel probabilistic joint model of aspect and sentiment based on factorial LDA. We leverage this model to exploit a small set of previously annotated reviews to automatically analyze the topics and sentiment latent in over 50,000 online reviews of physicians (and we make this dataset publicly available). The proposed model outperforms baseline models for this task with respect to model perplexity and sentiment classification. We report the most representative words with respect to positive and negative sentiment along three clinical aspects, thus complementing existing qualitative work exploring patient reviews of physicians. Justin Snyder, Rebecca Knowles, Mark Dredze, Matthew R Gormley, Travis Wolfe. Topic Models and Metadata for Visualizing Text Corpora. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2013. [PDF] [Bibtex] Effectively exploring and analyzing large text corpora requires visualizations that provide a high level summary. Past work has relied on faceted browsing of document metadata or on natural language processing of document text. In this paper, we present a new web-based tool that integrates topics learned from an unsupervised topic model in a faceted browsing experience. The user can manage topics, filter documents by topic and summarize views with metadata and topic graphs. We report a user study of the usefulness of topics in our tool. Damianos Karakos, Mark Dredze, Sanjeev Khudanpur. Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation. Technical Report 8, Johns Hopkins University, 2013. [PDF] [Bibtex] Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is self-training, in which a language model's parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide self-training, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1-best, we improve self-training efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topic-based language modeling adaptation results over both 1-best and lattice self-training using our ASR channel confusion estimates on telephone conversations. Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] [Bibtex] [Data] Hidden properties of social media users, such as their ethnicity, gender, and location, are often reflected in their observed attributes, such as their first and last names. Furthermore, users who communicate with each other often have similar hidden properties. We propose an algorithm that exploits these insights to cluster the observed attributes of hundreds of millions of Twitter users. Attributes such as user names are grouped together if users with those names communicate with other similar users. We separately cluster millions of unique first names, last names, and userprovided locations. The efficacy of these clusters is then evaluated on a diverse set of classification tasks that predict hidden users properties such as ethnicity, geographic location, gender, language, and race, using only profile names and locations when appropriate. Our readily-replicable approach and publicly released clusters are shown to be remarkably effective and versatile, substantially outperforming state-of-the-art approaches and human accuracy on each of the tasks studied. Mahesh Joshi, Mark Dredze, William W Cohen, Carolyn P Rose. What's in a Domain? Multi-Domain Learning for Multi-Attribute Data. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2013. [PDF] [Bibtex] Multi-Domain learning assumes that a single metadata attribute is used in order to divide the data into so-called domains. However, real-world datasets often have multiple metadata attributes that can divide the data into domains. It is not always apparent which single attribute will lead to the best domains, and more than one attribute might impact classification. We propose extensions to two multi-domain learning techniques for our multi-attribute setting, enabling them to simultaneously learn from several metadata attributes. Experimentally, they outperform the multi-domain learning baseline, even when it selects the single "best" attribute. Alex Lamb, Michael J Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2013. [PDF] [Bibtex] [Data] Twitter has been shown to be a fast and reliable method for disease surveillance of common illnesses like influenza. However, previous work has relied on simple content analysis, which conflates flu tweets that report infection with those that express concerned awareness of the flu. By discriminating these categories, as well as tweets about the authors versus about others, we demonstrate significant improvements on influenza surveillance using Twitter. Michael J Paul, Mark Dredze. Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] [Bibtex] Multi-dimensional latent text models, such as factorial LDA (f-LDA), capture multiple factors of corpora, creating structured output for researchers to better understand the contents of a corpus. We consider such models for clinical research of new recreational drugs and trends, an important application for mining current information for healthcare workers. We use a "three-dimensional" f-LDA variant to jointly model combinations of drug (marijuana, salvia, etc.), aspect (effects, chemistry, etc.) and route of administration (smoking, oral, etc.) Since a purely unsupervised topic model is unlikely to discover these specific factors of interest, we develop a novel method of incorporating prior knowledge by leveraging user generated tags as priors in our model. We demonstrate that this model can be used as an exploratory tool for learning about these drugs from the Web by applying it to the task of extractive summarization. In addition to providing useful output for this important public health task, our prior-enriched model provides a framework for the application of f-LDA to other tasks Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Machine Learning, 2013;91(2):155-187. [PDF] [Bibtex] We present AROW, an online learning algorithm for binary and multiclass problems that combines large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive mistake bounds for the binary and multiclass settings that are similar in form to the second order perceptron bound. Our bounds do not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques. Empirical evaluations show that AROW achieves state-of-the-art performance on a wide range of binary and multiclass tasks, as well as robustness in the face of non-separable data. Delip Rao, Paul McNamee, Mark Dredze. Entity Linking: Finding Extracted Entities in a Knowledge Base. Multi-source, Multi-lingual Information Extraction and Summarization, 2013. [Bibtex] In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.

 2012 (17 Publications) Kristian Hammond, Jerome Budzik, Lawrence Birnbaum, Kevin Livingston, Mark Dredze. Request initiated collateral content offering. US Patent 8,260,874, 2012. [Bibtex] Mark Dredze. How Social Media Will Change Public Health. IEEE Intelligent Systems, 2012;27(4):81-84. [Bibtex] [Link] Recent work in machine learning and natural language processing has studied the health content of tweets and demonstrated the potential for extracting useful public health information from their aggregation. This article examines the types of health topics discussed on Twitter, and how tweets can both augment existing public health capabilities and enable new ones. The author also discusses key challenges that researchers must address to deliver high-quality tools to the public health community. Michael J Paul, Mark Dredze. Factorial LDA: Sparse Multi-Dimensional Text Models. Neural Information Processing Systems (NIPS), 2012. [PDF] [Bibtex] Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors. Alex Lamb, Michael J Paul, Mark Dredze. Investigating Twitter as a Source for Studying Behavioral Responses to Epidemics. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex] We present preliminary results for mining concerned awareness of influenza tweets. We describe our data set construction and experiments with binary classification of data into influenza versus general messages and classification into concerned awareness and existing infection. Atul Nakhasi, Ralph J Passarella, Sarah G Bell, Michael J Paul, Mark Dredze, Peter J Pronovost. Malpractice and Malcontent: Analyzing Medical Complaints in Twitter. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex] In this paper we report preliminary results from a study of Twitter to identify patient safety reports, which offer an immediate, untainted, and expansive patient perspective un- like any other mechanism to date for this topic. We identify patient safety related tweets and characterize them by which medical populations caused errors, who reported these er- rors, what types of errors occurred, and what emotional states were expressed in response. Our long term goal is to improve the handling and reduction of errors by incorpo- rating this patient input into the patient safety process. Michael J Paul, Mark Dredze. Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex] Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of a multi-dimensional latent text model -- factorial LDA -- that captures orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interest to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results. Ralph J Passarella, Atul Nakhasi, Sarah G Bell, Michael J Paul, Peter J Pronovost, Mark Dredze. Twitter as a Source for Learning about Patient Safety Events. Annual Symposium of the American Medical Informatics Association (AMIA), 2012. [Bibtex] Damianos Karakos, Brian Roark, Izhak Shafran, Kenji Sagae, Maider Lehr, Emily Prud'hommeaux, Puyang Xu, Nathan Glenn, Sanjeev Khudanpur, Murat Saraclar, Dan Bikel, Mark Dredze, Chris Callison-Burch, Yuan Cao, Keith Hall, Eva Hasler, Philipp Koehn, Adam Lopez, Matt Post, Darcey Riley. Deriving conversation-based features from unlabeled speech for discriminative language modeling. International Speech Communication Association (INTERSPEECH), 2012. [PDF] [Bibtex] The perceptron algorithm was used in [1] to estimate discriminative language models which correct errors in the output of ASR systems. In its simplest version, the algorithm simply increases the weight of n-gram features which appear in the correct (oracle) hypothesis and decreases the weight of n-gram features which appear in the 1-best hypothesis. In this paper, we show that the perceptron algorithm can be successfully used in a semi-supervised learning (SSL) framework, where limited amounts of labeled data are available. Our framework has some similarities to graph-based label propagation [2] in the sense that a graph is built based on proximity of unlabeled conversations, and then it is used to propagate confidences (in the form of features) to the labeled data, based on which perceptron trains a discriminative model. The novelty of our approach lies in the fact that the confidence "flows" from the unlabeled data to the labeled data, and not vice-versa, as is done traditionally in SSL. Experiments conducted at the 2011 CLSP Summer Workshop on the conversational telephone speech corpora Dev04f and Eval04f demonstrate the effectiveness of the proposed approach. Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Structured Language Modeling for Speech Recognition. International Speech Communication Association (INTERSPEECH), 2012. [PDF] [Bibtex] The structured language model (SLM) of [1] was one of the first to successfully integrate syntactic structure into language models. We extend the SLM framework in two new directions. First, we propose a new syntactic hierarchical interpolation that improves over previous approaches. Second, we develop a general information-theoretic algorithm for pruning the underlying Jelinek-Mercer interpolated LM used in [1], which substantially reduces the size of the LM, enabling us to train on large data. When combined with hill-climbing [2] the SLM is an accurate model, space-efficient and fast for rescoring large speech lattices. Experimental results on broadcast news demonstrate that the SLM outperforms a large 4-gram LM. Nicholas Andrews, Jason Eisner, Mark Dredze. Name Phylogeny: A Generative Model of String Variation. Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF] [Bibtex] Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance. Mahesh Joshi, Mark Dredze, William W Cohen, Carolyn P Rose. Multi-Domain Learning: When Do Domains Matter? Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF] [Bibtex] We present a systematic analysis of existing multi-domain learning approaches with respect to two questions. First, many multi-domain learning algorithms resemble ensemble learning algorithms. (1) Are multi-domain learning improvements the result of ensemble learning effects? Second, these algorithms are traditionally evaluated in a balanced label setting, although in practice many multi-domain settings have domain-specific label biases. When multi-domain learning is applied to these settings, (2) are multi-domain methods improving because they capture domain-specific class biases? An understanding of these two issues presents a clearer idea about where the field has had success in multi-domain learning, and it suggests some important open questions for improving beyond the current state of the art. Ariya Rastrow, Sanjeev Khudanpur, Mark Dredze. Revisiting the Case for Explicit Syntactic Information in Language Models. NAACL Workshop on the Future of Language Modeling for HLT, 2012. [PDF] [Bibtex] Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naive, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactically well-motivated models. This unusual resilience of n-grams, as well as their weaknesses, are examined here. It is demonstrated that n-grams are good word-predictors, even linguistically speaking, in a large majority of word-positions, and it is suggested that to improve over n-grams, one must explore syntax-aware (or other) language models that focus on positions where n-grams are weak. Spence Green, Nicholas Andrews, Matthew R Gormley, Mark Dredze, Christopher D Manning. Entity Clustering Across Languages. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF] [Bibtex] Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline. Matthew R Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF] [Bibtex] With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters. Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining. Association for Computational Linguistics (ACL), 2012. [PDF] [Bibtex] Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation. However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set. In this work, we propose substructure sharing, which saves duplicate work in processing hypothesis sets with redundant hypothesis structures. We apply substructure sharing to a dependency parser and part of speech tagger to obtain significant speedups, and further improve the accuracy of these tools through up-training. When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N-best and hill climbing rescoring, and show that up-training leads to WER reduction. Koby Crammer, Alex Kulesza, Mark Dredze. New H-∞ Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012. [PDF] [Bibtex] The well known recursive least squares (RLS) algorithm has been widely used for many years. Most analyses of RLS have assumed statistical properties of the data or noise process, but recent robust H∞ analyses have been used to bound the ratio of the performance of the algorithm to the total noise. In this paper, we provide the first additive analysis bounding the difference between performance and noise. Our analysis provides additional convergence guarantees in general, and particularly with structured input data. We illustrate the analysis using human speech and white noise. Koby Crammer, Mark Dredze, Fernando Pereira. Confidence-Weighted Linear Classification for Text Categorization. Journal of Machine Learning Research (JMLR), 2012;13(Jun):1891-1926. [PDF] [Bibtex] Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing.

 2011 (12 Publications) Joshua T Vogelstein, William R Gray, Jason G Martin, Glen A Coppersmith, Mark Dredze, J Bogovic, J L Prince, S M Resnick, Carey E Priebe, R Jacob Vogelstein. Connectome Classification using Statistical Graph Theory and Machine Learning. Society for Neuroscience (Poster), 2011. [Bibtex] Spence Green, Nicholas Andrews, Matthew R Gormley, Mark Dredze, Christopher D Manning. Cross-lingual Coreference Resolution: A New Task for Multilingual Comparable Corpora. Technical Report 6, Human Language Technology Center of Excellence, Johns Hopkins University, 2011. [Bibtex] We introduce cross-lingual coreference resolution, the task of grouping entity mentions with a common referent in a multilingual corpus. Information, especially on the web, is increasingly multilingual. We would like to track entity references across languages without machine translation, which is expensive and unavailable for many language pairs. Therefore, we develop a set of models that rely on decreasing levels of parallel resources: a bitext, a bilingual lexicon, and a parallel name list. We propose baselines, provide experimental results, and analyze sources of error. Across a range of metrics, we find that even our lowest resource model gives a 2.5% F1 absolute improvement over the strongest baseline. Our results present a positive outlook for crosslingual coreference resolution even in low resource languages. We are releasing our crosslingual annotations for the ACE2008 ArabicEnglish evaluation corpus. Matthew R Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models with Application to Selectional Preference. NIPS Workshop on Learning Semantics, 2011. [PDF] [Bibtex] Damianos Karakos, Mark Dredze, Kenneth Church, Aren Jansen, Sanjeev Khudanpur. Estimating Document Frequencies in a Speech Corpus. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex] Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df(w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task. Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex] Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of- domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures. Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Discrimnative Training of Long-Span Language Models. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex] Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long- span LMs based on a hill climbing rescoring algorithm. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription. Ann Irvine, Mark Dredze, Geraldine Legendre, Paul Smolensky. Optimality Theory Syntax Learnability: An Empirical Exploration of the Perceptron and GLA. CogSci Workshop on OT as a General Cognitive Architecture, 2011. [Bibtex] Carolina Parada, Mark Dredze, Frederick Jelinek. OOV Sensitive Named-Entity Recognition in Speech. International Speech Communication Association (INTERSPEECH), 2011. [PDF] [Bibtex] [Data] Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named entities and always produce transcription errors. In this work, we improve speech NER by including features indicative of OOVs based on a OOV detector, allowing for the identification of regions of speech containing named entities, even if they are incorrectly transcribed. We construct a new speech NER data set and demonstrate significant improvements for this task. Michael J Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011. [PDF] [Bibtex] [Data] We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ailments, such as influenza, infections, obesity, as compared to standard topic models. Furthermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data. Michael J Paul, Mark Dredze. You Are What You Tweet: Analyzing Twitter for Public Health. International Conference on Weblogs and Social Media (ICWSM), 2011. [PDF] [Bibtex] Analyzing user messages in social media can mea- sure different population haracteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and in- somnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research. Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Learning Sub-Word Units for Open Vocabulary Speech Recognition. Association for Computational Linguistics (ACL), 2011. [PDF] [Bibtex] Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to \em learn the sub-word lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. %We present results on a Broadcast News and MIT Lectures data sets. A hybrid model with our learned sub-word lexicon reduces error by 6.3\% and 7.6\% (absolute) at a 5\% false alarm rate on an English Broadcast News and MIT Lectures task respectively. Ariya Rastrow, Markus Dreyer, Abhinav Sethy, Sanjeev Khudanpur, Bhuvana Ramabhadran, Mark Dredze. Hill Climbing on Speech Lattices: A New Rescoring Framework. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [PDF] [Bibtex] We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.

 2010 (12 Publications) Mark Dredze, Aren Jansen, Glen A Coppersmith, Kenneth Church. NLP on Spoken Documents without ASR. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF] [Bibtex] There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-of-vocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long ( 1 sec) repetitions in speech, and clusters them into pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudo-terms; performance on a Switchboard task approaches a baseline using gold standard manual transcriptions. Mark Dredze, Tim Oates, Christine Piatko. We're Not in Kansas Anymore: Detecting Domain Changes in Streams. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF] [Bibtex] Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention -- detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses A-distance, a metric for detecting shifts in data streams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions. Carolina Parada, Abhinav Sethy, Mark Dredze, Frederick Jelinek. A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web. International Speech Communication Association (INTERSPEECH), 2010. [PDF] [Bibtex] Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error. Delip Rao, Paul McNamee, Mark Dredze. Streaming Cross Document Entity Coreference Resolution. Conference on Computational Linguistics (Coling), 2010. [PDF] [Bibtex] Previous research in cross-document entity coreference has generally been restricted to the offline scenario where the set of documents is provided in advance. As a consequence, the dominant approach is based on greedy agglomerative clustering techniques that utilize pairwise vector comparisons and thus require O(n^2) space and time. In this paper we explore identifying coreferent entity mentions across documents in high-volume streaming text, including methods for utilizing orthographic and contextual information. We test our methods using several corpora to quantitatively measure both the efficacy and scalability of our streaming approach. We show that our approach scales to at least an order of magnitude larger data than previous reported methods. Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, Tim Finin. Entity Disambiguation for Knowledge Base Population. Conference on Computational Linguistics (Coling), 2010. [PDF] [Bibtex] The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources. Further, our approach achieves performance of up to 95% on entities mentioned from newswire and 80% on a public test set that was designed to include challenging queries. Chris Callison-Burch, Mark Dredze. Creating Speech and Language Data With Amazon's Mechanical Turk. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex] In this paper we give an introduction to using Amazon\'s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop\'s shared task to create data for speech and language applications with \$100. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, Mark Dredze. Annotating named entities in Twitter data with crowdsourcing. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex] [Data] We describe our experience using both Amazon Mechanical Turk (MTurk) and CrowdFlower to collect simple named entity annotations for Twitter status updates. Unlike most genres that have traditionally been the focus of named entity experiments Twitter is far more informal and abbreviated. The collected annotations and annotation techniques will provide a first step towards the full study of named entity recognition in domains like Facebook and Twitter. We also briefly describe how to use MTurk to collect judgements on the quality of "word clouds." Matthew R Gormley, Adam Gerber, Mary Harper, Mark Dredze. Non-Expert Correction of Automatically Generated Relation Annotations. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex] We explore a new way to collect human annotated relations in text using Amazon Mechanical Turk. Given a knowledge base of relations and a corpus, we identify sentences which mention both an entity and an attribute that have some relation in the knowledge base. Each noisy sentence/relation pair is presented to multiple turkers, who are asked whether the sentence expresses the relation. We describe a design which encourages user efficiency and aids discovery of cheating. We also present results on inter-annotator agreement. Courtney Napoles, Mark Dredze. Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language. NAACL-HLT Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids, 2010. [PDF] [Bibtex] Text simplification is the process of changing vocabulary and grammatical structure to create a more accessible version of the text while maintaining the underlying information and content. Automated tools for text simplification are a practical way to make large corpora of text accessible to a wider audience lacking high levels of fluency in the corpus language. In this work, we investigate the potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English. Most text simplification systems are based on hand-written rules (e.g., PEST and its module SYSTAR), and therefore face limitations scaling and transferring across domains. The potential for using Simple Wikipedia for text simplification is significant; it contains nearly 60,000 articles with revision histories and aligned articles to ordinary English Wikipedia. Using articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains. These findings help further understanding of what makes text simple and can be applied as a tool to help writers craft simple text. Justin Ma, Alex Kulesza, Koby Crammer, Mark Dredze, Lawrence Saul, Fernando Pereira. Exploiting Feature Covariance in High-Dimensional Online Learning. AIStats, 2010. [PDF] [Bibtex] Some online algorithms for linear classification model the uncertainty in their weights over the course of learning. Modeling the full covariance structure of the weights can provide a significant advantage for classification. However, for high-dimensional, large-scale data, even though there may be many second-order feature interactions, it is computationally infeasible to maintain this covariance structure. To extend second-order methods to high-dimensional data, we develop low-rank approximations of the covariance structure. We evaluate our approach on both synthetic and real-world data sets using the confidence-weighted online learning framework. We show improvements over diagonal covariance matrices for both low and high-dimensional data. Carolina Parada, Mark Dredze, Denis Filimonov, Frederick Jelinek. Contextual Information Improves OOV Detection in Speech. North American Chapter of the Association for Computational Linguistics (NAACL), 2010. [PDF] [Bibtex] Out-of-vocabulary (OOV) words represent an important source of error in large vocabulary continuous speech recognition (LVCSR) systems. These words cause recognition failures, which propagate through pipeline systems impacting the performance of downstream applications. The detection of OOV regions in the output of a LVCSR system is typically addressed as a binary classification task, where each region is independently classified using local information. In this paper, we show that jointly predicting OOV regions, and including contextual information from each region, leads to substantial improvement in OOV detection. Compared to the state-of-the-art, we reduce the missed OOV rate from 42.6% to 28.4% at 10% false alarm rate. Mark Dredze, Alex Kulesza, Koby Crammer. Multi-Domain Learning by Confidence-Weighted Parameter Combination. Machine Learning, 2010;79(1-2):123-149. [PDF] [Bibtex] [Tech Report] State-of-the-art statistical NLP systems for a variety of tasks learn from labeled training data that is often domain specific. However, there may be multiple domains or sources of interest on which the system must perform. For example, a spam filtering system must give high quality predictions for many users, each of whom receives emails from different sources and may make slightly different decisions about what is or is not spam. Rather than learning separate models for each domain, we explore systems that learn across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of disparate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.

 2009 (6 Publications) Mark Dredze. Intelligent Email: Aiding Users with AI. PhD Thesis, Computer and Information Science, University of Pennsylvania, 2009. [PDF] [Bibtex] Paul McNamee, Mark Dredze, Adam Gerber, Nikesh Garera, Tim Finin, James Mayfield, Christine Piatko, Delip Rao, David Yarowsky, Markus Dreyer. HLTCOE Approaches to Knowledge Base Population at TAC 2009. Text Analysis Conference (TAC), 2009. [Bibtex] The HLTCOE participated in the entity linking and slot filling tasks at TAC 2009. A machine learning-based approach to entity linking, operating over a wide range of feature types, yielded good performance on the entity linking task. Slot-filling based on sentence selection, application of weak patterns and exploitation of redundancy was ineffective in the slot filling task. Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Advances in Neural Information Processing Systems (NIPS), 2009. [PDF] [Bibtex] We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. Mark Dredze, Partha Pratim Talukdar, Koby Crammer. Sequence Learning from Data with Multiple Labels. ECML/PKDD Workshop on Learning from Multi-Label Data, 2009. [PDF] [Bibtex] We present novel algorithms for learning structured predictors from instances with multiple labels in the presence of noise. The proposed algorithms improve performance on two standard NLP tasks when we have a small amount of training data (low quantity) and when the labels are noisy (low quality). In these settings, the methods improve performance over using a single label, in some cases exceeding performance using gold labels. Our methods could be used in a semi-supervised setting, where a limited amount of labeled data could be combined with a rule based automatic labeling of unlabeled data with multiple possible labels. Koby Crammer, Mark Dredze, Alex Kulesza. Multi-Class Confidence Weighted Algorithms. Empirical Methods in Natural Language Processing (EMNLP), 2009. [PDF] [Bibtex] [Data (Amazon 7)] The recently introduced online confidence-weighted (CW) learning algorithm for binary classification performs well on many binary NLP tasks. However, for multi-class problems CW learning updates and inference cannot be computed analytically or solved as convex optimization problems as they are in the binary case. We derive learning algorithms for the multi-class CW setting and provide extensive evaluation using nine NLP datasets, including three derived from the recently released New York Times corpus. Our best algorithm outperforms state-of-the-art online and batch methods on eight of the nine tasks. We also show that the confidence information maintained during learning yields useful probabilistic information at test time. Mark Dredze, Bill Schilit, Peter Norvig. Suggesting Email View Filters for Triage and Search. International Joint Conference on Artificial Intelligence (IJCAI), 2009. [PDF] [Bibtex] Growing email volumes cause flooded inboxes and swelled email archives, making search and new email processing difficult. While emails have rich metadata, such as recipients and folders, suitable for creating filtered views, it is often difficult to choose appropriate filters for new inbox messages without first examining messages. In this work, we consider a system that automatically suggests relevant view filters to the user for the currently viewed messages. We propose several ranking algorithms for suggesting useful filters. Our work suggests that such systems quickly filter groups of inbox messages and find messages more easily during search.

 2008 (12 Publications) Kevin Lerman, Ari Gilder, Mark Dredze, Fernando Pereira. Reading the Markets: Forecasting Public Opinion of Political Candidates by News Analysis. Conference on Computational Linguistics (Coling), 2008. [PDF] [Bibtex] Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of political candidates. Our system uses daily newspaper articles to predict shifts in public opinion as reflected in prediction markets. We discuss various types of features designed for this problem. The news system improves market prediction over baseline market systems. Mark Dredze, Joel Wallenberg. Further Results and Analysis of Icelandic Part of Speech Tagging. Technical Report MS-CIS-08-13, University of Pennsylvania, Department of Computer and Information Science, 2008. [PDF] [Bibtex] Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. This paper presents further results and analysis to the original work. Mark Dredze, Joel Wallenberg. Icelandic Data-Driven Part of Speech Tagging. Association for Computational Linguistics (ACL) (short paper), 2008. [PDF] [Bibtex] Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. Kuzman Ganchev, Mark Dredze. Small Statistical Models by Random Feature Mixing. ACL Workshop on Mobile NLP, 2008. [PDF] [Bibtex] The application of statistical NLP systems to resource constrained devices is limited by the need to maintain parameters for a large number of features and an alphabet mapping features to parameters. We introduce random feature mixing to eliminate alphabet storage and reduce the number of parameters without severely impacting model performance. Mark Dredze, Hanna Wallach, Danny Puller, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Intelligent User Interfaces (IUI), 2008. [PDF] [Bibtex] Email summary keywords, used to concisely represent the gist of an email, can help users manage and prioritize large numbers of messages. We develop an unsupervised learning framework for selecting summary keywords from emails using latent representations of the underlying topics in a user's mailbox. This approach selects words that describe each message in the context of existing topics rather than simply selecting keywords based on a single message in isolation. We present and compare four methods for selecting summary keywords based on two well-known models for inferring latent topics: latent semantic analysis and latent Dirichlet allocation. The quality of the summary keywords is assessed by generating summaries for emails from twelve users in the Enron corpus. The summary keywords are then used in place of entire messages in two proxy tasks: automated foldering and recipient prediction. We also evaluate the extent to which summary keywords enhance the information already available in a typical email user interface by repeating the same tasks using email subject lines. Mark Dredze, Hanna Wallach, Danny Puller, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Aiding Users with AI. American National Conference on Artificial Intelligence (AAAI) (Nectar), 2008. [PDF] [Bibtex] Email occupies a central role in the modern workplace. This has led to a vast increase in the number of email messages that users are expected to handle daily. Furthermore, email is no longer simply a tool for asynchronous online communication - email is now used for task management, personal archiving, as well both synchronous and asynchronous online communication. This explosion can lead to "email overload" - many users are overwhelmed by the large quantity of information in their mailboxes. In the human--computer interaction community, there has been much research on tackling email overload. Recently, similar efforts have emerged in the artificial intelligence (AI) and machine learning communities to form an area of research known as intelligent email.\nIn this paper, we take a user-oriented approach to applying AI to email. We identify enhancements to email user interfaces and employ machine learning techniques to support these changes. We focus on three tasks - summary keyword generation, reply prediction and attachment prediction - and summarize recent work in these areas. Mark Dredze, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Reply and Attachment Prediction. Intelligent User Interfaces (IUI), 2008. [PDF] [Bibtex] We present two prediction problems under the rubric of Intelligent Email that are designed to support enhanced email interfaces that relieve the stress of email overload. Reply prediction alerts users when an email requires a response and facilitates email response management. Attachment prediction alerts users when they are about to send an email missing an attachment or triggers a document recommendation system, which can catch missing attachment emails before they are sent. Both problems use the same underlying email classification system and task specific features. Each task is evaluated for both single-user and cross-user settings. Mark Dredze, Hanna Wallach. User Models for Email Activity Management. IUI Workshop on Ubiquitous User Modeling, 2008. [PDF] [Bibtex] A single user activity, such as planning a conference trip, typically involves multiple actions. Although these actions may involve several applications, the central point of co-ordination for any particular activity is usually email. Previous work on email activity management has focused on clustering emails by activity. Dredze et al. accomplished this by combining supervised classifiers based on document similarity, authors and recipients, and thread information. In this paper, we take a different approach and present an unsupervised framework for email activity clustering. We use the same information sources as Dredze et al.- namely, document similarity, message recipients and authors, and thread information - but combine them to form an unsupervised, non-parametric Bayesian user model. This approach enables email activities to be inferred without any user input. Inferring activities from a user's mailbox adapts the model to that user. We next describe the statistical machinery that forms the basis of our user model, and explain how several email properties may be incorporated into the model. We evaluate this approach using the same data as Dredze et al., showing that our model does well at clustering emails by activity. Koby Crammer, Mark Dredze, Fernando Pereira. Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems (NIPS), 2008. [PDF] [Bibtex] Confidence-weighted (CW) learning, an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. Mark Dredze, Koby Crammer, Fernando Pereira. Confidence-Weighted Linear Classification. International Conference on Machine Learning (ICML), 2008. [PDF] [Bibtex] We introduce confidence-weighted linear classifiers, which add parameter confidence information to linear classifiers. Online learners in this setting update both classifier parameters and the estimate of their confidence. The particular online algorithms we study here maintain a Gaussian distribution over parameter vectors and update the mean and covariance of the distribution with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training. Mark Dredze, Koby Crammer. Active Learning with Confidence. Association for Computational Linguistics (ACL) (short paper), 2008. [PDF] [Bibtex] Active learning is a machine learning approach to achieving high-accuracy with a small amount of labels by letting the learning algorithm choose instances to be labeled. Most of previous approaches based on discriminative learning use the margin for choosing instances. We present a method for incorporating confidence into the margin by using a newly introduced online learning algorithm and show empirically that confidence improves active learning. Mark Dredze, Koby Crammer. Online Methods for Multi-Domain Learning and Adaptation. Empirical Methods in Natural Language Processing (EMNLP), 2008. [PDF] [Bibtex] NLP tasks are often domain specific, yet systems can learn behaviors across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of dispirate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.

 2007 (11 Publications) John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. North East Student Colloquium on Artificial Intelligence (NESCAI), 2007. [Bibtex] Danny Puller, Hanna Wallach, Mark Dredze, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Women in Machine Learning Workshop (WiML) at Grace Hopper, 2007. [Bibtex] Koby Crammer, Mark Dredze, John Blitzer, Fernando Pereira. Batch Performance for an Online Price. NIPS Workshop on Efficient Machine Learning, 2007. [PDF] [Bibtex] Batch learning techniques achieve good performance, but at the cost of many (sometimes even hundreds) of passes over the data. For many tasks, such as web-scale ranking of machine translation hypotheses, making many passes over the data is prohibitively expensive, even in parallel over thousands of machines. Online algorithms, which treat data as a stream of examples, are conceptually appealing for these large scale problems. In practice, however, online algorithms tend to underperform batch methods, unless they are themselves run in multiple passes over the data.
In this work we explore a new type of online learning algorithm that incorporates a measure of confidence to the algorithm. The model maintains a confidence for each parameter, reflecting previously observed properties of the data. While this requires an additional parameter for each feature of the data, this is a minimal cost when compared to running the algorithm multiple times over the data. The resulting algorithm learns faster, requiring both fewer training instances and fewer passes over the training data, often approaching batch performance with only a single pass through the data. Mark Dredze, Krzysztof Czuba. Learning to Admit You're Wrong: Statistical Tools for Evaluating Web QA. NIPS Workshop on Machine Learning for Web Search, 2007. [PDF] [Bibtex] Web search engines provide specialized results to specific queries, often relying on the output of a QA system. However, targeted answers, while helpful, are embarrassing when wrong. Automated techniques are required to avoid wrong answers and improve system performance. We present the Expected Answer System, a statistical data-driven framework that analyzes the performance of a QA system with the goal of improving system accuracy. Our system is used for wrong answer prediction, missing answer discovery, and question class analysis. An empirical study of a production QA system, one of the first such evaluations presented in the literature, motivates our approach. Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum, Mark Dredze. Lightly-Supervised Attribute Extraction for Web Search. NIPS Workshop on Machine Learning for Web Search, 2007. [PDF] [Bibtex] Web search engines can greatly benefit from knowledge about attributes of entities present in search queries. In this paper, we introduce lightly-supervised methods for extracting entity attributes from natural language text. Using these methods, we are able to extract large numbers of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a previously proposed pattern-based relation extractor, showing that the new methods give considerable improvements over that baseline. We also demonstrate that query expansion using extracted attributes improves retrieval performance on underspecified information-seeking queries. Neal Parikh, Mark Dredze. Graphical Models for Primarily Unsupervised Sequence Labeling. Technical Report MS-CIS-07-18, University of Pennsylvania, Department of Computer and Information Science, 2007. [PDF] [Bibtex] Most models used in natural language processing must be trained on large corpora of labeled text. This tutorial explores a 'primarily unsupervised' approach (based on graphical models) that augments a corpus of unlabeled text with some form of prior domain knowledge, but does not require any fully labeled examples. We survey probabilistic graphical models for (supervised) classification and sequence labeling and then present the prototype-driven approach of Haghighi and Klein (2006) to sequence labeling in detail, including a discussion of the theory and implementation of both conditional random fields and prototype learning. We show experimental results for English part of speech tagging. Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach. Learning Fast Classifiers for Image Spam. Conference on Email and Anti-Spam (CEAS), 2007. [PDF] [Bibtex] [Data] Recently, spammers have proliferated image spam, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes imagespam classification practical by providing both high accuracy features and a method to learn fast classifiers. Koby Crammer, Mark Dredze, Kuzman Ganchev, Partha Pratim Talukdar, Steven Carroll. Automatic Code Assignment to Medical Text. BioNLP Workshop at ACL, 2007. [PDF] [Bibtex] Code assignment is important for handling large amounts of electronic medical data in the modern hospital. However, only expert annotators with extensive training can assign codes. We present a system for the assignment of ICD-9-CM clinical codes to free text radiology reports. Our system assigns a code configuration, predicting one or more codes for each document. We combine three coding systems into a single learning system for higher accuracy. We compare our system on a real world medical dataset with both human annotators and other automated systems, achieving nearly the maximum score on the Computational Medicine Center's challenge. Mark Dredze, Hanna M Wallach. Email Keyword Summarization and Visualization with Topic Models. North East Student Colloquium on Artificial Intelligence (NESCAI), 2007. [Bibtex] John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association for Computational Linguistics (ACL), 2007. [PDF] [Bibtex] (Over 1000 citations) Automatic sentiment classification has been extensively studied and applied in recent years. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is impractical. We investigate domain adaptation for sentiment classifiers, focusing on online reviews for different types of products. First, we extend to sentiment classification the recently-proposed structural correspondence learning (SCL) algorithm, reducing the relative error due to adaptation between domains by an average of 30% over the original SCL algorithm and 46% over a supervised baseline. Second, we identify a measure of domain similarity that correlates well with the potential for adaptation of a classifier from one domain to another. This measure could for instance be used to select a small set of domains to annotate whose trained classifiers would transfer well to many other domains. Mark Dredze, John Blitzer, Partha Pratim Talukdar, Kuzman Ganchev, Joao Graca, Fernando Pereira. Frustratingly Hard Domain Adaptation for Dependency Parsing. Shared Task - Conference on Natural Language Learning - CoNLL 2007 shared task, 2007. [PDF] [Bibtex] We describe some challenges of adaptation in the 2007 CoNLL Shared Task on Domain Adaptation. Our error analysis for this task suggests that a primary source of error is differences in annotation guidelines between treebanks. Our suspicions are supported by the observation that no team was able to improve target domain performance substantially over a state of the art baseline.

 2006 (4 Publications) Mark Dredze, John Blitzer, Koby Crammer, Fernando Pereira. Feature Design for Transfer Learning. North East Student Colloquium on Artificial Intelligence (NESCAI), 2006. [PDF] [Bibtex] Mark Dredze, John Blitzer, Fernando Pereira. `Sorry, I Forgot the Attachment:'' Email Attachment Prediction. Conference on Email and Anti-Spam (CEAS), 2006. [PDF] [Bibtex] The missing attachment problem: a missing attachment generates a wave of emails from the recipients notifying the sender of the error. We present an attachment prediction system to reduce the volume of missing attachment mail. Our classifier could prompt an alert when an outgoing email is missing an attachment. Additionally, the system could activate an attachment recommendation system, whereby suggested documents are offered once the system determines the user is likely to include an attachment, effectively reminding the user to include the attachment. We present promising initial results and discuss implications of our work. Mark Dredze, Tessa Lau, Nicholas Kushmerick. Automatically classifying emails into activities. Intelligent User Interfaces (IUI), 2006. [PDF] [Bibtex] Email-based activity management systems promise to give users better tools for managing increasing volumes of email, by organizing email according to a user\'s activities. Current activity management systems do not automatically classify incoming messages by the activity to which they belong, instead relying on simple heuristics (such as message threads), or asking the user to manually classify incoming messages as belonging to an activity. This paper presents several algorithms for automatically recognizing emails as part of an ongoing activity. Our baseline methods are the use of message reply-to threads to determine activity membership and a naive Bayes classifier. Our SimSubset and SimOverlap algorithms compare the people involved in an activity against the recipients of each incoming message. Our SimContent algorithm uses IRR (a variant of latent semantic indexing) to classify emails into activities using similarity based on message contents. An empirical evaluation shows that each of these methods provide a significant improvement to the baseline methods. In addition, we show that a combined approach that votes the predictions of the individual methods performs better than each individual method alone. Nicholas Kushmerick, Tessa Lau, Mark Dredze, Rinat Khoussainov. Activity-Centric Email: A Machine Learning Approach. American National Conference on Artificial Intelligence (AAAI) (Nectar), 2006. [PDF] [Bibtex]

 2005 (3 Publications) Rie Kuboto Ando, Mark Dredze, Tong Zhang. Trec 2005 Genomics Track Experiments at IBM Watson. Text REtrieval Conference (TREC), 2005. [PDF] [Bibtex] (Group invited talk at TREC 2005, ranked 3rd and 4th out of 53 entries) Mark Dredze, John Blitzer, Fernando Pereira. Reply Expectation Prediction for Email Management. Conference on Email and Anti-Spam (CEAS), 2005. [PDF] [Bibtex] We reduce email overload by addressing the problem of waiting for a reply to one's email. We predict whether sent and received emails necessitate a reply, enabling the user to both better manage his inbox and to track mail sent to others. We discuss the features used to discriminate emails, show promising initial results with a logistic regression model, and outline future directions for this work. Catalina Danis, Wendy Kellogg, Tessa Lau, Mark Dredze, Jeffrey Stylos, Nicholas Kushmerick. Managers Email: Beyond Tasks and To-Dos. Conference on Human Factors in Computing Systems (CHI) (Extended Abstracts), 2005. [PDF] [Bibtex] In this paper, we describe preliminary findings that indicate that managers and non-mangers think about their email differently. We asked three research managers and three research non-managers to sort about 250 of their own email messages into categories that "would help them to manage their work." Our analyses indicate that managers create more categories and a more differentiated category structure than non-managers. Our data also suggest that managers create "relationship-oriented" categories more often than non-managers. These results are relevant to research on "email overload" that has highlighted the use of email for activities beyond communication. In particular, our findings suggest that too strong a focus on task management may be incomplete, and that a user's organizational role has an impact on their conceptualization and likely use of email.

 2004 (1 Publications) Mark Dredze, Jeffrey Stylos, Tessa Lau, Wendy Kellogg, Catalina Danis, Nicholas Kushmerick. Taxie: Automatically identifying tasks in email. Unpublished Manuscript, 2004. [Bibtex]

 2003 (1 Publications) Kevin Livingston, Mark Dredze, Kristian Hammond, Larry Birnbaum. Beyond Broadcast. International Conference on Intelligent User Interfaces (IUI), 2003. [Bibtex]