Click to show abstract.

Conference (abstract)   Conference (proceedings)  Journal  Workshop  Patent  Book 

     2024 (6 Publications)
Expand Me   John W Ayers, Adam Poliak, Nikolas T Beros, Michael Paul, Mark Dredze, Michael Hogarth, Davey M Smith. A Digital Cohort Approach for Social Media Monitoring: A Cohort Study of People Who Vape E-Cigarettes. American Journal of Preventive Medicine (AJPM), 2024;67(1):147-154. [PDF] [Bibtex]
Introduction. The evidence hierarchy in public health emphasizes longitudinal studies whereas social media monitoring relies on aggregate analyses. Authors propose integrating longitudinal analyses into social media monitoring by creating a digital cohort of individual account holders, as demonstrated by a case study analysis of people who vape. Methods. All English language X posts mentioning vape or vaping were collected from January 1, 2017 through December 31, 2020. The digital cohort was composed of people who self-reported vaping and posted at least 10 times about vaping during the study period to determine the (1) prevalence, (2) success rate, and (3) timing of cessation behaviors. Results. There were 25,112 instances where an account shared at least 10 posts about vaping, with 619 (95%616
Expand Me   Eric C Leas, John W Ayers, Nimit Desai, Mark Dredze, Michael Hogarth, Davey M Smith. Using Large Language Models to Support Content Analysis: A Case Study of ChatGPT for Adverse Event Detection. Journal of Medical Internet Research (JMIR), 2024. [PDF] [Bibtex]
This study explores the potential of using large language models to assist content analysis by conducting a case study to identify adverse events (AEs) in social media posts. The case study compares ChatGPT's performance with human annotators' in detecting AEs associated with delta-8-tetrahydrocannabinol, a cannabis-derived product. Using the identical instructions given to human annotators, ChatGPT closely approximated human results, with a high degree of agreement noted: 94.4% (9436/10,000) for any AE detection (Fleiss κ=0.95) and 99.3% (9931/10,000) for serious AEs (κ=0.96). These findings suggest that ChatGPT has the potential to replicate human annotation accurately and efficiently. The study recognizes possible limitations, including concerns about the generalizability due to ChatGPT's training data, and prompts further research with different models, data sources, and content analysis tasks. The study highlights the promise of large language models for enhancing the efficiency of biomedical research.
Expand Me   David Mueller, Mark Dredze, Nicholas Andrews. Multi-Task Transfer Matters During Instruction-Tuning. Association for Computational Linguistics (ACL), 2024. [Bibtex]
Instruction-tuning trains a language model on hundreds of tasks jointly to improve a model's ability to learn in-context; however, the mechanisms that drive in-context learning are poorly understood and, as a result, the role of instruction-tuning on in-context generalization is poorly understood as well. In this work, we study the impact of instruction-tuning on multi-task transfer: how well a model's parameters adapt to an unseen task via fine-tuning. We find that instruction-tuning negatively impacts a model's transfer to unseen tasks, and that model transfer and in-context generalization are highly correlated, suggesting that this catastrophic forgetting may impact in-context learning. We study methods to improve model transfer, finding that multi-task training---how well the training tasks are optimized---can significantly impact ICL generalization; additionally, we find that continual training on unsupervised pre-training data can mitigate forgetting and improve ICL generalization as well. Finally, we demonstrate that, early into training, the impact of instruction-tuning on model transfer to tasks impacts in-context generalization on that task. Overall, we provide significant evidence that multi-task transfer is deeply connected to a model's ability to learn a task in-context.
Expand Me   Miriam Wanner, Seth Ebner, Zhengping Jiang, Mark Dredze, Benjamin Van Durme. A Closer Look at Claim Decomposition. Joint Conference on Lexical and Computational Semantics (*SEM 2023), 2024. [PDF] [Bibtex]
As generated text becomes more commonplace, it is increasingly important to evaluate how well-supported such text is by external knowledge sources. Many approaches for evaluating textual support rely on some method for decomposing text into its individual subclaims which are scored against a trusted reference. We investigate how various methods of claim decomposition -- especially LLM-based methods -- affect the result of an evaluation approach such as the recently proposed FActScore, finding that it is sensitive to the decomposition method used. This sensitivity arises because such metrics attribute overall textual support to the model that generated the text even though error can also come from the metric's decomposition step. To measure decomposition quality, we introduce an adaptation of FActScore, which we call DecompScore. We then propose an LLM-based approach to generating decompositions inspired by Bertrand Russell's theory of logical atomism and neo-Davidsonian semantics and demonstrate its improved decomposition quality over previous methods.
Expand Me   Matthew R Allen, Nimit Desai, Aiden Namazi, Eric Leas, Mark Dredze, Davey M Smith, John W Ayers. Characteristics of X (Formerly Twitter) Community Notes Addressing COVID-19 Vaccine Misinformation. JAMA, 2024. [PDF] [Bibtex] (Ranked in the top 0.6% of 26m research outputs by Altmetric)
Social media can magnify health misinformation, especially about vaccination. Platform countermeasures have included censoring, shadowbanning (limiting distribution without disclosure), and adding warning labels to problematic content. Yet, evaluating these countermeasures is challenging due to restrictive public disclosures about their inner workings.In late 2022, X (formerly Twitter) introduced Community Notes, a crowdsourced misinformation countermeasure. Anonymous volunteer contributors independently identify posts containing misinformation and propose corrections called ``notes.'' Notes labeled as helpful by contributors who disagreed on past notes (to rely on a diversity of perspectives) are shown alongside the original post. Because Community Notes is open source, we were able to evaluate the topics, accuracy, and credibility of notes addressing COVID-19 vaccination.
     Paiheng Xu, David A Broniatowski, Mark Dredze. Twitter social mobility data reveal demographic variations in social distancing practices during the COVID-19 pandemic. Scientific Reports, 2024. [PDF] [Bibtex]

     2023 (14 Publications)
Expand Me   Keith Harrigian, Tina Tang, Anthony Franco Gonzales, Cindy Cai, Mark Dredze. An Eye on Clinical BERT: Investigating Language Model Generalization for Diabetic Eye Disease Phenotyping. Machine Learning for Health (ML4H) (Findings), 2023. [PDF] [Bibtex]
Diabetic eye disease is a major cause of blindness worldwide. The ability to monitor relevant clinical trajectories and detect lapses in care is critical to managing the disease and preventing blindness. Alas, much of the information necessary to support these goals is found only in the free text of the electronic medical record. To fill this information gap, we introduce a system for extracting evidence from clinical text of 19 clinical concepts related to diabetic eye disease and inferring relevant attributes for each. In developing this ophthalmology phenotype system, we are also afforded a unique opportunity to evaluate the effectiveness of clinical language models at adapting to new clinical domains. Across multiple training paradigms, we find that BERT language models pretrained on out-of-distribution clinical data offer no significant improvement over BERT language models pretrained on non-clinical data for our domain. Our study calls into question recent work which suggests advances in clinical language modeling are driven by improvements in the transfer of generalizable clinical knowledge.
     Alexandra DeLucia, Mark Dredze, Anna L Buczak. A Multi-instance Learning Approach to Civil Unrest Event Detection on Twitter. RANLP Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE), 2023. [PDF] [Bibtex]
     John W Ayers, Zechariah Zhu, Keith Harrigian, Gwenyth P Wightman, Mark Dredze, Steffanie A Strathdee, Davey M Smith. Managing HIV During the COVID-19 Pandemic: A Study of Help-Seeking Behaviors on a Social Media Forum. AIDS and Behavior, 2023. [PDF] [Bibtex]
     Katherine Hoops, Paul S Nestadt, Mark Dredze. The case for social media standards on suicide. The Lancet Psychiatry, 2023. [PDF] [Bibtex]
Expand Me   John W Ayers, Zechariah Zhu, Adam Poliak, Eric C Leas, Mark Dredze, Michael Hogarth, Davey M Smith. Evaluating Artificial Intelligence Responses to Public Health Questions. JAMA Network Open, 2023. [PDF] [Bibtex] (Ranked in the top 0.15% of 23m research outputs by Altmetric)
Artificial intelligence (AI) assistants have the potential to transform public health by offering accurate and actionable information to the general public. Unlike web-based knowledge resources (eg, Google Search) that return numerous results and require the searcher to synthesize information, AI assistants are designed to receive complex questions and provide specific answers. However, AI assistants often fail to recognize and respond to basic health questions. ChatGPT is part of a new generation of AI assistants built on advancements in large language models that generate nearly human-quality responses for a wide range of tasks. Although studies3 have focused on using ChatGPT as a supporting resource for healthcare professionals, it is unclear how well ChatGPT handles general health inquiries from the lay public. In this cross-sectional study, we evaluated ChatGPT responses to public health questions.
Expand Me   Gwenyth Portillo Wightman, Alexandra DeLucia, Mark Dredze. Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement. ACL Workshop on Trustworthy Natural Language Processing (TrustNLP), 2023. [PDF] [Bibtex]
Large language models have achieved impressive few-shot performance on a wide variety of tasks. However, in many settings, users require confidence estimates for model predictions. While traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. We compare generations across diverse prompts and show that these can be used to create confidence scores. By utilizing more prompts we can get more precise confidence estimates and use response diversity as a proxy for confidence. We evaluate this approach across ten multiple-choice question-answering datasets using three models: T0, FLAN-T5, and GPT-3. In addition to analyzing multiple human written prompts, we automatically generate more prompts using a language model in order to produce finer-grained confidence estimates. Our method produces more calibrated confidence estimates compared to the log probability of the answer to a single prompt. These improvements could benefit users who rely on prediction confidence for integration into a larger system or in decision-making processes.
Expand Me   Elliot Schumacher, James Mayfield, Mark Dredze. On the Surprising Effectiveness of Name Matching Alone in Autoregressive Entity Linking. ACL Workshop on Matching, 2023. [PDF] [Bibtex]
Fifteen years of work on entity linking has established the importance of different information sources in making linking decisions: mention and entity name similarity, contextual relevance, and features of the knowledge base. Modern state-of-the-art systems build on these features, including through neural representations (Wu et al., 2020). In contrast to this trend, the autoregressive language model GENRE (De Cao et al., 2021) generates normalized entity names for mentions and beats many other entity linking systems, despite making no use of knowledge base (KB) information. How is this possible? We analyze the behavior of GENRE on several entity linking datasets and demonstrate that its performance stems from memorization of name patterns. In contrast, it fails in cases that might benefit from using the KB. We experiment with a modification to the model to enable it to utilize KB information, highlighting challenges to incorporating traditional entity linking information sources into autoregressive models.
Expand Me   Keith Harrigian, Ayah Zirikly, Brant Chee, Alya Ahmad, Anne Links, Som Saha, Mary Catherine Beach, Mark Dredze. Characterization of Stigmatizing Language in Medical Records. Association for Computational Linguistics (ACL), 2023. [PDF] [Bibtex] [Code and Data]
Widespread disparities in clinical outcomes exist between different demographic groups in the United States. A new line of work in medical sociology has demonstrated physicians often use stigmatizing language in electronic medical records within certain groups, such as black patients, which may exacerbate disparities. In this study, we characterize these instances at scale using a series of domain-informed NLP techniques. We highlight important differences between this task and analogous bias-related tasks studied within the NLP community (e.g., classifying microaggressions). Our study establishes a foundation for NLP researchers to contribute timely insights to a problem domain brought to the forefront by recent legislation regarding clinical documentation transparency. We release data, code, and models.
Expand Me   Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, David Rosenberg. MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies. Association for Computational Linguistics (ACL), 2023. [PDF] [Bibtex]
Autoregressive language models are trained by minimizing the cross-entropy of the model distribution Q relative to the data distribution P -- that is, minimizing the forward cross-entropy, which is equivalent to maximum likelihood estimation (MLE). We have observed that models trained in this way may ``over-generalize'', in the sense that they produce non-human-like text. Moreover, we believe that reverse cross-entropy, i.e., the cross-entropy of P relative to Q, is a better reflection of how a human would evaluate text generated by a model. Hence, we propose learning with MixCE, an objective that mixes the forward and reverse cross-entropies. We evaluate models trained with this objective on synthetic data settings (where P is known) and real data, and show that the resulting models yield better generated text without complex decoding strategies.
Expand Me   Elizabeth Spaulding, Gary Kazantsev, Mark Dredze. Joint End-to-end Semantic Proto-role Labeling. Association for Computational Linguistics (ACL), 2023. [PDF] [Bibtex]
Semantic proto-role labeling (SPRL) assigns properties to arguments based on a series of binary labels. While multiple studies have evaluated various approaches to SPRL, it has only been studied in-depth as a standalone task using gold predicate/argument pairs. How do SPRL systems perform as part of an information extraction pipeline? We model SPRL jointly with predicate-argument extraction using a deep transformer model. We find that proto-role labeling is surprisingly robust in this setting, with only a small decrease when using predicted arguments. We include a detailed analysis of each component of the joint system, and an error analysis to understand correlations in errors between system stages. Finally, we study the effects of annotation errors on SPRL.
Expand Me   Jingyu Zhang, Alexandra DeLucia, Chenyu Zhang, Mark Dredze. Geo-Seq2seq: Twitter User Geolocation on Noisy Data through Sequence to Sequence Learning. Association for Computational Linguistics (ACL), 2023. [PDF] [Bibtex]
Location information can support social media analyses by providing geographic context. Some of the most accurate and popular Twitter geolocation systems rely on rule-based methods that examine the user-provided profile location, which fail to handle informal or noisy location names. We propose Geo-Seq2seq, a sequence-to-sequence (seq2seq) model for Twitter user geolocation that rewrites noisy, multilingual user-provided location strings into structured English location names. We train our system on tens of millions of multilingual location string and geotagged-tweet pairs. Compared to leading methods, our model vastly increases coverage (i.e., the number of users we can geolocate) while achieving comparable or superior accuracy. Our error analysis reveals that constrained decoding helps the model produce valid locations according to a location database. Finally, we measure biases across language, country of origin, and time to evaluate fairness, and find that while our model can generalize well to unseen temporal data, performance does vary by language and country.
Expand Me   John W Ayers, Adam Poliak, Mark Dredze, Eric C Leas, Zechariah Zhu, Jessica B Kelley, Dennis J Faix, Aaron M Goodman, Christopher A Longhurst, Michael Hogarth, Davey M Smith. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, 2023. [PDF] [Bibtex] (Ranked in #550 out of 25m (top 0.003%) research outputs by Altmetric; Most viewed JAMA IM article of 2023)
Importance The rapid expansion of virtual health care has caused a surge in patient messages concomitant with more work and burnout among health care professionals. Artificial intelligence (AI) assistants could potentially aid in creating answers to patient questions by drafting responses that could be reviewed by clinicians. Objective To evaluate the ability of an AI chatbot assistant (ChatGPT), released in November 2022, to provide quality and empathetic responses to patient questions. Design, Setting, and Participants In this cross-sectional study, a public and nonidentifiable database of questions from a public social media forum (Reddit's r/AskDocs) was used to randomly draw 195 exchanges from October 2022 where a verified physician responded to a public question. Chatbot responses were generated by entering the original question into a fresh session (without prior questions having been asked in the session) on December 22 and 23, 2022. The original question along with anonymized and randomly ordered physician and chatbot responses were evaluated in triplicate by a team of licensed health care professionals. Evaluators chose ``which response was better'' and judged both ``the quality of information provided'' (very poor, poor, acceptable, good, or very good) and ``the empathy or bedside manner provided'' (not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic). Mean outcomes were ordered on a 1 to 5 scale and compared between chatbot and physicians. Results Of the 195 questions and responses, evaluators preferred chatbot responses to physician responses in 78.6% (95% CI, 75.0%-81.8%) of the 585 evaluations. Mean (IQR) physician responses were significantly shorter than chatbot responses (52 [17-62] words vs 211 [168-245] words; t = 25.4; P < .001). Chatbot responses were rated of significantly higher quality than physician responses (t = 13.3; P < .001). The proportion of responses rated as good or very good quality (≥ 4), for instance, was higher for chatbot than physicians (chatbot: 78.5%, 95% CI, 72.3%-84.1%; physicians: 22.1%, 95% CI, 16.4%-28.2%;). This amounted to 3.6 times higher prevalence of good or very good quality responses for the chatbot. Chatbot responses were also rated significantly more empathetic than physician responses (t = 18.9; P < .001). The proportion of responses rated empathetic or very empathetic (≥4) was higher for chatbot than for physicians (physicians: 4.6%, 95% CI, 2.1%-7.7%; chatbot: 45.1%, 95% CI, 38.5%-51.8%; physicians: 4.6%, 95% CI, 2.1%-7.7%). This amounted to 9.8 times higher prevalence of empathetic or very empathetic responses for the chatbot. Conclusions In this cross-sectional study, a chatbot generated quality and empathetic responses to patient questions posed in an online forum. Further exploration of this technology is warranted in clinical settings, such as using chatbot to draft responses that physicians could then edit. Randomized trials could assess further if using AI assistants might improve responses, lower clinician burnout, and improve patient outcomes.
Expand Me   Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann. BloombergGPT: A Large Language Model for Finance. arXiv 2303.17564, 2023. [PDF] [Bibtex] (Ranked in #4033 out of 23m (top 0.02%) research outputs by Altmetric)
The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. As a next step, we plan to release training logs (Chronicles) detailing our experience in training BloombergGPT.
     Alexandra DeLucia, Adam Poliak, Zechariah Zhu, Stephanie R Pitts, Mario Navarro, Sharareh Shojaie, John W Ayers, Mark Dredze. Automated Discovery of Perceived Health-related Concerns about E-cigarettes from Reddit. Annual Meeting of the Society for Research on Nicotine and Tobacco, 2023. [PDF] [Bibtex]

     2022 (18 Publications)
Expand Me   Carlos Aguirre, Mark Dredze, Philip Resnik. Using Open-Ended Stressor Responses to Predict Depressive Symptoms across Demographics. Machine Learning for Health (ML4H) (Extended Abstract), 2022. [PDF] [Bibtex]
Stressors have shown to be related to depression, however, this relationship is complex. In this work, we study the relationship between open-ended text responses about stressors and depressive symptoms across gender and racial/ethnic groups. Open-ended text responses in survey instruments promise more nuanced insights compared to more traditional inquiries, e.g. multiple choice questions, which may be crucial in settings such as in mental health. However, they often require expensive analyzing, such as coding and annotations. Language Models offer solutions for automatically analyzing these, but with many possible risks, such as biases. We train language models using self-reported stressors to predict depressive symptoms, finding a relationship between stressors and depression. Further, we analyze stressors finding different trends across demographic groups. Finally, we find that these differences translate to downstream performance differences across demographic groups.
Expand Me   Moniba Keymanesh, Adrian Benton, Mark Dredze. What Makes Data-to-Text Generation Hard for Pretrained Language Models? EMNLP Workshop on Generation, Evaluation & Metrics (GEM), 2022. [PDF] [Bibtex]
Expressing natural language descriptions of structured facts or relations -- data-to-text generation (D2T) -- increases the accessibility of structured knowledge repositories. Previous work shows that pre-trained language models (PLMs) perform remarkably well on this task after fine-tuning on a significant amount of task-specific training data. On the other hand, while auto-regressive PLMs can generalize from a few task examples, their efficacy at D2T is largely unexplored. Furthermore, we have an incomplete understanding of the limits of PLMs on D2T. In this work, we conduct an empirical study of both fine-tuned and auto-regressive PLMs on the DART multi-domain D2T dataset. We consider their performance as a function of the amount of task-specific data and how the data is incorporated into the models: zero and few-shot learning, and fine-tuning of model weights. In addition, we probe the limits of PLMs by measuring performance on subsets of the evaluation data: novel predicates and abstractive test examples. To improve the performance on these subsets, we investigate two techniques: providing predicate descriptions in the context and re-ranking generated candidates by information reflected in the source. Finally, we conduct a human evaluation of model errors and show that D2T generation tasks would benefit from datasets with more careful manual curation.
Expand Me   Elliot Schumacher, James Mayfield, Mark Dredze. Zero-shot Cross-Language Transfer of Monolingual Entity Linking Models. EMNLP Workshop on Multilingual Representation Learning, 2022. [PDF] [Bibtex] (Best Paper Award Honorable Mention)
Most entity linking systems, whether mono or multilingual, link mentions to a single English knowledge base. Few have considered linking non-English text to a non-English KB, and therefore, transferring an English entity linking model to both a new document and KB language. We consider the task of zero-shot cross-language transfer of entity linking systems to a new language and KB. We find that a system trained with multilingual representations does reasonably well, and propose improvements to system training that lead to improved recall in most datasets, often matching the in-language performance. We further conduct a detailed evaluation to elucidate the challenges of this setting.
Expand Me   David Mueller, Mark Dredze, Nicholas Andrews. The Importance of Temperature in Multi-Task Optimization. NeurIPS Workshop on Optimization for Machine Learning (OPT), 2022. [PDF] [Bibtex]
The promise of multi-task learning is that optimizing a single model on multiple related tasks will lead to a better solution for all tasks than independently trained models. In practice, optimization difficulties, such as conflicting gradients, can result in negative transfer, where multi-task models which perform worse than single-task models. In this work, we identify the optimization temperature---the ratio of learning rate to batch size---as a key factor in negative transfer. Temperature controls the level of noise in each optimization step, which prior work has shown to have a strong correlation with generalization. We demonstrate that, in some multi-task settings, negative transfer may arise due to poorly set optimization temperature, rather than inherently high task conflict. The implication of this finding is that in some settings, SGD with a carefully controlled temperature achieves comparable, and in some cases superior, performance to that of specialized optimization procedures such as PCGrad, MGDA, and GradNorm. In particular, our results suggest that the significant additional computational burden of these specialized methods may not always be necessary. Finally, we observe a conflict between the optimal temperatures of different tasks in a multi-task objective, with different levels of noise promoting better generalization for different tasks. Our work suggests the need for novel multi-task optimization methods which consider individual task noise-levels, and their impact on generalization.
Expand Me   Zach Wood-Doughty, Ilya Shpitser, Mark Dredze. Generating Synthetic Text Data to Evaluate Causal Inference Methods. American Causal Inference Conference, 2022. [PDF] [Bibtex]
Drawing causal conclusions from observational data requires making assumptions about the true data-generating process. Causal inference research typically considers low-dimensional data, such as categorical or numerical fields in structured medical records. High-dimensional and unstructured data such as natural language complicates the evaluation of causal inference methods; such evaluations rely on synthetic datasets with known causal effects. Models for natural language generation have been widely studied and perform well empirically. However, existing methods not immediately applicable to producing synthetic datasets for causal evaluations, as they do not allow for quantifying a causal effect on the text itself. In this work, we develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects. We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data. We release our code and synthetic datasets.
Expand Me   Alexandra DeLucia, Shijie Wu, Aaron Mueller, Carlos Aguirre, Philip Resnik, Mark Dredze. Bernice: A Multilingual Pre-trained Encoder for Twitter. Empirical Methods in Natural Language Processing (EMNLP), 2022. [PDF] [Bibtex]
The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
Expand Me   David Mueller, Nicholas Andrews, Mark Dredze. Do Text-to-Text Multi-Task Learners Suffer from Task Conflict? Findings of the Empirical Methods in Natural Language Processing (EMNLP), 2022. [PDF] [Bibtex]
Traditional multi-task learning architectures train a single model across multiple tasks through a shared encoder followed by taskspecific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks (Raffel et al., 2020). Does this new architecture suffer from taskconflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multitask conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures.
Expand Me   Jingyu Zhang, Alexandra DeLucia, Mark Dredze. Changes in Tweet Geolocation over Time: A Study with Carmen 2.0. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2022. [PDF] [Bibtex]
Researchers across disciplines use Twitter geolocation tools to filter data for desired locations. These tools have largely been trained and tested on English tweets, often originating in the United States from almost a decade ago. Despite the importance of these tools for data curation, the impact of tweet language, country of origin, and creation date on tool performance remains largely unknown. We explore these issues with Carmen, a popular tool for Twitter geolocation. To support this study we introduce Carmen 2.0, a major update which includes the incorporation of GeoNames, a gazetteer that provides much broader coverage of locations. We evaluate using two new Twitter datasets, one for multilingual, multiyear geolocation evaluation, and another for usage trends over time. We found that language, country origin, and time does impact geolocation tool performance.
Expand Me   Keith Harrigian, Mark Dredze. Then and Now: Quantifying the Longitudinal Validity of Self-Disclosed Depression Diagnoses. NAACL Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2022. [PDF] [Bibtex]
Self-disclosed mental health diagnoses, which serve as ground truth annotations of mental health status in the absence of clinical measures, underpin the conclusions behind most computational studies of mental health language from the last decade. However, psychiatric conditions are dynamic; a prior depression diagnosis may no longer be indicative of an individual's mental health, either due to treatment or other mitigating factors. We ask: to what extent are self-disclosures of mental health diagnoses actually relevant over time? We analyze recent activity from individuals who disclosed a depression diagnosis on social media over five years ago and, in turn, acquire a new understanding of how presentations of mental health status on social media manifest longitudinally. We also provide expanded evidence for the presence of personality-related biases in datasets curated using self-disclosed diagnoses. Our findings motivate three practical recommendations for improving mental health datasets curated using self-disclosed diagnoses: 1) Annotate diagnosis dates and psychiatric comorbidities; 2) Sample control groups using propensity score matching; 3) Identify and remove spurious correlations introduced by selection bias.
Expand Me   Ayah Zirikly, Mark Dredze. Explaining Models of Mental Health via Clinically Grounded Auxiliary Tasks. NAACL Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2022. [PDF] [Bibtex]
Models of mental health based on natural language processing can uncover latent signals of mental health from language. Models that indicate whether an individual is depressed, or has other mental health conditions, can aid in diagnosis and treatment. A critical aspect of integration of these models into the clinical setting relies on explaining their behavior to domain experts. In the case of mental health diagnosis, clinicians already rely on an assessment framework to make these decisions; that framework can help a model generate meaningful explanations.In this work we propose to use PHQ-9 categories as an auxiliary task to explaining a social media based model of depression. We develop a multi-task learning framework that predicts both depression and PHQ-9 categories as auxiliary tasks. We compare the quality of explanations generated based on the depression task only, versus those that use the predicted PHQ-9 categories. We find that by relying on clinically meaningful auxiliary tasks, we produce more meaningful explanations.
Expand Me   Keith Harrigian, Mark Dredze. The Problem of Semantic Shift in Longitudinal Monitoring of Social Media: A Case Study on Mental Health During the COVID-19 Pandemic. Conference on Web Science (WebSci), 2022. [PDF] [Bibtex]
Social media allows researchers to track societal and cultural changes over time based on language analysis tools. Many of these tools rely on statistical algorithms which need to be tuned to specific types of language. Recent studies have shown the absence of appropriate tuning, specifically in the presence of semantic shift, can hinder robustness of the underlying methods. However, little is known about the practical effect this sensitivity may have on downstream longitudinal analyses. We explore this gap in the literature through a timely case study: understanding shifts in depression during the course of the COVID-19 pandemic. We find that inclusion of only a small number of semantically-unstable features can promote significant changes in longitudinal estimates of our target outcome. At the same time, we demonstrate that a recently-introduced method for measuring semantic shift may be used to proactively identify failure points of language-based models and, in turn, improve predictive generalization.
Expand Me   Zach Wood-Doughty, Isabel Cachola, Mark Dredze. Model Distillation for Faithful Explanations of Medical Code Predictions. ACL Workshop on Biomedical Natural Language Processing (BioNLP), 2022. [PDF] [Bibtex]
Machine learning models that offer excellent predictive performance often lack the interpretability necessary to support integrated human machine decision-making. In clinical medicine and other high-risk settings, domain experts may be unwilling to trust model predictions without explanations. Work in explainable AI must balance competing objectives along two different axes: 1) Models should ideally be both accurate and simple. 2) Explanations must balance faithfulness to the model's decision-making with their plausibility to a domain expert. We propose to use knowledge distillation, or training a student model that mimics the behavior of a trained teacher model, as a technique to generate faithful and plausible explanations. We evaluate our approach on the task of assigning ICD codes to clinical notes to demonstrate that the student model is faithful to the teacher model's behavior and produces quality natural language explanations.
Expand Me   Shijie Wu, Benjamin Van Durme, Mark Dredze. Zero-shot Cross-lingual Transfer is Under-specified Optimization. ACL Workshop on Representation Learning for NLP (RepL4NLP), 2022. [PDF] [Bibtex]
Pretrained multilingual encoders enable zero-shot cross-lingual transfer, but often produce unreliable models that exhibit high performance variance on the target language. We postulate that this high variance results from zero-shot cross-lingual transfer solving an under-specified optimization problem. We show that any linear-interpolated model between the source language monolingual model and source + target bilingual model has equally low source language generalization error, yet the target language generalization error reduces smoothly and linearly as we move from the monolingual to bilingual model, suggesting that the model struggles to identify good solutions for both source and target languages using the source language alone. Additionally, we show that zero-shot solution lies in non-flat region of target language error generalization surface, causing the high variance.
Expand Me   Xiaolei Huang, Franck Dernoncourt, Mark Dredze. Enriching Unsupervised User Embedding via Medical Concepts. Proceedings of the Conference on Health, Inference, and Learning, 2022. [PDF] [Bibtex]
Clinical notes in Electronic Health Records (EHR) present rich documented information of patients to inference phenotype for disease diagnosis and study patient characteristics for cohort selection. Unsupervised user embedding aims to encode patients into fixed-length vectors without human supervisions. Medical concepts extracted from the clinical notes contain rich connections between patients and their clinical categories. However, existing \textitunsupervised approaches of user embeddings from clinical notes do not explicitly incorporate medical concepts. In this study, we propose a concept-aware unsupervised user embedding that jointly leverages text documents and medical concepts from two clinical corpora, MIMIC-III and Diabetes. We evaluate user embeddings on both extrinsic and intrinsic tasks, including phenotype classification, in-hospital mortality prediction, patient retrieval, and patient relatedness. Experiments on the two clinical corpora show our approach exceeds unsupervised baselines, and incorporating medical concepts can significantly improve the baseline performance.
     Joshua Dredze, Lisi Dredze, Mark Dredze. Interest in Teletherapy during the COVID-19 Pandemic as Measured by Google Search Trends. Association for Psychological Science Annual Convention (APS), 2022. [Bibtex]
Expand Me   Sheena Panthaplackel, Adrian Benton, Mark Dredze. Updated Headline Generation: Creating Updated Summaries for Evolving News Stories. Association for Computational Linguistics (ACL), 2022. [PDF] [Bibtex]
We propose the task of updated headline generation, in which a system generates a headline for an updated article, considering both the previous article and headline. The system must identify the novel information in the article update, and modify the existing headline to reflect this. We introduce a new dataset for this task by automatically identifying contiguous article versions that are likely to require a substantive headline update from the NewsEdits corpus (Spangher and May, 2021). We find that models conditioned on the prior headline and body revisions produce headlines judged by humans to be as factual as gold headlines while making fewer unnecessary edits compared to a standard headline generation model. Our experiments establish benchmarks for this new contextual summarization task.
     Adam Poliak, Paiheng Xu, Eric Leas, Mario Navarro, Stephanie Pitts, Andie Malterud, John W Ayers, Mark Dredze. A Machine Learning Approach For Discovering Tobacco Brands, Products, and Manufacturers in the United States. Annual Meeting of the Society for Research on Nicotine and Tobacco, 2022. [Bibtex]
Expand Me   David A Broniatowski, Daniel Kerchne, Fouzia Farooq, Xiaolei Huang, Amelia M Jamison, Mark Dredze, Sandra Crouse Quinn. Twitter and Facebook posts about COVID-19 are less likely to spread misinformation compared to other health topics. PLoS ONE, 2022. [PDF] [Bibtex]
The COVID-19 pandemic brought widespread attention to an ``infodemic'' of potential health misinformation. This claim has not been assessed based on evidence. We evaluated if health misinformation became more common during the pandemic. We gathered about 325 million posts sharing URLs from Twitter and Facebook during the beginning of the pandemic (March 8-May 1, 2020) compared to the same period in 2019. We relied on source credibility as an accepted proxy for misinformation across this database. Human annotators also coded a subsample of 3000 posts with URLs for misinformation. Posts about COVID-19 were 0.37 times as likely to link to ``not credible'' sources and 1.13 times more likely to link to ``more credible'' sources than prior to the pandemic. Posts linking to ``not credible'' sources were 3.67 times more likely to include misinformation compared to posts from ``more credible'' sources. Thus, during the earliest stages of the pandemic, when claims of an infodemic emerged, social media contained proportionally less misinformation than expected based on the prior year. Our results suggest that widespread health misinformation is not unique to COVID-19. Rather, it is a systemic feature of online health communication that can adversely impact public health behaviors and must therefore be addressed.

     2021 (17 Publications)
Expand Me   Anna L Buczak, Benjamin D Baugher, Christine S Martin, Meg W Keiley-Listermann, James Howard II, Nathan H Parrish, Anton Q Stalick, Daniel S Berman, Mark H Dredze. Crystal Cube: Forecasting Disruptive Events. Applied Artificial Intelligence, 2021;0(0):1-24. [PDF] [Bibtex]
Disruptive events within a country can have global repercussions, creating a need for the anticipation and planning of these events. Crystal Cube (CC) is a novel approach to forecasting disruptive political events at least one month into the future. The system uses a recurrent neural network and a novel measure of event similarity between past and current events. We also introduce the innovative Thermometer of Irregular Leadership Change (ILC). We present an evaluation of CC in predicting ILC for 167 countries and show promising results in forecasting events one to twelve months in advance. We compare CC results with results using a random forest as well as previous work.
Expand Me   Zach Wood-Doughty, Isabel Cachola, Mark Dredze. Proxy Model Explanations for Time Series RNNs. IEEE International Conference on Machine Learning and Applications (ICMLA), 2021. [PDF] [Bibtex]
While machine learning models can produce accurate predictions of complex real-world phenomena, domain experts may be unwilling to trust such a prediction without an explanation of the model's behavior. This concern has motivated widespread research and produced many methods for interpreting black-box models. Many such methods explain predictions one-by-one, which can be slow and inconsistent across a large dataset, and ill-suited for time series applications. We introduce a proxy model approach that is fast to train, faithful to the original model, and globally consistent in its explanations. We compare our approach to several previous methods and find both that methods disagree with one another and that our approach improves over existing methods in an application to political event forecasting.
Expand Me   Abhinav Chinta, Jingyu Zhang, Alexandra DeLucia, Mark Dredze, Anna L Buczak. Study of Manifestation of Civil Unrest on Twitter. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2021. [PDF] [Bibtex]
Twitter is commonly used for civil unrest detection and forecasting tasks, but there is a lack of work in evaluating how civil unrest manifests on Twitter across countries and events. We present two in-depth case studies for two specific large-scale events, one in a country with high (English) Twitter usage (Johannesburg riots in South Africa) and one in a country with low Twitter usage (Burayu massacre protests in Ethiopia). We show that while there is event signal during the events, there is little signal leading up to the events. In addition to the case studies, we train Ngram-based models on a larger set of Twitter civil unrest data across time, events, and countries and use machine learning explainability tools (SHAP) to identify important features. The models were able to find words indicative of civil unrest that generalized across countries. The 42 countries span Africa, Middle East, and Southeast Asia and the events range occur between 2014 and 2019.
Expand Me   Mahsa Yarmohammadi, Shijie Wu, Marc Marone, Haoran Xu, Seth Ebner, Guanghui Qin, Yunmo Chen, Jialiang Guo, Craig Harman, Kenton Murray, Aaron Steven White, Mark Dredze, Benjamin Van Durme. Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction. Empirical Methods in Natural Language Processing (EMNLP), 2021. [PDF] [Bibtex]
Zero-shot cross-lingual information extraction (IE) describes the construction of an IE model for some target language, given existing annotations exclusively in some other language, typically English. While the advance of pretrained multilingual encoders suggests an easy optimism of ``train on English, run on any language'', we find through a thorough exploration and extension of techniques that a combination of approaches, both new and old, leads to better performance than any one cross-lingual strategy in particular. We explore techniques including data projection and self-training, and how different pretrained encoders impact them. We use English-to-Arabic IE as our initial example, demonstrating strong performance in this setting for event extraction, named entity recognition, part-of-speech tagging, and dependency parsing. We then apply data projection and self-training to three tasks across eight target languages. Because no single set of techniques performs the best across all tasks, we encourage practitioners to explore various configurations of the techniques described in this work when seeking to improve on zero-shot training.
Expand Me   John W Ayers, Brian Chu, Zechariah Zhu, Eric C Leas, Davey M Smith, Mark Dredze, David A Broniatowski. Spread of Misinformation About Face Masks and COVID-19 by Automated Software on Facebook. JAMA Internal Medicine, 2021. [PDF] [Bibtex] (Ranked in the top 0.1% of 18m research outputs by Altmetric)
The dangers of misinformation spreading on social media during the COVID-19 pandemic are known. However, software that allows individuals to generate automated content and share it via counterfeit accounts (or ``bots'') to amplify misinformation has been overlooked, including how automated software can be used to disseminate original research while undermining scientific communication. We analyzed conversations on public Facebook groups, a platform known to be susceptible to automated misinformation, concerning the publication of the Danish Study to Assess Face Masks for the Protection Against COVID-19 Infection (DANMASK-19) to explore automated misinformation. We selected DANMASK-19 because it was widely discussed (it was the fifth most shared research article of all time as of March 2021 according to Altmetric5) and demonstrated that masks are an important public health measure to control the pandemic.
Expand Me   Elliot Schumacher, James Mayfield, Mark Dredze. Cross-Lingual Transfer in Zero-Shot Cross-Language Entity Linking. Findings of the Association for Computational Linguistics (ACL), 2021. [PDF] [Bibtex]
Cross-language entity linking grounds mentions in multiple languages to a single-language knowledge base. We propose a neural ranking architecture for this task that uses multilingual BERT representations of the mention and the context in a neural network. We find that the multilingual ability of BERT leads to robust performance in monolingual and multilingual settings. Furthermore, we explore zero-shot language transfer and find surprisingly robust performance. We investigate the zero-shot degradation and find that it can be partially mitigated by a proposed auxiliary training objective, but that the remaining error can best be attributed to domain shift rather than language transfer.
Expand Me   David A Broniatowski, Mark Dredze, John W Ayers. ``First Do No Harm'': Effective Communication About COVID-19 Vaccines. American Journal of Public Health (AJPH), 2021;111(6):1055-1057. [PDF] [Bibtex]
With effective COVID-19 vaccines in hand, we must now address the spread of information on social media that might encourage vaccine hesitancy. Although misinformation comes in many forms, including false claims, disinformation (e.g., deliberately false information), and rumors (e.g., unverified information), social media companies now seek to interdict this objectionable content---for the first time in their history---by removing content explicitly containing conspiracy theories and false or debunked claims about vaccines. Concurrently, social media users routinely disparage ``anti-vaxxers'' online, conflating a large group of vaccine-hesitant individuals who may be using social media to seek information about vaccination with a potentially much smaller group of ``vaccine refusers.'' Both strategies could cause more harm than good, necessitating a change in strategy informed by a large body of scientific evidence for making online communications about COVID-19 vaccines more effective.
Expand Me   John W Ayers, Eric C Leas, Mark Dredze, Theodore L Caputi, Shu-Hong Zhu, Joanna E Cohen. Did Philip Morris International use the e-cigarette, or vaping, product use associated lung injury (EVALI) outbreak to market IQOS heated tobacco? Tobacco Control, 2021. [PDF] [Bibtex]
25 July 2021 will mark the second anniversary of the e-cigarette, or vaping, product use associated lung injury (EVALI) outbreak. The concerns raised and news media attention focused on EVALI created a fertile environment for the tobacco industry to promote their e-cigarette alternatives, but this has not been studied. One such product is Philip Morris International's (PMI) heated tobacco product: `IQOS'. To assess how PMI promoted IQOS in the news during EVALI, we used `Tobacco Watcher' (, a fully automated and publicly available tobacco media analysis engine that warehouses news from more than 500 000+ sources. We plotted trends in news stories mentioning `IQOS' finding the largest number of stories mentioning IQOS occurred on 25 September 2019, with 261 articles, more than double the next highest day previously recorded. While investigating this anomaly we discovered an official PMI press release entitled `Lung illnesses associated with use of vaping products in the US' was published the same day. In the release (see online supplemental material), PMI recounted the EVALI outbreak beginning: `Skepticism and fear around vaping has emerged following the cases of respiratory illness and deaths in the US associated with the use of e-cigarettes.' PMI then contrasted this against their IQOS heated tobacco product, writing `on April 30 2019, the FDA authorized IQOS for sale in the US, finding that marketing of the product would be `appropriate for the protection of public health'' (quotes used in the original release).
Expand Me   Keith Harrigian, Carlos Aguirre, Mark Dredze. On the State of Social Media Data for Mental Health Research. NAACL Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2021. [PDF] [Bibtex]
Data-driven methods for mental health treatment and surveillance have become a major focus in computational science research in the last decade. However, progress in the domain remains bounded by the availability of adequate data. Prior systematic reviews have not necessarily made it possible to measure the degree to which data-related challenges have affected research progress. In this paper, we offer an analysis specifically on the state of social media data that exists for conducting mental health research. We do so by introducing an open-source directory of mental health datasets, annotated using a standardized schema to facilitate meta-analysis.
Expand Me   Carlos Aguirre, Mark Dredze. Qualitative Analysis of Depression Models by Demographics. NAACL Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2021. [PDF] [Bibtex]
Models for identifying depression using social media text exhibit biases towards different gender and racial/ethnic groups. Factors like representation and balance of groups within the dataset are contributory factors, but difference in content and social media use may further explain these biases. We present an analysis of the content of social media posts from different demographic groups. Our analysis shows that there are content differences between depression and control subgroups across demographic groups, and that temporal topics and demographic-specific topics are correlated with downstream depression model error. We discuss the implications of our work on creating future datasets, as well as designing and training models for mental health.
Expand Me   Eli Sherman, Keith Harrigian, Carlos Aguirre, Mark Dredze. Towards Understanding the Role of Demographics in Deploying Social Media-Based Mental Health Surveillance Models. NAACL Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2021. [PDF] [Bibtex]
Spurred by advances in machine learning and natural language processing, developing social media-based mental health surveillance models has received substantial recent attention. For these models to be maximally useful, it is necessary to understand how they perform on various subgroups, especially those defined in terms of protected characteristics. In this paper we study the relationship between user demographics -- focusing on gender -- and depression. Considering a population of Reddit users with known genders and depression statuses, we analyze the degree to which depression predictions are subject to biases along gender lines using domaininformed classifiers. We then study our models' parameters to gain qualitative insight into the differences in posting behavior across genders.
Expand Me   Zach Wood-Doughty, Paiheng Xu, Xiao Liu, Mark Dredze. Using Noisy Self-Reports to Predict Twitter User Demographics. NAACL Workshop on Natural Language Processing for Social Media (SocialNLP), 2021. [PDF] [Bibtex] [Data]
Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter), numerous studies have inferred demographics automatically. Despite many studies presenting proof-of-concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite the noise of automated supervision, our self-report datasets enable improvements in classification performance on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.
Expand Me   Aaron Mueller, Mark Dredze. Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling. North American Chapter of the Association for Computational Linguistics (NAACL), 2021. [PDF] [Bibtex] [Code]
Neural topic models can augment or replace bag-of-words inputs with the learned representations of deep pre-trained transformer-based word prediction models. One added benefit when using representations from multilingual models is that they facilitate zero-shot polylingual topic modeling. However, while it has been widely observed that pre-trained embeddings should be fine-tuned to a given task, it is not immediately clear what supervision should look like for an unsupervised task such as topic modeling. Thus, we propose several methods for fine-tuning encoders to improve both monolingual and zero-shot polylingual neural topic modeling. We consider fine-tuning on auxiliary tasks, constructing a new topic classification task, integrating the topic classification objective directly into topic model training, and continued pre-training. We find that fine-tuning encoder representations on topic classification and integrating the topic classification task directly into topic modeling improves topic quality, and that fine-tuning encoder representations on any task is the most important factor for facilitating cross-lingual transfer.
Expand Me   Xiaolei Huang, Michael J Paul, Franck Dernoncourt, Robin Burke, Mark Dredze. User Factor Adaptation for User Embedding via Multitask Learning. EACL Workshop on Domain Adaptation for NLP (Adapt-NLP), 2021. [PDF] [Bibtex]
Language varies across users and their interested fields in social media data: words authored by a user across his/her interests may have different meanings (e.g., cool) or sentiments (e.g., fast). However, most of the existing methods to train user embeddings ignore the variations across user interests, such as product and movie categories (e.g., drama vs. action). In this study, we treat the user interest as domains and empirically examine how the user language can vary across the user factor in three English social media datasets. We then propose a user embedding model to account for the language variability of user interests via a multitask learning framework. The model learns user language and its variations without human supervision. While existing work mainly evaluated the user embedding by extrinsic tasks, we propose an intrinsic evaluation via clustering and evaluate user embeddings by an extrinsic task, text classification. The experiments on the three English-language social media datasets show that our proposed approach can generally outperform baselines via adapting the user factor.
Expand Me   John W Ayers, Adam Poliak, Derek C Johnson, Eric C Leas, Mark Dredze, Theodore Caputi, Alicia L Nobles. Suicide-Related Internet Searches During the Early Stages of the COVID-19 Pandemic in the US. JAMA Network Open, 2021;4(1):e2034261. [PDF] [Bibtex]
Experts anticipate that the societal fallout associated with the coronavirus disease 2019 (COVID-19) pandemic will increase suicidal behavior, and strategies to address this anticipated increase have been woven into policy decision-making without contemporaneous data.For instance, President Trump cited increased suicides as an argument against COVID-19 control measures during the first presidential debate on September 29, 2020. Given the time delays inherent in traditional population mental health surveillance, it is important for decision-makers to seek other contemporaneous data to evaluate potential associations. To assess the value that free and public internet search query trends can provide to rapidly identify associations, we monitored suicide-related internet search rates during the early stages of the COVID-19 pandemic in the US.
Expand Me   Carlos Aguirre, Keith Harrigian, Mark Dredze. Gender and Racial Fairness in Depression Research using Social Media. Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2021. [PDF] [Bibtex]
Multiple studies have demonstrated that behaviors expressed on online social media platforms can indicate the mental health state of an individual. The widespread availability of such data has spurred interest in mental health research, using several datasets where individuals are labeled with mental health conditions. While previous research has raised concerns about possible biases in models produced from this data, no study has investigated how these biases manifest themselves with regards to demographic groups in data, such as gender and racial/ethnic groups. Here, we analyze the fairness of depression classifiers trained on Twitter data with respect to gender and racial demographic groups. We find that model performance differs for underrepresented groups, and we investigate sources of these biases beyond data representation. Our study results in recommendations on how to avoid these biases in future research.
Expand Me   Aaron Mueller, Zach Wood-Doughty, Silvio Amir, Mark Dredze, Alicia L Nobles. Demographic Representation and Collective Storytelling in the Me Too Twitter Hashtag Activism Movement. Computer-supported Cooperative Work (CSCW), 2021. [PDF] [Bibtex]
The #MeToo movement on Twitter has drawn attention to the pervasive nature of sexual harassment and violence. While #MeToo has been praised for providing support for self-disclosures of harassment or violence and shifting societal response, it has also been criticized for exemplifying how women of color have been discounted for their historical contributions to and excluded from feminist movements. Through an analysis of over 600,000 tweets from over 256,000 unique users, we examine online #MeToo conversations across gender and racial/ethnic identities and the topics that each demographic emphasized. We found that tweets authored by white women were overrepresented in the movement compared to other demographics, aligning with criticism of unequal representation. We found that intersected identities contributed differing narratives to frame the movement, co-opted the movement to raise visibility in parallel ongoing movements, employed the same hashtags both critically and supportively, and revived and created new hashtags in response to pivotal moments. Notably, tweets authored by black women often expressed emotional support and were critical about differential treatment in the justice system and by police. In comparison, tweets authored by white women and men often highlighted sexual harassment and violence by public figures and weaved in more general political discussions. We discuss the implications of this work for digital activism research and design, including suggestions to raise visibility by those who were under-represented in this hashtag activism movement. Content warning: this article discusses issues of sexual harassment and violence.

     2020 (28 Publications)
Expand Me   Caitlin Weiger, Katherine C Smith, Joanna E Cohen, Mark Dredze, Meghan Bridgid Moran. How Internet Contracts Impact Research: Content Analysis of Terms of Service on Consumer Product Websites. Journal of Medical Internet Research Public Health Surveillance, 2020;6(4):e23579. [PDF] [Bibtex]
Background: Companies use brand websites as a promotional tool to engage consumers on the web, which can increase product use. Given that some products are harmful to the health of consumers, it is important for marketing associated with these products to be subject to public health surveillance. However, terms of service (TOS) governing the use of brand website content may impede such important research. Objective: The aim of this study is to explore the TOS for brand websites with public health significance to assess possible legal and ethical challenges for conducting research on consumer product websites. Methods: Using Statista, we purposefully constructed a sample of 15 leading American tobacco, alcohol, psychiatric pharmaceutical, fast-food, and gun brands that have associated websites. We developed and implemented a structured coding system for the TOS on these websites and coded for the presence versus absence of different types of restriction that might impact the ability to conduct research. Results: All TOS stated that by accessing the website, users agreed to abide by the TOS (15/15, 100%). A total of 11 out of 15 (73%) websites had age restrictions in their TOS. All alcohol brand websites (5/15, 33%) required users to enter their age or date of birth before viewing website content. Both websites for tobacco brands (2/15, 13%) further required that users register and verify their age and identity to access any website content and agree that they use tobacco products. Only one website (1/15, 7%) allowed users to display, download, copy, distribute, and translate the website content as long as it was for personal and not commercial use. A total of 33% (5/15) of TOS unconditionally prohibited or put substantial restrictions on all of these activities and/or failed to specify if they were allowed or prohibited. Moreover, 87% (13/15) of TOS indicated that website access could be restricted at any time. A total of 73% (11/15) of websites specified that violating TOS could result in deleting user content from the website, revoking access by having the user's Internet Protocol address blocked, terminating log-in credentials, or enforcing legal action resulting in civil or criminal penalties. Conclusions: TOS create complications for public health surveillance related to e-marketing on brand websites. Recent court opinions have reduced the risk of federal criminal charges for violating TOS on public websites, but this risk remains unclear for private websites. The public health community needs to establish standards to guide and protect researchers from the possibility of legal repercussions related to such efforts.
Expand Me     Rachel Dorn, Alicia L Nobles, Masoud Rouhizadeh, Mark Dredze. Examining the Feasibility of Off-the-Shelf Algorithms for Masking Directly Identifiable Information in Social Media Data. arXiv:2011.08324, 2020. [PDF] [Bibtex]
The identification and removal/replacement of protected information from social media data is an understudied problem, despite being desirable from an ethical and legal perspective. This paper identifies types of potentially directly identifiable information (inspired by protected health information in clinical texts) contained in tweets that may be readily removed using off-the-shelf algorithms, introduces an English dataset of tweets annotated for identifiable information, and compiles these off-the-shelf algorithms into a tool (Nightjar) to evaluate the feasibility of using Nightjar to remove directly identifiable information from the tweets. Nightjar as well as the annotated data can be retrieved from this https URL.
Expand Me   Zach Wood-Doughty, Ilya Shpitser, Mark Dredze. Sensitivity Analyses for Incorporating Machine Learning Predictions into Causal Estimates. NeurIPS Workshop on Causal Discovery & Causality-Inspired Machine Learning, 2020. [PDF] [Bibtex]
Causal inference methods can yield insights into causation from observational datasets. When some necessary variables are unavailable for a causal analysis, machine learning systems may be able to infer those variables based on unstructured data such as images and text. However, if these inferred variables are to be incorporated into causal analyses, the error rate of the underlying classifier should affect the uncertainty in the causal conclusions. Past work has framed classifier accuracy as measurement error to incorporate predictions into consistently-estimated causal effects. However, this estimator is sensitive to small errors in its estimation of the classifier's error rate, leading to erratic outputs that are uninterpretable for any given analysis. In this paper we introduce three sensitivity analyses that capture the uncertainty from using a machine learning classifier in causal estimation, and show that our methods enable more robust analyses.
     Paiheng Xu, David Broniatowski, Mark Dredze. Twitter Detects Who is Social Distancing During COVID-19. NeurIPS Workshop on Machine Learning in Public Health, 2020. [Bibtex]
Expand Me   Eric C Leas, Alicia L Nobles, Theodore L Caputi, Mark Dredze, Shu-Hong Zhu, Joanna E Cohen, John W Ayers. News coverage of the E-cigarette, or Vaping, product use Associated Lung Injury (EVALI) outbreak and internet searches for vaping cessation. Tobacco Control, 2020. [PDF] [Bibtex]
Background In the latter half of 2019, an outbreak of pulmonary disease in the USA resulted in 2807 hospitalisations and 68 deaths, as of 18 February 2020. Given the severity of the outbreak, we assessed whether articles during the outbreak era more frequently warned about the dangers of vaping and whether internet searches for vaping cessation increased. Methods Using Tobacco Watcher, a media monitoring platform that automatically identifies and categorises news articles from sources across the globe, we obtained all articles that (a) discussed the outbreak and (b) primarily warned about the dangers of vaping. We obtained internet search trends originating from the USA that mentioned `quit' or `stop' and `e cig(s),' `ecig(s),' `e-cig(s),' `e cigarette(s),' `e-cigarette(s),' `electronic cigarette(s),' `vape(s),' `vaping' or `vaper(s)' from Google Trends (eg, `how do I quit vaping?'). All data were obtained from 1 January 2014 to 18 February 2020 and ARIMA models were used with historical trends to forecast the ratio of observed to expected search volumes during the outbreak era. Results News of the vaping-induced pulmonary disease outbreak was first reported on 25 July 2019 with 195 articles, culminating in 44 512 articles by 18 February 2020. On average, news articles warning about the dangers of vaping were 130% (95% prediction interval (PI): −15 to 417) and searches for vaping cessation were 76% (95% PI: 28 to 182) higher than expected levels for the days during the period when the sources of the outbreak were unknown (25 July to 27 September 2019). News and searches stabilised just after the US Centers for Disease Control and Prevention reported that a primary source of the outbreak was an additive used in marijuana vapes on 27 September 2019. In sum, there were 12 286 articles archived in Tobacco Watcher primarily warning about the dangers of vaping and 1 025 000 cessation searches following the outbreak. Conclusion The vaping-induced pulmonary disease outbreak spawned increased coverage about the dangers of vaping and internet searches for vaping cessation. Resources and strategies that respond to this elevated interest should become a priority among public health leaders.
Expand Me   John W Ayers, Benjamin M Althouse, Adam Poliak, Eric C Leas, Alicia L Nobles, Mark Dredze, Davey Smith. Quantifying Public Interest in Police Reforms by Mining Internet Search Data Following George Floyd's Death. Journal of Medical Internet Research (JMIR), 2020;22(10):e22574. [PDF] [Bibtex] (Ranked in the top 3% of 16m research outputs by Altmetric)
Background: The death of George Floyd while in police custody has resurfaced serious questions about police conduct that result in the deaths of unarmed persons. Objective: Data-driven strategies that identify and prioritize the public's needs may engender a public health response to improve policing. We assessed how internet searches indicative of interest in police reform changed after Mr Floyd's death. Methods: We monitored daily Google searches (per 10 million total searches) that included the terms ``police'' and ``reform(s)'' (eg, ``reform the police,'' ``best police reforms,'' etc) originating from the United States between January 1, 2010, through July 5, 2020. We also monitored searches containing the term ``police'' with ``training,'' ``union(s),'' ``militarization,'' or ``immunity'' as markers of interest in the corresponding reform topics. Results: The 41 days following Mr Floyd's death corresponded with the greatest number of police ``reform(s)'' searches ever recorded, with 1,350,000 total searches nationally. Searches increased significantly in all 50 states and Washington DC. By reform topic, nationally there were 1,220,000 total searches for ``police'' and ``union(s)''; 820,000 for ``training''; 360,000 for ``immunity''; and 72,000 for ``militarization.'' In terms of searches for all policy topics by state, 33 states searched the most for ``training,'' 16 for ``union(s),'' and 2 for ``immunity.'' States typically in the southeast had fewer queries related to any police reform topic than other states. States that had a greater percentage of votes for President Donald Trump during the 2016 election searched more often for police ``union(s)'' while states favoring Secretary Hillary Clinton searched more for police ``training.'' Conclusions: The United States is at a historical juncture, with record interest in topics related to police reform with variability in search terms across states. Policy makers can respond to searches by considering the policies their constituencies are searching for online, notably police training and unions. Public health leaders can respond by engaging in the subject of policing and advocating for evidence-based policy reforms.
Expand Me   Eric C Leas, Erik M Hendrickson, Alicia L Nobles, Rory Todd, Davey M Smith, Mark Dredze, John W Ayers. Self-reported Cannabidiol (CBD) Use for Conditions With Proven Therapies. JAMA Network Open, 2020;3(10):e2020977. [PDF] [Bibtex] (Ranked in the top 1% of 16m research outputs by Altmetric)
Question Is the public using cannabidiol (CBD) to treat diagnosable conditions that have evidence-based therapies? Findings In this case series of 376 posts on a CBD forum on Reddit, most users reported taking CBD as a therapeutic for diagnosable conditions, including mental health, cardiological, dermatological, gastroenterological, ophthalmological, oral health, and sexual health conditions, many of which have other evidence-based treatment regimens. Meaning The findings suggest a need for interventions that address the use of CBD for unproven applications, including regulating therapeutic claims about CBD and redirecting patients to proven therapies in lieu of CBD.
Expand Me   Paiheng Xu, Mark Dredze, David A Broniatowski. The Twitter Social Mobility Index: Measuring Social Distancing Practices from Geolocated Tweets. Journal of Medical Internet Research (JMIR), 2020. [PDF] [Bibtex]
Background: Social distancing is an important component of the response to the novel Coronavirus (COVID-19) pandemic. Minimizing social interactions and travel reduces the rate at which the infection spreads, and "flattens the curve" such that the medical system can better treat infected individuals. However, it remains unclear how the public will respond to these policies as the pandemic continues. Objective: We present the Twitter Social Mobility Index, a measure of social distancing and travel derived from Twitter data. We use public geolocated Twitter data to measure how much a user travels in a given week. Methods: We collect 469,669,925 geotagged tweets from January 1, 2019 to April 27, 2020 in the United States. We analyze the aggregated mobility variance of a total of 3,768,959 Twitter users at the city and state level since the start of the COVID-19 pandemic. Results: We find a large reduction in travel in the United States after the implementation of social distancing policies, with larger reductions in states that were early adopters and smaller changes in states without policies. Our findings are presented on and we will continue to update our analysis during the pandemic. Conclusions: Geolocated tweets are an effective way to track social distancing practices from a public resource, suggesting that it can be used as part of ongoing pandemic response planning.
Expand Me   Amelia Jamison, David A Broniatowski, Michael C Smith, Kajal S Parikh, Adeena Malik, Mark Dredze, Sandra C Quinn. Adapting and Extending a Typology to Identify Vaccine Misinformation on Twitter. American Journal of Public Health (AJPH), 2020;110(S3):S331--S339. [PDF] [Bibtex]
Objectives. To adapt and extend an existing typology of vaccine misinformation to classify the major topics of discussion across the total vaccine discourse on Twitter. Methods. Using 1.8 million vaccine-relevant tweets compiled from 2014 to 2017, we adapted an existing typology to Twitter data, first in a manual content analysis and then using latent Dirichlet allocation (LDA) topic modeling to extract 100 topics from the data set. Results. Manual annotation identified 22% of the data set as antivaccine, of which safety concerns and conspiracies were the most common themes. Seventeen percent of content was identified as provaccine, with roughly equal proportions of vaccine promotion, criticizing antivaccine beliefs, and vaccine safety and effectiveness. Of the 100 LDA topics, 48 contained provaccine sentiment and 28 contained antivaccine sentiment, with 9 containing both. Conclusions. Our updated typology successfully combines manual annotation with machine-learning methods to estimate the distribution of vaccine arguments, with greater detail on the most distinctive topics of discussion. With this information, communication efforts can be developed to better promote vaccines and avoid amplifying antivaccine rhetoric on Twitter.
Expand Me   David A Broniatowski, Amelia M Jamison, Neil F Johnson, Nicol&#225;s Velasquez, Rhys Leahy, Nicholas Johnson Restrepo, Mark Dredze, Sandra C Quinn. Facebook Pages, the ``Disneyland'' Measles Outbreak, and Promotion of Vaccine Refusal as a Civil Right, 2009--2019. American Journal of Public Health (AJPH), 2020;110(S3):S312--S318. [PDF] [Bibtex] (Ranked in the top 0.03% of 16m research outputs by Altmetric)
Objectives. To understand changes in how Facebook pages frame vaccine opposition. Methods. We categorized 204 Facebook pages expressing vaccine opposition, extracting public posts through November 20, 2019. We analyzed posts from October 2009 through October 2019 to examine if pages' content was coalescing. Results. Activity in pages promoting vaccine choice as a civil liberty increased in January 2015, April 2016, and January 2019 (t[76] = 11.33 [P < .001]; t[46] = 7.88 [P < .001]; and t[41] = 17.27 [P < .001], respectively). The 2019 increase was strongest in pages mentioning US states (t[41] = 19.06; P < .001). Discussion about vaccine safety decreased (rs[119] = −0.61; P < .001) while discussion about civil liberties increased (rs[119] = 0.33; Py < .001]). Page categories increasingly resembled one another (civil liberties: rs[119] = −0.50 [P < .001]; alternative medicine: rs[84] = −0.77 [P < .001]; conspiracy theories: rs[119] = −0.46 [P < .001]; morality: rs[106] = −0.65 [P < .001]; safety and efficacy: rs[119] = −0.46 [P < .001]). Conclusions. The ``Disneyland'' measles outbreak drew vaccine opposition into the political mainstream, followed by promotional campaigns conducted in pages framing vaccine refusal as a civil right. Political mobilization in state-focused pages followed in 2019. Public Health Implications. Policymakers should expect increasing attempts to alter state legislation associated with vaccine exemptions, potentially accompanied by fiercer lobbying from specific celebrities.
Expand Me   Justin Sech, Alexandra DeLucia, Anna L Buczak, Mark Dredze. Civil Unrest on Twitter (CUT): A Dataset of Tweets to Support Research on Civil Unrest. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2020. [PDF] [Bibtex] [Data]
We present CUT, a dataset for studying Civil Unrest on Twitter. Our dataset includes 4,381 tweets related to civil unrest, hand-annotated with information related to the study of civil unrest discussion and events. Our dataset is drawn from 42 countries from 2014 to 2019. We present baseline systems trained on this data for the identification of tweets related to civil unrest. We include a discussion of ethical issues related to research on this topic
Expand Me   Shijie Wu, Mark Dredze. Do explicit alignments robustly improve multilingual encoders? Empirical Methods in Natural Language Processing (EMNLP), 2020. [PDF] [Bibtex]
Multilingual BERT (Devlin et al., 2019, mBERT), XLM-RoBERTa (Conneau et al., 2019, XLMR) and other unsupervised multilingual encoders can effectively learn crosslingual representation. Explicit alignment objectives based on bitexts like Europarl or MultiUN have been shown to further improve these representations. However, word-level alignments are often suboptimal and such bitexts are unavailable for many languages. In this paper, we propose a new contrastive alignment objective that can better utilize such signal, and examine whether these previous alignment methods can be adapted to noisier sources of aligned data: a randomly sampled 1 million pair subset of the OPUS collection. Additionally, rather than report results on a single dataset with a single model run, we report the mean and standard derivation of multiple runs with different seeds, on four datasets and tasks. Our more extensive analysis finds that, while our new objective outperforms previous work, overall these methods do not improve performance with a more robust evaluation framework. Furthermore, the gains from using a better underlying model eclipse any benefits from alignment training. These negative results dictate more care in evaluating these methods and suggest limitations in applying explicit alignment objectives.
Expand Me   Keith Harrigian, Carlos Aguirre, Mark Dredze. Do Models of Mental Health Based on Social Media Data Generalize? Findings of the Empirical Methods in Natural Language Processing (EMNLP), 2020. [PDF] [Bibtex] [Code]
Proxy-based methods for annotating mental health status in social media have grown popular in computational research due to their ability to gather large training samples. However, an emerging body of literature has raised new concerns regarding the validity of these types of methods for use in clinical applications. To further understand the robustness of distantly supervised mental health models, we explore the generalization ability of machine learning classifiers trained to detect depression in individuals across multiple social media platforms. Our experiments not only reveal that substantial loss occurs when transferring between platforms, but also that there exist several unreliable confounding factors that may enable researchers to overestimate classification performance. Based on these results, we enumerate recommendations for future mental health dataset construction.
Expand Me   Amelia M Jamison, David A Broniatowski, Mark Dredze, Anu Sangraula, Michael C Smith, Sandra C Quinn. Not just conspiracy theories: vaccine opponents and pro-ponents add to the COVID-19 `infodemic' on Twitter. The Harvard Kennedy School (HKS) Misinformation Review, 2020. [PDF] [Bibtex]
In February 2020, the World Health Organization announced an `infodemic' --- a deluge of both accurate and inaccurate health information --- that accompanied the global pandemic of COVID-19 as a major challenge to effective health communication. We assessed content from the most active vaccine accounts on Twitter to understand how existing online communities contributed to the `infodemic' during the early stages of the pandemic. While we expected vaccine opponents to share misleading information about COVID-19, we also found vaccine proponents were not immune to spreading less reliable claims. In both groups, the single largest topic of discussion consisted of narratives comparing COVID-19 to other diseases like seasonal influenza, often downplaying the severity of the novel coronavirus. When considering the scope of the `infodemic,' researchers and health communicators must move beyond focusing on known bad actors and the most egregious types of misinformation to scrutinize the full spectrum of information --- from both reliable and unreliable sources --- that the public is likely to encounter online.
Expand Me   John W Ayers, Eric C Leas, Derek C Johnson, Adam Poliak, Benjamin M Althouse, Mark Dredze, Alicia L Nobles. Internet Searches for Acute Anxiety During the Early Stages of the COVID-19 Pandemic. JAMA Internal Medicine, 2020. [PDF] [Bibtex] (Ranked in the top 0.1% of 15m research outputs by Altmetric)
There is widespread concern that the coronavirus disease 2019 (COVID-19) pandemic may harm population mental health, chiefly owing to anxiety about the disease and its societal fallout. But traditional population mental health surveillance (eg, telephone surveys, medical records) is time consuming, expensive, and may miss persons who do not participate or seek care. To evaluate the association of COVID-19 with anxiety on a population basis, we examined internet searches indicative of acute anxiety during the early stages of the COVID-19 pandemic.
Expand Me   Alicia L Nobles, Eric C Leas, Mark Dredze, Christopher A Longhurst, Davey Smith, John W Ayers. Crowd-Diagnosis: When Patients Turn to Social Media to Obtain Clinical Diagnoses. Annual Symposium of the American Medical Informatics Association (AMIA), 2020. [PDF] [Bibtex]
In a 2019 study published in the Journal of the American Medical Association, we coined the term ``crowd-diagnosis'' to describe when people turn to public social media to obtain a diagnosis. Using a case study of sexually transmitted infections, we found thousands requesting crowd-diagnoses, commonly posting pictures to aid in diagnosis and sometimes seeking diagnoses to overrule a doctor's diagnosis.1 Our goal is to extend this work to a more general setting by focusing on a popular social media forum dedicated to obtaining feedback on medical conditions and answering (RQ1) who requests crowd-diagnoses, (RQ2) for what health issues are crowd-diagnoses more frequently sought, and (RQ3) what crowd-diagnosis requests are most likely to receive a response?
Expand Me   Alicia L Nobles, Eric C Leas, Seth Noar, Mark Dredze, Carl A Latkin, Steffanie A Strathdee, John W Ayers. Automated image analysis of instagram posts: Implications for risk perception and communication in public health using a case study of #HIV. PLoS One, 2020. [PDF] [Bibtex]
People's perceptions about health risks, including their risk of acquiring HIV, are impacted in part by who they see portrayed as at risk in the media. Viewers in these cases are asking themselves ``do those portrayed as at risk look like me?'' An accurate perception of risk is critical for high-risk populations, who already suffer from a range of health disparities. Yet, to date no study has evaluated the demographic representation of health-related content from social media. The objective of this case study was to apply automated image recognition software to examine the demographic profile of faces in Instagram posts containing the hashtag #HIV (obtained from January 2017 through July 2018) and compare this to the demographic breakdown of those most at risk of a new HIV diagnosis (estimates of incidence of new HIV diagnoses from the 2017 US Centers for Disease Control HIV Surveillance Report). We discovered 26,766 Instagram posts containing #HIV authored in American English with 10,036 (37.5%) containing a detectable human face with a total of 18,227 faces (mean = 1.8, standard deviation [SD] = 1.7). Faces skewed older (47% vs. 11% were 35--39 years old), more female (41% vs. 19%), more white (43% vs. 26%), less black (31% vs 44%), and less Hispanic (13% vs 25%) on Instagram than for new HIV diagnoses. The results were similarly skewed among the subset of #HIV posts mentioning pre-exposure prophylaxis (PrEP). This disparity might lead Instagram users to potentially misjudge their own HIV risk and delay prophylactic behaviors. Social media managers and organic advocates should be encouraged to share images that better reflect at-risk populations so as not to further marginalize these populations and to reduce disparity in risk perception. Replication of our methods for additional diseases, such as cancer, is warranted to discover and address other misrepresentations.
Expand Me   Theodore L Caputi, John W Ayers, Mark Dredze, Nicholas Suplina, Sarah Burd-Sharps. Collateral Crises of Gun Preparation and the COVID-19 Pandemic: Infodemiology Study. JMIR Public Health and Surveillance, 2020;6(2):e19369. [PDF] [Bibtex]
Background: In the past, national emergencies in the United States have resulted in increased gun preparation (ie, purchasing new guns or removing guns from storage); in turn, these gun actions have effected increases in firearm injuries and deaths. Objective: The aim of this paper was to assess the extent to which interest in gun preparation has increased amid the coronavirus disease (COVID-19) pandemic using data from Google searches related to purchasing and cleaning guns. Methods: We fit an Autoregressive Integrated Moving Average (ARIMA) model over Google search data from January 2004 up to the week that US President Donald Trump declared COVID-19 a national emergency. We used this model to forecast Google search volumes, creating a counterfactual of the number of gun preparation searches we would expect if the COVID-19 pandemic had not occurred, and reported observed deviations from this counterfactual. Results: Google searches related to preparing guns have surged to unprecedented levels, approximately 40% higher than previously reported spikes following the Sandy Hook, CT and Parkland, FL shootings and 158% (95% CI 73-270) greater than would be expected if the COVID-19 pandemic had not occurred. In absolute terms, approximately 2.1 million searches related to gun preparation were performed over just 34 days. States severely affected by COVID-19 appear to have some of the greatest increases in the number of searches. Conclusions: Our results corroborate media reports that gun purchases are increasing amid the COVID-19 pandemic and provide more precise geographic and temporal trends. Policy makers should invest in disseminating evidence-based educational tools about gun risks and safety procedures to avert a collateral public health crisis.
Expand Me   Shijie Wu, Mark Dredze. Are All Languages Created Equal in Multilingual BERT? ACL Workshop on Representation Learning for NLP (RepL4NLP), 2020. [PDF] [Bibtex] (Best Long Paper Award)
Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.
Expand Me   Michael Liu, Theodore L Caputi, Mark Dredze, Aaron S Kesselheim, John W Ayers. Internet Searches for Unproven COVID-19 Therapies in the United States. JAMA Internal Medicine, 2020. [PDF] [Bibtex] (Ranked in the top 0.2% of 15m research outputs by Altmetric)
There are no highly effective prescription drug therapies supported by any reliable evidence for the ongoing coronavirus disease 2019 (COVID-19) pandemic of severe acute respiratory syndrome coronavirus 2. However, fears among the public can lead to searches for unproven therapies. Therefore, when several high-profile figures, including entrepreneur Elon Musk and President Donald Trump, endorsed the use of chloroquine, a malarial prophylaxis drug, and hydroxychloroquine (with the antibiotic azithromycin), a lupus and rheumatoid arthritis treatment, to treat COVID-19, it drew massive public attention that could shape individual decision-making. This attention is especially troublesome because chloroquine and hydroxychloroquine (1) are thus far only known to inhibit severe acute respiratory syndrome coronavirus 2 in vitro,1 (2) have potential cardiovascular toxic effects,2 and (3) can be confused with commercially available chloroquine-containing products, such as aquarium cleaner. Poisonings, including 1 fatality, attributed to persons taking chloroquine to prevent or treat COVID-19 without the supervision of a licensed physician have already been reported.3 To better understand the scope of demand for these drugs, we examined internet searches indicative of shopping for chloroquine and hydroxychloroquine.
Expand Me   Dian Hu, Christine Martin, Mark Dredze, David A Broniatowski. Chinese social media suggest decreased vaccine acceptance in China: An observational study on Weibo following the 2018 Changchun Changsheng vaccine incident. Vaccine, 2020;38(13):2764-2770. [PDF] [Bibtex]
China is home to the world's largest population with the potential for disease outbreaks to affect billions. However, knowledge of Chinese vaccine acceptance trends is limited. In this work we use Chinese social media to track responses to the recent Changchun Changsheng Biotechnology vaccine scandal, which led to extensive discussion regarding vaccine safety and regulation in China. We analyzed messages from the popular Chinese microblogging platform Sina Weibo in July 2018 (11
     David A Broniatowski, Sandra C Quinn, Mark Dredze, Amelia M Jamison. Vaccine Communication as Weaponized Identity Politics. American Journal of Public Health (AJPH), 2020;110(5):617--618. [PDF] [Bibtex]
Expand Me   Elliot Schumacher, Andriy Mulyar, Mark Dredze. Clinical Concept Linking with Contextualized Neural Representations. Association for Computational Linguistics (ACL), 2020. [PDF] [Bibtex]
In traditional approaches to entity linking, linking decisions are based on three sources of information -- the similarity of the mention string to an entity's name, the similarity of the context of the document to the entity, and broader information about the knowledge base (KB). In some domains, there is little contextual information present in the KB and thus we rely more heavily on mention string similarity. We consider one example of this, concept linking, which seeks to link mentions of medical concepts to a medical concept ontology. We propose an approach to concept linking that leverages recent work in contextualized neural models, such as ELMo (Peters et al. 2018), which create a token representation that integrates the surrounding context of the mention and concept name. We find a neural ranking approach paired with contextualized embeddings provides gains over a competitive baseline (Leaman et al. 2013). Additionally, we find that a pre-training step using synonyms from the ontology offers a useful initialization for the ranker.
Expand Me   David Mueller, Nicholas Andrews, Mark Dredze. Sources of Transfer in Multilingual Named Entity Recognition. Association for Computational Linguistics (ACL), 2020. [PDF] [Bibtex]
Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.
Expand Me   Manya Wadhwa, Silvio Amir, Mark Dredze. Aligning Public Feedback To Requests For Comments On International Conference on Web and Social Media (ICWSM), 2020. [PDF] [Bibtex]
In an effort to democratize the regulatory process, the United States Federal government created, a portal through which federal agencies can share proposed regulations and solicit feedback from the public. A proposed regulation will contain several requests for feedback on specific topics, and the public can then submit comments in response. While this reduces barriers to soliciting feedback, it still leaves regulators with a challenge: how to produce a summary and incorporate feedback from the sometimes tens of thousands of submitted comments. We propose an information retrieval system by which comments are aligned to specific regulatory requests. We evaluate several measures of semantic similarity for matching comments to information requests. We evaluate our proposed system over a dataset containing several regulations proposed for electronic cigarettes, an issue that energized tens of thousands of comments in response.
Expand Me   Alicia L Nobles, Eric C Leas, Mark Dredze, John W Ayers. Examining Peer-to-Peer and Patient-Provider Interactions on a Social Media Community Facilitating Ask the Doctor Services. International Conference on Web and Social Media (ICWSM), 2020. [PDF] [Bibtex]
Ask the Doctor (AtD) services provide patients the opportunity to seek medical advice using online platforms. While these services represent a new mode of healthcare delivery, study of these online health communities and how they are used is limited. In particular, it is unknown if these platforms replicate existing barriers and biases in traditional healthcare delivery across demographic groups. We present an analy- sis of AskDocs, a subreddit that functions as a public AtD platform on social media. We examine the demographics of users, the health topics discussed, if biases present in offline healthcare settings exist on this platform, and how empathy is expressed in interactions between users and physicians. Our findings suggest a number of implications to enhance and support peer-to-peer and patient-provider interactions on online platforms.
     Joshua Dredze, Mark Dredze. Theoretical Orientations: Measuring Online Information Seeking from Google Search Queries. Association for Psychological Science Annual Convention (APS) (Conference canceled), 2020. [PDF] [Bibtex]
Expand Me   Alicia L Nobles, Eric C Leas, Carl A Latkin, Mark Dredze, Steffanie A Strathdee, John W Ayers. #HIV: Alignment of HIV-Related Visual Content on Instagram with Public Health Priorities in the US. AIDS and Behavior, 2020. [PDF] [Bibtex]
Instagram, with more than 1 billion monthly users, is the go-to social media platform to chronicle one's life via images, but how are people using the platform to present visual content about HIV? We analyzed public Instagram posts containing the hashtag ``#HIV'' (because they are self-tagged as related to HIV) between January 2017 and July 2018. We described the prevalence of co-occurring hashtags and explored thematic concepts in the images using automated image recognition and topic modeling. Twenty-eight percent of all #HIV posts included hashtags focused on awareness, followed by LGBTQ (24.5%) and living with HIV (17.9%). However, specific strategies were rarely cited, including testing (10.8%), treatment (10.3%), PrEP (6.2%) and condoms (4.1%). Image analyses revealed 44.5% of posts included infographics followed by people (21.3%) thereby humanizing HIV and stigmatized populations and promoting community mobilization. Novel content such as the handwriting image-theme (3.8%) where posters shared their HIV test results appeared. We discuss how this visual content aligns with public health priorities to reduce HIV in the US and the novel, organic messages that public health could help amplify.

     2019 (20 Publications)
Expand Me   Benjamin M Althouse, Daniel M Weinberger, Samuel V Scarpino, Virginia E Pitzer, John W Ayers, Edward Wenger, Isaac Chun-Hai Fung, Mark Dredze, Hao Hu. Google searches accurately forecast RSV hospitalizations. Chest Infections, 2019. [PDF] [Bibtex]
Hospitalization of children with respiratory syncytial virus (RSV) is common and costly. Traditional sources of hospitalization data, useful for public health decision-makers and physicians to make decisions, are themselves costly to acquire and are subject to delays from gathering to publication. Here we use Google searches for RSV as a proxy for RSV hospitalizations.
Expand Me   Amelia M Jamison, David A Broniatowski, Mark Dredze, Zach Wood-Doughty, Dure-Aden Khan, Sandra Crouse-Quinn. Vaccine-related advertising in the Facebook Ad Archive. Vaccine, 2019. [PDF] [Bibtex] (Ranked in the top 0.1% of 14.1m research outputs by Altmetric)
Background. In 2018, Facebook introduced Ad Archive as a platform to improve transparency in advertisements related to politics and ``issues of national importance.'' Vaccine-related Facebook advertising is publicly available for the first time. After measles outbreaks in the US brought renewed attention to the possible role of Facebook advertising in the spread of vaccine-related misinformation, Facebook announced steps to limit vaccine-related misinformation. This study serves as a baseline of advertising before new policies went into effect. Methods. Using the keyword `vaccine', we searched Ad Archive on December 13, 2018 and again on February 22, 2019. We exported data for 505 advertisements. A team of annotators sorted advertisements by content: pro-vaccine, anti-vaccine, not relevant. We also conducted a thematic analysis of major advertising themes. We ran Mann-Whitney U tests to compare ad performance metrics. Results. 309 advertisements were included in analysis with 163 (53%) pro-vaccine advertisements and 145 (47%) anti-vaccine advertisements. Despite a similar number of advertisements, the median number of ads per buyer was significantly higher for anti-vaccine ads. First time buyers are less likely to complete disclosure information and risk ad removal. Thematically, anti-vaccine advertising messages are relatively uniform and emphasize vaccine harms (55%). In contrast, pro-vaccine advertisements come from a diverse set of buyers (83 unique) with varied goals including promoting vaccination (49%), vaccine related philanthropy (15%), and vaccine related policy (14%). Conclusions. A small set of anti-vaccine advertisement buyers have leveraged Facebook advertisements to reach targeted audiences. By deeming all vaccine-related content an issue of ``national importance,'' Facebook has further the politicized vaccines. The implementation of a blanket disclosure policy also limits which ads can successfully run on Facebook. Improving transparency and limiting misinformation should not be separate goals. Public health communication efforts should consider the potential impact on Facebook users' vaccine attitudes and behaviors.
Expand Me   Elliot Schumacher, Mark Dredze. Learning unsupervised contextual representations for medical synonym discovery. JAMIA Open, 2019. [PDF] [Bibtex]
Objectives. An important component of processing medical texts is the identification of synonymous words or phrases. Synonyms can inform learned representations of patients or improve linking mentioned concepts to medical ontologies. However, medical synonyms can be lexically similar (``dilated RA'' and ``dilated RV'') or dissimilar (``cerebrovascular accident'' and ``stroke''); contextual information can determine if 2 strings are synonymous. Medical professionals utilize extensive variation of medical terminology, often not evidenced in structured medical resources. Therefore, the ability to discover synonyms, especially without reliance on training data, is an important component in processing training notes. The ability to discover synonyms from models trained on large amounts of unannotated data removes the need to rely on annotated pairs of similar words. Models relying solely on non-annotated data can be trained on a wider variety of texts without the cost of annotation, and thus may capture a broader variety of language. Materials and Methods. Recent contextualized deep learning representation models, such as ELMo (Peters et al., 2019) and BERT, (Devlin et al. 2019) have shown strong improvements over previous approaches in a broad variety of tasks. We leverage these contextualized deep learning models to build representations of synonyms, which integrate the context of surrounding sentence and use character-level models to alleviate out-of-vocabulary issues. Using these models, we perform unsupervised discovery of likely synonym matches, which reduces the reliance on expensive training data. Results. We use the ShARe/CLEF eHealth Evaluation Lab 2013 Task 1b data to evaluate our synonym discovery method. Comparing our proposed contextualized deep learning representations to previous non-neural representations, we find that the contextualized representations show consistent improvement over non-contextualized models in all metrics. Conclusions. Our results show that contextualized models produce effective representations for synonym discovery. We expect that the use of these representations in other tasks would produce similar gains in performance.
Expand Me   Alicia L Nobles, Eric C Leas, Benjamin M Althouse, Mark Dredze, Christopher A Longhurst, Davey M Smith, John W Ayers. Requests for Diagnoses of Sexually Transmitted Diseases on a Social Media Platform. Journal of the American Medical Association (JAMA), 2019;322(17):1712-1713. [PDF] [Bibtex] (Ranked in the top 0.05% of 13.9m research outputs by Altmetric)
Although many studies document the use of social media for sharing and requesting information on specific health conditions,1,2 whether individuals obtain diagnoses on social media platforms has not been investigated.3,4 The occurrence of requests for a diagnosis on social media (crowd-diagnosis) and determination as to whether the requested diagnosis was for a second opinion after seeing a health care professional were evaluated in a case study.
Expand Me   Eric C Leas, Alicia L Nobles, Theodore L Caputi, Mark Dredze, Davey M Smith, John W Ayers. Trends in Internet Searches for Cannabidiol (CBD) in the United States. JAMA Network Open, 2019;2(10):e1913853. [PDF] [Bibtex] (Ranked in the top 0.5% of 13.6m research outputs by Altmetric)
Cannabidiol (CBD) is widely promoted as a panacea. For example, the cannabis brand MedMen claims CBD treats acne, anxiety, opioid addiction, pain, and menstrual problems.1 However, the US Food and Drug Administration has only approved highly purified CBD (Epidiolex) for treating epilepsy. To our knowledge, there is currently no population-focused surveillance of public interest in CBD. Consequently, many question whether CBD should be prioritized by public health leaders and regulators. This article describes public interest in CBD within the United States.
Expand Me   Eric C Leas, Mark Dredze, John W Ayers. Ignoring Data Delays Our Reaction to Emerging Public Health Tragedies Like 13 Reasons Why. JAMA Psychiatry, 2019. [PDF] [Bibtex]
We applaud Niederkrotenthaler and colleagues1 for adding another layer of evidence that 13 Reasons Why is harming the public by pushing some individuals toward suicide. However, their dismissal of some of the earliest evidence on this subject deserves a revision not because it undermines their central claim but because it makes it even stronger and can make psychiatric epidemiology more actionable in the future.
Expand Me   Andriy Mulyar, Elliot Schumacher, Masoud Rouhizadeh, Mark Dredze. Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models. NeurIPS Workshop on Machine Learning for Health (ML4H), 2019. [PDF] [Bibtex]
Clinical notes contain an extensive record of a patient's health status, such as smoking status or the presence of heart conditions. However, this detail is not replicated within the structured data of electronic health systems. Phenotyping, the extraction of patient conditions from free clinical text, is a critical task which supports a variety of downstream applications such as decision support and secondary use of medical records. Previous work has resulted in systems which are high performing but require hand engineering, often of rules. Recent work in pretrained contextualized language models have enabled advances in representing text for a variety of tasks. We therefore explore several architectures for modeling phenotyping that rely solely on BERT representations of the clinical note, removing the need for manual engineering. We find these architectures are competitive with or outperform existing state of the art methods on two phenotyping tasks.
Expand Me   Zachary Wood-Doughty, David Broniatowski, Mark Dredze. Machine Learning Classifiers for Socio-Demographics of Social Media Users: Limitations and Possibilities. American Public Health Association (APHA), 2019. [PDF] [Bibtex]
Background: Social media analyses of health behaviors, such as vaccination, have shown significant promise for surveillance. However, existing research, both quantitative and qualitiative, suggests that health behaviors varies significantly with socio-demographic factors, particularly race, ethnicity, and gender. Often, these factors are not explicitly disclosed on social media platforms and must instead be inferred, raising the possibility of methodological bias. This session examines existing tools and methodologies used to infer socio-demographics of social media users. Methods: We survey several socio-demographic classifiers that work with Twitter and Reddit data. These classifiers use features including users' language patterns, follower behaviors, and choice of names. These classifiers predict labels including users' gender, race and ethnicity, or filter out social media accounts run by organizations. Results: We explain how the data for these classifiers is collected, how the classification models are trained, and how they could be applied to public health research. We in particular discuss the limitations that these classifiers have, including possible methodological bias introduced by the challenges of large-scale data collection of social media users' demographic information. Discussion: Health behaviors vary with socio-demographic factors, which are challenging to measure on social media platforms. Machine learning classification of socio-demographics is possible, but requires interdisciplinary considerations.
Expand Me   Yuchen Zhou, Mark Dredze, David A Broniatowski, William D Adler. Elites and foreign actors among the alt-right: The Gab social media platform. First Monday, 2019. [PDF] [Bibtex] (Ranked in the top 3% of 13.6m research outputs by Altmetric)
Content regulation and censorship of social media platforms is increasingly discussed by governments and the platforms themselves. To date, there has been little data-driven analysis of the effects of regulated content deemed inappropriate on online user behavior. We therefore compared Twitter --- a popular social media platform that occasionally removes content in violation of its Terms of Service --- to Gab --- a platform that markets itself as completely unregulated. Launched in mid-2016, Gab is, in practice, dominated by individuals who associate with the ``alt-right'' political movement in the United States. Despite its billing as ``The Free Speech Social Network,'' Gab users display more extreme social hierarchy and elitism when compared to Twitter. Although the framing of the site welcomes all people, Gab users' content is more homogeneous, preferentially sharing material from sites traditionally associated with the extremes of American political discourse, especially the far right. Furthermore, many of these sites are associated with state-sponsored propaganda from foreign governments. Finally, we discovered a significant presence of German language posts on Gab, with several topics focusing on German domestic politics, yet sharing significant amounts of content from U.S. and Russian sources. These results indicate possible emergent linkages between domestic politics in European and American far right political movements. Implications for regulation of social media platforms are discussed.
Expand Me   Michelle R Kaufman, Debangan Dey, Ciprian Crainiceanu, Mark Dredze. #MeToo and Google Inquiries Into Sexual Violence: A Hashtag Campaign Can Sustain Information Seeking. Journal of Interpersonal Violence, 2019. [PDF] [Bibtex]
The #MeToo Movement has brought new attention to sexual harassment and assault. While the movement originates with activist Tarana Burke, actor Alyssa Milano used the phrase on Twitter in October 2017 in response to multiple sexual harassment allegations against Hollywood producer Harvey Weinstein. Within 24 hours, 53,000 people tweeted comments and/or shared personal experiences of sexual violence. The study objective was to measure how information seeking via Google searches for sexual harassment and assault changed following Milano's tweet and whether this change was sustained in spite of celebrity scandals. Weekly Google search inquiries in the United States were downloaded for the terms metoo, sexual assault, sexual harassment, sexual abuse, and rape for January 1, 2017 to July 15, 2018. Seven related news events about perpetrator accusations were considered. Results showed that searches for metoo increased dramatically after the Weinstein accusation and stayed high during subsequent accusations. A small decrease in searches followed, but the number remained very high relative to baseline (the period before the Weinstein accusation). Searches for sexual assault and sexual harassment increased substantially immediately following the Weinstein accusation, stayed high during subsequent accusations, and saw a decline after the accusation of Matt Lauer (talk show host; last event considered). We estimated a 40% to 70% reduction in searches 6 months after the Lauer accusation, though the increase in searches relative to baseline remained statistically significant. For sexual abuse and rape, the number of searches returned close to baseline by 6 months. It appears that the #MeToo movement sparked greater information seeking that was sustained beyond the associated events. Given its recent ubiquitous use in the media and public life, hashtag activism such as #MeToo can be used to draw further attention to the next steps in addressing sexual assault and harassment, moving public web inquiries from information seeking to action.
Expand Me   Shijie Wu, Mark Dredze. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. Empirical Methods in Natural Language Processing (EMNLP), 2019. [PDF] [Bibtex]
Pretrained contextual representation models (Peters et al.,2018; Devlin et al.,2019) have pushed forward the state-of-the-art on many NLP tasks. A new release of BERT (Devlin,2018) includes a model simultaneously pre-trained on 104 languages with impressive performance for zero-shot cross-lingual transfer on a natural language inference task. This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero-shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification,NER, POS tagging, and dependency parsing. We compare mBERT with the best-published methods for zero-shot cross-lingual transfer and find mBERT competitive on each task. Additionally, we investigate the most effective strategy for utilizing mBERT in this manner, determine to what extent mBERT generalizes away from language-specific features, and measure factors that influence cross-lingual transfer.
Expand Me   Tao Chen, Mark Dredze, Jonathan P Weiner, Hadi Kharrazi. Identifying vulnerable older adult populations by contextualizing geriatric syndrome information in clinical notes of electronic health records. Journal of the American Medical Informatics Association (JAMIA), 2019. [PDF] [Bibtex]
Objective Geriatric syndromes such as functional disability and lack of social support are often not encoded in electronic health records (EHRs), thus obscuring the identification of vulnerable older adults in need of additional medical and social services. In this study, we automatically identify vulnerable older adult patients with geriatric syndrome based on clinical notes extracted from an EHR system, and demonstrate how contextual information can improve the process. Materials and Methods We propose a novel end-to-end neural architecture to identify sentences that contain geriatric syndromes. Our model learns a representation of the sentence and augments it with contextual information: surrounding sentences, the entire clinical document, and the diagnosis codes associated with the document. We trained our system on annotated notes from 85 patients, tuned the model on another 50 patients, and evaluated its performance on the rest, 50 patients. Results Contextual information improved classification, with the most effective context coming from the surrounding sentences. At sentence level, our best performing model achieved a micro-F1 of 0.605, significantly outperforming context-free baselines. At patient level, our best model achieved a micro-F1 of 0.843. Discussion Our solution can be used to expand the identification of vulnerable older adults with geriatric syndromes. Since functional and social factors are often not captured by diagnosis codes in EHRs, the automatic identification of the geriatric syndrome can reduce disparities by ensuring consistent care across the older adult population. Conclusion EHR free-text can be used to identify vulnerable older adults with a range of geriatric syndromes.
Expand Me   Silvio Amir, Mark Dredze, John W Ayers. Population Level Mental Health Surveillance over Social Media with Digital Cohorts. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2019. [PDF] [Bibtex]
The ability to track mental health conditions via social media opened the doors for largescale, automated, mental health surveillance. However, inferring accurate population-level trends requires representative samples of the underlying population, which can be challenging given the biases inherent in social media data. While previous work has adjusted samples based on demographic estimates, the populations were selected based on specific outcomes, e.g. specific mental health conditions. We depart from these methods, by conducting analyses over demographically representative digital cohorts of social media users. To validated this approach, we constructed a cohort of US based Twitter users to measure the prevalence of depression and PTSD, and investigate how these illnesses manifest across demographic subpopulations. The analysis demonstrates that cohort-based studies can help control for sampling biases, contextualize outcomes, and provide deeper insights into the data.
Expand Me   Alicia L Nobles, Mark Dredze, John W Ayers. Repeal and replace": increased demand for intrauterine devices following the 2016 presidential election. Contraception, 2019. [PDF] [Bibtex]
Objective To evaluate public's interest in contraceptive options following heightened focus on a repeal of the Affordable Care Act (ACA) since the 2016 United States presidential election. Study design We monitored the fraction of Google searches emerging from the United States for the three most popular reversible contraceptive methods --- oral contraceptives, intrauterine devices (IUDs) and condoms --- from January 1, 2004, through October 31, 2017 (1 year after the presidential election). Results IUD searches were cumulatively 15% (95% CI: 10 to 20) higher than expected the year following the 2016 election, reflecting 10 to 21 million excess searches. IUD searches were statistically significantly higher in all states, except NV, and were consistent across states won by Trump or Clinton (Welch t test=0.60, p=.548). Conversely, searches for oral contraceptives and condoms remained stable (0%; 95% CI: −2 to 1) or declined (−4%; 95% CI: −5 to −2), respectively, following the election. Conclusions The etiology of increased searches for IUDs is likely multifaceted. However, it may largely be because IUDs will confer continued protection even after an ACA repeal, thereby providing a medical hedge against a possible repeal. Regardless, these data suggest the heightened focus on an ACA repeal is a concern to the record number of Americans seeking out information about IUDs.
Expand Me   Tao Chen, Mark Dredze, Jonathan P Weiner, Leilani Hernandez, Joe Kimura, Hadi Kharrazi. Extraction of Geriatric Syndromes From Electronic Health Record Clinical Notes: Assessment of Statistical Natural Language Processing Methods. JMIR Medical Informatics, 2019. [PDF] [Bibtex]
Background: Geriatric syndromes in older adults are associated with adverse outcomes. However, despite being reported in clinical notes, these syndromes are often poorly captured by diagnostic codes in the structured fields of electronic health records (EHRs) or administrative records. Objective: We aim to automatically determine if a patient has any geriatric syndromes by mining the free text of associated EHR clinical notes. We assessed which statistical natural language processing (NLP) techniques are most effective. Methods: We applied conditional random fields (CRFs), a widely used machine learning algorithm, to identify each of 10 geriatric syndrome constructs in a clinical note. We assessed three sets of features and attributes for CRF operations: a base set, enhanced token, and contextual features. We trained the CRF on 3901 manually annotated notes from 85 patients, tuned the CRF on a validation set of 50 patients, and evaluated it on 50 held-out test patients. These notes were from a group of US Medicare patients over 65 years of age enrolled in a Medicare Advantage Health Maintenance Organization and cared for by a large group practice in Massachusetts. Results: A final feature set was formed through comprehensive feature ablation experiments. The final CRF model performed well at patient-level determination (macroaverage F1=0.834, microaverage F1=0.851); however, performance varied by construct. For example, at phrase-partial evaluation, the CRF model worked well on constructs such as absence of fecal control (F1=0.857) and vision impairment (F1=0.798) but poorly on malnutrition (F1=0.155), weight loss (F1=0.394), and severe urinary control issues (F1=0.532). Errors were primarily due to previously unobserved words (ie, out-of-vocabulary) and a lack of context. Conclusions: This study shows that statistical NLP can be used to identify geriatric syndromes from EHR-extracted clinical notes. This creates new opportunities to identify patients with geriatric syndromes and study their health outcomes.
Expand Me   Elliot Schumacher, Mark Dredze. Discriminative Candidate Generation for Medical Concept Linking. Knowledge Base Construction (AKBC), 2019. [PDF] [Bibtex]
Linking mentions of medical concepts in a clinical note to a concept in an ontology enables a variety of tasks that rely on understanding the content of a medical record, such as identifying patient populations and decision support. Medical concept linking can be formulated as a two-step task; 1) candidate generator, which selects likely candidates from the ontology for the given mention, and 2) a ranker, which orders the candidates based on a set of features to find the best one.In this paper, we propose a candidate generation system based on the DiscK framework [Chen andVan Durme, 2017]. Our system produces a candidate list with both high coverage and a rankingthat is a useful starting point for the second step of the linking process. we integrate our candidate selection process into a current linking system, DNorm [Leaman et al., 2013]. The resulting system achieves similar accuracy paired with with a gain in efficiency due to a large reduction in the number of potential candidates considered.
Expand Me   Ran Zhao, Yuntian Deng, Mark Dredze, Arun Verma, David Rosenberg, Amanda Stent. Visual Attention Model for Cross-sectional Stock Return Prediction and End-to-End Multimodal Market Representation Learning. The Florida Artificial Intelligence Research Society (FLAIRS), 2019. [PDF] [Bibtex]
Technical and fundamental analysis are traditional tools used to analyze stocks; however, the finance literature has shown that the price movement of each individual stock is highly correlated with that of other stocks, especially those within the same sector. In this paper we propose a general- purpose market representation that incorporates fundamental and technical indicators and relationships between individual stocks. We treat the daily stock market as a `market image' where rows (grouped by market sector) represent individual stocks and columns represent indicators. We apply a convo- lutional neural network over this market image to build market features in a hierarchical way. We use a recurrent neural network, with an attention mechanism over the market fea- ture maps, to model temporal dynamics in the market. Our model outperforms strong baselines in both short-term and long-term stock return prediction tasks. We also show another use for our market image: to construct concise and dense mar- ket embeddings suitable for downstream prediction tasks.
     Joshua Dredze, Lisi Dredze, Mark Dredze. Measuring Online Information Seeking for Stimulants from Google Search Queries. American Psychological Association (APA), 2019. [PDF] [Bibtex]
Expand Me   John W Ayers, Alicia L Nobles, Mark Dredze. Media Trends for the Substance Abuse and Mental Health Services Administration 800-662-HELP Addiction Treatment Referral Services After a Celebrity Overdose. JAMA Internal Medicine, 2019. [PDF] [Bibtex] (Ranked in the top 1% of 12.5m research outputs by Altmetric)
Despite a substantial investment in evidence-based addiction resources, only 10% of US individuals who need treatment for drug addiction receive it.1 The Substance Abuse and Mental Health Services Administration (SAMHSA) national helpline (800-662-HELP) is the only free, federally managed and endorsed US addiction treatment referral service, helping callers find the best local services that match their needs. We sought to determine awareness of the avialability of this free resource.
Expand Me   Xiaolei Huang, Michael C Smith, Amelia M Jamison, David A Broniatowski, Mark Dredze, Sandra Crouse Quinn, Justin Cai, Michael J Paul. Can online self-reports assist in real-time identification of influenza vaccination uptake? A cross-sectional study of influenza vaccine-related tweets in the USA, 2013--2017. BMJ Open, 2019. [PDF] [Bibtex]
Introduction The Centers for Disease Control and Prevention (CDC) spend significant time and resources to track influenza vaccination coverage each influenza season using national surveys. Emerging data from social media provide an alternative solution to surveillance at both national and local levels of influenza vaccination coverage in near real time. Objectives This study aimed to characterise and analyse the vaccinated population from temporal, demographical and geographical perspectives using automatic classification of vaccination-related Twitter data. Methods In this cross-sectional study, we continuously collected tweets containing both influenza-related terms and vaccine-related terms covering four consecutive influenza seasons from 2013 to 2017. We created a machine learning classifier to identify relevant tweets, then evaluated the approach by comparing to data from the CDC's FluVaxView. We limited our analysis to tweets geolocated within the USA. Results We assessed 1 124 839 tweets. We found strong correlations of 0.799 between monthly Twitter estimates and CDC, with correlations as high as 0.950 in individual influenza seasons. We also found that our approach obtained geographical correlations of 0.387 at the US state level and 0.467 at the regional level. Finally, we found a higher level of influenza vaccine tweets among female users than male users, also consistent with the results of CDC surveys on vaccine uptake. Conclusion Significant correlations between Twitter data and CDC data show the potential of using social media for vaccination surveillance. Temporal variability is captured better than geographical and demographical variability. We discuss potential paths forward for leveraging this approach.

     2018 (18 Publications)
Expand Me   John W Ayers, Mark Dredze, Eric C Leas, Theodore L Caputi, Jon-Patrick Allem, Joanna E Cohen. Next generation media monitoring: Global coverage of electronic nicotine delivery systems (electronic cigarettes) on Bing, Google and Twitter, 2013-2018. PloS one, 2018;13(11):e0205822. [PDF] [Bibtex]
News media monitoring is an important scientific tool. By treating news reporters as data collectors and their reports as qualitative accounts of a fast changing public health landscape, researchers can glean many valuable insights. Yet, there have been surprisingly few innovations in public health media monitoring, with nearly all studies relying on labor-intensive content analyses limited to a small number of media reports. We propose to advance this subfield by using scalable machine learning. In potentially the largest contemporary public health media monitoring study to date, we systematically characterize global news reports surrounding electronic cigarettes or electronic nicotine delivery systems (ENDS) using natural language processing techniques. News reports including ENDS terms (e.g., ``electronic cigarettes'') from over 100,000 sources (all sources archived on Google News or Bing News, as well as all news articles shared on Twitter) were monitored for 1 January 2013 through 31 July 2018. The geographic and subject (e.g., prevalence, bans, quitting, warnings, marketing, prices, age, flavor and industry) foci of news articles, their popularity among readers who share news on social media, and the sentiment behind news articles were assessed algorithmically. Globally there were 86,872 ENDS news reports with coverage increasing from 8 (standard deviation [SD] = 8) stories per day in 2013 to 75 (SD = 56) stories per day during 2018. The focus of ENDS news spanned 148 nations, with the plurality focusing on the United States (34% of all news). Potentially overlooked hotspots of ENDS media activity included China, Egypt, Russia, Ukraine, and Paraguay. The most common subject was warnings about ENDS (18%), followed by bans on using ENDS (13%) and ENDS prices (9%). Flavor and age restrictions were the least covered news subjects (&#160;1% each). Among different subject foci, reports on quitting cigarettes using ENDS had the highest probability of scoring in the top three deciles of popularity rankings. Moreover, ENDS news on quitting and prices had a more positive sentiment on average than news with other subject foci. Public health leaders can use these trends to stay abreast of how ENDS are portrayed in the media, and potentially how the public perceives ENDS. Because our analytical strategies are updated in near real time, we aim to make media monitoring part of standard practice to support evidence-based tobacco control in the future.
     Masoud Rouhizadeh, Elham Hatef, Mark Dredze, Christopher Chute, Hadi Kharrazi. Identifying Social Determinants of Health from Clinical Notes: A Rule-Based Approach. AMIA Natural Language Processing Working Group Pre-Symposium, 2018. [Bibtex]
Expand Me   David A Broniatowski, Amelia M Jamison, SiHua Qi, Lulwah AlKulaib, Tao Chen, Adrian Benton, Sandra C Quinn, Mark Dredze. Weaponized Health Communication: Twitter Bots and Russian Trolls Amplify the Vaccine Debate. American Journal of Public Health (AJPH), 2018;108(10):1378-1384. [PDF] [Bibtex] (Ranked #31 of 13.6 million research outputs by Altmetric and in the top-20 for 2018.) [Data]
Objectives. To understand how Twitter bots and trolls (``bots'') promote online health content. Methods. We compared bots' to average users' rates of vaccine-relevant messages, which we collected online from July 2014 through September 2017. We estimated the likelihood that users were bots, comparing proportions of polarized and antivaccine tweets across user types. We conducted a content analysis of a Twitter hashtag associated with Russian troll activity. Results. Compared with average users, Russian trolls (χ2(1) = 102.0; P < .001), sophisticated bots (χ2(1) = 28.6; P < .001), and ``content polluters'' (χ2(1) = 7.0; P < .001) tweeted about vaccination at higher rates. Whereas content polluters posted more antivaccine content (χ2(1) = 11.18; P < .001), Russian trolls amplified both sides. Unidentifiable accounts were more polarized (χ2(1) = 12.1; P < .001) and antivaccine (χ2(1) = 35.9; P < .001). Analysis of the Russian troll hashtag showed that its messages were more political and divisive. Conclusions. Whereas bots that spread malware and unsolicited content disseminated antivaccine messages, Russian trolls promoted discord. Accounts masquerading as legitimate users create false equivalency, eroding public consensus on vaccination. Public Health Implications. Directly confronting vaccine skeptics enables bots to legitimize the vaccine debate. More research is needed to determine how best to combat bot-driven content. (Am J Public Health. Published online ahead of print August 23, 2018: e1--e7. doi:10.2105/AJPH.2018.304567)
Expand Me   Adrian Benton, Mark Dredze. Using Author Embeddings to Improve Tweet Stance Classification. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2018. [PDF] [Bibtex]
Many social media classification tasks analyze the content of a message, but do not consider the context of the message. For example, in tweet stance classification -- where a tweet is categorized according to a view-point it espouses -- the expressed viewpoint depends on latent beliefs held by the user. In this paper we investigate whether incorporating knowledge about the author can improve tweet stance classification. Furthermore, since author information and embeddings are often unavailable for labeled training examples, we propose a semi-supervised pre-training method to predict user embeddings. Although the neural stance classifiers we learn are often outperformed by a baseline SVM, author embedding pre-training yields improvements over a non-pre-trained neural network on four out of five domains in the SemEval 2016 6A tweet stance classification task. In a tweet gun control stance classification dataset, improvements from pre-training are only apparent when training data is limited.
Expand Me   Zachary Wood-Doughty, Nicholas Andrews, Mark Dredze. Convolutions Are All You Need (For Classifying Character Sequences) EMNLP Workshop on Noisy User-generated Text (W-NUT), 2018. [PDF] [Bibtex]
While recurrent neural networks (RNNs) are widely used for text classification, they demonstrate poor performance and slow convergence when trained on long sequences. When text is modeled as characters instead of words, the longer sequences make RNNs a poor choice. Convolutional neural networks (CNNs), although somewhat less ubiquitous than RNNs, have an internal structure more appropriate for long-distance character dependencies. To better understand how CNNs and RNNs differ in handling long sequences, we use them for text classification tasks in several character-level social media datasets. The CNN models vastly outperform the RNN models in our experiments, suggesting that CNNs are superior to RNNs at learning to classify character-level data.
Expand Me   Vedran Sekara, Alex Rutherford, Gideon Mann, Mark Dredze, Natalia Adler, Manuel García-Herranz. Trends in the Adoption of Corporate Child Labor Policies: An Analysis with Bloomberg Terminal ESG Data. Bloomberg Data for Good Exchange, 2018. [PDF] [Bibtex]
Over 150 million children worldwide are estimated to be engaged in some form of child labor, with nearly one in every four children between the ages of 5 and 14 engaged in potentially harmful work in the world's poorest countries. Child labor compromises children's physical, mental, social and educational development. It also reinforces cycles of poverty, negatively affecting the ecosystem necessary for business to thrive in a sustainable manner. Against a backdrop of multiple international and national laws against child labor, corporations also adopt policies on child labor. However, new methods of globally dispersed production have made this commitment to sustainability issues across supply chains more challenging. In this work we examine, through the lens of Bloomberg's environmental, social and governance (ESG) and financial data, trends in corporate child labor policies and their relationship to classic economic variables as a first step in understanding sustainability issues across global supply networks.
Expand Me   Zachary Wood-Doughty, Ilya Shpitser, Mark Dredze. Challenges of Using Text Classifiers for Causal Inference. Empirical Methods in Natural Language Processing (EMNLP), 2018. [PDF] [Bibtex]
Causal understanding is essential for many kinds of decision-making, but causal inference from observational data has typically only been applied to structured, low-dimensional datasets. While text classifiers produce low-dimensional outputs, their use in causal inference has not previously been studied. To facilitate causal analyses based on language data, we consider the role that text classifiers can play in causal inference through established modeling mechanisms from the causality literature on missing data and measurement error. We demonstrate how to conduct causal analyses using text classifiers on simulated and Yelp data, and discuss the opportunities and challenges of future work that uses text data in causal inference.
Expand Me   John W Ayers, Theodore L Caputi, Camille Nebeker, Mark Dredze. Don't quote me: reverse identification of research participants in social media studies. Nature Digital Medicine, 2018. [PDF] [Bibtex] (Ranked in the top 0.6% of 11.6m million research outputs by Altmetric)
We investigated if participants in social media surveillance studies could be reverse identified by reviewing all articles published on PubMed in 2015 or 2016 with the words ``Twitter'' and either ``read,'' ``coded,'' or ``content'' in the title or abstract. Seventy-two percent (95% CI: 63--80) of articles quoted at least one participant's tweet and searching for the quoted content led to the participant 84% (95% CI: 74--91) of the time. Twenty-one percent (95% CI: 13--29) of articles disclosed a participant's Twitter username thereby making the participant immediately identifiable. Only one article reported obtaining consent to disclose identifying information and institutional review board (IRB) involvement was mentioned in only 40% (95% CI: 31--50) of articles, of which 17% (95% CI: 10--25) received IRB-approval and 23% (95% CI:16--32) were deemed exempt. Biomedical publications are routinely including identifiable information by quoting tweets or revealing usernames which, in turn, violates ICMJE ethical standards governing scientific ethics, even though said content is scientifically unnecessary. We propose that authors convey aggregate findings without revealing participants' identities, editors refuse to publish reports that reveal a participant's identity, and IRBs attend to these privacy issues when reviewing studies involving social media data. These strategies together will ensure participants are protected going forward.
Expand Me   Yuki Lama, Tao Chen, Mark Dredze, Amelia M Jamison, Sandra C Quinn, David A Broniatowski. Discordance Between Human Papillomavirus Twitter Images and Disparities in Human Papillomavirus Risk and Disease in the United States: Mixed-Methods Analysis. Journal of Medical Internet Research (JMIR), 2018;20(9):e10244. [PDF] [Bibtex]
Background: Racial and ethnic minorities are disproportionately affected by human papillomavirus (HPV)-related cancer, many of which could have been prevented with vaccination. Yet, the initiation and completion rates of HPV vaccination remain low among these populations. Given the importance of social media platforms for health communication, we examined US-based HPV images on Twitter. We explored inconsistencies between the demographics represented in HPV images and the populations that experience the greatest burden of HPV-related disease. Objective: The objective of our study was to observe whether HPV images on Twitter reflect the actual burden of disease by select demographics and determine to what extent Twitter accounts utilized images that reflect the burden of disease in their health communication messages. Methods: We identified 456 image tweets about HPV that contained faces posted by US users between November 11, 2014 and August 8, 2016. We identified images containing at least one human face and utilized Face++ software to automatically extract the gender, age, and race of each face. We manually annotated the source accounts of these tweets into 3 types as follows: government (38/298, 12.8%), organizations (161/298, 54.0%), and individual (99/298, 33.2%) and topics (news, health, and other) to examine how images varied by message source. Results: Findings reflected the racial demographics of the US population but not the disease burden (795/1219, 65.22% white faces; 140/1219, 11.48% black faces; 71/1219, 5.82% Asian faces; and 213/1219, 17.47% racially ambiguous faces). Gender disparities were evident in the image faces; 71.70% (874/1219) represented female faces, whereas only 27.89% (340/1219) represented male faces. Among the 11-26 years age group recommended to receive HPV vaccine, HPV images contained more female-only faces (214/616, 34.3%) than males (37/616, 6.0%); the remainder of images included both male and female faces (365/616, 59.3%). Gender and racial disparities were present across different image sources. Faces from government sources were more likely to depict females (n=44) compared with males (n=16). Of male faces, 80% (12/15) of youth and 100% (1/1) of adults were white. News organization sources depicted high proportions of white faces (28/38, 97% of female youth and 12/12, 100% of adult males). Face++ identified fewer faces compared with manual annotation because of limitations with detecting multiple, small, or blurry faces. Nonetheless, Face++ achieved a high degree of accuracy with respect to gender, race, and age compared with manual annotation. Conclusions: This study reveals critical differences between the demographics reflected in HPV images and the actual burden of disease. Racial minorities are less likely to appear in HPV images despite higher rates of HPV incidence. Health communication efforts need to represent populations at risk better if we seek to reduce disparities in HPV infection.
     Katherine Smith, Caitlin Weiger, Errol Fields, Joanna E Cohen, Meghan Moran, Mark Dredze. Conducting public health surveillance research on consumer product websites. American Public Health Association (APHA), 2018. [PDF] [Bibtex]
Expand Me   Yuchen Zhou, Mark Dredze, David A Broniatowski, William Adler. Gab: The Alt-Right Social Media Platform. International Conference on Social Computing, Behavioral-Cultural Modeling & Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 2018. [PDF] [Bibtex]
This study proposes the use of Gab as a vehicle for political science research regarding modern American politics and the Alt-Right population. We collect several million Gab messages posted on Gab web- site from August 2016 to February 2018. We conduct a preliminary analysis of Gab platform related to site use, growth and topics, which shows that Gab is a reasonable resource for Alt-Right study.
     Travis Wolfe, Annabelle Carrell, Mark Dredze, Benjamin Van Durme. Summarizing Entities using Distantly Supervised Information Extractors. SIGIR Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR), 2018. [Bibtex]
     Alexis S Hammond, Michael J Paul, J Gregory Hobelmann, Animesh R Koratana, Mark Dredze, Margaret S Chisolm. Perceived Attitudes About Substance Use in Anonymous Social Media Posts Near College Campuses. Journal of Medical Internet Research Mental Health (JMIR MH), 2018;5(3):e52. [PDF] [Bibtex]
     Zachary Wood-Doughty, Praateek Mahajan, Mark Dredze. Johns Hopkins or johnny-hopkins: Classifying Individuals versus Organizations on Twitter. NAACL Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, 2018. [PDF] [Bibtex]
     Zachary Wood-Doughty, Nicholas Andrews, Rebecca Marvin, Mark Dredze. Predicting Twitter User Demographics from Names Alone. NAACL Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, 2018. [PDF] [Bibtex]
     Theodore L Caputi, Eric C Leas, Mark Dredze, John W Ayers. Online Sales of Marijuana: An Unrecognized Public Health Dilemma. American Journal of Preventive Medicine (AJPM), 2018;54(5):719-721. [PDF] [Bibtex]
     Adrian Benton, Mark Dredze. Deep Dirichlet Multinomial Regression. North American Chapter of the Association for Computational Linguistics (NAACL), 2018. [PDF] [Bibtex]
     Tao Chen, Mark Dredze. Vaccine Images on Twitter: What is Shared and Why. Journal of Medical Internet Research (JMIR), 2018;20(4):2018. [PDF] [Bibtex]

     2017 (23 Publications)
Expand Me   Seth M Noar, Eric C Leas, Benjamin M Althouse, Mark Dredze, Dannielle Kelley, John W Ayers. Can a selfie promote public engagement with skin cancer? Preventive Medicine, 2017. [PDF] [Bibtex]
Social media may provide new opportunities to promote skin cancer prevention, but research to understand this potential is needed. In April of 2015, Kentucky native Tawny Willoughby (TW) shared a graphic skin cancer selfie on Facebook that subsequently went viral. We examined the volume of comments and shares of her original Facebook post; news volume of skin cancer from Google News; and search volume for skin cancer Google queries. We compared these latter metrics after TWs announcement against expected volumes based on forecasts of historical trends. TW's skin cancer selfie went viral on May 11, 2015 after the social media post had been shared approximately 50,000 times. All search queries for skin cancer increased 162% (95% CI 102 to 320) and 155% (95% CI 107 to 353) on May 13th and 14th, when news about TW's skin cancer selfie was at its peak, and remained higher through May 17th. Google searches about skin cancer prevention and tanning were also significantly higher than expected volumes. In practical terms, searches reached near-record levels - i.e., May 13th, 14th and 15th were respectively the 6th, 8th, and 40th most searched days for skin cancer since January 1, 2004 when Google began tracking searches. We conclude that an ordinary person's social media post caught the public's imagination and led to significant increases in public engagement with skin cancer prevention. Digital surveillance methods can rapidly detect these events in near real time, allowing public health practitioners to engage and potentially elevate positive effects.
     Theodore L Caputi, Eric C Leas, Mark Dredze, Joanna E Cohen, John W Ayers. They're heating up: Internet search query trends reveal significant public interest in heat-not-burn tobacco products. PLoS ONE, 2017. [PDF] [Bibtex]
     Benjamin Van Durme, Tom Lippincott, Kevin Duh, Deana Burchfield, Adam Poliak, Cash Costello, Tim Finin, Scott Miller, James Mayfield, Philipp Koehn, Craig Harman, Dawn Lawrie, Chandler May, Max Thomas, Julianne Chaloux, Annabelle Carrell, Tongfei Chen, Alex Comerford, Mark Dredze, Benjamin Glass, Shudong Hao, Patrick Martin, Rashmi Sankepally, Pushpendre Rastogi, Travis Wolfe, Ying-Ying Tran, Ted Zhang. CADET: Computer Assisted Discovery Extraction and Translation. International Joint Conference on Natural Language Processing (IJCNLP) (Demonstration Track), 2017. [PDF] [Bibtex]
     Michael C Smith, Mark Dredze, Sandra C Quinn, David A Broniatowski. Monitoring Real-time Spatial Public Health Discussions in the Context of Vaccine Hesitancy. AMIA Workshop on Social Media Mining for Health Applications, 2017. [PDF] [Bibtex]
     Ning Gao, Mark Dredze, Douglas Oard. Enhancing Scientific Collaboration Through Knowledge Base Population and Linking for Meetings. Hawaii International Conference on System Sciences (HICSS), 2017. [PDF] [Bibtex]
     Michael J Paul, Mark Dredze. Social Monitoring for Public Health. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2017;9(5):1-183. [PDF] [Bibtex] [Preprint (free)]
     Ning Gao, Gregory Sell, Douglas Oard, Mark Dredze. Leveraging Side Information for Speaker Identification with the Enron Conversational Telephone Speech Collection. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017. [PDF] [Bibtex]
     John W Ayers, Benjamin M Althouse, Eric C Leas, Mark Dredze, Jon-Patrick Allem. Internet searches for suicide following the release of 13 Reasons Why. JAMA Internal Medicine, 2017;57(4):238-240. [PDF] [Bibtex] (Ranked in the top .02% of 8.2m research outputs by Altmetric, Read the JAMA IM Editorial, Read Netflix cast's response to criticism)
Expand Me   Mark Dredze, Zachary Wood-Doughty, Sandra C Quinn, David A Broniatowski. Vaccine opponents' use of Twitter during the 2016 US presidential election: Implications for practice and policy. Vaccine, 2017;35(36):4670-4672. [PDF] [Bibtex]
The recent inauguration of President Trump carries with it many public health policy implications. During the election, President Trump, like all political candidates, made policy commitments to various interest groups including vaccine skeptics. These groups celebrated the announcement that Robert Kennedy Jr., a noted proponent of a causal link between vaccines and autism, may chair a commission on vaccines. Furthermore, during the GOP primaries, Mr. Trump endorsed messages associated with vaccine refusal on Twitter, and met with prominent vaccine refusal advocates including Andrew Wakefield, who published the retracted and discredited 1998 Lancet article claiming to link autism to MMR vaccination. In this paper, we show that the new administration has mobilized vaccine refusal advocates, potentially enabling them to influence the national agenda in a manner that could lead to changes in existing vaccination policy.
     Anietie Andy, Mark Dredze, Mugizi Rwebangira, Chris Callison-Burch. Constructing an Alias List for Named Entities during an Event. EMNLP Workshop on Noisy User-generated Text (W-NUT), 2017. [PDF] [Bibtex]
Expand Me   Nanyun Peng, Mark Dredze. Multi-task Domain Adaptation for Sequence Tagging. ACL Workshop on Representation Learning for NLP (RepL4NLP), 2017. [PDF] [Bibtex]
Many domain adaptation approaches rely on learning cross domain shared representations to transfer the knowledge learned in one domain to other domains. Traditional domain adaptation only considers adapting for one task. In this paper, we explore multi-task representation learning under the domain adaptation scenario. We propose a neural network framework that supports domain adaptation for multiple tasks simultaneously, and learns shared representations that better generalize for domain adaptation. We apply the proposed framework to domain adaptation for sequence tagging problems considering two tasks: Chinese word segmentation and named entity recognition. Experiments show that multi-task domain adaptation works better than disjoint domain adaptation for each task, and achieves the state-of-the-art results for both tasks in the social media domain.
Expand Me   Zachary Wood-Doughty, Michael C Smith, David A Broniatowski, Mark Dredze. How Does Twitter User Behavior Vary Across Demographic Groups? ACL Workshop on Natural Language Processing and Computational Social Science, 2017. [PDF] [Bibtex]
Demographically-tagged social media messages are a common source of data for computational social science. While these messages can indicate differences in beliefs and behaviors between demographic groups, we do not have a clear understanding of how different demographic groups use platforms such as Twitter. This paper presents a preliminary analysis of how groups' differing behaviors may confound analyses of the groups themselves. We analyzed one million Twitter users by first inferring demographic attributes, and then measuring several indicators of Twitter behavior. We find differences in these indicators across demographic groups, suggesting that there may be underlying differences in how different demographic groups use Twitter.
Expand Me   Jon-Patrick Allem, Eric C Leas, Theodore L Caputi, Mark Dredze, Benjamin M Althouse, Seth M Noar, John W Ayers. The Charlie Sheen Effect on Rapid In-home Human Immunodeficiency Virus Test Sales. Prevention Science, 2017;18(5):541--544. [PDF] [Bibtex]
One in eight of the 1.2 million Americans living with human immunodeficiency virus (HIV) are unaware of their positive status, and untested individuals are responsible for most new infections. As a result, testing is the most cost-effective HIV prevention strategy and must be accelerated when opportunities are presented. Web searches for HIV spiked around actor Charlie Sheen's HIV-positive disclosure. However, it is unknown whether Sheen's disclosure impacted offline behaviors like HIV testing. The goal of this study was to determine if Sheen's HIV disclosure was a record-setting HIV prevention event and determine if Web searches presage increases in testing allowing for rapid detection and reaction in the future. Sales of OraQuick rapid in-home HIV test kits in the USA were monitored weekly from April 12, 2014, to April 16, 2016, alongside Web searches including the terms ``test,'' ``tests,'' or ``testing'' and ``HIV'' as accessed from Google Trends. Changes in OraQuick sales around Sheen's disclosure and prediction models using Web searches were assessed. OraQuick sales rose 95% (95% CI, 75--117; p < 0.001) of the week of Sheen's disclosure and remained elevated for 4 more weeks (p < 0.05). In total, there were 8225 more sales than expected around Sheen's disclosure, surpassing World AIDS Day by a factor of about 7. Moreover, Web searches mirrored OraQuick sales trends (r = 0.79), demonstrating their ability to presage increases in testing. The ``Charlie Sheen effect'' represents an important opportunity for a public health response, and in the future, Web searches can be used to detect and act on more opportunities to foster prevention behaviors.
Expand Me   Ning Gao, Douglas Oard, Mark Dredze. Support for Interactive Identification of Mentioned Entities in Conversational Speech. International Conference on Research and Development in Information Retrieval (SIGIR) (short paper), 2017. [PDF] [Bibtex]
Searching conversational speech poses several new challenges, among which is how the searcher will make sense of what they find. This paper describes our initial experiments with a freely available collection of Enron telephone conversations. Our goal is to help the user make sense of search results by finding information about mentioned people, places and organizations. Because automated entity recognition is not yet sufficiently accurate on conversational telephone speech, we ask the user to transcribe just the name, and to indicate where in the recording it was heard. We then seek to link that mention to other mentions of the same entity in a variety of sources (in our experiments, in email and in Wikipedia). We cast this as an entity linking problem, and achieve promising results by utilizing social network features to help compensate for the limited accuracy of automatic transcription for this challenging content.
Expand Me   Nicholas Andrews, Mark Dredze, Benjamin Van Durme, Jason Eisner. Bayesian Modeling of Lexical Resources for Low-Resource Settings. Association for Computational Linguistics (ACL), 2017. [PDF] [Bibtex]
Lexical resources such as dictionaries and gazetteers are often used as auxiliary data for tasks such as part-of-speech induction and named-entity recognition. However, discriminative training with lexical features requires annotated data to reliably estimate the lexical feature weights and may result in overfitting the lexical features at the expense of features which generalize better. In this paper, we investigate a more robust approach: we stipulate that the lexicon is the result of an assumed generative process. Practically, this means that we may treat the lexical resources as observations under the proposed generative model. The lexical resources provide training data for the generative model without requiring separate data to estimate lexical feature weights. We evaluate the proposed approach in two settings: part-of-speech induction and low-resource named-entity recognition.
Expand Me   Travis Wolfe, Mark Dredze, Benjamin Van Durme. Pocket Knowledge Base Population. Association for Computational Linguistics (ACL) (short paper), 2017. [PDF] [Bibtex] [Code]
Existing Knowledge Base Population methods extract relations from a closed relational schema with limited coverage, leading to sparse KBs. We propose Pocket Knowledge Base Population (PKBP), the task of dynamically constructing a KB of entities related to a query and finding the best characterization of relationships between entities. We describe novel Open Information Extraction methods which leverage the PKB to find informative trigger words. We evaluate using existing KBP shared-task data as well as new annotations collected for this work. Our methods produce high quality KBs from just text with many more entities and relationships than existing KBP systems.
Expand Me   Ning Gao, Mark Dredze, Douglas Oard. Person Entity Linking in Email with NIL Detection. Journal of the Association for Information Science and Technology (JAIST), 2017. [PDF] [Bibtex]
For each specific mention of an entity found in a text, the goal of entity linking is to determine whether the referenced entity is present in an existing knowledge base, and if so to determine which KB entity is the correct referent. Entity linking has been well explored for dissemination-oriented sources such as news stories, blogs, and microblog posts, but the limited work to date on ``conversational'' sources such as email or text chat has not yet attempted to determine when the referent entity is not in the knowledge base (a task known as ``NIL detection''). This article presents a supervised machine learning system for linking named mentions of people in email messages to a collection-specific knowledge base, and that is also capable of NIL detection. This system learns from manually annotated training examples to leverage a rich set of features. The entity linking accuracy for entities present in the knowledge base is substantially and significantly better than the best previously reported results on the Enron email collection, comparable accuracy is reported for the challenging NIL detection task, and these results are for the first time replicated on a second email collection from a different source with comparable results.
Expand Me     Ann Irvine, Mark Dredze. Harmonic Grammar, Optimality Theory, and Syntax Learnability: An Empirical Exploration of Czech Word Order. Unpublished Manuscript, 2017. [PDF] [Bibtex]
This work presents a systematic theoretical and empirical comparison of the major algorithms that have been proposed for learning Harmonic and Optimality Theory grammars (HG and OT, respectively). By comparing learning algorithms, we are also able to compare the closely related OT and HG frameworks themselves. Experimental results show that the additional expressivity of the HG framework over OT affords performance gains in the task of predicting the surface word order of Czech sentences. We compare the perceptron with the classic Gradual Learning Algorithm (GLA), which learns OT grammars, as well as the popular Maximum Entropy model. In addition to showing that the perceptron is theoretically appealing, our work shows that the performance of the HG model it learns approaches that of the upper bound in prediction accuracy on a held out test set and that it is capable of accurately modeling observed variation.
Expand Me   Adrian Benton, Glen A Coppersmith, Mark Dredze. Ethical Research Protocols for Social Media Health Research. EACL Workshop on Ethics in Natural Language Processing, 2017. [PDF] [Bibtex]
Social media have transformed data driven research in political science, the social sciences, health, and medicine. Since health research often touches on sensitive topics that relate to ethics of treatment and patient privacy, similar ethical considerations should be acknowledged when using social media data in health research. While much has been said regarding the ethical considerations of social media research, health research leads to an additional set of concerns. We provide practical suggestions in the form of guidelines for researchers working with social media data in health research. These guidelines can inform an IRB proposal for researchers new to social media health research.
Expand Me   John W Ayers, Eric C Leas, Jon-Patrick Allem, Adrian Benton, Mark Dredze, Benjamin M Althouse, Tess B Cruz, Jennifer B Unger. Why Do People Use Electronic Nicotine Delivery Systems (Electronic Cigarettes)? A Content Analysis of Twitter, 2012-2015. PLoS One, 2017. [PDF] [Bibtex]
The reasons for using electronic nicotine delivery systems (ENDS) are poorly understood and are primarily documented by expensive cross-sectional surveys that use preconceived close-ended response options rather than allowing respondents to use their own words. We passively identify the reasons for using ENDS longitudinally from a content analysis of public postings on Twitter. All English language public tweets including several ENDS terms (e.g., ``e-cigarette'' or ``vape'') were captured from the Twitter data stream during 2012 and 2015. After excluding spam, advertisements, and retweets, posts indicating a rationale for vaping were retained. The specific reasons for vaping were then inferred based on a supervised content analysis using annotators from Amazon's Mechanical Turk. During 2012 quitting combustibles was the most cited reason for using ENDS with 43% (95%CI 39--48) of all reason-related tweets cited quitting combustibles, e.g., ``I couldn't quit till I tried ecigs,'' eclipsing the second most cited reason by more than double. Other frequently cited reasons in 2012 included ENDS's social image (21%; 95%CI 18--25), use indoors (14%; 95%CI 11--17), flavors (14%; 95%CI 11--17), safety relative to combustibles (9%; 95%CI 7--11), cost (3%; 95%CI 2--5) and favorable odor (2%; 95%CI 1--3). By 2015 the reasons for using ENDS cited on Twitter had shifted. Both quitting combustibles and use indoors significantly declined in mentions to 29% (95%CI 24--33) and 12% (95%CI 9--16), respectively. At the same time, social image increased to 37% (95%CI 32--43) and lack of odor increased to 5% (95%CI 2--5), the former leading all cited reasons in 2015. Our data suggest the reasons people vape are shifting away from cessation and toward social image. The data also show how the ENDS market is responsive to a changing policy landscape. For instance, smoking indoors was less frequently cited in 2015 as indoor smoking restrictions became more common. Because the data and analytic approach are scalable, adoption of our strategies in the field can inform follow-up survey-based surveillance (so the right questions are asked), interventions, and policies for ENDS.
Expand Me   Anthony Nastasi, Tyler Bryant, Joseph K Canner, Mark Dredze, Melissa S Camp, Neeraja Nagarajan. Breast Cancer Screening and Social Media: a Content Analysis of Evidence Use and Guideline Opinions on Twitter. Journal of Cancer Education, 2017. [PDF] [Bibtex]
There is ongoing debate regarding the best mammography screening practices. Twitter has become a powerful tool for disseminating medical news and fostering healthcare conversations; however, little work has been done examining these conversations in the context of how users are sharing evidence and discussing current guidelines for breast cancer screening. To characterize the Twitter conversation on mammography and assess the quality of evidence used as well as opinions regarding current screening guidelines, individual tweets using mammography-related hashtags were prospectively pulled from Twitter from 5 November 2015 to 11 December 2015. Content analysis was performed on the tweets by abstracting data related to user demographics, content, evidence use, and guideline opinions. Standard descriptive statistics were used to summarize the results. Comparisons were made by demographics, tweet type (testable claim, advice, personal experience, etc.), and user type (non-healthcare, physician, cancer specialist, etc.). The primary outcomes were how users are tweeting about breast cancer screening, the quality of evidence they are using, and their opinions regarding guidelines. The most frequent user type of the 1345 tweets was ``non-healthcare'' with 323 tweets (32.5%). Physicians had 1.87 times higher odds (95% CI, 0.69--5.07) of providing explicit support with a reference and 11.70 times higher odds (95% CI, 3.41--40.13) of posting a tweet likely to be supported by the scientific community compared to non-healthcare users. Only 2.9% of guideline tweets approved of the guidelines while 14.6% claimed to be confused by them. Non-healthcare users comprise a significant proportion of participants in mammography conversations, with tweets often containing claims that are false, not explicitly backed by scientific evidence, and in favor of alternative ``natural'' breast cancer prevention and treatment. Furthermore, users appear to have low approval and confusion regarding screening guidelines. These findings suggest that more efforts are needed to educate and disseminate accurate information to the general public regarding breast cancer prevention modalities, emphasizing the safety of mammography and the harms of replacing conventional prevention and treatment modalities with unsubstantiated alternatives.
Expand Me   Xiaolei Huang, Michael C Smith, Michael J Paul, Dmytro Ryzhkov, Sandra C Quinn, David A Broniatowski, Mark Dredze. Examining Patterns of Influenza Vaccination in Social Media. AAAI Joint Workshop on Health Intelligence (W3PHIAI), 2017. [PDF] [Bibtex] [Data]
Traditional data on influenza vaccination has several limitations: high cost, limited coverage of underrepresented groups, and low sensitivity to emerging public health issues. Social media, such as Twitter, provide an alternative way to understand a population's vaccination-related opinions and behaviors. In this study, we build and employ several natural language classifiers to examine and analyze behavioral patterns regarding influenza vaccination in Twitter across three dimensions: temporality (by week and month), geography (by US region), and demography (by gender). Our best results are highly correlated official government data, with a correlation over 0.90, providing validation of our approach. We then suggest a number of directions for future work.
Expand Me   Neeraja Nagarajan, Husain Alshaikh, Anthony Nastasi, Blair J Smart, Zackary D Berger, Eric B Schneider, Mark Dredze, Joseph K Canner, Nita Ahuja. The Utility of Twitter in Generating High-Quality Conversations about Surgical Care. Academic Surgical Congress, 2017. [PDF] [Bibtex]
Introduction: There is growing interest among various stakeholders in using social media sites to discuss healthcare issues. However, little is known about how social media sites are used to discuss surgical care. There is also a lack of understanding of the types of content generated and the quality of the information shared in social media platforms about surgical care issues. We therefore sought to identify and summarize conversations on surgical care in Twitter, a popular microblogging website. Methods: A comprehensive list of surgery-related hashtags was used to pull individual tweets from 3/27-4/27/2015. Four independent reviewers blindly analyzed 25 tweets to develop themes for extraction from a larger sample. The themes were broadly divided further to obtain data at the levels of the user, the tweet, the content of the tweet and personal information shared (Figure I). Standard descriptive statistical analysis and simple logistic regression analysis was used. Results: In total, 17,783 tweets were pulled and 1000 from 615 unique users were randomly selected for analysis. Most users were from North America (62.4%) and non-healthcare related individuals (31.8%). Healthcare organizations generated 12.4%, and surgeons 9.5%, of tweets. Overall, 67.4% were original tweets and 79.0% contained a hyperlink (11% to healthcare and 8.7% to peer-reviewed sources). The common areas of surgery discussed were global surgery/health systems (18.4%), followed by general surgery (15.6%). Among personal tweets (n=236), 31.1% concerned surgery on family/friends and 24.4% on the user; 61.1% discussed procedures already performed and 58.0% used positive language about their personal experience with surgical care. Surgical news/opinion was present in 45% of tweets and 13.7% contained evidence-based information. Non-healthcare professionals were 53.5% (95% CI: 3.8%-77.5%, p=0.039) and 72.8% (95% CI: 21.1%-91.7%, p=0.017) less likely to generate a tweet that contained evidence-based information and to quote from a peer-reviewed journal, respectively, when compared to other users. Conclusion: Our study demonstrates that while healthcare professionals and organizations tend to share higher quality data on surgical care on social media, non-health care related individuals largely drive the conversation. Fewer than half of all surgery-related tweets included surgical news/opinion; only 14% included evidence-based information and just 9% linked to peer-reviewed sources. As social media outlets become important sources of actionable information, leaders in the surgical community should develop professional guidelines to maximize this versatile platform to disseminate accurate and high-quality content on surgical issues to a wide range of audiences.

     2016 (34 Publications)
Expand Me     Travis Wolfe, Mark Dredze, Benjamin Van Durme. Feature Generation for Robust Semantic Role Labeling. Unpublished Manuscript, 2016. [PDF] [Bibtex]
Hand-engineered feature sets are a well understood method for creating robust NLP models, but they require a lot of expertise and effort to create. In this work we describe how to automatically generate rich feature sets from simple units called featlets, requiring less engineering. Using information gain to guide the generation process, we train models which rival the state of the art on two standard Semantic Role Labeling datasets with almost no task or linguistic insight.
Expand Me   Anietie Andy, Satoshi Sekine, Mugizi Rwebangira, Mark Dredze. Name Variation in Community Question Answering Systems. COLING Workshop on Noisy User-generated Text, 2016. [PDF] [Bibtex] (Best Paper Award)
Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, Who is the best player for the Reds? and Who is currently the biggest star at Manchester United? have a shared need but are worded differently; also, Reds and Manchester United are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.
Expand Me   Travis Wolfe, Mark Dredze, Benjamin Van Durme. A Study of Imitation Learning Methods for Semantic Role Labeling. EMNLP Workshop on Structured Prediction for NLP, 2016. [PDF] [Bibtex]
Global features have proven effective in a wide range of structured prediction problems but come with high inference costs. Imitation learning is a common method for training models when exact inference isn't feasible. We study imitation learning for Semantic Role Labeling (SRL) and analyze the effectiveness of the Violation Fixing Perceptron (VFP) (Huang et al., 2012) and Locally Optimal Learning to Search (LOLS) (Chang et al.,2015) frameworks with respect to SRL global features. We describe problems in applying each framework to SRL and evaluate the effectiveness of some solutions. We also show that action ordering, including easy first inference, has a large impact on the quality of greedy global models.
Expand Me   Rebecca Knowles, Josh Carroll, Mark Dredze. Demographer: Extremely Simple Name Demographics. EMNLP Workshop on Natural Language Processing and Computational Social Science, 2016. [PDF] [Bibtex] [Code]
The lack of demographic information available when conducting passive analysis of social media content can make it difficult to compare results to traditional survey results. We present DEMOGRAPHER, a tool that predicts gender from names, using name lists and a classifier with simple character-level features. By relying only on a name, our tool can make predictions even without extensive user-authored content. We compare DEMOGRAPHER to other available tools and discuss differences in performance. In particular, we show that DEMOGRAPHER performs well on Twitter data, making it useful for simple and rapid social media demographic inference.
Expand Me   John W Ayers, Eric C Leas, Mark Dredze, Jon-Patrick Allem, Jurek G Grabowski, Linda Hill. Pok\'emon go---a new distraction for drivers and pedestrians. JAMA Internal Medicine, 2016;176(12):1865-1866. [PDF] [Bibtex] (Ranked in the top .02% of 6.5m research outputs by Altmetric)
Pok\'emon GO, an augmented reality game, has swept the nation. As players move, their avatar moves within the game, and players are then rewarded for collecting Pok\'emon placed in real-world locations. By rewarding movement, the game incentivizes physical activity. However, if players use their cars to search for Pok\'emon they negate any health benefit and incur serious risk. Motor vehicle crashes are the leading cause of death among 16- to 24-year-olds, whom the game targets. Moreover, according to the American Automobile Association, 59% of all crashes among young drivers involve distractions within 6 seconds of the accident. We report on an assessment of drivers and pedestrians distracted by Pok\'emon GO and crashes potentially caused by Pok\'emon GO by mining social and news media reports.
Expand Me   Mark Dredze, Nicholas Andrews, Jay DeYoung. Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation. EMNLP Workshop on Natural Language Processing for Social Media, 2016. [PDF] [Bibtex] [Code], [Data]
Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking.
Expand Me   John W Ayers, Benjamin M Althouse, Eric C Leas, Ted Alcorn, Mark Dredze. Big Media Data Can Inform Gun Violence Prevention. Bloomberg Data for Good Exchange, 2016. [PDF] [Bibtex]
The scientific method drives improvements in public health, but a strategy of obstructionism has impeded scientists from gathering even a minimal amount of information to address America's gun violence epidemic. We argue that in spite of a lack of federal investment, large amounts of publicly available data offer scientists an opportunity to measure a range of firearm-related behaviors. Given the diversity of available data -- including news coverage, social media, web forums, online advertisements, and Internet searches (to name a few) -- there are ample opportunities for scientists to study everything from trends in particular types of gun violence to gun related behaviors (such as purchases and safety practices) to public understanding of and sentiment towards various gun violence reduction measures. Science has been sidelined in the gun violence debate for too long. Scientists must tap the big media datastream and help resolve this crisis.
Expand Me   Adrian Benton, Braden Hancock, Glen A Coppersmith, John W Ayers, Mark Dredze. After Sandy Hook Elementary: A Year in the Gun Control Debate on Twitter. Bloomberg Data for Good Exchange, 2016. [PDF] [Bibtex]
The mass shooting at Sandy Hook elementary school on December 14, 2012 catalyzed a year of active debate and legislation on gun control in the United States. Social media hosted an active public discussion where people expressed their support and opposition to a variety of issues surrounding gun legislation. In this paper, we show how a content based analysis of Twitter data can provide insights and understanding into this debate. We estimate the relative support and opposition to gun control measures, along with a topic analysis of each camp by analyzing over 70 million gun-related tweets from 2013. We focus on spikes in conversation surrounding major events related to guns throughout the year. Our general approach can be applied to other important public health and political issues to analyze the prevalence and nature of public opinion.
Expand Me   Eric C Leas, Benjamin M Althouse, Mark Dredze, Nick Obradovich, James H Fowler, Seth M Noar, Jon-Patrick Allem, John W Ayers. Big data sensors of organic advocacy: The case of Leonardo DiCaprio and Climate Change. PLoS One, 2016;11(8):e0159885. [PDF] [Bibtex]
The strategies that experts have used to share information about social causes have historically been top-down, meaning the most influential messages are believed to come from planned events and campaigns. However, more people are independently engaging with social causes today than ever before, in part because online platforms allow them to instantaneously seek, create, and share information. In some cases this ``organic advocacy'' may rival or even eclipse top-down strategies. Big data analytics make it possible to rapidly detect public engagement with social causes by analyzing the same platforms from which organic advocacy spreads. To demonstrate this claim we evaluated how Leonardo DiCaprio's 2016 Oscar acceptance speech citing climate change motivated global English language news (Bloomberg Terminal news archives), social media (Twitter postings) and information seeking (Google searches) about climate change. Despite an insignificant increase in traditional news coverage (54%; 95%CI: -144 to 247), tweets including the terms ``climate change'' or ``global warming'' reached record highs, increasing 636% (95%CI: 573--699) with more than 250,000 tweets the day DiCaprio spoke. In practical terms the ``DiCaprio effect'' surpassed the daily average effect of the 2015 Conference of the Parties (COP) and the Earth Day effect by a factor of 3.2 and 5.3, respectively. At the same time, Google searches for ``climate change'' or ``global warming'' increased 261% (95%CI, 186--335) and 210% (95%CI 149--272) the day DiCaprio spoke and remained higher for 4 more days, representing 104,190 and 216,490 searches. This increase was 3.8 and 4.3 times larger than the increases observed during COP's daily average or on Earth Day. Searches were closely linked to content from Dicaprio's speech (e.g., ``hottest year''), as unmentioned content did not have search increases (e.g., ``electric car''). Because these data are freely available in real time our analytical strategy provides substantial lead time for experts to detect and participate in organic advocacy while an issue is salient. Our study demonstrates new opportunities to detect and aid agents of change and advances our understanding of communication in the 21st century media landscape.
Expand Me   Michael J Paul, Margaret S Chisolm, Matthew W Johnson, Ryan G Vandrey, Mark Dredze. Assessing the validity of online drug forums as a source for estimating demographic and temporal trends in drug use. Journal of Addiction Medicine, 2016;10(5):324--330. [PDF] [Bibtex]
Objectives: Addiction researchers have begun monitoring online forums to uncover self-reported details about use and effects of emerging drugs. The use of such online data sources has not been validated against data from large epidemiological surveys. This study aimed to characterize and compare the demographic and temporal trends associated with drug use as reported in online forums and in a large epidemiological survey. Methods: Data were collected from the website,, from January 2007 through August 2012 (143,416 messages posted by 8,087 members) and from the United States National Survey on Drug Use and Health (NSDUH) from 2007-2012. Measures of forum participation levels were compared with and validated against two measures from the NSDUH survey data: percentage of people using the drug in last 30 days and percentage using the drug more than 100 times in the past year. Results: For established drugs (e.g., cannabis), significant correlations were found across demographic groups between and the NSDUH survey data, while weaker, non-significant correlations were found with temporal trends. Emerging drugs (e.g., Salvia divinorum) were strongly associated with male users in the forum, in agreement with survey-derived data, and had temporal patterns that increased in synchrony with poison control reports. Conclusions: These results offer the first assessment of online drug forums as a valid source for estimating demographic and temporal trends in drug use. The analyses suggest that online forums are a reliable source for estimation of demographic associations and early identification of emerging drugs, but a less reliable source for measurement of long-term temporal trends.
     David A Broniatowski, Mark Dredze, Karen M Hilyard, Maeghan Dessecker, Sandra C Quinn, Amelia M Jamison, Michael J Paul, Michael C Smith. Both Mirror and Complement: A Comparison of Social Media Data and Survey Data about Flu Vaccination. American Public Health Association, 2016. [PDF] [Bibtex]
Expand Me   Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, Paola Velardi, Alessandro Vespignani, Lyn Finelli. Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge. BMC Infectious Diseases, 2016;16(357):10.1186/s12879-016-1669-x. [PDF] [Bibtex]
Background: Early insights into the timing of the start, peak, and intensity of the influenza season could be useful in planning influenza prevention and control activities. To encourage development and innovation in influenza forecasting, the Centers for Disease Control and Prevention (CDC) organized a challenge to predict the 2013--14 Unites States influenza season. Methods: Challenge contestants were asked to forecast the start, peak, and intensity of the 2013-2014 influenza season at the national level and at any or all Health and Human Services (HHS) region level(s). The challenge ran from December 1, 2013--March 27, 2014; contestants were required to submit 9 biweekly forecasts at the national level to be eligible. The selection of the winner was based on expert evaluation of the methodology used to make the prediction and the accuracy of the prediction as judged against the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Results: Nine teams submitted 13 forecasts for all required milestones. The first forecast was due on December 2, 2013; 3/13 forecasts received correctly predicted the start of the influenza season within one week, 1/13 predicted the peak within 1 week, 3/13 predicted the peak ILINet percentage within 1%, and 4/13 predicted the season duration within 1 week. For the prediction due on December 19, 2013, the number of forecasts that correctly forecasted the peak week increased to 2/13, the peak percentage to 6/13, and the duration of the season to 6/13. As the season progressed, the forecasts became more stable and were closer to the season milestones. Conclusion: Forecasting has become technically feasible, but further efforts are needed to improve forecast accuracy so that policy makers can reliably use these predictions. CDC and challenge contestants plan to build upon the methods developed during this contest to improve the accuracy of influenza forecasts.
Expand Me   Mark Dredze, Manuel García-Herranz, Alex Rutherford, Gideon Mann. Twitter as a Source of Global Mobility Patterns for Social Good. ICML Workshop on #Data4Good: Machine Learning in Social Good Applications, 2016. [PDF] [Bibtex]
Data on human spatial distribution and movement is essential for understanding and analyzing social systems. However existing sources for this data are lacking in various ways; difficult to access, biased, have poor geographical or temporal resolution, or are significantly delayed. In this paper, we describe how geolocation data from Twitter can be used to estimate global mobility patterns and address these shortcomings. These findings will inform how this novel data source can be harnessed to address humanitarian and development efforts.
     Mark Dredze, David A Broniatowski, Karen M Hilyard. Zika Vaccine Misconceptions: A social media analysis. Vaccine, 2016;34(30):3441-3442. [PDF] [Bibtex] [Data]
Expand Me   Mark Dredze, Prabhanjan Kambadur, Gary Kazantsev, Gideon Mann, Miles Osborne. How Twitter is Changing the Nature of Financial News Discovery. SIGMOD Workshop on Data Science for Macro-Modeling with Financial and Economic Datasets, 2016. [PDF] [Bibtex]
Access to the most relevant and current information is critical to financial analysis and decision making.Historically, financial news has been discovered through company press releases, required disclosures and news articles. More recently, social media has reshaped the financial news landscape, radically changing the dynamics of news dissemination. In this paper we discuss the ways in which Twitter, a leading social media platform, has contributed to changes in this landscape. We explain why today Twitter is a valuable source of material financial information and describe opportunities and challenges in using this novel news source for financial information discovery.
Expand Me   Nanyun Peng, Mark Dredze. Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning. Association for Computational Linguistics (ACL) (short paper), 2016. [PDF] [Bibtex]
Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results.
Expand Me   Adrian Benton, Raman Arora, Mark Dredze. Learning Multiview Embeddings of Twitter Users. Association for Computational Linguistics (ACL) (short paper), 2016. [PDF] [Bibtex] [Code]
Low-dimensional vector representations are widely used as stand-ins for the text of words, sentences, and entire documents. These embeddings are used to identify similar words or make predictions about documents. In this work, we consider embeddings for social media users and demonstrate that these can be used to identify users who behave similarly or to predict attributes of users. In order to capture information from all aspects of a user's online life, we take a multiview approach, applying a weighted variant of Generalized Canonical Correlation Analysis (GCCA) to a collection of over 100,000 Twitter users. We demonstrate the utility of these multiview embeddings on three downstream tasks: user engagement, friend selection, and demographic attribute prediction.
Expand Me   David A Broniatowski, Mark Dredze, Karen M Hilyard. Effective Vaccine Communication during the Disneyland Measles Outbreak. Vaccine, 2016;34(28):3225-3228. [PDF] [Bibtex]
Vaccine refusal rates have increased in recent years, highlighting the need for effective risk communication, especially over social media. Fuzzy-trace theory predicts that individuals encode bottom-line meaning (''gist'') and statistical information (''verbatim'') in parallel and those articles expressing a clear gist will be most compelling. We coded news articles (n = 4581) collected during the 2014−2015 Disneyland measles for content including statistics, stories, or bottom-line gists regarding vaccines and vaccine-preventable illnesses. We measured the extent to which articles were compelling by how frequently they were shared on Facebook. The most widely shared articles expressed bottom-line gists, although articles containing statistics were also more likely to be shared than articles lacking statistics. Stories had limited impact on Facebook shares. Results support Fuzzy Trace Theory's predictions regarding the distinct yet parallel impact of categorical gist and statistical verbatim information on public health communication.
Expand Me   Ning Gao, Mark Dredze, Douglas Oard. Knowledge Base Population for Organization Mentions in Email. NAACL Workshop on Automated Knowledge Base Construction (AKBC), 2016. [PDF] [Bibtex]
A prior study found that on average there are 6.3 named mentions of organizations found in email messages from the Enron collection, only about half of which could be linked to known entities in Wikipedia. That suggests a need for collection-specific approaches to entity linking, similar to those have proven successful for person mentions. This paper describes a process for automatically constructing such a collection-specific knowledge base of organization entities for named mentions in Enron. A new public test collection for linking 130 mentions of organizations found in Enron email to either Wikipedia or to this new collection-specific knowledge base is also described. Together, Wikipedia entities plus the new collection-specific knowledge base cover 83% of the 130 organization mentions, a 14% (absolute) improvement over the 69% that could be linked to Wikipedia alone.
     Michael C Smith, David A Broniatowski, Mark Dredze. Using Twitter to Examine Social Rationales for Vaccine Refusal. International Engineering Systems Symposium (CESUN), 2016. [PDF] [Bibtex] [Data] [Poster]
Expand Me   Mo Yu, Mark Dredze, Raman Arora, Matthew R Gormley. Embedding Lexical Features via Low-rank Tensors. North American Chapter of the Association for Computational Linguistics (NAACL), 2016. [PDF] [Bibtex]
Modern NLP models rely heavily on engineered features, which often combine word and contextual information into complex lexical features. Such combination results in large numbers of features, which can lead to over-fitting. We present a new model that represents complex lexical features---comprised of parts for words, contextual information and labels---in a tensor that captures conjunction information among these parts. We apply low-rank tensor approximations to the corresponding parameter tensors to reduce the parameter space and improve prediction speed. Furthermore, we investigate two methods for handling features that include n-grams of mixed lengths. Our model achieves state-of-the-art results on tasks in relation extraction, PP-attachment, and preposition disambiguation.
Expand Me   Mark Dredze, Miles Osborne, Prabhanjan Kambadur. Geolocation for Twitter: Timing Matters. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2016. [PDF] [Bibtex]
Automated geolocation of social media messages can benefit a variety of downstream applications. However, these geolocation systems are typically evaluated without attention to how changes in time impact geolocation. Since different people, in different locations write messages at different times, these factors can significantly vary the performance of a geolocation system over time. We demonstrate cyclical temporal effects on geolocation accuracy in Twitter, as well as rapid drops as test data moves beyond the time period of training data. We show that temporal drift can effectively be countered with even modest online model updates.
Expand Me   John W Ayers, Benjamin M Althouse, Mark Dredze, Eric C Leas, Seth M Noar. News and Internet Searches About Human Immunodeficiency Virus After Charlie Sheen's Disclosure. JAMA Internal Medicine, 2016;176(4):552-554. [PDF] [Bibtex] (Ranked in the top .03% of 4.8m research outputs by Altmetric)
Celebrity Charlie Sheen publicly disclosed his human immunodeficiency virus (HIV)--positive status on November 17, 2015. Could Sheen's disclosure, like similar announcements from celebrities, generate renewed attention to HIV? We provide an early answer by examining news trends to reveal discussion of HIV in the mass media and Internet searches to reveal engagement with HIV-related topics around the time of Sheen's disclosure.
     Neeraja Nagarajan, Blair J Smart, Anthony Nastasi, Zoya J Effendi, Sruthi Murali, Zackary D Berger, Eric B Schneider, Mark Dredze, Joseph K Canner. An Analysis of Twitter Conversations on Global Surgical Care. Annual CUGH Global Health Conference, 2016. [Bibtex] (poster)
Expand Me   John W Ayers, J Lee Westmaas, Eric C Leas, Adrian Benton, Yunqi Chen, Mark Dredze, Benjamin M Althouse. Leveraging Big Data to Improve Health Awareness Campaigns: A Novel Evaluation of the Great American Smokeout. JMIR Public Health and Surveillance, 2016;2(1):e16. [PDF] [Bibtex]
Awareness campaigns are ubiquitous, but little is known about their potential effectiveness because traditional evaluations are often unfeasible. For 40 years, the ``Great American Smokeout'' (GASO) has encouraged media coverage and popular engagement with smoking cessation on the third Thursday of November as the nation's longest running awareness campaign. We proposed a novel evaluation framework for assessing awareness campaigns using the GASO as a case study by observing cessation-related news reports and Twitter postings, and cessation-related help seeking via Google, Wikipedia, and government-sponsored quitlines.
Expand Me   Munmun De Choudhury, Emre Kiciman, Mark Dredze, Glen A Coppersmith, Mrinal Kumar. Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media. Conference on Human Factors in Computing Systems (CHI), 2016. [PDF] [Bibtex] (Honorable Mention Award)
History of mental illness is a major factor behind suicide risk and ideation. However research efforts toward characterizing and forecasting this risk is limited due to the paucity of information regarding suicide ideation, exacerbated by the stigma of mental illness. This paper fills gaps in the literature by developing a statistical methodology to infer which individuals could undergo transitions from mental health discourse to suicidal ideation. We utilize semi-anonymous support communities on Reddit as unobtrusive data sources to infer the likelihood of these shifts. We develop language and interactional measures for this purpose, as well as a propensity score matching based statistical approach. Our approach allows us to derive distinct markers of shifts to suicidal ideation. These markers can be modeled in a prediction framework to identify individuals likely to engage in suicidal ideation in the future. We discuss societal and ethical implications of this research.
Expand Me   Animesh R Koratana, Mark Dredze, Margaret S Chisolm, Matthew W Johnson, Michael J Paul. Studying Anonymous Health Issues and Substance Use on College Campuses with Yik Yak. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2016. [PDF] [Bibtex]
This study investigates the public health intelligence utility of Yik Yak, a social media platform that allows users to anonymously post and view messages within precise geographic locations. Our dataset contains 122,179 "yaks" collected from 120 college campuses across the United States during 2015. We first present an exploratory analysis of the topics commonly discussed in Yik Yak, clarifying the health issues for which this may serve as a source of information. We then present an in-depth content analysis of data describing substance use, an important public health issue that is not often discussed in public social media, but commonly discussed on Yik Yak under the cloak of anonymity.
Expand Me   Adrian Benton, Michael J Paul, Braden Hancock, Mark Dredze. Collective Supervision of Topic Models for Predicting Surveys with Social Media. Association for the Advancement of Artificial Intelligence (AAAI), 2016. [PDF] [Bibtex]
This paper considers survey prediction from social media. We use topic models to correlate social media messages with survey outcomes and to provide an interpretable representation of the data. Rather than rely on fully unsupervised topic models, we use existing aggregated survey data to inform the inferred topics, a class of topic model supervision referred to as collective supervision. We introduce and explore a variety of topic model variants and provide an empirical analysis, with conclusions of the most effective models for this task.
Expand Me   Michael C Smith, David A Broniatowski, Michael J Paul, Mark Dredze. Towards Real-Time Measurement of Public Epidemic Awareness: Monitoring Influenza Awareness through Twitter. AAAI Spring Symposium on Observational Studies through Social Media and Other Human-Generated Content, 2016. [PDF] [Bibtex]
This study analyzes temporal trends in Twitter data pertaining to both influenza awareness and influenza infection during the 2012--13 influenza season in the US. We make use of classifiers to distinguish tweets that express a personal infection (``sick with the flu'') versus a more general awareness (``worried about the flu''). While previous research has focused on estimating prevalence of influenza infection, little is known about trends in public awareness of the disease. Our analysis shows that infection and awareness have very different trends. In contrast to infection trends, awareness trends have little regional variation, and our experiments suggest that public awareness is primarily driven by news media.
     Blair J Smart, Neeraja Nagarajan, Joseph K Canner, Mark Dredze, Eric B Schneider, Minh Luu, Zackary D Berger, Jonathan A Myers. The Use of Social Media in Surgical Education: An Analysis of Twitter. Annual Academic Surgical Congress, 2016. [Bibtex]
     Neeraja Nagarajan, Blair J Smart, Mark Dredze, Joy L Lee, James Taylor, Jonathan A Myers, Eric B Schneider, Zackary D Berger, Joseph K Canner. How do Surgical Providers use Social Media? A Mixed-Methods Analysis using Twitter. Annual Academic Surgical Congress, 2016. [Bibtex]
Expand Me   John W Ayers, Benjamin M Althouse, Jon-Patrick Allem, Eric C Leas, Mark Dredze, Rebecca Williams. Revisiting the Rise of Electronic Nicotine Delivery Systems Using Search Query Surveillance. American Journal of Preventive Medicine (AJPM), 2016;50(6):e173-e181. [PDF] [Bibtex]
Introduction: Public perceptions of electronic nicotine delivery systems (ENDS) remain poorly understood because surveys are too costly to regularly implement and, when implemented, there are long delays between data collection and dissemination. Search query surveillance has bridged some of these gaps. Herein, ENDS' popularity in the U.S. is reassessed using Google searches. Methods: ENDS searches originating in the U.S. from January 2009 through January 2015 were disaggregated by terms focused on e-cigarette (e.g., e-cig) versus vaping (e.g., vapers); their geolocation (e.g., state); the aggregate tobacco control measures corresponding to their geolocation (e.g., clean indoor air laws); and by terms that indicated the searcher's potential interest (e.g., buy e-cigs likely indicates shopping)---all analyzed in 2015. Results: ENDS searches are rapidly increasing in the U.S., with 8,498,000 searches during 2014 alone. Increasingly, searches are shifting from e-cigarette- to vaping-focused terms, especially in coastal states and states where anti-smoking norms are stronger. For example, nationally, e-cigarette searches declined 9% (95% CI=1%, 16%) during 2014 compared with 2013, whereas vaping searches increased 136% (95% CI=97%, 186%), even surpassing e-cigarette searches. Additionally, the percentage of ENDS searches related to shopping (e.g., vape shop) nearly doubled in 2014, whereas searches related to health concerns (e.g., vaping risks) or cessation (e.g., quit smoking with e-cigs) were rare and declined in 2014. Conclusions: ENDS popularity is rapidly growing and evolving. These findings could inform survey questionnaire development for follow-up investigation and immediately guide policy debates about how the public perceives the health risks or cessation benefits of ENDS.
Expand Me   Atul Nakhasi, Sarah G Bell, Ralph J Passarella, Michael J Paul, Mark Dredze, Peter J Pronovost. The Potential of Twitter as a Data Source for Patient Safety. Journal of Patient Safety, 2016. [PDF] [Bibtex]
Background: Error-reporting systems are widely regarded as critical components to improving patient safety, yet current systems do not effectively engage patients. We sought to assess Twitter as a source to gather patient perspective on errors in this feasibility study. Methods: We included publicly accessible tweets in English from any geography. To collect patient safety tweets, we consulted a patient safety expert and constructed a set of highly relevant phrases, such as "doctor screwed up." We used Twitter's search application program interface from January to August 2012 to identify tweets that matched the set of phrases. Two researchers used criteria to independently review tweets and choose those relevant to patient safety; a third reviewer resolved discrepancies. Variables included source and sex of tweeter, source and type of error, emotional response, and mention of litigation. Results: Of 1006 tweets analyzed, 839 (83%) identified the type of error: 26% of which were procedural errors, 23% were medication errors, 23% were diagnostic errors, and 14% were surgical errors. A total of 850 (84%) identified a tweet source, 90% of which were by the patient and 9% by a family member. A total of 519 (52%) identified an emotional response, 47% of which expressed anger or frustration, 21% expressed humor or sarcasm, and 14% expressed sadness or grief. Of the tweets, 6.3% mentioned an intent to pursue malpractice litigation. Conclusions: Twitter is a relevant data source to obtain the patient perspective on medical errors. Twitter may provide an opportunity for health systems and providers to identify and communicate with patients who have experienced a medical error. Further research is needed to assess the reliability of the data.
Expand Me   Brad J Bushman, Katherine Newman, Sandra L Calvert, Geraldine Downey, Mark Dredze, Michael Gottfredson, Nina G Jablonski, Ann S Masten, Calvin Morrill, Daniel B Neill, Daniel Romer, Daniel W Webster. Youth Violence: What We Know and What We Need to Know. American Psychologist, 2016;71(1):17-39. [PDF] [Bibtex]
School shootings tear the fabric of society. In the wake of a school shooting, parents, pediatricians, policymakers, politicians, and the public search for ``the'' cause of the shooting. But there is no single cause. The causes of school shootings are extremely complex. After the Sandy Hook Elementary School rampage shooting in Newtown, Connecticut, we wrote a report for the National Science Foundation on what is known and not known about youth violence. This article summarizes and updates that report. After distinguishing violent behavior from aggressive behavior, we describe the prevalence of gun violence in the United States and age-related risks for violence. We delineate important differences between violence in the context of rare rampage school shootings, and much more common urban street violence. Acts of violence are influenced by multiple factors, often acting together. We summarize evidence on some major risk factors and protective factors for youth violence, highlighting individual and contextual factors, which often interact. We consider new quantitative ``data mining'' procedures that can be used to predict youth violence perpetrated by groups and individuals, recognizing critical issues of privacy and ethical concerns that arise in the prediction of violence. We also discuss implications of the current evidence for reducing youth violence, and we offer suggestions for future research. We conclude by arguing that the prevention of youth violence should be a national priority.

     2015 (28 Publications)
     Yu Wang, Eugene Agichtein, Tom Clark, Mark Dredze, Jeffrey Staton. Inferring latent user characteristics for analyzing political discussions in social media. Atlanta Computational Social Science Workshop, 2015. [Bibtex]
     Mark Dredze, David A Broniatowski, Michael C Smith, Karen M Hilyard. Understanding Vaccine Refusal: Why We Need Social Media Now. American Journal of Preventive Medicine (AJPM), 2015;50(4):550-552. [PDF] [Bibtex] [Data]
Expand Me   Mauricio Santillana, Andre T Nguyen, Mark Dredze, Michael J Paul, Elaine Nsoesie, John S Brownstein. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLOS Computational Biology, 2015. [PDF] [Bibtex]
We present a machine learning-based methodology capable of providing real-time (``nowcast'') and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013--2014 (retrospective) and 2014--2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizons.
Expand Me   Matthew R Gormley, Mark Dredze, Jason Eisner. Approximation-Aware Dependency Parsing by Belief Propagation. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex]
We show how to train the fast dependency parser of Smith and Eisner (2008) for improved accuracy. This parser can consider higher-order interactions among edges while retaining O(n^3) runtime. It outputs the parse with maximum expected recall---but for speed, this expectation is taken under a posterior distribution that is constructed only approximately, using loopy belief propagation through structured factors. We show how to adjust the model parameters to compensate for the errors introduced by this approximation, by following the gradient of the actual loss on training data. We find this gradient by backpropagation. That is, we treat the entire parser (approximations and all) as a differentiable circuit, as others have done for loopy CRFs (Domke, 2010; Stoyanov et al., 2011; Domke, 2011; Stoyanov and Eisner, 2012). The resulting parser obtains higher accuracy with fewer iterations of belief propagation than one trained by conditional log-likelihood.
     David A Broniatowski, Mark Dredze, Karen M Hilyard. News Articles are More Likely to be Shared if they Combine Statistics with Explanation. Conference of the Society for Medical Decision Making, 2015. [Bibtex]
Expand Me   Matthew R Gormley, Mo Yu, Mark Dredze. Improved Relation Extraction with Feature-Rich Compositional Embedding Models. Empirical Methods in Natural Language Processing (EMNLP), 2015. [PDF] [Bibtex]
Compositional embedding models build a representation (or embedding) for a linguistic structure based on its component word embeddings. We propose a Feature-rich Compositional Embedding Model (FCM) for relation extraction that is expressive, generalizes to new domains, and is easy-to-implement. The key idea is to combine both (unlexicalized) handcrafted features with learned word embeddings. The model is able to directly tackle the difficulties met by traditional compositional embeddings models, such as handling arbitrary types of sentence annotations and utilizing global information for composition. We test the proposed model on two relation extraction tasks, and demonstrate that our model outperforms both previous compositional models and traditional feature rich models on the ACE 2005 relation extraction task, and the SemEval 2010 relation classification task. The combination of our model and a loglinear classifier with hand-crafted features gives state-of-the-art results. We made our implementation available for general use.
Expand Me   Nanyun Peng, Mark Dredze. Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings. Empirical Methods in Natural Language Processing (EMNLP) (short paper), 2015. [PDF] [Bibtex] [Code]
We consider the task of named entity recognition for Chinese social media. The long line of work in Chinese NER has focused on formal domains, and NER for social media has been largely restricted to English. We present a new corpus of Weibo messages annotated for both name and nominal mentions. Additionally, we evaluate three types of neural embeddings for representing Chinese text. Finally, we propose a joint training objective for the embeddings that makes use of both (NER) labeled and unlabeled raw text. Our methods yield a 9% improvement over a state-of-the-art baseline.
     Matthew Biggerstaff, David Alper, Mark Dredze, Spencer Fox, Isaac Chun-Hai Fung, Kyle S Hickmann, Bryan Lewis, Roni Rosenfeld, Jeffrey Shaman, Ming-Hsiang Tsou, Paola Velardi, Alessandro Vespignani, Lyn Finelli. Results from the Centers for Disease Control and Prevention's Predict the 2013--2014 Influenza Season Challenge. International Conference of Emerging Infectious Diseases Conference, 2015. [PDF] [Bibtex]
Expand Me   Ellie Pavlick, Travis Wolfe, Pushpendre Rastogi, Chris Callison-Burch, Mark Dredze, Benjamin Van Durme. FrameNet+: Fast Paraphrastic Tripling of FrameNet. Association for Computational Linguistics (ACL) (short paper), 2015. [PDF] [Bibtex] [Data]
We increase the lexical coverage of FrameNet through automatic paraphrasing. We use crowdsourcing to manually filter out bad paraphrases in order to ensure a high-precision resource. Our expanded FrameNet contains an additional 22K lexical units, a 3-fold increase over the current FrameNet, and achieves 40% better coverage when evaluated in a practical setting on New York Times data.
Expand Me   Nanyun Peng, Mo Yu, Mark Dredze. An Empirical Study of Chinese Name Matching and Applications. Association for Computational Linguistics (ACL) (short paper), 2015. [PDF] [Bibtex] [Code]
Methods for name matching, an important component to support downstream tasks such as entity linking and entity clustering, have focused on alphabetic languages, primarily English. In contrast, logogram languages such as Chinese remain untested. We evaluate methods for name matching in Chinese, including both string matching and learning approaches. Our approach, based on new representations for Chinese, improves both name matching and a downstream entity clustering task.
Expand Me     Travis Wolfe, Mark Dredze, James Mayfield, Paul McNamee, Craig Harman, Tim Finin, Benjamin Van Durme. Interactive Knowledge Base Population. Unpublished Manuscript, 2015. [PDF] [Bibtex]
Most work on building knowledge bases has focused on collecting entities and facts from as large a collection of documents as possible. We argue for and describe a new paradigm where the focus is on a high-recall extraction over a small collection of documents under the supervision of a human expert, that we call Interactive Knowledge Base Population (IKBP).
Expand Me   Mrinal Kumar, Mark Dredze, Glen A Coppersmith, Munmun De Choudhury. Detecting Changes in Suicide Content Manifested in Social Media Following Celebrity Suicides. Conference on Hypertext and Social Media, 2015. [PDF] [Bibtex]
The Werther effect describes the increased rate of completed or attempted suicides following the depiction of an individual's suicide in the media, typically a celebrity. We present findings on the prevalence of this effect in an online platform: r/SuicideWatch on Reddit. We examine both the posting activity and post content after the death of ten high-profile suicides. Posting activity increases following reports of celebrity suicides, and post content exhibits considerable changes that indicate increased suicidal ideation. Specifically, we observe that post-celebrity suicide content is more likely to be inward focused, manifest decreased social concerns, and laden with greater anxiety, anger, and negative emotion. Topic model analysis further reveals content in this period to switch to a more derogatory tone that bears evidence of self-harm and suicidal tendencies. We discuss the implications of our findings in enabling better community support to psychologically vulnerable populations, and the potential of building suicide prevention interventions following high-profile suicides.
     Michael C Smith, David A Broniatowski, Michael J Paul, Mark Dredze. Tracking Public Awareness of Influenza through Twitter. 3rd International Conference on Digital Disease Detection (DDD), 2015. [Bibtex] (rapid fire talk)
     Joanna E Cohen, Rebecca Shillenn, Mark Dredze, John W Ayers. Tobacco Watcher: Real-Time Global Tobacco Surveillance Using Online News Media. Annual Meeting of the Society for Research on Nicotine and Tobacco, 2015. [Bibtex]
Expand Me   David A Broniatowski, Mark Dredze, Michael J Paul, Andrea Dugas. Using Social Media to Perform Local Influenza Surveillance in an Inner-City Hospital. JMIR Public Health and Surveillance, 2015. [PDF] [Bibtex]
Background: Public health officials and policy makers in the United States expend significant resources at the national, state, county, and city levels to measure the rate of influenza infection. These individuals rely on influenza infection rate information to make important decisions during the course of an influenza season driving vaccination campaigns, clinical guidelines, and medical staffing. Web and social media data sources have emerged as attractive alternatives to supplement existing practices. While traditional surveillance methods take 1-2 weeks, and significant labor, to produce an infection estimate in each locale, web and social media data are available in near real-time for a broad range of locations. Objective: The objective of this study was to analyze the efficacy of flu surveillance from combining data from the websites Google Flu Trends and HealthTweets at the local level. We considered both emergency department influenza-like illness cases and laboratory-confirmed influenza cases for a single hospital in the City of Baltimore. Methods: This was a retrospective observational study comparing estimates of influenza activity of Google Flu Trends and Twitter to actual counts of individuals with laboratory-confirmed influenza, and counts of individuals presenting to the emergency department with influenza-like illness cases. Data were collected from November 20, 2011 through March 16, 2014. Each parameter was evaluated on the municipal, regional, and national scale. We examined the utility of social media data for tracking actual influenza infection at the municipal, state, and national levels. Specifically, we compared the efficacy of Twitter and Google Flu Trends data. Results: We found that municipal-level Twitter data was more effective than regional and national data when tracking actual influenza infection rates in a Baltimore inner-city hospital. When combined, national-level Twitter and Google Flu Trends data outperformed each data source individually. In addition, influenza-like illness data at all levels of geographic granularity were best predicted by national Google Flu Trends data. Conclusions: In order to overcome sensitivity to transient events, such as the news cycle, the best-fitting Google Flu Trends model relies on a 4-week moving average, suggesting that it may also be sacrificing sensitivity to transient fluctuations in influenza infection to achieve predictive power. Implications for influenza forecasting are discussed in this report.
     J Lee Westmaas, John W Ayers, Mark Dredze, Benjamin M Althouse. Evaluation of the Great American Smokeout by Digital Surveillance. Society of Behavioral Medicine, 2015. [Bibtex] (Citation Award, given to the best submissions)
Expand Me   Glen A Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, Margaret Mitchell. CLPsych 2015 Shared Task: Depression and PTSD on Twitter. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015. [PDF] [Bibtex]
This paper presents a summary of the Computational Linguistics and Clinical Psychology (CLPsych) 2015 shared and unshared tasks. These tasks aimed to provide apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The data used for these tasks is from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) and demographically-matched community controls. The unshared task was a hackathon held at Johns Hopkins University in November 2014 to explore the data, and the shared task was conducted remotely, with each participating team submitted scores for a held-back test set of users. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD. Classifiers were compared primarily via their average precision, though a number of other metrics are used along with this to allow a more nuanced interpretation of the performance measures.
Expand Me   Mo Yu, Mark Dredze. Learning Composition Models for Phrase Embeddings. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] [Code]
Lexical embeddings can serve as useful representations for words for a variety of NLP tasks, but learning embeddings for phrases can be challenging. While separate embeddings are learned for each word, this is infeasible for every phrase. We construct phrase embeddings by learning how to compose word embeddings using features that capture phrase structure and context. We propose efficient unsupervised and task-specific learning objectives that scale our model to large datasets. We demonstrate improvements on both language modeling and several phrase semantic similarity tasks with various phrase lengths. We make the implementation of our model and the datasets available for general use.
Expand Me   Glen A Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead. From ADHD to SAD: analyzing the language of mental health on Twitter through self-reported diagnoses. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015. [PDF] [Bibtex]
Many significant challenges exist for the mental health field, but one in particular is a lack of data available to guide research. Language provides a natural lens for studying mental health -- much existing work and therapy have strong linguistic components, so the creation of a large, varied, language-centric dataset could provide significant grist for the field of mental health research. We examine a broad range of mental health conditions in Twitter data by identifying self-reported statements of diagnosis. We systematically explore language differences between ten conditions with respect to the general population, and to each other. Our aim is to provide guidance and a roadmap for where deeper exploration is likely to be fruitful.
Expand Me   Nanyun Peng, Francis Ferraro, Mo Yu, Nicholas Andrews, Jay DeYoung, Max Thomas, Matthew R Gormley, Travis Wolfe, Craig Harman, Benjamin Van Durme, Mark Dredze. A Chinese Concrete NLP Pipeline. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2015. [PDF] [Bibtex]
Natural language processing research increasingly relies on the output of a variety of syntactic and semantic analytics. Yet integrating output from multiple analytics into a single framework can be time consuming and slow research progress. We present a CONCRETE Chinese NLP Pipeline: an NLP stack built using a series of open source systems integrated based on the CONCRETE data schema. Our pipeline includes data ingest, word segmentation, part of speech tagging, parsing, named entity recognition, relation extraction and cross document coreference resolution. Additionally, we integrate a tool for visualizing these annotations as well as allowing for the manual annotation of new data. We release our pipeline to the research community to facilitate work on Chinese language tasks that require rich linguistic annotations.
Expand Me   Adrian Benton, Mark Dredze. Entity Linking for Spoken Language. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2015. [PDF] [Bibtex] [Data]
Research on entity linking has considered a broad range of text, including newswire, blogs and web documents in multiple languages. However, the problem of entity linking for spoken language remains unexplored. Spoken language obtained from automatic speech recognition systems poses different types of challenges for entity linking; transcription errors can distort the context, and named entities tend to have high error rates. We propose features to mitigate these errors and evaluate the impact of ASR errors on entity linking using a new corpus of entity linked broadcast news transcripts.
Expand Me   Travis Wolfe, Mark Dredze, Benjamin Van Durme. Predicate Argument Alignment using a Global Coherence Model. North American Chapter of the Association for Computational Linguistics (NAACL), 2015. [PDF] [Bibtex]
We present a joint model for predicate argument alignment. We leverage multiple sources of semantic information, including temporal ordering constraints between events. These are combined in a max-margin framework to find a globally consistent view of entities and events across multiple documents, which leads to improvements over a very strong local baseline.
Expand Me   Mo Yu, Matthew R Gormley, Mark Dredze. Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2015. [PDF] [Bibtex]
Compositional embedding models build a representation for a linguistic structure based on its component word embeddings. While recent work has combined these word embeddings with hand crafted features for improved performance, it was restricted to a small number of features due to model complexity, thus limiting its applicability. We propose a new model that conjoins features and word embeddings while maintaining a small number of parameters by learning feature embeddings jointly with the parameters of a compositional model. The result is a method that can scale to more features and more labels, while avoiding overfitting. We demonstrate that our model attains state-of-the-art results on ACE and ERE fine-grained relation extraction.
Expand Me   Michael J Paul, Mark Dredze. SPRITE: Generalizing Topic Models with Structured Priors. Transactions of the Association for Computational Linguistics (TACL), 2015. [PDF] [Bibtex] [Code]
We introduce SPRITE, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing SPRITE to be tailored to particular settings. We demonstrate this flexibility by constructing a SPRITE-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.
Expand Me   Shiliang Wang, Michael J Paul, Mark Dredze. Social Media as a Sensor of Air Quality and Public Response in China. Journal of Medical Internet Research (JMIR), 2015. [PDF] [Bibtex]
Background: Recent studies have demonstrated the utility of social media data sources for a wide range of public health goals including disease surveillance, mental health trends, and health perceptions and sentiment. Most such research has focused on English-language social media for the task of disease surveillance. Objective: We investigated the value of Chinese social media for monitoring air quality trends and related public perceptions and response. The goal was to determine if this data is suitable for learning actionable information about pollution levels and public response. Methods: We mined a collection of 93 million messages from Sina Weibo, China's largest microblogging service. We experimented with different filters to identify messages relevant to air quality, based on keyword matching and topic modeling. We evaluated the reliability of the data filters by comparing message volume per city to air particle pollution rates obtained from the Chinese government for 74 cities. Additionally, we performed a qualitative study of the content of pollution-related messages by coding a sample of 170 messages for relevance to air quality, and whether the message included details such as a reactive behavior or a health concern. Results: The volume of pollution-related messages is highly correlated with particle pollution levels, with Pearson correlation values up to .718 (74
Expand Me   Haoyu Wang, Eduard Hovy, Mark Dredze. The Hurricane Sandy Twitter Corpus. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015. [PDF] [Bibtex] [Data]
The growing use of social media has made it a critical component of disaster response and recovery efforts. Both in terms of preparedness and response, public health officials and first responders have turned to automated tools to assist with organizing and visualizing large streams of social media. In turn, this has spurred new research into algorithms for information extraction, event detection and organization, and information visualization. One challenge of these efforts has been the lack of a common corpus for disaster response on which researchers can compare and contrast their work. This paper describes the Hurricane Sandy Twitter Corpus: 6.5 million geotagged Twitter posts from the geographic area and time period of the 2012 Hurricane Sandy.
     Michael J Paul, Mark Dredze, David A Broniatowski, Nicholas Generous. Worldwide Influenza Surveillance through Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2015. [Bibtex]
     Joanna E Cohen, John W Ayers, Mark Dredze. Tobacco Watcher: Real-time Global Surveillance for Tobacco Control. World Conference on Tobacco or Health (WCTOH), 2015. [Bibtex]

     2014 (23 Publications)
Expand Me   Ning Gao, Douglas Oard, Mark Dredze. A Test Collection for Email Entity Linking. NIPS Workshop on Automated Knowledge Base Construction, 2014. [PDF] [Bibtex]
Most prior work on entity linking has focused on linking name mentions found in third-person communication (e.g., news) to broad-coverage knowledge bases (e.g., Wikipedia). A restricted form of domain-specific entity linking has, however, been tried with email, linking mentions of people to specific email addresses. This paper introduces a new test collection for the task of linking mentions of people, organizations, and locations to Wikipedia. Annotation of 200 randomly selected entities of each type from the Enron email collection indicates that domain specific knowledge bases are indeed required to get good coverage of people and organizations, but that Wikipedia provides good (93%) coverage for the named mentions of locations in the Enron collection. Furthermore, experiments with an existing entity linking system indicate that the absence of a suitable referent in Wikipedia can easily be recognized by automated systems, with NIL precision (i.e., correct detection of the absence of a suitable referent) above 90% for all three entity types.
Expand Me   Adrian Benton, Jay DeYoung, Adam Teichert, Mark Dredze, Benjamin Van Durme, Stephen Mayhew, Max Thomas. Faster (and Better) Entity Linking with Cascades. NIPS Workshop on Automated Knowledge Base Construction, 2014. [PDF] [Bibtex]
Entity linking requires ranking thousands of candidates for each query, a time consuming process and a challenge for large scale linking. Many systems rely on prediction cascades to efficiently rank candidates. However, the design of these cascades often requires manual decisions about pruning and feature use, limiting the effectiveness of cascades. We present Slinky, a modular, flexible, fast and accurate entity linker based on prediction cascades. We adapt the web-ranking prediction cascade learning algorithm, Cronus, in order to learn cascades that are both accurate and fast. We show that by balancing between accurate and fast linking, this algorithm can produce Slinky configurations that are significantly faster and more accurate than a baseline configuration and an alternate cascade learning method with a fixed introduction of features.
     Mo Yu, Matthew R Gormley, Mark Dredze. Factor-based Compositional Embedding Models. NIPS Workshop on Learning Semantics, 2014. [PDF] [Bibtex] [Code]
     Rebecca Knowles, Mark Dredze, Kathleen Evans, Elyse Lasser, Tom Richards, Jonathan Weiner, Hadi Kharrazi. High Risk Pregnancy Prediction from Clinical Text. NIPS Workshop on Machine Learning for Clinical Data Analysis, 2014. [PDF] [Bibtex]
Expand Me   Michael J Paul, Mark Dredze, David A Broniatowski. Twitter Improves Influenza Forecasting. PLOS Currents Outbreaks, 2014. [PDF] [Bibtex]
Accurate disease forecasts are imperative when preparing for influenza epidemic outbreaks; nevertheless, these forecasts are often limited by the time required to collect new, accurate data. In this paper, we show that data from the microblogging community Twitter significantly improves influenza forecasting. Most prior influenza forecast models are tested against historical influenza-like illness (ILI) data from the U.S. Centers for Disease Control and Prevention (CDC). These data are released with a one-week lag and are often initially inaccurate until the CDC revises them weeks later. Since previous studies utilize the final, revised data in evaluation, their evaluations do not properly determine the effectiveness of forecasting. Our experiments using ILI data available at the time of the forecast show that models incorporating data derived from Twitter can reduce forecasting error by 17-30% over a baseline that only uses historical data. For a given level of accuracy, using Twitter data produces forecasts that are two to four weeks ahead of baseline models. Additionally, we find that models using Twitter data are, on average, better predictors of influenza prevalence than are models using data from Google Flu Trends, the leading web data source.
     Joy L Lee, Matthew DeCamp, Mark Dredze, Margaret S Chisolm, Zackary D Berger. What Are Health-related Users Tweeting? A Qualitative Content Analysis of Health-related Users and their Messages on Twitter. Journal of Medical Internet Research (JMIR), 2014. [PDF] [Bibtex]
Expand Me   Michael J Paul, Mark Dredze. Discovering Health Topics in Social Media Using Topic Models. PLoS ONE, 2014. [PDF] [Bibtex] [Data]
By aggregating self-reported health statuses across millions of users, we seek to characterize the variety of health information discussed in Twitter. We describe a topic modeling framework for discovering health topics in Twitter, a social media website. This is an exploratory approach with the goal of understanding what health topics are commonly discussed in social media. This paper describes in detail a statistical topic model created for this purpose, the Ailment Topic Aspect Model (ATAM), as well as our system for filtering general Twitter data based on health keywords and supervised classification. We show how ATAM and other topic models can automatically infer health topics in 144 million Twitter messages from 2011 to 2013. ATAM discovered 13 coherent clusters of Twitter messages, some of which correlate with seasonal influenza (r = 0.689) and allergies (r = 0.810) temporal surveillance data, as well as exercise (r = .534) and obesity (r = −.631) related geographic survey data in the United States. These results demonstrate that it is possible to automatically discover topics that attain statistically significant correlations with ground truth data, despite using minimal human supervision and no historical data to train the model, in contrast to prior work. Additionally, these results demonstrate that a single general-purpose model can identify many different health topics in social media.
     David A Broniatowski, Michael J Paul, Mark Dredze. Twitter: Big Data Opportunities (Letter) Science, 2014;345(6193):148. [PDF] [Bibtex]
     Ahmed Abbasi, Donald Adjeroh, Mark Dredze, Michael J Paul, Fatemeh Mariam Zahedi, Huimin Zhao, Nitin Walia, Hemant Jain, Patrick Sanvanson, Reza Shaker, Marco D Huesch, Richard Beal, Wanhong Zheng, Marie Abate, Arun Ross. Social Media Analytics for Smart Health. IEEE Intelligent Systems, 2014;29(2):60--80. [PDF] [Bibtex]
     Byron C Wallace, Michael J Paul, Urmimala Sarkar, Thomas A Trikalinos, Mark Dredze. A Large-Scale Quantitative Analysis of Latent Factors and Sentiment in Online Doctor Reviews. Journal of the American Medical Informatics Association (JAMIA), 2014;21(6):1098--1103. [PDF] [Bibtex]
Expand Me   Mark Dredze, Renyuan Cheng, Michael J Paul, David A Broniatowski. A Platform for Public Health Surveillance using Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [PDF] [Bibtex] [Website]
We present, a new platform for sharing the latest research results on Twitter data with researchers and public officials. In this demo paper, we describe data collection, processing, and features of the site. The goal of this service is to transition results from research to practice.
     Michael J Paul, Mark Dredze, David A Broniatowski. Challenges in Influenza Forecasting and Opportunities for Social Media. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [Bibtex]
Expand Me   Shiliang Wang, Michael J Paul, Mark Dredze. Exploring Health Topics in Chinese Social Media: An Analysis of Sina Weibo. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014. [PDF] [Bibtex]
This paper seeks to identify and characterize health-related topics discussed on the Chinese microblogging website, Sina Weibo. We identified nearly 1 million messages containing health-related keywords, filtered from a dataset of 93 million messages spanning five years. We applied probabilistic topic models to this dataset and identified the prominent health topics. We show that a variety of health topics are discussed in Sina Weibo, and that four flu-related topics are correlated with monthly influenza case rates in China.
Expand Me   Mo Yu, Mark Dredze. Improving Lexical Embeddings with Semantic Knowledge. Association for Computational Linguistics (ACL) (short paper), 2014. [PDF] [Bibtex] [Code]
Word embeddings learned on unlabeled data are a popular tool in semantics, but may not capture the desired semantics. We propose a new learning objective that incorporates both a neural language model objective and prior knowledge from semantic resources to learn improved lexical semantic embeddings. We demonstrate that our embeddings improve over those learned solely on raw text in three settings: language modeling, measuring semantic similarity, and predicting human judgements.
Expand Me   Nanyun Peng, Yiming Wang, Mark Dredze. Learning Polylingual Topic Models from Code-Switched Social Media Documents. Association for Computational Linguistics (ACL) (short paper), 2014. [PDF] [Bibtex] [Code]
Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which infers language specific topic distributions based on code-switched documents to facilitate multi-lingual corpus analysis. We experiment on two code-switching corpora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human annotators.
Expand Me   Glen A Coppersmith, Mark Dredze, Craig Harman. Quantifying Mental Health Signals in Twitter. ACL Workshop on Computational Linguistics and Clinical Psychology, 2014. [PDF] [Bibtex]
The ubiquity of social media provides a rich opportunity to enhance the data available to mental health clinicians and researchers, enabling a better-informed and better-equipped mental health field. We present analysis of mental health phenomena in publicly available Twitter data, demonstrating how rigorous application of simple natural language processing methods can yield insight into specific disorders as well as mental health writ large, along with evidence that as-of-yet undiscovered linguistic signals relevant to mental health exist in social media. We present a novel method for gathering data for a range of mental illnesses quickly and cheaply, then focus on analysis of four in particular: post-traumatic stress disorder (PTSD), major depressive disorder, bipolar disorder, and seasonal affective disorder. We intend for these proof-of-concept results to inform the necessary ethical discussion regarding the balance between the utility of such data and the privacy of mental health related information.
Expand Me   Glen A Coppersmith, Craig Harman, Mark Dredze. Measuring Post Traumatic Stress Disorder in Twitter. International Conference on Weblogs and Social Media (ICWSM), 2014. [PDF] [Bibtex]
Traditional mental health studies rely on information primarily collected and analyzed through personal contact with a health care professional. Recent work has shown the utility of social media data for studying depression, but there have been limited evaluations of other mental health conditions. We consider post traumatic stress disorder (PTSD), a serious condition that affects millions worldwide, with especially high rates in military veterans. We show how to obtain a PTSD classifier for social media using simple searches of available Twitter data, a significant reduction in training data cost compared to previous work on mental health. We demonstrate its utility by an examination of language use from PTSD individuals, and by detecting elevated rates of PTSD at and around US military bases using our classifiers.
Expand Me   Miles Osborne, Mark Dredze. Facebook, Twitter and Google Plus for Breaking News: Is there a winner? International Conference on Weblogs and Social Media (ICWSM), 2014. [PDF] [Bibtex] [Supplement]
Twitter is widely seen as being the go to place for breaking news. Recently however, competing Social Media have begun to carry news. Here we examine how Facebook, Google Plus and Twitter report on breaking news. We consider coverage (whether news events are reported) and latency (the time when they are reported). Using data drawn from three weeks in December 2013, we identify 29 major news events, ranging from celebrity deaths, plague outbreaks to sports events. We find that all media carry the same major events, but Twitter continues to be the preferred medium for breaking news, almost consistently leading Facebook or Google Plus. Facebook and Google Plus largely repost newswire stories and their main research value is that they conveniently package multitple sources of information together.
Expand Me   Matthew R Gormley, Margaret Mitchell, Benjamin Van Durme, Mark Dredze. Low-Resource Semantic Role Labeling. Association for Computational Linguistics (ACL), 2014. [PDF] [Bibtex] [Code]
We explore the extent to which high-resource manual annotations such as treebanks are necessary for the task of semantic role labeling (SRL). We examine how performance changes without syntactic supervision, comparing both joint and pipelined methods to induce latent syntax. This work highlights a new application of unsupervised grammar induction and demonstrates several approaches to SRL in the absence of supervised syntax. Our best models obtain competitive results in the high-resource setting and state-of-the-art results in the low resource setting, reaching 72.48% F1 averaged across languages. We release our code for this work along with a larger toolkit for specifying arbitrary graphical structure.
Expand Me   Nicholas Andrews, Jason Eisner, Mark Dredze. Robust Entity Clustering via Phylogenetic Inference. Association for Computational Linguistics (ACL), 2014. [PDF] [Bibtex] [Code]
Entity clustering must determine when two named-entity mentions refer to the same entity. Typical approaches use a pipeline architecture that clusters the mentions using fixed or learned measures of name and context similarity. In this paper, we propose a model for cross-document coreference resolution that achieves robustness by learning similarity from unlabeled data. The generative process assumes that each entity mention arises from copying and optionally mutating an earlier name from a similar context. Clustering the mentions into entities depends on recovering this copying tree jointly with estimating models of the mutation process and parent selection process. We present a block Gibbs sampler for posterior inference and an empirical evalution on several datasets. On a challenging Twitter corpus, our method outperforms the best baseline by 12.6 points of F1 score.
     John W Ayers, Benjamin M Althouse, Mark Dredze. Could Behavioral Medicine Lead the Web Data Revolution? Journal of the American Medical Association (JAMA), 2014;311(14):1399--1400. [PDF] [Bibtex]
     John W Ayers, Benjamin M Althouse, Morgan Johnson, Mark Dredze, Joanna E Cohen. What's the Healthiest Day? Circaseptan (Weekly) Rhythms in Healthy Considerations. American Journal of Preventive Medicine (AJPM), 2014;47(1):73-76. [PDF] [Bibtex]
     Benjamin M Althouse, Jon-Patrick Allem, Matt Childers, Mark Dredze, John W Ayers. Population Health Concerns During the United States' Great Recession. American Journal of Preventive Medicine (AJPM), 2014;46(2):166-170. [PDF] [Bibtex]

     2013 (14 Publications)
     Mark Dredze, Bill Schilit. Facet suggestion for search query augmentation. US Patent 8,433,705, 2013. [Bibtex]
Expand Me   David A Broniatowski, Michael J Paul, Mark Dredze. National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic. PLOS ONE, 2013. [PDF] [Bibtex]
Social media have been proposed as a data source for influenza surveillance because they have the potential to offer real-time access to millions of short, geographically localized messages containing information regarding personal well-being. However, accuracy of social media surveillance systems declines with media attention because media attention increases ``chatter'' -- messages that are about influenza but that do not pertain to an actual infection -- masking signs of true influenza prevalence. This paper summarizes our recently developed influenza infection detection algorithm that automatically distinguishes relevant tweets from other chatter, and we describe our current influenza surveillance system which was actively deployed during the full 2012-2013 influenza season. Our objective was to analyze the performance of this system during the most recent 2012--2013 influenza season and to analyze the performance at multiple levels of geographic granularity, unlike past studies that focused on national or regional surveillance. Our system's influenza prevalence estimates were strongly correlated with surveillance data from the Centers for Disease Control and Prevention for the United States (r = 0.93, p < 0.001) as well as surveillance data from the Department of Health and Mental Hygiene of New York City (r = 0.88, p < 0.001). Our system detected the weekly change in direction (increasing or decreasing) of influenza prevalence with 85% accuracy, a nearly twofold increase over a simpler model, demonstrating the utility of explicitly distinguishing infection tweets from other chatter.
Expand Me   Travis Wolfe, Benjamin Van Durme, Mark Dredze, Nicholas Andrews, Charley Beller, Chris Callison-Burch, Jay DeYoung, Justin Snyder, Jonathan Weese, Tan Xu, Xuchen Yao. PARMA: A Predicate Argument Aligner. Association for Computational Linguistics (ACL) (short paper), 2013. [PDF] [Bibtex]
We introduce PARMA, a system for cross-document, semantic predicate and argument alignment. Our system integrates popular lexical semantic resources into a simple discriminative model. PARMA achieves state of the art results. We suggest that existing efforts have focussed on data that is too easy, and we provide a more difficult dataset based on MT translation references which has a lower baseline which we beat by 17% absolute F1.
Expand Me     Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition. Technical Report 10, Human Language Technology Center of Excellence, Johns Hopkins University, 2013. [PDF] [Bibtex]
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. We present a novel probabilistic model to learn the sub-word lexicon optimized for a given task. We consider the task of Out Of vocabulary (OOV) word detection, which relies on output from a hybrid system. We combine the proposed hybrid system with confidence based metrics to improve OOV detection performance. Previous work address OOV detection as a binary classification task, where each region is independently classified using local information. We propose to treat OOV detection as a sequence labeling problem, and we show that 1) jointly predicting out-of-vocabulary regions, 2) including contextual information from each region, and 3) learning sub-lexical units optimized for this task, leads to substantial improvements with respect to state-of-the-art on an English Broadcast News and MIT Lectures task.
Expand Me   Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Bibtex] [Code]
Public health applications using social media often require accurate, broad-coverage location information. However, the standard information provided by social media APIs, such as Twitter, cover a limited number of messages. This paper presents Carmen, a geolocation system that can determine structured location information for messages provided by the Twitter API. Our system utilizes geocoding tools and a combination of automatic and manual alias resolution methods to infer location structures from GPS positions and user-provided profile data. We show that our system is accurate and covers many locations, and we demonstrate its utility for improving influenza surveillance.
Expand Me   Michael J Paul, Byron C Wallace, Mark Dredze. What Affects Patient (Dis)satisfaction? Analyzing Online Doctor Ratings with a Joint Topic-Sentiment Model. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Bibtex]
We analyze patient reviews of doctors using a novel probabilistic joint model of aspect and sentiment based on factorial LDA. We leverage this model to exploit a small set of previously annotated reviews to automatically analyze the topics and sentiment latent in over 50,000 online reviews of physicians (and we make this dataset publicly available). The proposed model outperforms baseline models for this task with respect to model perplexity and sentiment classification. We report the most representative words with respect to positive and negative sentiment along three clinical aspects, thus complementing existing qualitative work exploring patient reviews of physicians.
Expand Me   Justin Snyder, Rebecca Knowles, Mark Dredze, Matthew R Gormley, Travis Wolfe. Topic Models and Metadata for Visualizing Text Corpora. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2013. [PDF] [Bibtex]
Effectively exploring and analyzing large text corpora requires visualizations that provide a high level summary. Past work has relied on faceted browsing of document metadata or on natural language processing of document text. In this paper, we present a new web-based tool that integrates topics learned from an unsupervised topic model in a faceted browsing experience. The user can manage topics, filter documents by topic and summarize views with metadata and topic graphs. We report a user study of the usefulness of topics in our tool.
Expand Me     Damianos Karakos, Mark Dredze, Sanjeev Khudanpur. Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation. Technical Report 8, Johns Hopkins University, 2013. [PDF] [Bibtex]
Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is self-training, in which a language model's parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide self-training, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1-best, we improve self-training efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topic-based language modeling adaptation results over both 1-best and lattice self-training using our ASR channel confusion estimates on telephone conversations.
Expand Me   Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] [Bibtex] [Data]
Hidden properties of social media users, such as their ethnicity, gender, and location, are often reflected in their observed attributes, such as their first and last names. Furthermore, users who communicate with each other often have similar hidden properties. We propose an algorithm that exploits these insights to cluster the observed attributes of hundreds of millions of Twitter users. Attributes such as user names are grouped together if users with those names communicate with other similar users. We separately cluster millions of unique first names, last names, and userprovided locations. The efficacy of these clusters is then evaluated on a diverse set of classification tasks that predict hidden users properties such as ethnicity, geographic location, gender, language, and race, using only profile names and locations when appropriate. Our readily-replicable approach and publicly released clusters are shown to be remarkably effective and versatile, substantially outperforming state-of-the-art approaches and human accuracy on each of the tasks studied.
Expand Me   Mahesh Joshi, Mark Dredze, William W Cohen, Carolyn P Rose. What's in a Domain? Multi-Domain Learning for Multi-Attribute Data. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2013. [PDF] [Bibtex]
Multi-Domain learning assumes that a single metadata attribute is used in order to divide the data into so-called domains. However, real-world datasets often have multiple metadata attributes that can divide the data into domains. It is not always apparent which single attribute will lead to the best domains, and more than one attribute might impact classification. We propose extensions to two multi-domain learning techniques for our multi-attribute setting, enabling them to simultaneously learn from several metadata attributes. Experimentally, they outperform the multi-domain learning baseline, even when it selects the single "best" attribute.
Expand Me   Alex Lamb, Michael J Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL) (short paper), 2013. [PDF] [Bibtex] [Data]
Twitter has been shown to be a fast and reliable method for disease surveillance of common illnesses like influenza. However, previous work has relied on simple content analysis, which conflates flu tweets that report infection with those that express concerned awareness of the flu. By discriminating these categories, as well as tweets about the authors versus about others, we demonstrate significant improvements on influenza surveillance using Twitter.
Expand Me   Michael J Paul, Mark Dredze. Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] [Bibtex]
Multi-dimensional latent text models, such as factorial LDA (f-LDA), capture multiple factors of corpora, creating structured output for researchers to better understand the contents of a corpus. We consider such models for clinical research of new recreational drugs and trends, an important application for mining current information for healthcare workers. We use a "three-dimensional" f-LDA variant to jointly model combinations of drug (marijuana, salvia, etc.), aspect (effects, chemistry, etc.) and route of administration (smoking, oral, etc.) Since a purely unsupervised topic model is unlikely to discover these specific factors of interest, we develop a novel method of incorporating prior knowledge by leveraging user generated tags as priors in our model. We demonstrate that this model can be used as an exploratory tool for learning about these drugs from the Web by applying it to the task of extractive summarization. In addition to providing useful output for this important public health task, our prior-enriched model provides a framework for the application of f-LDA to other tasks
Expand Me   Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Machine Learning, 2013;91(2):155-187. [PDF] [Bibtex]
We present AROW, an online learning algorithm for binary and multiclass problems that combines large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive mistake bounds for the binary and multiclass settings that are similar in form to the second order perceptron bound. Our bounds do not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques. Empirical evaluations show that AROW achieves state-of-the-art performance on a wide range of binary and multiclass tasks, as well as robustness in the face of non-separable data.
Expand Me     Delip Rao, Paul McNamee, Mark Dredze. Entity Linking: Finding Extracted Entities in a Knowledge Base. Multi-source, Multi-lingual Information Extraction and Summarization, 2013. [Bibtex]
In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.

     2012 (17 Publications)
     Kristian Hammond, Jerome Budzik, Lawrence Birnbaum, Kevin Livingston, Mark Dredze. Request initiated collateral content offering. US Patent 8,260,874, 2012. [Bibtex]
Expand Me   Mark Dredze. How Social Media Will Change Public Health. IEEE Intelligent Systems, 2012;27(4):81-84. [Bibtex] [Link]
Recent work in machine learning and natural language processing has studied the health content of tweets and demonstrated the potential for extracting useful public health information from their aggregation. This article examines the types of health topics discussed on Twitter, and how tweets can both augment existing public health capabilities and enable new ones. The author also discusses key challenges that researchers must address to deliver high-quality tools to the public health community.
Expand Me   Michael J Paul, Mark Dredze. Factorial LDA: Sparse Multi-Dimensional Text Models. Neural Information Processing Systems (NIPS), 2012. [PDF] [Bibtex]
Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors.
Expand Me   Alex Lamb, Michael J Paul, Mark Dredze. Investigating Twitter as a Source for Studying Behavioral Responses to Epidemics. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex]
We present preliminary results for mining concerned awareness of influenza tweets. We describe our data set construction and experiments with binary classification of data into influenza versus general messages and classification into concerned awareness and existing infection.
Expand Me   Atul Nakhasi, Ralph J Passarella, Sarah G Bell, Michael J Paul, Mark Dredze, Peter J Pronovost. Malpractice and Malcontent: Analyzing Medical Complaints in Twitter. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex]
In this paper we report preliminary results from a study of Twitter to identify patient safety reports, which offer an immediate, untainted, and expansive patient perspective un- like any other mechanism to date for this topic. We identify patient safety related tweets and characterize them by which medical populations caused errors, who reported these er- rors, what types of errors occurred, and what emotional states were expressed in response. Our long term goal is to improve the handling and reduction of errors by incorpo- rating this patient input into the patient safety process.
Expand Me   Michael J Paul, Mark Dredze. Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF] [Bibtex]
Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of a multi-dimensional latent text model -- factorial LDA -- that captures orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interest to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.
     Ralph J Passarella, Atul Nakhasi, Sarah G Bell, Michael J Paul, Peter J Pronovost, Mark Dredze. Twitter as a Source for Learning about Patient Safety Events. Annual Symposium of the American Medical Informatics Association (AMIA), 2012. [Bibtex]
Expand Me   Damianos Karakos, Brian Roark, Izhak Shafran, Kenji Sagae, Maider Lehr, Emily Prud'hommeaux, Puyang Xu, Nathan Glenn, Sanjeev Khudanpur, Murat Saraclar, Dan Bikel, Mark Dredze, Chris Callison-Burch, Yuan Cao, Keith Hall, Eva Hasler, Philipp Koehn, Adam Lopez, Matt Post, Darcey Riley. Deriving conversation-based features from unlabeled speech for discriminative language modeling. International Speech Communication Association (INTERSPEECH), 2012. [PDF] [Bibtex]
The perceptron algorithm was used in [1] to estimate discriminative language models which correct errors in the output of ASR systems. In its simplest version, the algorithm simply increases the weight of n-gram features which appear in the correct (oracle) hypothesis and decreases the weight of n-gram features which appear in the 1-best hypothesis. In this paper, we show that the perceptron algorithm can be successfully used in a semi-supervised learning (SSL) framework, where limited amounts of labeled data are available. Our framework has some similarities to graph-based label propagation [2] in the sense that a graph is built based on proximity of unlabeled conversations, and then it is used to propagate confidences (in the form of features) to the labeled data, based on which perceptron trains a discriminative model. The novelty of our approach lies in the fact that the confidence "flows" from the unlabeled data to the labeled data, and not vice-versa, as is done traditionally in SSL. Experiments conducted at the 2011 CLSP Summer Workshop on the conversational telephone speech corpora Dev04f and Eval04f demonstrate the effectiveness of the proposed approach.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Structured Language Modeling for Speech Recognition. International Speech Communication Association (INTERSPEECH), 2012. [PDF] [Bibtex]
The structured language model (SLM) of [1] was one of the first to successfully integrate syntactic structure into language models. We extend the SLM framework in two new directions. First, we propose a new syntactic hierarchical interpolation that improves over previous approaches. Second, we develop a general information-theoretic algorithm for pruning the underlying Jelinek-Mercer interpolated LM used in [1], which substantially reduces the size of the LM, enabling us to train on large data. When combined with hill-climbing [2] the SLM is an accurate model, space-efficient and fast for rescoring large speech lattices. Experimental results on broadcast news demonstrate that the SLM outperforms a large 4-gram LM.
Expand Me   Nicholas Andrews, Jason Eisner, Mark Dredze. Name Phylogeny: A Generative Model of String Variation. Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF] [Bibtex]
Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
Expand Me   Mahesh Joshi, Mark Dredze, William W Cohen, Carolyn P Rose. Multi-Domain Learning: When Do Domains Matter? Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF] [Bibtex]
We present a systematic analysis of existing multi-domain learning approaches with respect to two questions. First, many multi-domain learning algorithms resemble ensemble learning algorithms. (1) Are multi-domain learning improvements the result of ensemble learning effects? Second, these algorithms are traditionally evaluated in a balanced label setting, although in practice many multi-domain settings have domain-specific label biases. When multi-domain learning is applied to these settings, (2) are multi-domain methods improving because they capture domain-specific class biases? An understanding of these two issues presents a clearer idea about where the field has had success in multi-domain learning, and it suggests some important open questions for improving beyond the current state of the art.
Expand Me   Ariya Rastrow, Sanjeev Khudanpur, Mark Dredze. Revisiting the Case for Explicit Syntactic Information in Language Models. NAACL Workshop on the Future of Language Modeling for HLT, 2012. [PDF] [Bibtex]
Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naive, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactically well-motivated models. This unusual resilience of n-grams, as well as their weaknesses, are examined here. It is demonstrated that n-grams are good word-predictors, even linguistically speaking, in a large majority of word-positions, and it is suggested that to improve over n-grams, one must explore syntax-aware (or other) language models that focus on positions where n-grams are weak.
Expand Me   Spence Green, Nicholas Andrews, Matthew R Gormley, Mark Dredze, Christopher D Manning. Entity Clustering Across Languages. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF] [Bibtex]
Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline.
Expand Me   Matthew R Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF] [Bibtex]
With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining. Association for Computational Linguistics (ACL), 2012. [PDF] [Bibtex]
Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation. However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set. In this work, we propose substructure sharing, which saves duplicate work in processing hypothesis sets with redundant hypothesis structures. We apply substructure sharing to a dependency parser and part of speech tagger to obtain significant speedups, and further improve the accuracy of these tools through up-training. When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N-best and hill climbing rescoring, and show that up-training leads to WER reduction.
Expand Me   Koby Crammer, Alex Kulesza, Mark Dredze. New H-∞ Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012. [PDF] [Bibtex]
The well known recursive least squares (RLS) algorithm has been widely used for many years. Most analyses of RLS have assumed statistical properties of the data or noise process, but recent robust H∞ analyses have been used to bound the ratio of the performance of the algorithm to the total noise. In this paper, we provide the first additive analysis bounding the difference between performance and noise. Our analysis provides additional convergence guarantees in general, and particularly with structured input data. We illustrate the analysis using human speech and white noise.
Expand Me   Koby Crammer, Mark Dredze, Fernando Pereira. Confidence-Weighted Linear Classification for Text Categorization. Journal of Machine Learning Research (JMLR), 2012;13(Jun):1891-1926. [PDF] [Bibtex]
Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing.

     2011 (12 Publications)
     Joshua T Vogelstein, William R Gray, Jason G Martin, Glen A Coppersmith, Mark Dredze, J Bogovic, J L Prince, S M Resnick, Carey E Priebe, R Jacob Vogelstein. Connectome Classification using Statistical Graph Theory and Machine Learning. Society for Neuroscience (Poster), 2011. [Bibtex]
Expand Me     Spence Green, Nicholas Andrews, Matthew R Gormley, Mark Dredze, Christopher D Manning. Cross-lingual Coreference Resolution: A New Task for Multilingual Comparable Corpora. Technical Report 6, Human Language Technology Center of Excellence, Johns Hopkins University, 2011. [Bibtex]
We introduce cross-lingual coreference resolution, the task of grouping entity mentions with a common referent in a multilingual corpus. Information, especially on the web, is increasingly multilingual. We would like to track entity references across languages without machine translation, which is expensive and unavailable for many language pairs. Therefore, we develop a set of models that rely on decreasing levels of parallel resources: a bitext, a bilingual lexicon, and a parallel name list. We propose baselines, provide experimental results, and analyze sources of error. Across a range of metrics, we find that even our lowest resource model gives a 2.5% F1 absolute improvement over the strongest baseline. Our results present a positive outlook for crosslingual coreference resolution even in low resource languages. We are releasing our crosslingual annotations for the ACE2008 ArabicEnglish evaluation corpus.
     Matthew R Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models with Application to Selectional Preference. NIPS Workshop on Learning Semantics, 2011. [PDF] [Bibtex]
Expand Me   Damianos Karakos, Mark Dredze, Kenneth Church, Aren Jansen, Sanjeev Khudanpur. Estimating Document Frequencies in a Speech Corpus. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex]
Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df(w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex]
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of- domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures.
Expand Me   Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Discrimnative Training of Long-Span Language Models. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF] [Bibtex]
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long- span LMs based on a hill climbing rescoring algorithm. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.
     Ann Irvine, Mark Dredze, Geraldine Legendre, Paul Smolensky. Optimality Theory Syntax Learnability: An Empirical Exploration of the Perceptron and GLA. CogSci Workshop on OT as a General Cognitive Architecture, 2011. [Bibtex]
Expand Me   Carolina Parada, Mark Dredze, Frederick Jelinek. OOV Sensitive Named-Entity Recognition in Speech. International Speech Communication Association (INTERSPEECH), 2011. [PDF] [Bibtex] [Data]
Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named entities and always produce transcription errors. In this work, we improve speech NER by including features indicative of OOVs based on a OOV detector, allowing for the identification of regions of speech containing named entities, even if they are incorrectly transcribed. We construct a new speech NER data set and demonstrate significant improvements for this task.
Expand Me     Michael J Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011. [PDF] [Bibtex] [Data]
We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ailments, such as influenza, infections, obesity, as compared to standard topic models. Furthermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data.
Expand Me   Michael J Paul, Mark Dredze. You Are What You Tweet: Analyzing Twitter for Public Health. International Conference on Weblogs and Social Media (ICWSM), 2011. [PDF] [Bibtex]
Analyzing user messages in social media can mea- sure different population haracteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and in- somnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.
Expand Me   Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Learning Sub-Word Units for Open Vocabulary Speech Recognition. Association for Computational Linguistics (ACL), 2011. [PDF] [Bibtex]
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to \em learn the sub-word lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. %We present results on a Broadcast News and MIT Lectures data sets. A hybrid model with our learned sub-word lexicon reduces error by 6.3\% and 7.6\% (absolute) at a 5\% false alarm rate on an English Broadcast News and MIT Lectures task respectively.
Expand Me   Ariya Rastrow, Markus Dreyer, Abhinav Sethy, Sanjeev Khudanpur, Bhuvana Ramabhadran, Mark Dredze. Hill Climbing on Speech Lattices: A New Rescoring Framework. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [PDF] [Bibtex]
We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.

     2010 (12 Publications)
Expand Me   Mark Dredze, Aren Jansen, Glen A Coppersmith, Kenneth Church. NLP on Spoken Documents without ASR. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF] [Bibtex]
There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-of-vocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (&#160;1 sec) repetitions in speech, and clusters them into pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudo-terms; performance on a Switchboard task approaches a baseline using gold standard manual transcriptions.
Expand Me   Mark Dredze, Tim Oates, Christine Piatko. We're Not in Kansas Anymore: Detecting Domain Changes in Streams. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF] [Bibtex]
Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention -- detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses A-distance, a metric for detecting shifts in data streams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.
Expand Me   Carolina Parada, Abhinav Sethy, Mark Dredze, Frederick Jelinek. A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web. International Speech Communication Association (INTERSPEECH), 2010. [PDF] [Bibtex]
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.
Expand Me   Delip Rao, Paul McNamee, Mark Dredze. Streaming Cross Document Entity Coreference Resolution. Conference on Computational Linguistics (Coling), 2010. [PDF] [Bibtex]
Previous research in cross-document entity coreference has generally been restricted to the offline scenario where the set of documents is provided in advance. As a consequence, the dominant approach is based on greedy agglomerative clustering techniques that utilize pairwise vector comparisons and thus require O(n^2) space and time. In this paper we explore identifying coreferent entity mentions across documents in high-volume streaming text, including methods for utilizing orthographic and contextual information. We test our methods using several corpora to quantitatively measure both the efficacy and scalability of our streaming approach. We show that our approach scales to at least an order of magnitude larger data than previous reported methods.
Expand Me   Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, Tim Finin. Entity Disambiguation for Knowledge Base Population. Conference on Computational Linguistics (Coling), 2010. [PDF] [Bibtex]
The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources. Further, our approach achieves performance of up to 95% on entities mentioned from newswire and 80% on a public test set that was designed to include challenging queries.
Expand Me   Chris Callison-Burch, Mark Dredze. Creating Speech and Language Data With Amazon's Mechanical Turk. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex]
In this paper we give an introduction to using Amazon\'s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop\'s shared task to create data for speech and language applications with $100.
Expand Me   Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, Mark Dredze. Annotating named entities in Twitter data with crowdsourcing. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex] [Data]
We describe our experience using both Amazon Mechanical Turk (MTurk) and CrowdFlower to collect simple named entity annotations for Twitter status updates. Unlike most genres that have traditionally been the focus of named entity experiments Twitter is far more informal and abbreviated. The collected annotations and annotation techniques will provide a first step towards the full study of named entity recognition in domains like Facebook and Twitter. We also briefly describe how to use MTurk to collect judgements on the quality of "word clouds."
Expand Me   Matthew R Gormley, Adam Gerber, Mary Harper, Mark Dredze. Non-Expert Correction of Automatically Generated Relation Annotations. NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk, 2010. [PDF] [Bibtex]
We explore a new way to collect human annotated relations in text using Amazon Mechanical Turk. Given a knowledge base of relations and a corpus, we identify sentences which mention both an entity and an attribute that have some relation in the knowledge base. Each noisy sentence/relation pair is presented to multiple turkers, who are asked whether the sentence expresses the relation. We describe a design which encourages user efficiency and aids discovery of cheating. We also present results on inter-annotator agreement.
Expand Me   Courtney Napoles, Mark Dredze. Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language. NAACL-HLT Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids, 2010. [PDF] [Bibtex]
Text simplification is the process of changing vocabulary and grammatical structure to create a more accessible version of the text while maintaining the underlying information and content. Automated tools for text simplification are a practical way to make large corpora of text accessible to a wider audience lacking high levels of fluency in the corpus language. In this work, we investigate the potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English. Most text simplification systems are based on hand-written rules (e.g., PEST and its module SYSTAR), and therefore face limitations scaling and transferring across domains. The potential for using Simple Wikipedia for text simplification is significant; it contains nearly 60,000 articles with revision histories and aligned articles to ordinary English Wikipedia. Using articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains. These findings help further understanding of what makes text simple and can be applied as a tool to help writers craft simple text.
Expand Me   Justin Ma, Alex Kulesza, Koby Crammer, Mark Dredze, Lawrence Saul, Fernando Pereira. Exploiting Feature Covariance in High-Dimensional Online Learning. AIStats, 2010. [PDF] [Bibtex]
Some online algorithms for linear classification model the uncertainty in their weights over the course of learning. Modeling the full covariance structure of the weights can provide a significant advantage for classification. However, for high-dimensional, large-scale data, even though there may be many second-order feature interactions, it is computationally infeasible to maintain this covariance structure. To extend second-order methods to high-dimensional data, we develop low-rank approximations of the covariance structure. We evaluate our approach on both synthetic and real-world data sets using the confidence-weighted online learning framework. We show improvements over diagonal covariance matrices for both low and high-dimensional data.
Expand Me   Carolina Parada, Mark Dredze, Denis Filimonov, Frederick Jelinek. Contextual Information Improves OOV Detection in Speech. North American Chapter of the Association for Computational Linguistics (NAACL), 2010. [PDF] [Bibtex]
Out-of-vocabulary (OOV) words represent an important source of error in large vocabulary continuous speech recognition (LVCSR) systems. These words cause recognition failures, which propagate through pipeline systems impacting the performance of downstream applications. The detection of OOV regions in the output of a LVCSR system is typically addressed as a binary classification task, where each region is independently classified using local information. In this paper, we show that jointly predicting OOV regions, and including contextual information from each region, leads to substantial improvement in OOV detection. Compared to the state-of-the-art, we reduce the missed OOV rate from 42.6% to 28.4% at 10% false alarm rate.
Expand Me   Mark Dredze, Alex Kulesza, Koby Crammer. Multi-Domain Learning by Confidence-Weighted Parameter Combination. Machine Learning, 2010;79(1-2):123-149. [PDF] [Bibtex] [Tech Report]
State-of-the-art statistical NLP systems for a variety of tasks learn from labeled training data that is often domain specific. However, there may be multiple domains or sources of interest on which the system must perform. For example, a spam filtering system must give high quality predictions for many users, each of whom receives emails from different sources and may make slightly different decisions about what is or is not spam. Rather than learning separate models for each domain, we explore systems that learn across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of disparate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.

     2009 (6 Publications)
       Mark Dredze. Intelligent Email: Aiding Users with AI. PhD Thesis, Computer and Information Science, University of Pennsylvania, 2009. [PDF] [Bibtex]
Expand Me   Paul McNamee, Mark Dredze, Adam Gerber, Nikesh Garera, Tim Finin, James Mayfield, Christine Piatko, Delip Rao, David Yarowsky, Markus Dreyer. HLTCOE Approaches to Knowledge Base Population at TAC 2009. Text Analysis Conference (TAC), 2009. [Bibtex]
The HLTCOE participated in the entity linking and slot filling tasks at TAC 2009. A machine learning-based approach to entity linking, operating over a wide range of feature types, yielded good performance on the entity linking task. Slot-filling based on sentence selection, application of weak patterns and exploitation of redundancy was ineffective in the slot filling task.
Expand Me   Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Advances in Neural Information Processing Systems (NIPS), 2009. [PDF] [Bibtex]
We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data.
Expand Me   Mark Dredze, Partha Pratim Talukdar, Koby Crammer. Sequence Learning from Data with Multiple Labels. ECML/PKDD Workshop on Learning from Multi-Label Data, 2009. [PDF] [Bibtex]
We present novel algorithms for learning structured predictors from instances with multiple labels in the presence of noise. The proposed algorithms improve performance on two standard NLP tasks when we have a small amount of training data (low quantity) and when the labels are noisy (low quality). In these settings, the methods improve performance over using a single label, in some cases exceeding performance using gold labels. Our methods could be used in a semi-supervised setting, where a limited amount of labeled data could be combined with a rule based automatic labeling of unlabeled data with multiple possible labels.
Expand Me   Koby Crammer, Mark Dredze, Alex Kulesza. Multi-Class Confidence Weighted Algorithms. Empirical Methods in Natural Language Processing (EMNLP), 2009. [PDF] [Bibtex] [Data (Amazon 7)]
The recently introduced online confidence-weighted (CW) learning algorithm for binary classification performs well on many binary NLP tasks. However, for multi-class problems CW learning updates and inference cannot be computed analytically or solved as convex optimization problems as they are in the binary case. We derive learning algorithms for the multi-class CW setting and provide extensive evaluation using nine NLP datasets, including three derived from the recently released New York Times corpus. Our best algorithm outperforms state-of-the-art online and batch methods on eight of the nine tasks. We also show that the confidence information maintained during learning yields useful probabilistic information at test time.
Expand Me   Mark Dredze, Bill Schilit, Peter Norvig. Suggesting Email View Filters for Triage and Search. International Joint Conference on Artificial Intelligence (IJCAI), 2009. [PDF] [Bibtex]
Growing email volumes cause flooded inboxes and swelled email archives, making search and new email processing difficult. While emails have rich metadata, such as recipients and folders, suitable for creating filtered views, it is often difficult to choose appropriate filters for new inbox messages without first examining messages. In this work, we consider a system that automatically suggests relevant view filters to the user for the currently viewed messages. We propose several ranking algorithms for suggesting useful filters. Our work suggests that such systems quickly filter groups of inbox messages and find messages more easily during search.

     2008 (12 Publications)
Expand Me   Kevin Lerman, Ari Gilder, Mark Dredze, Fernando Pereira. Reading the Markets: Forecasting Public Opinion of Political Candidates by News Analysis. Conference on Computational Linguistics (Coling), 2008. [PDF] [Bibtex]
Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of political candidates. Our system uses daily newspaper articles to predict shifts in public opinion as reflected in prediction markets. We discuss various types of features designed for this problem. The news system improves market prediction over baseline market systems.
Expand Me     Mark Dredze, Joel Wallenberg. Further Results and Analysis of Icelandic Part of Speech Tagging. Technical Report MS-CIS-08-13, University of Pennsylvania, Department of Computer and Information Science, 2008. [PDF] [Bibtex]
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. This paper presents further results and analysis to the original work.
Expand Me   Mark Dredze, Joel Wallenberg. Icelandic Data-Driven Part of Speech Tagging. Association for Computational Linguistics (ACL) (short paper), 2008. [PDF] [Bibtex]
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions.
Expand Me   Kuzman Ganchev, Mark Dredze. Small Statistical Models by Random Feature Mixing. ACL Workshop on Mobile NLP, 2008. [PDF] [Bibtex]
The application of statistical NLP systems to resource constrained devices is limited by the need to maintain parameters for a large number of features and an alphabet mapping features to parameters. We introduce random feature mixing to eliminate alphabet storage and reduce the number of parameters without severely impacting model performance.
Expand Me   Mark Dredze, Hanna Wallach, Danny Puller, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Intelligent User Interfaces (IUI), 2008. [PDF] [Bibtex]
Email summary keywords, used to concisely represent the gist of an email, can help users manage and prioritize large numbers of messages. We develop an unsupervised learning framework for selecting summary keywords from emails using latent representations of the underlying topics in a user's mailbox. This approach selects words that describe each message in the context of existing topics rather than simply selecting keywords based on a single message in isolation. We present and compare four methods for selecting summary keywords based on two well-known models for inferring latent topics: latent semantic analysis and latent Dirichlet allocation. The quality of the summary keywords is assessed by generating summaries for emails from twelve users in the Enron corpus. The summary keywords are then used in place of entire messages in two proxy tasks: automated foldering and recipient prediction. We also evaluate the extent to which summary keywords enhance the information already available in a typical email user interface by repeating the same tasks using email subject lines.
Expand Me   Mark Dredze, Hanna Wallach, Danny Puller, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Aiding Users with AI. American National Conference on Artificial Intelligence (AAAI) (Nectar), 2008. [PDF] [Bibtex]
Email occupies a central role in the modern workplace. This has led to a vast increase in the number of email messages that users are expected to handle daily. Furthermore, email is no longer simply a tool for asynchronous online communication - email is now used for task management, personal archiving, as well both synchronous and asynchronous online communication. This explosion can lead to "email overload" - many users are overwhelmed by the large quantity of information in their mailboxes. In the human--computer interaction community, there has been much research on tackling email overload. Recently, similar efforts have emerged in the artificial intelligence (AI) and machine learning communities to form an area of research known as intelligent email.\nIn this paper, we take a user-oriented approach to applying AI to email. We identify enhancements to email user interfaces and employ machine learning techniques to support these changes. We focus on three tasks - summary keyword generation, reply prediction and attachment prediction - and summarize recent work in these areas.
Expand Me   Mark Dredze, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Reply and Attachment Prediction. Intelligent User Interfaces (IUI), 2008. [PDF] [Bibtex]
We present two prediction problems under the rubric of Intelligent Email that are designed to support enhanced email interfaces that relieve the stress of email overload. Reply prediction alerts users when an email requires a response and facilitates email response management. Attachment prediction alerts users when they are about to send an email missing an attachment or triggers a document recommendation system, which can catch missing attachment emails before they are sent. Both problems use the same underlying email classification system and task specific features. Each task is evaluated for both single-user and cross-user settings.
Expand Me   Mark Dredze, Hanna Wallach. User Models for Email Activity Management. IUI Workshop on Ubiquitous User Modeling,