Research | Alexandra DeLucia

Research Themes

2025

MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, and 1 more author

Oct 2025

Health AI Evaluation

Abs DOI PDF Code

While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score substantially varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation by using our generalizable and modularized pipeline for domain adaptation.
MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi, Alexandra DeLucia, Lillian C Chen, and 8 more authors

In Machine Learning for Health 2025, 2025

Health AI Annotation Dataset

PDF Code Data

2025

Can One Size Fit All?: Measuring Failure in Multi-Document Summarization Domain Transfer

Alexandra DeLucia and Mark Dredze

Jul 2025

LLMs Evaluation

Abs DOI PDF

Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training ("direct"), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer "failure" as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.

2024

Using Natural Language Inference to Improve Persona Extraction from Dialogue in a New Domain

Alexandra DeLucia, Mengjie Zhao, Yoshinori Maeda, and 3 more authors

arXiv preprint arXiv:2401.06742, 2024

LLMs Decoding

Abs PDF

While valuable datasets such as PersonaChat provide a foundation for training persona-grounded dialogue agents, they lack diversity in conversational and narrative settings, primarily existing in the "real" world. To develop dialogue agents with unique personas, models are trained to converse given a specific persona, but hand-crafting these persona can be time-consuming, thus methods exist to automatically extract persona information from existing character-specific dialogue. However, these persona-extraction models are also trained on datasets derived from PersonaChat and struggle to provide high-quality persona information from conversational settings that do not take place in the real world, such as the fantasy-focused dataset, LIGHT. Creating new data to train models on a specific setting is human-intensive, thus prohibitively expensive. To address both these issues, we introduce a natural language inference method for post-hoc adapting a trained persona extraction model to a new setting. We draw inspiration from the literature of dialog natural language inference (NLI), and devise NLI-reranking methods to extract structured persona information from dialogue. Compared to existing persona extraction models, our method returns higher-quality extracted persona and requires less human annotation.
Anti-LM Decoding for Zero-shot In-context Machine Translation

Suzanna Sia, Alexandra DeLucia, and Kevin Duh

In Findings of the Association for Computational Linguistics: NAACL 2024, Jun 2024

LLMs

Abs DOI PDF Code

Zero-shot In-context learning is the phenomenon where models can perform a task given only the instructions. However, pre-trained large language models are known to be poorly calibrated for zero-shot tasks. One of the most effective approaches to handling this bias is to adopt a contrastive decoding objective, which accounts for the prior probability of generating the next token by conditioning on a context. This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation. We conduct our experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search. The proposed method outperforms other state-of-the-art decoding objectives, with up to 20 BLEU point improvement from the default objective in some settings.

2023

Common Law Annotations: Investigating the Stability of Dialog System Output Annotations

Seunggun Lee, Alexandra DeLucia, Nikita Nangia, and 9 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

LLMs Annotation Evaluation

Abs DOI PDF

Metrics for Inter-Annotator Agreement (IAA), like Cohen’s Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure or reproducibility. While researchers are encouraged to increase annotator agreement, this can lead to specific and tailored annotation guidelines. We hypothesize that this may result in diverging annotations from different groups. To study this, we first propose the Lee et al. Protocol (LEAP), a standardized and codified annotation protocol. LEAP strictly enforces transparency in the annotation process, which ensures reproducibility of annotation guidelines. Using LEAP to annotate a dialog dataset, we empirically show that while research groups may create reliable guidelines by raising agreement, this can cause divergent annotations across different research groups, thus questioning the validity of the annotations. Therefore, we caution NLP researchers against using reliability as a proxy for reproducibility and validity.
Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement

Gwenyth Portillo Wightman, Alexandra DeLucia, and Mark Dredze

In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Jul 2023

LLMs

Abs DOI PDF Code

Large language models have achieved impressive few-shot performance on a wide variety of tasks. However, in many settings, users require confidence estimates for model predictions. While traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. We compare generations across diverse prompts and show that these can be used to create confidence scores. By utilizing more prompts we can get more precise confidence estimates and use response diversity as a proxy for confidence. We evaluate this approach across ten multiple-choice question-answering datasets using three models: T0, FLAN-T5, and GPT-3. In addition to analyzing multiple human written prompts, we automatically generate more prompts using a language model in order to produce finer-grained confidence estimates. Our method produces more calibrated confidence estimates compared to the log probability of the answer to a single prompt. These improvements could benefit users who rely on prediction confidence for integration into a larger system or in decision-making processes.

2021

Decoding Methods for Neural Narrative Generation

Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and 1 more author

In Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), Aug 2021

LLMs Evaluation Decoding Annotation

Abs DOI PDF Code Model

Narrative generation is an open-ended NLP task in which a model generates a story given a prompt. The task is similar to neural response generation for chatbots; however, innovations in response generation are often not applied to narrative generation, despite the similarity between these tasks. We aim to bridge this gap by applying and evaluating advances in decoding methods for neural response generation to neural narrative generation. In particular, we employ GPT-2 and perform ablations across nucleus sampling thresholds and diverse decoding hyperparameters—specifically, maximum mutual information—analyzing results over multiple criteria with automatic and human evaluation. We find that (1) nucleus sampling is generally best with thresholds between 0.7 and 0.9; (2) a maximum mutual information objective can improve the quality of generated stories; and (3) established automatic metrics do not correlate well with human judgments of narrative quality on any qualitative metric.

2023

A Multi-instance Learning Approach to Civil Unrest Event Detection on Twitter

Alexandra DeLucia, Mark Dredze, and Anna L. Buczak

In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, Sep 2023

Crisis Informatics Social Media

Abs PDF Code Data

Social media has become an established platform for people to organize and take offline actions, often in the form of civil unrest. Understanding these events can help support pro-democratic movements. The primary method to detect these events on Twitter relies on aggregating many tweets, but this includes many that are not relevant to the task. We propose a multi-instance learning (MIL) approach, which jointly identifies relevant tweets and detects civil unrest events. We demonstrate that MIL improves civil unrest detection over methods based on simple aggregation. Our best model achieves a 0.73 F1 on the Global Civil Unrest on Twitter (G-CUT) dataset.
Geo-Seq2seq: Twitter User Geolocation on Noisy Data through Sequence to Sequence Learning

Jingyu Zhang, Alexandra DeLucia, Chenyu Zhang, and 1 more author

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

Crisis Informatics Social Media Decoding

Abs DOI PDF Code

Location information can support social media analyses by providing geographic context. Some of the most accurate and popular Twitter geolocation systems rely on rule-based methods that examine the user-provided profile location, which fail to handle informal or noisy location names. We propose Geo-Seq2seq, a sequence-to-sequence (seq2seq) model for Twitter user geolocation that rewrites noisy, multilingual user-provided location strings into structured English location names. We train our system on tens of millions of multilingual location string and geotagged-tweet pairs. Compared to leading methods, our model vastly increases coverage (i.e., the number of users we can geolocate) while achieving comparable or superior accuracy. Our error analysis reveals that constrained decoding helps the model produce valid locations according to a location database. Finally, we measure biases across language, country of origin, and time to evaluate fairness, and find that while our model can generalize well to unseen temporal data, performance does vary by language and country.

2022

Changes in Tweet Geolocation over Time: A Study with Carmen 2.0

Jingyu Zhang, Alexandra DeLucia, and Mark Dredze

In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), Oct 2022

Crisis Informatics Social Media

Abs PDF Code

Researchers across disciplines use Twitter geolocation tools to filter data for desired locations. These tools have largely been trained and tested on English tweets, often originating in the United States from almost a decade ago. Despite the importance of these tools for data curation, the impact of tweet language, country of origin, and creation date on tool performance remains largely unknown. We explore these issues with Carmen, a popular tool for Twitter geolocation. To support this study we introduce Carmen 2.0, a major update which includes the incorporation of GeoNames, a gazetteer that provides much broader coverage of locations. We evaluate using two new Twitter datasets, one for multilingual, multiyear geolocation evaluation, and another for usage trends over time. We found that language, country origin, and time does impact geolocation tool performance.

2021

Study of Manifestation of Civil Unrest on Twitter

Abhinav Chinta, Jingyu Zhang, Alexandra DeLucia, and 2 more authors

In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), Nov 2021

Crisis Informatics Social Media Dataset

Abs DOI PDF Code Data

Twitter is commonly used for civil unrest detection and forecasting tasks, but there is a lack of work in evaluating how civil unrest manifests on Twitter across countries and events. We present two in-depth case studies for two specific large-scale events, one in a country with high (English) Twitter usage (Johannesburg riots in South Africa) and one in a country with low Twitter usage (Burayu massacre protests in Ethiopia). We show that while there is event signal during the events, there is little signal leading up to the events. In addition to the case studies, we train Ngram-based models on a larger set of Twitter civil unrest data across time, events, and countries and use machine learning explainability tools (SHAP) to identify important features. The models were able to find words indicative of civil unrest that generalized across countries. The 42 countries span Africa, Middle East, and Southeast Asia and the events range occur between 2014 and 2019.

2025

Histories of Daily Cannabis Use Patterns and Reported Negative Effects: A Content Analysis of Three Cannabis-Centric Communities on Reddit

Savannah Brenneke, Eloise Estrada, Epifania Ortiz, and 4 more authors

Drug and Alcohol Dependence, 2025

Public Health Reddit Social Media

Abs DOI

This qualitative study sought to describe Redditor histories of daily cannabis use and patterns of use, while exploring themes around reported negative effects associated with daily use.
Leveraging Cannabis-Centric Reddit Communities to Explore Patterns of Cannabis Product, Modes and Frequency of Use between 2010-2019

Savannah Brenneke, Alexandra DeLucia, Gwenyth Portillo Wightman, and 3 more authors

Drug and Alcohol Dependence, 2025

Public Health Reddit Social Media

2024

First-Hand Accounts of Structural Stigma toward People Who Use Opioids on Reddit

Evan L Eschliman, Karen Choe, Alexandra DeLucia, and 6 more authors

Social Science & Medicine, 2024

Public Health Annotation Reddit Social Media

2023

R/AskAComputerScientist: Processing Reddit Data for the Social Sciences

Savannah Brenneke, Meredith Meacham, Amanda Bunting, and 2 more authors

In 85th Annual Scientific Meeting of College on Problems of Drug Dependence, Jul 2023

Public Health Reddit Social Media

PDF
Automated Discovery of Perceived Health-related Concerns about E-cigarettes from Reddit

Alexandra DeLucia, Adam Poliak, Zechariah Zu, and 5 more authors

In 29th Annual Meeting of the Society for Research on Nicotine and Tobacco, Mar 2023

Public Health Reddit Social Media

Abs PDF

Significance: Public health communications concerning the risks of ENDS must address the public’s perceived health-related concerns. Identifying concerns of consumers can lead to more targeted messaging campaigns from health organizations. Current survey methods that rely on participant responses to specific questions may miss important unsolicited health concerns. Our analyses focus on machine learning methods utilizing crowdsourced conversations from the social media platform Reddit to discover naturally emerging ENDS health-related perceptions and outcomes. Methods: We obtained a sample of Reddit posts discussing ENDS-related health concerns. We collected all posts from the Reddit subcommunity, or “subreddit”, “r/electronic_cigarette” from its inception in September 17, 2008 through April 1, 2022 (N=10,403,433 posts) and identified posts containing questions about health outcomes, e.g. “does vaping cause” or “does ejuice flavor cause”. We collected replies (N=1,438) to these posts explicitly discussing health concerns. To form a larger dataset of posts discussing health concerns, we used a machine learning-based semantic search model to identify 10,905 posts from the subreddit with the most similar content to the collected replies. We compared these posts discussing health concerns to a random non-overlapping sample of posts (N=10,905) from the same subreddit. For every word in the 21,810 posts, we computed the conditional probability that the word was used in a post about health concerns compared to being from the random sample. All words with at least 0.8 conditional probability (N=367) were annotated with an open coding scheme for exclusive health categories. Three coders labeled the words with 100% agreement. Results: Of the 367 unique words, 121 were annotated as a health concern and grouped into 14 categories. The most cited categories of concerns were respiratory (3,983), addiction (1,147), allergy (643), oral health (589), and cardiovascular (278). Others included: mental health (191), gastrointestinal (141), oncologic (227), inflammation (64), neurological (207), dermatological (31), orthopedic (28), and sleep concerns (79). Health-related words with no clear category were grouped as non-specific. Conclusions: Machine learning models can identify potential consumer beliefs and perceptions regarding ENDS-related health topics found in social media platforms such as Reddit, which can inform campaign and health messaging and public education opportunities.

2020

Analyzing Hpc Support Tickets: Experience and Recommendations

Alexandra DeLucia and Elisabeth Moore

arXiv preprint arXiv:2010.04321, 2020

Systems

Abs PDF

High performance computing (HPC) user support teams are the first line of defense against large-scale problems, as they are often the first to learn of problems reported by users. Developing tools to better assist support teams in solving user problems and tracking issue trends is critical for maintaining system health. Our work examines the Los Alamos National Laboratory HPC Consult Team’s user support ticketing system and develops proof of concept tools to automate tasks such as category assignment and similar ticket recommendation. We also generate new categories for reporting and discuss ideas to improve future ticketing systems.

2018

Modeling High Performance Computing System Log Messages for Early Prediction of Job Outcome

Alexandra DeLucia and Elisabeth Moore

2018

Systems
Work in Progress: Topic Modeling for HPC Job State Prediction

Alexandra DeLucia and Elisabeth Baseman

In Proceedings of the First Workshop on Machine Learning for Computing Systems, 2018

Systems

2017

High Performance Computing Job Outcome Prediction by Mining System Logs

Alexandra DeLucia and Elisabeth Baseman

2017

Systems
Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs

Abida Haque, Alexandra DeLucia, and Elisabeth Baseman

In Proceedings of the Fourth International Workshop on HPC User Support Tools, Nov 2017

Systems

Abs DOI PDF

As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.

All Publications

2025

Histories of Daily Cannabis Use Patterns and Reported Negative Effects: A Content Analysis of Three Cannabis-Centric Communities on Reddit

Savannah Brenneke, Eloise Estrada, Epifania Ortiz, and 4 more authors

Drug and Alcohol Dependence, 2025

Public Health Reddit Social Media

Abs DOI

This qualitative study sought to describe Redditor histories of daily cannabis use and patterns of use, while exploring themes around reported negative effects associated with daily use.
Leveraging Cannabis-Centric Reddit Communities to Explore Patterns of Cannabis Product, Modes and Frequency of Use between 2010-2019

Savannah Brenneke, Alexandra DeLucia, Gwenyth Portillo Wightman, and 3 more authors

Drug and Alcohol Dependence, 2025

Public Health Reddit Social Media
Can One Size Fit All?: Measuring Failure in Multi-Document Summarization Domain Transfer

Alexandra DeLucia and Mark Dredze

Jul 2025

LLMs Evaluation

Abs DOI PDF

Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training ("direct"), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer "failure" as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.
MedScore: Generalizable Factuality Evaluation of Free-Form Medical Answers by Domain-adapted Claim Decomposition and Verification

Heyuan Huang, Alexandra DeLucia, Vijay Murari Tiyyala, and 1 more author

Oct 2025

Health AI Evaluation

Abs DOI PDF Code

While Large Language Models (LLMs) can generate fluent and convincing responses, they are not necessarily correct. This is especially apparent in the popular decompose-then-verify factuality evaluation pipeline, where LLMs evaluate generations by decomposing the generations into individual, valid claims. Factuality evaluation is especially important for medical answers, since incorrect medical information could seriously harm the patient. However, existing factuality systems are a poor match for the medical domain, as they are typically only evaluated on objective, entity-centric, formulaic texts such as biographies and historical topics. This differs from condition-dependent, conversational, hypothetical, sentence-structure diverse, and subjective medical answers, which makes decomposition into valid facts challenging. We propose MedScore, a new pipeline to decompose medical answers into condition-aware valid facts and verify against in-domain corpora. Our method extracts up to three times more valid facts than existing methods, reducing hallucination and vague references, and retaining condition-dependency in facts. The resulting factuality score substantially varies by decomposition method, verification corpus, and used backbone LLM, highlighting the importance of customizing each step for reliable factuality evaluation by using our generalizable and modularized pipeline for domain adaptation.
MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi, Alexandra DeLucia, Lillian C Chen, and 8 more authors

In Machine Learning for Health 2025, 2025

Health AI Annotation Dataset

PDF Code Data

2024

Using Natural Language Inference to Improve Persona Extraction from Dialogue in a New Domain

Alexandra DeLucia, Mengjie Zhao, Yoshinori Maeda, and 3 more authors

arXiv preprint arXiv:2401.06742, 2024

LLMs Decoding

Abs PDF

While valuable datasets such as PersonaChat provide a foundation for training persona-grounded dialogue agents, they lack diversity in conversational and narrative settings, primarily existing in the "real" world. To develop dialogue agents with unique personas, models are trained to converse given a specific persona, but hand-crafting these persona can be time-consuming, thus methods exist to automatically extract persona information from existing character-specific dialogue. However, these persona-extraction models are also trained on datasets derived from PersonaChat and struggle to provide high-quality persona information from conversational settings that do not take place in the real world, such as the fantasy-focused dataset, LIGHT. Creating new data to train models on a specific setting is human-intensive, thus prohibitively expensive. To address both these issues, we introduce a natural language inference method for post-hoc adapting a trained persona extraction model to a new setting. We draw inspiration from the literature of dialog natural language inference (NLI), and devise NLI-reranking methods to extract structured persona information from dialogue. Compared to existing persona extraction models, our method returns higher-quality extracted persona and requires less human annotation.
First-Hand Accounts of Structural Stigma toward People Who Use Opioids on Reddit

Evan L Eschliman, Karen Choe, Alexandra DeLucia, and 6 more authors

Social Science & Medicine, 2024

Public Health Annotation Reddit Social Media
Anti-LM Decoding for Zero-shot In-context Machine Translation

Suzanna Sia, Alexandra DeLucia, and Kevin Duh

In Findings of the Association for Computational Linguistics: NAACL 2024, Jun 2024

LLMs

Abs DOI PDF Code

Zero-shot In-context learning is the phenomenon where models can perform a task given only the instructions. However, pre-trained large language models are known to be poorly calibrated for zero-shot tasks. One of the most effective approaches to handling this bias is to adopt a contrastive decoding objective, which accounts for the prior probability of generating the next token by conditioning on a context. This work introduces an Anti-Language Model objective with a decay factor designed to address the weaknesses of In-context Machine Translation. We conduct our experiments across 3 model types and sizes, 3 language directions, and for both greedy decoding and beam search. The proposed method outperforms other state-of-the-art decoding objectives, with up to 20 BLEU point improvement from the default objective in some settings.

2023

R/AskAComputerScientist: Processing Reddit Data for the Social Sciences

Savannah Brenneke, Meredith Meacham, Amanda Bunting, and 2 more authors

In 85th Annual Scientific Meeting of College on Problems of Drug Dependence, Jul 2023

Public Health Reddit Social Media

PDF
A Multi-instance Learning Approach to Civil Unrest Event Detection on Twitter

Alexandra DeLucia, Mark Dredze, and Anna L. Buczak

In Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text, Sep 2023

Crisis Informatics Social Media

Abs PDF Code Data

Social media has become an established platform for people to organize and take offline actions, often in the form of civil unrest. Understanding these events can help support pro-democratic movements. The primary method to detect these events on Twitter relies on aggregating many tweets, but this includes many that are not relevant to the task. We propose a multi-instance learning (MIL) approach, which jointly identifies relevant tweets and detects civil unrest events. We demonstrate that MIL improves civil unrest detection over methods based on simple aggregation. Our best model achieves a 0.73 F1 on the Global Civil Unrest on Twitter (G-CUT) dataset.
Automated Discovery of Perceived Health-related Concerns about E-cigarettes from Reddit

Alexandra DeLucia, Adam Poliak, Zechariah Zu, and 5 more authors

In 29th Annual Meeting of the Society for Research on Nicotine and Tobacco, Mar 2023

Public Health Reddit Social Media

Abs PDF

Significance: Public health communications concerning the risks of ENDS must address the public’s perceived health-related concerns. Identifying concerns of consumers can lead to more targeted messaging campaigns from health organizations. Current survey methods that rely on participant responses to specific questions may miss important unsolicited health concerns. Our analyses focus on machine learning methods utilizing crowdsourced conversations from the social media platform Reddit to discover naturally emerging ENDS health-related perceptions and outcomes. Methods: We obtained a sample of Reddit posts discussing ENDS-related health concerns. We collected all posts from the Reddit subcommunity, or “subreddit”, “r/electronic_cigarette” from its inception in September 17, 2008 through April 1, 2022 (N=10,403,433 posts) and identified posts containing questions about health outcomes, e.g. “does vaping cause” or “does ejuice flavor cause”. We collected replies (N=1,438) to these posts explicitly discussing health concerns. To form a larger dataset of posts discussing health concerns, we used a machine learning-based semantic search model to identify 10,905 posts from the subreddit with the most similar content to the collected replies. We compared these posts discussing health concerns to a random non-overlapping sample of posts (N=10,905) from the same subreddit. For every word in the 21,810 posts, we computed the conditional probability that the word was used in a post about health concerns compared to being from the random sample. All words with at least 0.8 conditional probability (N=367) were annotated with an open coding scheme for exclusive health categories. Three coders labeled the words with 100% agreement. Results: Of the 367 unique words, 121 were annotated as a health concern and grouped into 14 categories. The most cited categories of concerns were respiratory (3,983), addiction (1,147), allergy (643), oral health (589), and cardiovascular (278). Others included: mental health (191), gastrointestinal (141), oncologic (227), inflammation (64), neurological (207), dermatological (31), orthopedic (28), and sleep concerns (79). Health-related words with no clear category were grouped as non-specific. Conclusions: Machine learning models can identify potential consumer beliefs and perceptions regarding ENDS-related health topics found in social media platforms such as Reddit, which can inform campaign and health messaging and public education opportunities.
Common Law Annotations: Investigating the Stability of Dialog System Output Annotations

Seunggun Lee, Alexandra DeLucia, Nikita Nangia, and 9 more authors

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

LLMs Annotation Evaluation

Abs DOI PDF

Metrics for Inter-Annotator Agreement (IAA), like Cohen’s Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure or reproducibility. While researchers are encouraged to increase annotator agreement, this can lead to specific and tailored annotation guidelines. We hypothesize that this may result in diverging annotations from different groups. To study this, we first propose the Lee et al. Protocol (LEAP), a standardized and codified annotation protocol. LEAP strictly enforces transparency in the annotation process, which ensures reproducibility of annotation guidelines. Using LEAP to annotate a dialog dataset, we empirically show that while research groups may create reliable guidelines by raising agreement, this can cause divergent annotations across different research groups, thus questioning the validity of the annotations. Therefore, we caution NLP researchers against using reliability as a proxy for reproducibility and validity.
The SIGMORPHON 2022 Shared Task on Cross-lingual and Low-Resource Grapheme-to-Phoneme Conversion

Arya D. McCarthy, Jackson L. Lee, Alexandra DeLucia, and 7 more authors

In Proceedings of the 20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Jul 2023

Abs DOI

Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The third iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year’s task (Ashby et al., 2021), including additional languages, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Three teams submitted a total of fifteen systems, at best achieving relative reductions of word error rate of 14% in the crosslingual subtask and 14% in the very-low resource subtask. The generally consistent result is that cross-lingual transfer substantially helps grapheme-to-phoneme modeling, but not to the same degree as in-language examples.
Strength in Numbers: Estimating Confidence of Large Language Models by Prompt Agreement

Gwenyth Portillo Wightman, Alexandra DeLucia, and Mark Dredze

In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), Jul 2023

LLMs

Abs DOI PDF Code

Large language models have achieved impressive few-shot performance on a wide variety of tasks. However, in many settings, users require confidence estimates for model predictions. While traditional classifiers produce scores for each label, language models instead produce scores for the generation which may not be well calibrated. We compare generations across diverse prompts and show that these can be used to create confidence scores. By utilizing more prompts we can get more precise confidence estimates and use response diversity as a proxy for confidence. We evaluate this approach across ten multiple-choice question-answering datasets using three models: T0, FLAN-T5, and GPT-3. In addition to analyzing multiple human written prompts, we automatically generate more prompts using a language model in order to produce finer-grained confidence estimates. Our method produces more calibrated confidence estimates compared to the log probability of the answer to a single prompt. These improvements could benefit users who rely on prediction confidence for integration into a larger system or in decision-making processes.
Geo-Seq2seq: Twitter User Geolocation on Noisy Data through Sequence to Sequence Learning

Jingyu Zhang, Alexandra DeLucia, Chenyu Zhang, and 1 more author

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

Crisis Informatics Social Media Decoding

Abs DOI PDF Code

Location information can support social media analyses by providing geographic context. Some of the most accurate and popular Twitter geolocation systems rely on rule-based methods that examine the user-provided profile location, which fail to handle informal or noisy location names. We propose Geo-Seq2seq, a sequence-to-sequence (seq2seq) model for Twitter user geolocation that rewrites noisy, multilingual user-provided location strings into structured English location names. We train our system on tens of millions of multilingual location string and geotagged-tweet pairs. Compared to leading methods, our model vastly increases coverage (i.e., the number of users we can geolocate) while achieving comparable or superior accuracy. Our error analysis reveals that constrained decoding helps the model produce valid locations according to a location database. Finally, we measure biases across language, country of origin, and time to evaluate fairness, and find that while our model can generalize well to unseen temporal data, performance does vary by language and country.

2022

Bernice: A Multilingual Pre-trained Encoder for Twitter

Alexandra DeLucia, Shijie Wu, Aaron Mueller, and 3 more authors

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Dec 2022

Social Media

Abs DOI PDF Code Data Model

The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data. We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall. We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.
Changes in Tweet Geolocation over Time: A Study with Carmen 2.0

Jingyu Zhang, Alexandra DeLucia, and Mark Dredze

In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), Oct 2022

Crisis Informatics Social Media

Abs PDF Code

Researchers across disciplines use Twitter geolocation tools to filter data for desired locations. These tools have largely been trained and tested on English tweets, often originating in the United States from almost a decade ago. Despite the importance of these tools for data curation, the impact of tweet language, country of origin, and creation date on tool performance remains largely unknown. We explore these issues with Carmen, a popular tool for Twitter geolocation. To support this study we introduce Carmen 2.0, a major update which includes the incorporation of GeoNames, a gazetteer that provides much broader coverage of locations. We evaluate using two new Twitter datasets, one for multilingual, multiyear geolocation evaluation, and another for usage trends over time. We found that language, country origin, and time does impact geolocation tool performance.

2021

Study of Manifestation of Civil Unrest on Twitter

Abhinav Chinta, Jingyu Zhang, Alexandra DeLucia, and 2 more authors

In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), Nov 2021

Crisis Informatics Social Media Dataset

Abs DOI PDF Code Data

Twitter is commonly used for civil unrest detection and forecasting tasks, but there is a lack of work in evaluating how civil unrest manifests on Twitter across countries and events. We present two in-depth case studies for two specific large-scale events, one in a country with high (English) Twitter usage (Johannesburg riots in South Africa) and one in a country with low Twitter usage (Burayu massacre protests in Ethiopia). We show that while there is event signal during the events, there is little signal leading up to the events. In addition to the case studies, we train Ngram-based models on a larger set of Twitter civil unrest data across time, events, and countries and use machine learning explainability tools (SHAP) to identify important features. The models were able to find words indicative of civil unrest that generalized across countries. The 42 countries span Africa, Middle East, and Southeast Asia and the events range occur between 2014 and 2019.
Decoding Methods for Neural Narrative Generation

Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and 1 more author

In Proceedings of the First Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), Aug 2021

LLMs Evaluation Decoding Annotation

Abs DOI PDF Code Model

Narrative generation is an open-ended NLP task in which a model generates a story given a prompt. The task is similar to neural response generation for chatbots; however, innovations in response generation are often not applied to narrative generation, despite the similarity between these tasks. We aim to bridge this gap by applying and evaluating advances in decoding methods for neural response generation to neural narrative generation. In particular, we employ GPT-2 and perform ablations across nucleus sampling thresholds and diverse decoding hyperparameters—specifically, maximum mutual information—analyzing results over multiple criteria with automatic and human evaluation. We find that (1) nucleus sampling is generally best with thresholds between 0.7 and 0.9; (2) a maximum mutual information objective can improve the quality of generated stories; and (3) established automatic metrics do not correlate well with human judgments of narrative quality on any qualitative metric.

2020

Analyzing Hpc Support Tickets: Experience and Recommendations

Alexandra DeLucia and Elisabeth Moore

arXiv preprint arXiv:2010.04321, 2020

Systems

Abs PDF

High performance computing (HPC) user support teams are the first line of defense against large-scale problems, as they are often the first to learn of problems reported by users. Developing tools to better assist support teams in solving user problems and tracking issue trends is critical for maintaining system health. Our work examines the Los Alamos National Laboratory HPC Consult Team’s user support ticketing system and develops proof of concept tools to automate tasks such as category assignment and similar ticket recommendation. We also generate new categories for reporting and discuss ideas to improve future ticketing systems.
Civil Unrest on Twitter (CUT): A Dataset of Tweets to Support Research on Civil Unrest

Justin Sech, Alexandra DeLucia, Anna L. Buczak, and 1 more author

In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Nov 2020

Social Media Dataset Annotation

Abs DOI PDF Code

We present CUT, a dataset for studying Civil Unrest on Twitter. Our dataset includes 4,381 tweets related to civil unrest, hand-annotated with information related to the study of civil unrest discussion and events. Our dataset is drawn from 42 countries from 2014 to 2019. We present baseline systems trained on this data for the identification of tweets related to civil unrest. We include a discussion of ethical issues related to research on this topic.

2018

Modeling High Performance Computing System Log Messages for Early Prediction of Job Outcome

Alexandra DeLucia and Elisabeth Moore

2018

Systems
Work in Progress: Topic Modeling for HPC Job State Prediction

Alexandra DeLucia and Elisabeth Baseman

In Proceedings of the First Workshop on Machine Learning for Computing Systems, 2018

Systems

2017

High Performance Computing Job Outcome Prediction by Mining System Logs

Alexandra DeLucia and Elisabeth Baseman

2017

Systems
Markov Chain Modeling for Anomaly Detection in High Performance Computing System Logs

Abida Haque, Alexandra DeLucia, and Elisabeth Baseman

In Proceedings of the Fourth International Workshop on HPC User Support Tools, Nov 2017

Systems

Abs DOI PDF

As high performance computing approaches the exascale era, analyzing the massive amount of monitoring data generated by supercomputers is quickly becoming intractable for human analysts. In particular, system logs, which are a crucial source of information regarding machine health and root cause analysis of problems and failures, are becoming far too large for a human to review by hand. We take a step toward mitigating this problem through mathematical modeling of textual system log data in order to automatically capture normal behavior and identify anomalous and potentially interesting log messages. We learn a Markov chain model from average case system logs and use it to generate synthetic system log data. We present a variety of evaluation metrics for scoring similarity between the synthetic logs and the real logs, thus defining and quantifying normal behavior. Then, we explore the abilities of this learned model to identify anomalous behavior by evaluating its ability to catch inserted and missing log messages. We evaluate our model and its performance on the anomaly detection task using a large set of system log files from two institutional computing clusters at Los Alamos National Laboratory. We find that while our model seems to pick up on key features of normal behavior, its ability to detect anomalies varies greatly by anomaly type and the training and test data used. Overall, we find mathematical modeling of system logs to be a promising area for further work, particularly with the goal of aiding human operators in troubleshooting tasks.

2015

Self-Driven Service Learning: Community-Student-Faculty Collaboratives Outside of the Classroom

Verónica A. Segarra, Alexandra A. DeLucia, Alyssa A. DeLucia, and 16 more authors

Journal of Microbiology & Biology Education, Dec 2015

Abs DOI PDF

Service learning is a community engagement pedagogy often used in the context of the undergraduate classroom to synergize course-learning objectives with community needs. We find that an effective way to catalyze student engagement in service learning is for student participation to occur outside the context of a graded course, driven by students’ own interests and initiative. In this paper, we describe the creation and implementation of a self-driven service learning program and discuss its benefits from the community, student, and faculty points of view. This experience allows students to explore careers in the sciences as well as identify skill strengths and weaknesses in an environment where mentoring is available but where student initiative and self-motivation are the driving forces behind the project’s success. Self-driven service learning introduces young scientists to the idea that their careers serve a larger community that benefits not only from their discoveries but also from effective communication about how these discoveries are relevant to everyday life.

Research Themes

Health AI Evaluation of AI for medical question answering.

2025

LLM Evaluation & Generation Training, fine-tuning, and evaluating language models for text generation, with a focus on decoding strategies, latency, and persona extraction.

2025

2024

2023

2021

Computational Social Science & Crisis Informatics Using social media data to model populations and predict civil unrest events such as riots and protests.

2023

2022

2021

Public Health Joint work with the FDA and JHU Bloomberg School of Public Health on social media data collection, processing, and extraction to support public health research on nicotine and drug dependence.

2025

2024

2023

Systems & High Performance Computing Applying NLP to high peformance computing monitoring and help-desk ticketing support for efficient maintenance of supercomputers.

2020

2018

2017

All Publications

2025

2024

2023

2022

2021

2020

2018

2017

2015