Publications tagged: #benchmark

  • Hell or High Water: Can Language Model Agents Formulate Backup Plans?

    As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? To answer this question, we devise a benchmark where each problem has at least two ways of solving it via distinct combinations of function calls. The agent interacts with this environment by searching for relevant functions from a set over four thousand possibilities. When we disable a function the agent is calling and communicate an error to that agent via natural language, we expect it to find backup solution through trial and error. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generation models as well as promising directions for future work.

    Andrew Wang , Sophia Hager , Adi Asija , Daniel Khashabi , Nicholas Andrews

    Second Conference on Language Modeling (COLM), 2025

    PDF BibTeX

    #agents #benchmark #language_grounding

  • ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts

    The problem of synthetic speech detection has enjoyed considerable attention, with recent methods achieving low error rates across several established benchmarks. However, to what extent can low error rates on academic benchmarks translate to more realistic conditions? In practice, while the training set is fixed at one point in time, test-time conditions may exhibit distribution shifts relative to the training conditions, such as changes in speaker characteristics, emotional expressiveness, language and acoustic conditions, and the emergence of novel synthesis methods. Although some existing datasets target subsets of these distribution shifts, systematic analysis remains difficult due to inconsistencies between source data and synthesis systems across datasets. This difficulty is further exacerbated by the rapid development of new text-to-speech (TTS) and vocoder systems, which continually expand the diversity of synthetic speech. To enable systematic benchmarking of model performance under distribution shifts, we introduce ShiftySpeech, a large-scale benchmark comprising over 3,000 hours of synthetic speech across 7 source domains, 6 TTS systems, 12 vocoders, and 3 languages. ShiftySpeech is specifically designed to evaluate model generalization under controlled distribution shifts while ensuring broad coverage of modern synthetic speech generation techniques. It fills a key gap in current benchmarks by supporting fine-grained, controlled analysis of generalization robustness. All tested distribution shifts significantly degrade detection performance of state-of-the-art detection approaches based on self-supervised features. Overall, our findings suggest that reliance on synthetic speech detection methods in production environments should be carefully evaluated based on anticipated distribution shifts.

    Ashi Garg , Zexin Cai , Lin Zhang , Henry Li Xinyuan , Leibny Paola GarcĂ­a-Perera , Kevin Duh , Sanjeev Khudanpur , Matthew Wiesner , Nicholas Andrews

    arXiv preprint arXiv:2502.05674, 2025

    PDF BibTeX

    #speech #deepfake_detection #benchmark #preprint

  • AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

    Humans regularly engage in analogical thinking, relating personal experiences to current situations (X is analogous to Y because of Z). Analogical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language models (LMs) do the same? To answer this question, we propose AnaloBench, a benchmark to determine analogical reasoning ability in LMs. Our benchmarking approach focuses on aspects of this ability that are common among humans: (i) recalling related experiences from a large amount of information, and (ii) applying analogical reasoning to complex and lengthy scenarios. We collect a set of 340 high quality, human written analogies for use in our benchmark, which constitutes the largest such collection to date. We then test a broad collection of models consisting of 12 open source and 3 proprietary in various sizes and architectures. As in prior results, scaling up LMs results in some performance boosts. Surprisingly, scale offers minimal gains when, (i) analogies involve lengthy scenarios, or (ii) recalling relevant scenarios from a large pool of information, a process analogous to finding a needle in a haystack. We hope these observations encourage further research in this field.

    Xiao Ye , Andrew Wang , Jacob Choi , Yining Lu , Shreya Sharma , Lingfeng Shen , Vijay Murari Tiyyala , Nicholas Andrews , Daniel Khashabi

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

    PDF BibTeX

    #benchmark #llm

  • Forecasting Earnings Surprises from Conference Call Transcripts

    There is a multitude of textual data relevant to the financial markets, spanning genres such as financial news, earnings conference calls, and social media posts. Earnings conference calls are one of the most important to information flow as they reflect a direct communication between company executives, financial analysts, and large shareholders. Since these calls contain content that is forward-looking in nature, they can be used to forecast the future performance of the company relative to market expectations. However, they typically contain over 5,000 words of text and large amounts of industry jargon. This length and domain-specific language present problems for many generic pretrained language models. In this work, we introduce a novel task of predicting earnings surprises from earnings call transcripts and contribute a new long document dataset that tests financial understanding with complex signals. We explore a variety of approaches for this long document classification task and establish some strong baselines. Furthermore, we demonstrate that it is possible to predict companies’ future earnings surprises from solely the text of their conference calls with reasonable accuracy. Finally, we probe the models through different interpretability methods and reveal some intuitive explanations of the linguistic features captured that go beyond traditional sentiment analysis.

    Ross Koval , Nicholas Andrews , Xifeng Yan

    Findings of the Association for Computational Linguistics: ACL 2023, 2023

    PDF BibTeX

    #finance #benchmark #language_grounding

  • Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation

    Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking. Our corpus is available: https://bitbucket.org/mdredze/tgx

    Mark Dredze , Nicholas Andrews , Jay DeYoung

    Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, 2016

    PDF BibTeX

    #social_media #benchmark

Back to all publications