Publications tagged: #agents

  • Feedback Friction: LLMs Struggle to Fully Incorporate External Feedback

    Recent studies have shown LLMs possess some ability to improve their responses when given external feedback. However, it remains unclear how effectively and thoroughly these models can incorporate extrinsic feedback. In an ideal scenario, if LLMs receive near-perfect and complete feedback, we would expect them to fully integrate the feedback and change their incorrect answers to correct ones. In this paper, we systematically investigate LLMs' ability to incorporate feedback by designing a controlled experimental environment. For each problem, a solver model attempts a solution, then a feedback generator with access to near-complete ground-truth answers produces targeted feedback, after which the solver tries again. We evaluate this pipeline across a diverse range of tasks, including math reasoning, knowledge reasoning, scientific reasoning, and general multi-domain evaluations with state-of-the-art language models including Claude 3.7 (with and without extended thinking). Surprisingly, even under these near-ideal conditions, solver models consistently show resistance to feedback, a limitation that we term FEEDBACK FRICTION. To mitigate this limitation, we experiment with sampling-based strategies like progressive temperature increases and explicit rejection of previously attempted incorrect answers, which yield improvements but still fail to help models achieve target performance. We also perform a rigorous exploration of potential causes of FEEDBACK FRICTION, ruling out factors such as model overconfidence and data familiarity. We hope that highlighting this issue in LLMs and ruling out several apparent causes will help future research in self-improvement.

    Dongwei Jiang , Alvin Zhang , Andrew Wang , Nicholas Andrews , Daniel Khashabi

    39th Conference on Neural Information Processing Systems (NeurIPS), 2025

    PDF BibTeX

    #llm #agents

  • Hell or High Water: Can Language Model Agents Formulate Backup Plans?

    As language model agents are applied to real world problems of increasing complexity, they will be expected to formulate plans across large search spaces. If those plans fail for reasons beyond their control, how well do language agents search for alternative ways to achieve their goals? To answer this question, we devise a benchmark where each problem has at least two ways of solving it via distinct combinations of function calls. The agent interacts with this environment by searching for relevant functions from a set over four thousand possibilities. When we disable a function the agent is calling and communicate an error to that agent via natural language, we expect it to find backup solution through trial and error. Overall, we find that language agents struggle to formulate and execute backup plans in response to environment feedback. While state-of-the-art models are often able to identify the correct function to use in the right context, they struggle to adapt to feedback from the environment and often fail to pursue alternate courses of action, even when the search space is artificially restricted. We provide a systematic analysis of the failures of both open-source and commercial models, examining the effects of search space size, as well as the benefits of scaling model size in our setting. Our analysis identifies key challenges for current generation models as well as promising directions for future work.

    Andrew Wang , Sophia Hager , Adi Asija , Daniel Khashabi , Nicholas Andrews

    Second Conference on Language Modeling (COLM), 2025

    PDF BibTeX

    #agents #benchmark #language_grounding

Back to all publications