Refreshments are available starting at 10:30 a.m. The seminar will begin at 10:45 a.m.
Abstract
AI evaluations inform critical decisions, from the valuations of trillion-dollar companies to policies on regulating AI. Yet evaluation methods have failed to keep pace with deployment, creating an evaluation crisis where performance in the lab fails to predict real-world utility. In this talk, Sayash Kapoor will discuss the evaluation crisis in a high-stakes domain: AI-based science. Across dozens of fields, from medicine to political science, Kapoor finds that flawed evaluation practices have led to overoptimistic claims about AI’s accuracy, affecting hundreds of published papers. To address these evaluation failures, he presents a consensus-based checklist that identifies common pitfalls and consolidates best practices for researchers adopting AI, as well as a benchmark to foster the development of AI agents that can verify scientific reproducibility. AI evaluation failures affect many other applications; beyond science, Kapoor examines how AI agent benchmarks miss many failure modes and presents systems to identify these errors. He examines inference scaling, a recent technique to improve AI capabilities, and shows that claims of improvement fail to hold under realistic conditions. Finally, Kapoor discusses how better AI evaluation can inform policymaking, drawing on his work on evaluating the risks of open foundation models and his engagement with state and federal agencies. Why does the evaluation crisis persist? The AI community has poured enormous resources into building evaluations for models, but not into investigating how models impact the world. To address the crisis, we need to build a systematic science of AI evaluation to bridge the gap between benchmark performance and real-world impact.
Speaker Biography
Sayash Kapoor is a computer science PhD candidate and a Porter Ogden Jacobus Fellow at Princeton University, as well as a senior fellow at Mozilla. He is a co-author of AI Snake Oil, one of Nature’s ten best books of 2024. Kapoor’s newsletter is read by over 65,000 AI enthusiasts, researchers, policymakers, and journalists. His work has been published in leading scientific journals such as Science and Nature Human Behaviour, as well as conferences like the Conference on Neural Information Processing Systems and the International Conference on Machine Learning. Kapoor has written for mainstream outlets including The Wall Street Journal and Wired, and his work has been featured by The New York Times, The Atlantic, The Washington Post, Bloomberg News, and many more. He has been recognized with various awards, including a Best Paper Award at the ACM Conference on Fairness, Accountability, and Transparency; an Impact Recognition Award at the ACM Conference on Computer-Supported Cooperative Work and Social Computing; and inclusion in TIME’s inaugural list of the 100 Most Influential People in AI.