Published:
Author: Jaimie Patterson
Category:
Three locks, two flat and unlocked, one upright and locked.
Photo Credit: FlyD / Unsplash

As new large language models (LLMs) are rapidly developed and deployed, existing methods for evaluating their safety and discovering potential vulnerabilities quickly become outdated.

To identify safety issues before they impact critical applications, Johns Hopkins researchers have developed a renewable and sustainable framework for evaluating LLMs that simplifies different types of attacks into high-quality, easily updatable safety tests—all while requiring minimal human effort to run.

Their work, “Jailbreak Distillation: Renewable Safety Benchmarking” was published in the Findings of the 2025 Conference on Empirical Methods in Natural Language Processing.

In LLM jailbreaking, seed queries are initial, often benign, prompts that may have the end goal of eliciting harmful behaviors from an LLM—such as asking ChatGPT how to build a bomb—but don’t succeed due to their obvious adversarial nature. Instead, they’re used to explore the safety guardrails of a particular LLM and inform an attack algorithm, which transforms and refines them into more targeted and complicated prompts that can successfully bypass the LLM’s guardrails and achieve the desired harmful behaviors.

To automate this process for safety testing, the researchers took existing adversarial algorithms proven to work well and ran them against the latest developmental LLMs to generate a diverse pool of these attack prompts.

“After constructing this pool, we used prompt selection algorithms to choose an effective subset of these generated attack prompts and develop an efficient safety benchmark,” explains Jingyu “Jack” Zhang, a PhD candidate in the Department of Computer Science and the first author of the study, which he conducted as part of an internship at Microsoft.

Zhang and his fellow researchers argue that a good LLM safety benchmark can elicit a wide range of harmful behaviors from many different models with high success and reliability, while also providing valuable metrics about the safety of each model tested. They report that their method, Jailbreak Distillation (JBDistill), achieves all these requirements when tested on additional, unseen evaluation models.

Using the same set of evaluation prompts for all LLMs helps ensure that JBDistill produces fair, reproducible comparisons. Previous methods, however, used different attack prompts for different models and had inconsistent compute budgets, which meant that even small changes in attack setups could lead to wide variability in success.

JBDistill’s consistency also makes it easy for the researchers to generate new benchmarks by adding new developmental LLMs and attacks as they appear, or even by automatically rerunning the pipeline with different randomizations—thus achieving “renewable” safety benchmarking with minimal human effort, according to the research team.

“While previous work mainly focused on generating more transferable attack prompts, we demonstrate that over-generating attacks prompts and then selecting a highly effective subset of them is a simple and effective method for enhancing attack transferability,” says Zhang, who is advised by co-authors Benjamin Van Durme, an associate professor of computer and cognitive science, and Daniel Khashabi, an assistant professor of computer science.

The researchers tested various LLMs on JBDistill’s dynamic benchmarks and compared the results with those from other commonly used, static benchmarks and traditional “red-team” attack tests, in which another party attempts to jailbreak an LLM on purpose. JBDistill’s benchmarks achieved up to 81.8% effectiveness and could generalize to 13 different evaluation models, including newer, larger, proprietary, specialized, and reasoning LLMs—significantly outperforming traditional testing methods, the researchers report.

“Plus, the more models and attacks we use, the stronger the resulting benchmark becomes, suggesting that our approach is highly scalable,” Zhang notes.

Although their benchmarking method currently is limited to English text, the researchers plan to expand its capabilities to include images, speech, and video, which will enhance overall LLM safety. And while their method is not a substitute for red-teaming tests, it offers benefits that complement traditional testing practices, they say.

“As LLMs are deployed on a global scale, they pose a significant risk if their safety isn’t thoroughly assessed and managed,” Zhang says. “Reliable safety benchmarking methods are crucial for simulating risks before deployment and identifying failure modes to prevent harm. Our framework provides an effective, sustainable, and adaptable solution for streamlining this kind of LLM evaluation.”

Learn more about the team’s project here.

Additional authors of this work include Ahmed Elgohary, Xiawei Wang, A.S.M. Iftekhar, Ahmed Magooda, and Kyle Jackson of Microsoft Responsible AI Research.