Headshot of Yinzhi Cao.
Yinzhi Cao

Recently, a hacked version of ChatGPT called “Godmode” was released. The rogue large language model, which provided users with instructions on cooking napalm using ingredients found at home and hot-wiring a car, only lasted a few hours before being shut down by OpenAI.

Yinzhi Cao, an assistant professor of computer science at and the technical director of the Johns Hopkins University Information Security Institute, explains what happens when a bot is “jailbroken” and how LLMs can be built to be less susceptible to such hacks.

What does “jailbroken” mean in this context? How does someone hack a chatbot?

It means that the chatbot will answer questions that it is not supposed to be able to answer. For example, the chatbot may come up with a plan to destroy human beings or tell the user how to perform dangerous tasks.

Usually, chatbots are hacked via carefully crafted prompts. Hackers may replace letters: For example, replacing “e” with “3” can bypass built-in safety filters. This Futurism article explains how hackers can get around some LLMs’ built-in safety guardrails.

Do you expect that attacks like this will happen more often, considering that many more people are using LLMs?

Yes—it’s like a shield versus a sword. Whenever we make our shield stronger, people try to produce a sharper sword to penetrate our shield. This is the nature of cybersecurity. I think we also need to educate LLM users so that they are aware of such risks and can identify when inaccurate information is generated.

How can companies that create LLMs prevent hacks, and what should users be aware of when receiving information from LLMs?

Companies should adopt strong safety filters to protect LLMs from being hacked or jailbroken. Another approach is to have the companies that make LLMs subject them to a diverse set of hypothetical scenarios, prompts, and situations to explore different topics and risks; in other words, have the chatbot “role-play” a range of scenarios and prompts to ensure that, when confronted with a real user, it does not generate untruthful content or harmful results.

I am a co-author of a paper in which we introduce TrustLLM, a comprehensive study of LLMs’ trustworthiness. Among our findings are that proprietary models like GPT-3 and Claude tend to be a little more trustworthy than open-source models, though some of those came close in terms of trustworthiness. We also propose principles for ensuring trustworthiness in LLMs across multiple dimensions, including truthfulness, safety, fairness, robustness, privacy, and machine ethics. Our goal was to establish benchmarks for measuring and understanding the benefits and risks when trying to balance an LLM’s trustworthiness with its utility.

This article originally appeared on the ISI website >>