The theory behind safer, smarter AI

Published: April 21, 2026

Author: Jaimie Patterson

Category:

Research

3D visualization of binary code and illuminated lines.

The success of current and future AI models depends on foundational algorithm development and advances in computational learning theory—the kind of work that Raman Arora, an associate professor of computer science affiliated with the Mathematical Institute for Data Science, focuses on in his lab.

“We’re deeply engaged in the computational and algorithmic bases of machine learning,” explains Arora, who is also a member of the Center for Language and Speech Processing, the Institute for Data-Intensive Engineering and Science, and the Data Science and AI Institute. “All of our projects involve both designing new algorithms and proving formal guarantees about their performance.”

Broadly, his group asks: How much computation, memory, and data are needed to train an intelligent system under real-world constraints like privacy, robustness, or limited feedback?

“We aim to design algorithms that are not only practically useful, but also provably optimal in terms of their statistical and computational efficiency,” he explains.

Consistently approaching core problems through the lens of computational theory, Arora and his students publish regularly in top computer science and machine learning venues—for example, at the 42nd International Conference on Machine Learning (ICML), where they introduced new research to improve the safety and utility of modern AI systems. Learn more about their recent advances below:

Battling backdoor attacks

In work published at ICML, Arora and Yunjuan Wang, Engr ’25 (PhD) showed how bad actors can sneak hidden behavior into AI systems like large language models by subtly manipulating how models decide which words, or “tokens,” to focus on. This process, called the “attention mechanism,” is at the heart of how modern AI understands language.

“Backdoor attacks are particularly insidious: If an attacker adds just a few, carefully crafted ‘poisoned triggers’—such as rare words or irrelevant phrases—during a model’s training period, then that model can appear to work normally in everyday situations, but behave very differently when specific hidden triggers are present,” explains Arora. “For example, the model might consistently respond in a harmful or incorrect way if a certain rare phrase is included in a prompt, while acting perfectly fine otherwise.”

This dual behavior makes such attacks difficult to detect and allows bad actors to exploit models selectively while preserving their apparent usefulness.

While prior work has focused on empirically designing and evaluating backdoor attacks, Arora and Wang—a previous Amazon Fellow at Johns Hopkins, now a research scientist at Meta—wanted to understand exactly when and why these kinds of attack succeed. In their analysis, they characterized how adversaries manipulate token selection to alter outputs and identified the theoretical conditions under which these attacks succeed.

“We didn’t just demonstrate this with examples—we actually proved when and why this kind of attack works,” Arora says. “Using mathematical analysis, we showed how poisoned examples can quietly influence an AI’s attention mechanism, leading it to make decisions based on the attacker’s hidden signal rather than the actual content it’s given.”

Proving that even well-performing models can be tricked into behaving maliciously, the pair’s work offers the first rigorous explanation for how this happens through the attention mechanism, Arora says.

Action vs. reaction

Just as it’s critical to account for malicious internal attacks on AI models, it’s equally important to remember that in the real world, these systems don’t operate in an isolated bubble; when they interact with other agents—such as people, other AIs, or strategic opponents—their decisions don’t just affect interaction outcomes, they also influence how others will respond in the future.

According to Arora, the theoretical and algorithmic foundations of decision-making in this kind of shared environment are underdeveloped—especially in scenarios where an AI learner faces an adaptive opponent that adjusts its strategies in response to the learner’s behavior.

“So naturally, we asked: Can we design AI agents that account for these reactions when planning their own actions?” Arora asks. “And more specifically, how can they learn to make decisions that perform well not just in hindsight, but under the feedback loops their actions create?”

Arora and former postdoctoral student Thanh Nguyen-Tang—now an assistant professor at the New Jersey Institute of Technology—propose a new learning framework that teaches AI agents to reason about “what would have happened” if they had behaved differently—not just in isolation, but also considering how others would have responded.

This approach, called policy regret minimization, uses tools from game theory and machine learning to allow for counterfactual reasoning—such as “what-if” questions—thus guiding the AI agent toward smarter, more stable behavior. In this vein, Arora and Nguyen-Tang developed the first algorithms that reason effectively even in complex environments, like games or simulations with many states and actions, by using modern deep learning techniques.

A key technical insight in their algorithm design is using a batching mechanism that enables an AI learner to commit to its policies over multiple rounds of decision-making, as this is critical for learning in the presence of reactive adversaries, the researchers say.

“We show that it’s possible to combine strategic reasoning with scalable learning,” Arora says. “Previous methods for reasoning about others’ reactions either made unrealistic assumptions or couldn’t scale to large problems. But by assuming opponents react smoothly to changes in behavior, we can design AI agents that learn to act thoughtfully, anticipating how their choices might ripple through future interactions.”

Now, the researchers are exploring how their framework can be extended to cooperative settings—where multiple AI agents need to coordinate and build trust—as well as more adversarial ones, where robustness and safety are critical. They are also testing their theories in realistic simulations and developing tools to help human developers understand how their AI agents might behave under strategic pressure.

“It’s important to explore these kinds of ideas because they push us closer to building AI systems that play well with others,” Arora explains. “Whether it’s a chatbot negotiating with a customer, a self-driving car interacting with human drivers, or an algorithm allocating resources in a shared network, we need AI to anticipate how others will adapt to its behavior and to act responsibly in light of that. Our work builds the mathematical and algorithmic foundation for that kind of adaptive, forward-thinking AI.”

These advances are just a sample of what the study of computational learning theory has to offer with respect to real-world applications.

“Ultimately, we hope our research can contribute to the broader goal of AI alignment, making sure that intelligent systems are secure and behave in ways that are compatible with human values and intentions,” says Arora.

This research was supported in part by DARPA GARD Award HR00112020004 and NSF CAREER Award IIS-1943251.

The theory behind safer, smarter AI

Battling backdoor attacks

Action vs. reaction

Stay Connected

Address

Contact

Site Menu

Share Options

Battling backdoor attacks

Action vs. reaction

Site Menu