Author: Jaimie Patterson
Close up image of a "command" key on a keyboard focused on symbol.

We’re all familiar with saying something like, “Hey Siri, add my 10 a.m. dentist appointment tomorrow to my schedule.” But what happens on the back end of your smartphone to translate your spoken request into the code that adds the correct appointment to your calendar app?

This process is called semantic parsing—taking natural language and turning it into executable code. And while it may seem easy enough to get right, the consequences of an error can be devastating in safety-critical domains like health care, transportation, and robotics. (Remember how well Tony Stark’s “protect the planet” command went over in Avengers: Age of Ultron?)

To ensure the reliability of semantic parsing systems, Johns Hopkins computer scientists assessed current systems and, based on their findings, devised a system that allows humans to remain in the loop while expending little effort, resulting in improved parsing performance.

In a paper published recently in Transactions of the Association for Computational Linguistics, Elias Stengel-Eskin, Engr ’23 (PhD) and Benjamin Van Durme, an associate professor of computer science affiliated with the Center for Language and Speech Processing and the Human Language Technology Center of Excellence, assessed the calibration of current semantic parsing models—or how well a model’s certainty that it’s correct correlates with how often it actually is correct.

“You can think of this as analogous to the probability you see on a weather forecast,” explains Stengel-Eskin. “If a forecast system is well-calibrated, then when it predicts a 20% chance of rain, it should on average rain 20% of the time. A poorly calibrated forecast might predict rain with a 90% chance when it really only rains 20% of the time.”

The team found that many models are well-calibrated with high confidence—meaning they’re right and they know they’re right—which means users can generally trust their predictions. But they also discovered a trade-off between calibration and performance: Better-performing models tend to have worse calibration. (In the weather forecast analogy, this means that although they predicted the right type of weather—that is, rainy or sunny—their chance percentage was further off.)

But what should a semantic parser do with a low-confidence program? If these commands are simply thrown away, users will get pretty annoyed: Imagine that each time you asked Siri to do something, it did nothing rather than risk making a mistake.

“Safety considerations need to be balanced with usability of the system: an unplugged agent would be very safe but unusable,” the team writes.

Another option is to try to recover the user’s original meaning, which the team explored in new work presented at the 2023 Conference on Empirical Methods in Natural Language Processing in December.

Attempting to balance concerns of usability and safety, the computer scientists found that by setting a confidence threshold—or only executing actions when the model is relatively sure they’re correct—a model can achieve better safety. Then they explored reviving some of the model’s usability through simple human interactions: yes-no questions.

In a user study, the researchers asked participants to confirm whether a model’s semantic parse matched the original meaning of a user request. If the participant answered “yes,” the model would execute the program—if not, the team’s DidYouMean confidence system would delete the erroneous program and start again. The system obtained a 36% improvement in usability while maintaining a 58% reduction in the number of incorrect programs executed. And for the user, providing simple confirmation takes much less effort then rephrasing their original query.

“Right now, semantic parsing systems are being used all over the place in user-facing applications. With the advent of large language models, this is only becoming more common, with models having more and more autonomy in taking actions,” Stengel-Eskin says. “When these actions are irreversible and have real-world consequences, we should be particularly concerned about safety—but at the same time, a system that’s too conservative will not be usable, so we need ways of balancing the two concerns. I think that the way to do this is to bring humans into the loop with low-cost interactions.”

The researchers’ next steps are to determine the correct safety-usability balance for specific tasks. For example, Alexa misunderstanding your request to know the temperature outside has only minor consequences. But if you ask a robotic kitchen assistant to hand you your last egg and it cracks it instead, you’re going to be pretty annoyed when you have to run to the store because it incorrectly interpreted your request without asking for confirmation.

“Based on the action the agent is going to take, the system should adaptively decide what the correct balance should be and act accordingly,” Stengel-Eskin says.