Published:
Author: Jaimie Patterson
Category:
A blank, blue humanoid head overlaid with code and connective lines and dots.

Research in natural language processing is seeing greater success and popularity than ever before as NLP applications become more integrated into people’s day-to-day lives, whether in the form of just-for-fun ChatGPT conversations or through automated content analyses of webpages and news articles.

But advancements in the field aren’t necessarily taking place in the areas where we need them the most. In high-stakes scenarios that involve critical information, like medicine or law, the gap between what current NLP technology can do and what people actually need from the technology—termed the “socio-technical gap”—can cause harm to users by reinforcing biases, introducing misinformation, or outputting misleading data. For example, in the social sciences, the numerical outputs of a large language model may over-simplify an “answer” to a given research question, leading investigators to presume the “easy” answer over one more grounded in context and explained by interpretation.

Echoing Hopkins computer scientists’ concerns surrounding the use of artificial intelligence, new commentaries by Johns Hopkins researchers highlight potential solutions to bridge the divide between real people and the NLP technologies built to serve them.

Human-Centered Evaluation

In a recent position paper, Ziang Xiao, an assistant professor of computer science, and Q. Vera Liao, a principle researcher at Microsoft Research Montréal, make the case for language model evaluation methods that take into account the divide between what is possible to achieve with NLP technology and what people actually need from its use. Their paper, “Rethinking Model Evaluation as Narrowing the Socio-Technical Gap,” received an honorable mention at the 2023 International Conference on Machine Learning‘s Artificial Intelligence & Human-Computer Interaction Workshop.

“If the socio-technical gap is not understood and attended to, we may produce dominant and widespread technologies that are not useful or are even harmful to stakeholders of downstream applications,” warns Xiao.

Xiao and Liao propose that NLP researchers evaluate large language models and their new applications not simply by their “accuracy” scores on existing datasets, but rather based on how well they perform in real use cases; that way, the success of a particular model or method directly relates to how it is going to be used by the people who may eventually come to rely on its output. To this end, the team encourages the development of new evaluation methods based on human-centric criteria, the pursuit of better-defined protocols to obtain human ratings, and the validation of language models from the “outside in”—that is, from user needs back to the model’s functions, rather than the other way around.

These suggestions don’t come without downsides, such as additional costs associated with expert evaluations and issues of realism in obtaining useful human input, but the team proposes solutions for these potential pitfalls, as well. They urge researchers to define and justify the use of “less costly” evaluation methods by taking into account different types of costs (like computational cost vs. environmental impact), weighing them against the potential benefits of more human-centered approaches, and considering factors like the current stage of a technology’s development and the extent to which it will have an effect when released.

“For example, while it is acceptable for an initial academic work on a new algorithm to adopt low-cost evaluations, it would not be responsible for an organization that is making a model or a system widely available to only engage in low-cost evaluation,” Xiao explains.

In other words, a grad student may get away with using analytics in place of real human feedback for a project in its initial stage, but OpenAI has a responsibility to use human-centric evaluation methods wherever possible.

The team urges the NLP research community to embrace methods used by the social sciences and the fields of human-computer interaction and explainable AI to develop more effective evaluation practices that address real-world requirements.

“By doing so, we aim to build LLMs that are more effective, responsible, and fair in real-life contexts,” says Xiao.

The Importance of Context

In an interdisciplinary effort between the Whiting School of Engineering and the Krieger School of Arts and Sciences, Arya McCarthy, a doctoral candidate in the Center for Language and Speech Processing, and Giovanna Maria Dora Dore, an associate teaching professor and the associate director of the East Asian Studies Program, give complementary recommendations to the NLP community. Their position paper, which earned an honorable mention at ACL 2023, emphasizes the importance of posing research questions that lead to integrative findings—those that can be applied to and within a broader context—rather than just descriptive ones, which have no intentional connection to a larger theory.

An example of a descriptive finding is that an apple falls, or that it falls faster when pushed than dropped, the authors explain. Newton’s theoretical analysis of the same phenomenon is that a fundamental force acts upon the apple, and that this same force governs the motion of the heavens. The theoretical analysis links, or integrates, this finding to a broader body of knowledge and context.

Inspired by a series of roundtables they conducted with colleagues in the computational and social sciences, McCarthy and Dore analyzed 60 papers submitted to various venues of ACL, the premier international society for natural language processing and computation researchers. They found that the majority of computational text analysis research only catalogued observations and did not provide explanations for findings that were sufficiently grounded in theory.

While this kind of research is still useful and has some predictive power, it’s not enough to know the ‘what’ without getting to the ‘why,’ the authors say; for example, characterizing the language of extreme groups on social media doesn’t explain why polarization happens in the first place.

Computing can still be an instrumental approach for modeling and understanding social complexity, write McCarthy and Dore. But they stress that this does not mean that other approaches, such as the historical, ethnographic, or mathematical, become irrelevant. In fact, the authors claim the contrary: that computational methods rely on these approaches to add value in terms of improving explanation and understanding.

As part of a movement to return to theory-grounded research, McCarthy and Dore urge computer scientists—especially within the field of natural language processing—to engage with subject matter experts in whichever area they plan to conduct research to better understand the context around and implications of their work.


The majority of NLP research inherently involves a human element, whether it’s analyzing text written by academics or developing the next chatbot for user interaction. For this kind of research to achieve maximum impact and usefulness, Hopkins researchers agree that it must be conducted with consideration for human needs and contexts from start to finish.