Author: Jaimie Patterson
A blue glass globe of Earth sits on a black keyboard.

Language models power tools like ChatGPT and serve as the foundation for machine translation tasks and speech recognition algorithms. Until now, most have been trained on—and generate—English text, but computer scientists are increasingly developing multilingual language models, which are trained on one language before what they learn is applied to another. Called cross-lingual transfer, this capability can be invaluable when developing applications for an international audience.

Cross-lingual transfer is typically achieved through “few-shot” machine learning techniques, in which models are trained on a small amount of human-annotated data from multiple languages, enabling accurate predictions and pattern recognition across additional languages. However, this approach relies on human input that may be lacking for underrepresented languages, posing difficulties in achieving cross-lingual transfer in those languages and potentially further compounding digital inequity.

In contrast, “zero-shot” learning enables a multilingual language model to leverage annotations from one benchmark language, such as English, and apply its learnings to an unannotated (and underrepresented) language, such as Haitian Creole. While zero-shot learning performs well on tasks such as translational part-of-speech tagging and sentence classification, it struggles with generative tasks like question answering or storytelling in a different language, occasionally generating text in the wrong language entirely.

Why does zero-shot cross-lingual transfer succeed in one area, but not in the other?

A team of Johns Hopkins researchers says they have not only found the answer to this question, but have also discovered a solution that will make zero-shot text generation more accurate and facilitate the development of AI applications in new languages.

Kenton Murray, a research scientist at the Human Language Technology Center of Excellence and a member of the Whiting School’s Center for Language and Speech Processing, worked with his CLSP colleague, Tianjian Li, a master’s student in the Department of Computer Science, to analyze the cosine similarity of parallel sentences in different languages—or how similar two sentences are in terms of meaning and content.

“We compared sentences in a model’s source language to correctly translated sentences in its target language and found something interesting,” says Murray. “The more similar the two sentences are, the more challenging it is for a zero-shot model to perform any sort of cross-lingual generation.”

“Two languages’ similarity with regard to grammar and word order also increases the difficulty in achieving cross-lingual generation,” Li adds. “But the converse is true, too: A model is more likely to succeed when asked to generate text in a completely different language—think of the difference between English and, say, Japanese.”

A big speech bubble made from colorful small bubbles.Based on their findings, the researchers propose a solution to improve multilingual language models’ performance on generation tasks: Instead of training a model on one fully annotated language, use two. The addition of this second source language regularizes the similarity between the source and target languages, allowing for better text generation, the team says.

Additionally, when the second source language is vastly different from the first, such as English and Japanese with their completely distinct scripts and grammars, the model is more capable of generating text in a third target language.

“Using very similar languages as source languages, like Spanish and Portuguese, wouldn’t be good,” Li says. “The model would only learn to map its pre-training data to Iberian Romance languages and then wouldn’t be able to generate in Russian, for example.”

The researchers say that this simple solution of adding another annotated source language to a model’s training data lays the foundation for improved zero-shot cross-lingual transfer in many areas. This means that target languages don’t necessarily need to be annotated, allowing low-resource languages like Haitian Creole and other underrepresented dialects to benefit from current computational linguistics research and applications.

Li presented the team’s findings at the 61st Annual Meeting of the Association for Computational Linguistics in Toronto, Canada last week. The ACL is the top international scientific and professional society for researchers working in the field of natural language processing and computation.