Abstract
Pre-trained language models have brought revolutionary progress to information-seeking in the English world. While the advance is exciting, how to transfer such progress into non-English—and especially lower-resource—languages presents new challenges that require developing new resources and methodologies. In this talk, Xinyu “Crystina” Zhang will present her research on building effective information-seeking systems for non-English speakers. She will begin by introducing the benchmarks and datasets developed to support the evaluation and training of the multilingual search systems. These resources have since become widely adopted within the community and enable the development of effective multilingual embedding models. The next part of her talk will share the best training practices found in such model development, including strategies for enhancing backbone models and surprising transfer effects across languages. Building on these foundations, Zhang’s work expanded to understand how language models process multilingual text and facilitate knowledge transfer across languages. Her talk will conclude with a vision for the future of multilingual language model development, with the goal of adapting these models to unseen languages with minimal data and resource requirements and thus bridging the gap for underrepresented linguistic communities.
Speaker Biography
Xinyu “Crystina” Zhang is a PhD candidate at the University of Waterloo, where she is advised by Professor Jimmy Lin. Zhang’s research focuses on enhancing search systems in multilingual scenarios, with works featured at top natural language processing and information retrieval conferences and journals such as Transactions of the Association for Computational Linguistics, ACM Transactions on Information Systems, the Meeting of the Association for Computational Linguistics, and the International ACM SIGIR Conference on Research and Development in Information Retrieval. Zhang has hosted competitions on multilingual retrieval in the 2022 ACM International Conference on Web Search and Data Mining Cup and the 2023 ACM Forum for Information Retrieval Evaluation, and received Outstanding Paper Awards at the 2024 Conference on Empirical Methods in Natural Language Processing and a Best Paper Award nomination at SIGIR 2024. She has interned at Google DeepMind, Cohere, the Max Planck Institute for Informatics, and NAVER. Prior to graduate school, she received her bachelor’s degree in computer science from the Hong Kong University of Science and Technology in 2020.