When: Mar 25 2024 @ 12:00 PM
Where: Hackerman B-17
Categories:
Computer Science Seminar Series.

The seminar will begin at 12:00 p.m. Refreshments will be available starting at 1:15 p.m.

Abstract

High-quality datasets are crucial for improving the capabilities and training efficiency of large language models. However, current datasets are typically prepared in an ad hoc, heuristic way. In this talk, Sang Michael Xie will present principled approaches to improving and understanding language models centered on the pre-training data distribution. First, he will describe how to improve the efficiency of training multipurpose language models by optimizing the mixture of data sources with robust optimization. Second, he will discuss an efficient importance resampling method for selecting relevant data from trillion-token-scale web datasets for training a specialized model. Finally, he will introduce a first theoretical analysis of in-context learning, a key capability of language models to learn from examples in a textual prompt, that traces the capability back to modeling coherence structure in the pre-training data.

Speaker Biography

Sang Michael Xie is a computer science PhD student at Stanford University advised by Percy Liang and Tengyu Ma. His research focuses on data-centric machine learning for language models, understanding pre-training and adaptation, and pre-training and self-training methods for robust machine learning. Xie was awarded a NDSEG Fellowship and was previously a student researcher at Google Brain. His work has been recognized as one of Scientific American‘s World-Changing Ideas, published in flagship venues such as Science, and covered by media outlets including The New York Times, The Washington Post, Reuters, BBC News, IEEE Spectrum, and The Verge.

Zoom link >>