When: Mar 03 2025 @ 12:00 PM
Where: B-17 Hackerman Hall
Categories:
Computer Science & CLSP Seminar Series.

Abstract

The paradigm of training large-scale foundation models has driven significant advancements in multimodal AI. However, pursuing further performance gains solely through model scaling is becoming impractical due to rising computational costs and resource limitations. Moreover, the reasoning and generation processes of these models remain mostly uninterpretable and uncontrollable, often leading to unfaithful outputs. In this talk, Jaemin Cho will discuss his efforts to make multimodal generative models more controllable and trustworthy without increasing their size. First, he will introduce faithful reasoning frameworks, in which the multimodal generation process mirrors how humans reason about and create content such as images and videos. Concretely, in these frameworks, models create a detailed plan that decomposes a complex generation task into simpler steps, as well as retrieve relevant information from multimodal knowledge bases before generating the final outputs. Next, Cho will describe fine-grained evaluation methods that assess model capabilities across multiple dimensions, such as object counting and spatial relation understanding, thereby providing a detailed understanding of the models’ strengths and weaknesses. In turn, these evaluations enable targeted model improvements that address identified weaknesses through test-time guidance or by updating training environments. Together, these directions offer a pathway toward more intelligent, reliable, and efficient multimodal AI models.

Speaker Biography

Jaemin Cho is a PhD candidate in the Department of Computer Science at the University of North Carolina at Chapel Hill. His research focuses on improving the reasoning capabilities in multimodal generation. His work has been featured at top conferences in computer vision (e.g., the Conference on Computer Vision and Pattern Recognition, the International Conference on Computer Vision, the European Conference on Computer Vision), natural language processing (e.g., Empirical Methods in Natural Language Processing, the Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, the Conference on Language Modeling), and machine learning (e.g., the Conference on Neural Information Processing Systems, the International Conference on Machine Learning, the International Conference on Learning Representations, the AAAI Conference on Artificial Intelligence). His work has been recognized through multiple oral/spotlight presentations and a Best Reviewer Award at NeurIPS, a Bloomberg Data Science PhD Fellowship, and media coverage (MIT Technology Review, IEEE Spectrum, and WIRED). He also has co-organized the T4V: Transformers for Vision workshop at CVPR 2023 and 2024.

Zoom link >>