Scalable, Accurate, Robust Binary Analysis with Transfer Learning Trace Modeling

Kexin Pei, Columbia University
Host: Johns Hopkins Department of Computer Science

Binary program analysis is a fundamental building block for a broad spectrum of security tasks, including vulnerability detection, reverse engineering, malware analysis, patching, security retrofitting, and forensics. Essentially, binary analysis encapsulates a diverse set of tasks that aim to understand and analyze how the binary program runs and its operational semantics. Unfortunately, existing approaches often tackle each analysis task independently and heavily employ ad-hoc heuristics as a shortcut for each task. These heuristics are often spurious and brittle, as they do not capture the real program semantics (behavior). While ML-based approaches have shown early promise, they too tend to learn spurious features and overfit specific tasks without understanding the underlying program semantics.

In this talk, I will describe two of our recent projects that learn program operational semantics for various binary analysis tasks. Our key observation is that by designing pretraining tasks that can learn how binary programs execute, we can drastically boost the performance of binary analysis tasks. Our pretraining tasks are fully self-supervised – they do not need expensive labeling effort. Therefore, our pretrained models can use diverse binaries to generalize across different architectures, operating systems, compilers, and optimizations/obfuscations. Extensive experiments show that our approach drastically improves the performance of tasks like matching semantically similar binary functions and binary type inference.

Speaker Biography

Kexin Pei is a fifth-year Ph.D. student at the Department of Computer Science, Columbia University. He is co-advised by Suman Jana and Junfeng Yang, and works closely with Baishakhi Ray. He is broadly interested in Security, Systems, and Machine Learning, with the current focus on developing ML architectures to understand program semantics and using them for program analysis and security.