Towards Understanding and Proactively Dealing with Failures in Modern Systems

Ryan Huang, University of California San Diego
Host: Mike Dinitz

Many of the services we use everyday now run in data centers or mobile devices. However, building systems in these modern platforms to provide reliable services is difficult. This is evidenced by the fact that despite the large amount of work put into system quality assurance, all modern systems continue to experience million-dollar outages and frustrating anomalies like battery drain.

In this talk, I will describe my research efforts to better understand and proactively tackle the reliability challenges in modern systems. First I will discuss work that looks into failures in cloud services. Instead of focusing on conventional root-cause analysis, this work takes a unique angle to examine the fault-tolerance mechanisms in cloud, and analyze why they did not prevent the service failures. I will summarize several challenges (opportunities) for reducing these failures in the future. One such challenge is system configuration: existing fault-tolerance techniques often cannot tolerate (or worse are nullified by) configuration errors, and misconfiguration becomes a major source of cloud outages. I will then present work that enables cloud practitioners to proactively prevent configuration error by using a systematic validation framework. The framework consists of a declarative language for developer/operator to express configuration specification, a service that continuously checks if configuration obeys its specification, and a tool that automatically infers basic specification. I will also touch on the challenge of app misbehavior in mobile ecosystem and proactive prevention at runtime by making mobile OS defensive.

Speaker Biography

Peng (Ryan) Huang is a Ph.D. candidate at UC San Diego advised by Professor Yuanyuan Zhou. His research interests intersect systems, software engineering and programming languages. He is particularly interested in understanding rising problems in real-world systems and reflecting that understanding in new techniques to improve system dependability. His work has been applied in industry including Microsoft and Teradata, and deployed to many real users. He is currently a part-time contractor with Facebook doing research on configuration management. Peng received his MS from UC San Diego in 2013, and his BS in computer science and BA in economics from Peking University in 2010.