SUBJECT: &NAME Research External Talk : &NAME &NAME : Automated Application-level Checkpointing of &NAME Programs &NAME RESEARCH LECTURE This is a PUBLIC lecture SPEAKER : &NAME &NAME INSTITUTION : &NAME University HOST : &NAME &NAME DATE : Monday , October &NUM , &NUM TIME : &NUM : &NUM - &NUM : &NUM PLACE : &NAME Theatre , &NAME Research Ltd &NUM &CHAR &CHAR &NAME Avenue ( Off &NAME Road ) &NAME TITLE : Automated Application-level Checkpointing of &NAME Programs Recent trends in supercomputing are leading us towards &NUM new types of supercomputing platforms : large arrays of commodity parts and grid computing . While both platforms promise to be more cost effective than their monolithic predecessors , they suffer from increased failure rates ( as high as one failed processor a day ) due their size and the fact that they are not manufactured by a single entity . At the same time many computational tasks like protein folding and materials science simulations are becoming larger , taking weeks , months or even a year to compute . In short , program runtimes are greatly exceeding the meantime to failure of their platforms . Our work deals with automatic methods for enabling applications to survive hardware faults . We have developed a way of recording program state at the application-level by using a preprocessor to get the program save its own state . Furthermore , we have developed and implemented a distributed coordination protocol that arranges checkpoints from individual processors into a consistent global recovery line . When any processors fail , everybody else can roll back to their most recent checkpoint and resume their computation , with very little work lost . You are currently subscribed to msrclectures as : &EMAIL To unsubscribe send a blank email to &EMAIL