Spring 2010

February 2, 2010

As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intelligence and medical data, and authentication of information as it flows through workplace tasks. While significant research has been conducted in this area, the associated security and privacy issues have not been explored, leaving provenance information vulnerable to illicit alteration as it passes through untrusted environments.

In this talk, we show how to provide strong integrity and confidentiality assurances for data provenance information in an untrusted distributed environment. We describe our provenance-aware system prototype that implements provenance tracking of data writes at the application layer, which makes it extremely easy to deploy. We present empirical results that show that, for typical real-life workloads, the run-time overhead of our approach to recording provenance with confidentiality and integrity guarantees ranges from 1% – 13%. For more details, please refer to http://ragibhasan.com/research/provenance.html

Speaker Biography: Ragib Hasan is an NSF/CRA Computing Innovation Fellow and Assistant Research Scientist at the Department of Computer Science, Johns Hopkins University. He is a member of the Hopkins Storage Systems Lab, and works with Professor Randal Burns. He received his PhD and MS in Computer Science from the University of Illinois at Urbana Champaign in October, 2009, and December, 2005, respectively, under the supervision of Professor Marianne Winslett of UIUC. Before that, he received a B.Sc. in Computer Science and Engineering and graduated summa cum laude from Bangladesh University of Engineering and Technology in 2003. Ragib Hasan’s research interest falls in the general area of data security, with emphasis on trustworthy data history and provenance for cloud computing, file systems, and databases. He is also interested in authorization and access control models for distributed systems and building automation systems. Hasan is the recipient of a 2009 NSF Computing Innovation Fellowship and the 2003 Chancellor Award and Gold Medal from Bangladesh University of Engineering and Technology. He is also an active Wikipedia editor, and an administrator in both the English and Bengali language Wikipedias.

Slides >>

February 16, 2010

Since late 1990-s, Wireless Sensor Networks (WSNs) have been an object of much attention, interest and hype in several research communities, including embedded systems, networking and security. Envisaged applications involve WSNs where nodes (sensors) collectively monitor/measure certain physical phenomena. Sensed data is then propagated to a centralized collection point — referred to as a “sink”– that usually also performs network management and control functions. The sink’s constant presence and availability form a key part of WSN operation.

This talk will show that some emerging WSN scenarios preclude sink’s constant presence. Anticipated application domains include military, law enforcement, and critical infrastructure protection. In such settings, nodes must accumulate sensed data until it can be off-loaded to an itinerant sink. Unattended Wireless Sensor Networks (UWSNs) pose a number of new research issues. In particular, security challenges arise if the deployment environment is hostile and sensors are subject to compromise. Notably, the UWSN model motivates a new stealthy mobile adversary model. Absence of an on-line trusted sink coupled with the power of the new adversary, make prior security techniques ineffective in UWSN settings. This talk will overview a number of potential threats posed by the UWSN adversary and sketch out some solutions that involve collaborative self-healing techniques.

Time permitting, this talk will also touch upon a few other research directions in security and applied cryptography.

Speaker Biography: Gene Tsudik is a Professor in the Department of Computer Science at the University of California, Irvine (UCI). He obtained his PhD in Computer Science from USC in 1991 for research on firewalls and Internet access control. Before coming to UCI in 2000, he was a Project Leader at IBM Zurich Research Laboratory (1991-1996) and USC Information Science Institute (1996-2000). Over the years, his research interests included: routing, firewalls, authentication, mobile networks, secure e-commerce, anonymity ad privacy, group communication, digital signatures, key management, mobile ad hoc networks, as well as database privacy and secure storage. He is currently serving as the Director of Secure Computing and Networking Center (SCONCE) at UCI and the Vice-Chair of the Computer Science Department. In 2007, he was on sabbatical at the University of Rome as a Fulbright Senior Scholar. Since 2009, he is the Editor-in-Chief of ACM Transactions on Information and Systems Security (TISSEC).

February 23, 2010

Despite the high dimensionality and complexity of many modern datasets, some problems have hidden structure that makes efficient statistical inference feasible. Examples of these hidden structures include: additivity, sparsity, low-dimensional manifold structure, smoothness, copula structure, and conditional independence relations.

In this talk, I will describe efficient nonparametric learning algorithms that exploit such hidden structures to overcome the curse of dimensionality. These algorithms have strong theoretical guarantees and provide practical methods for many fundamentally important learning problems, ranging from unsupervised exploratory data analysis to supervised predictive modeling.

I will use two examples of high dimensional graph estimation and multi-task regression to illustrate the principles of developing high dimensional nonparametric methods. The theoretical results are presented in terms of risk consistency, estimation consistency, and model selection consistency. The practical performance of the algorithms is illustrated on genomics and cognitive neuroscience examples and compared to state-of-the-art parametric competitors.

This work is joint with John Lafferty and Larry Wasserman.

Speaker Biography: Han Liu is a fifth-year PhD student in the Machine Learning Department within the School of Computer Science at Carnegie Mellon University. He is in the Joint PhD program in Machine Learning and Statistics. His dissertation, directed by John Lafferty and Larry Wasserman, is entitled, “High Dimensional Nonparametric Learning and Massive-Data Analysis”. This study investigates fundamental theory and methods for high dimensional nonparametric inference and demonstrates their applicability to areas such as computational biology and cognitive neuroscience. Over the past two years, Han Liu has won the Google Ph.D. fellowship award in Statistics, the best student paper award at the 26th International Conference on Machine Learning (ICML 2009), and the 2010 best paper award in the ASA (American Statistical Association) student paper competition in Statistical Computing and Graphics. Han Liu obtained his Msc in Statistics and Machine Learning at Carnegie Mellon University in 2007 and another Msc in Computer Science at University of Toronto in 2005.

February 25, 2010

Mobile networks of robots, vehicles and phones will form the next-generation large-scale, distributed, and reconfigurable sensing infrastructure. Effective coordination to maximize sensing coverage is the common, and key, concern. The differences lie in their capabilities and the degree of control possible. In this talk, I will describe our approaches and tools that harness mobility in a range of sensing networks. First, I will consider robots that allow precise control, and show local, geometric conditions on robot positions that guarantee global network connectivity. When combined with distributed controllers for mobile robots, these conditions can maximize sensing coverage while maintaining connectivity. The key idea is the introduction of a new construct – a Neighbor-Every-Theta (NET) graph – in which each node has at least one neighbor in every theta angular sector of its communication range. We prove that for theta < pi, NET graphs are guaranteed to have an edge-connectivity of at least floor(2pi/theta), even with an irregular communication range. The NET conditions are integrated into a virtual potential field-based controller for distributed deployment. In the second part of the talk, I will consider mobile phones that allow little or no position control, and describe ongoing work on participatory sensing of air visibility.

Speaker Biography: Dr. Sameera Poduri is a postdoctoral research associate at the University of Southern California (USC) where she conducts research on mobile sensing and teaches a course on intelligent embedded systems. She received a Ph.D. in Computer Science from USC in 2008, M.Tech in Computer Aided Design and Automation and B.Tech in Mechanical Engineering from the Indian Institute of Technology Bombay, India, in 2002. Her research work focuses on areas of distributed control and coordination algorithms for robot networks and large-scale participatory sensing.

Slides >>

March 4, 2010

Information extraction (IE) programs automatically extract structured data that are embedded in text corpora. These extracted structured data can then be exploited by all kinds of applications such as structure queries and data mining. As such, IE is becoming increasingly important in managing text corpora. Most current IE solutions have considered only static text corpora, over which we typically apply IE programs only once. However, many real-world text corpora such as Wikipedia are evolving: documents can be added, deleted and modified. Therefore, to keep extracted information up to date, we often must apply IE programs repeatedly, to every corpus snapshot. How can we execute such repeated IE efficiently?

In this talk, I will present efficient solutions for IE over evolving text. The underlying idea of these solutions is to recycle previous IE results, given that consecutive corpus snapshots often contain much overlapping text. I will first discuss Cyclex, a system that recycles for single IE programs. Cyclex models a small set of important properties shared by many practical IE programs to guarantee the correctness of recycling. Furthermore, it can efficiently recycle for large-scale text corpora by exploiting several database techniques such as cost-based optimization and a join-like recycling algorithm. Then I will talk about Delex, which builds on Cyclex and recycles for complex IE workflows that consist of multiple IE programs. Finally, I will conclude with future research directions in deploying and managing IE systems over large-scale text corpora and developing user-friendly, robust and scalable analysis tools for scientific researchers.

Speaker Biography: Fei Chen is a doctoral candidate in the Database group at the University of Wisconsin-Madison. Her dissertation develops solutions for efficient information extraction over large-scale evolving text. Her research interests include managing large-scale text corpora, distributed computing and mining biology sequences.

March 9, 2010

In this presentation, we describe the design of the Dietary Intake Monitoring Application (DIMA [1]), a mobile, electronic food diary for low-literacy patients with stage 5 Chronic Kidney Disease (CKD). CKD patients do not have functioning kidneys, requiring them to undergo hemodialysis three times a week. Because excess fluids and toxins normally removed continuously by the kidneys are only removed every other day with dialysis, CKD patients have an extremely restricted prescribed diet. For example, a typical patient must limit their fluid to 1 liter a day, and their nutrients to 2 g of sodium. Failure to adhere to the diet can lead to a host of complications, including exacerbated hypertension, pulmonary edema, and even death. However, this population often lacks the computational and memory skills necessary to track their fluid and nutrient intake on their own, with as many as 80% of patients not restricting their fluid and 67% not limiting their nutrients. Further, this patient group is particularly difficult to design for as they have varying literacy skills, prohibiting text-based input and output. In this presentation, we describe our approach to designing for a chronically ill patient population that is not tech-savy and has educational barriers for using technology.

[1] Funded by the National Institute of Biomedical Imaging and Bioengineering (NBIB): Award #1 R21 EB007083-01A1, titled Self-Monitoring of Dietary and Fluid Intake Using a PDA.

Speaker Biography: Dr. Kay Connelly is an Associate Professor in the School of Informatics at Indiana University. Her research interests are in the intersection of mobile and pervasive computing and healthcare. In particular, she is interested in issues that influence user acceptance of health technologies, such as privacy, integration into one’s lifestyle, convenience, and utility. Dr. Connelly works with a variety of patient groups, including very sick populations who need help in managing their disease, healthy populations interested in preventative care, and senior citizens looking to remain in their homes for as long as possible. Dr. Connelly is the Senior Associate Director for the Center for Applied Cybersecurity Research, and has recently taken the challenge to start a new Health Informatics program at Indiana University. Dr. Connelly received a BS in Computer Science and Mathematics from Indiana University (1995), and an MS (1999) and Ph.D. (2003) in Computer Science from the University of Illinois.

Slides >>

March 11, 2010

Designing autonomous agents capable of coping with the complexity of the real world is a tremendous engineering challenge. Such agents must often deal with rich observations (such as images), unknown dynamics, and rich structure—perhaps consisting of objects, their properties/types and their dynamical interactions. An ability to learn from experience and generalize radically to new situations is essential; at the same time, the agent may bring substantial prior knowledge to bear on the environment it finds itself in.

In this talk, I will present recent work on the combination of reinforcement learning and nonparametric Bayesian modeling. Hierarchical Bayes provides a principled framework for incorporating prior knowledge and dealing explicitly with uncertainty, while reinforcement learning provides a framework for making sequential decisions under uncertainty. I will discuss how nonparametric Bayesian models can help answer two questions: 1) how can an agent learn a representation of state space in a structured domain? and 2) how can an agent learn how to search for good control laws in hard-to-search spaces?

I will illustrate the concepts on applications including modeling neural spike train data, causal sound source separation and optimal control in high-dimensional, simulated robotic environments.

Speaker Biography: David Wingate received a B.S. and M.S. in Computer Science from Brigham Young University in 2002 and 2004, and a Ph.D. in Computer Science from University of Michigan in 2008. He is currently a postdoctoral research associate in the Computational Cognitive Science group at MIT. David’s research interests lie at the intersection of perception, control and cognition. His research spans diverse topics in reinforcement learning, Bayesian unsupervised learning, information theory, manifold learning, kernel methods, massively parallel processing, visual perception, and optimal control.

March 23, 2010

Genome-wide association studies have recently become popular as a tool for identifying the genetic loci that are responsible for increased disease susceptibility by examining genetic and phenotypic variation across a large number of individuals. The cause of many complex disease syndromes involves the complex interplay of a large number of genomic variations that perturb disease-related genes in the context of a regulatory network. As patient cohorts are routinely surveyed for a large number of traits such as hundreds of clinical phenotypes and genome-wide profiling for thousands of gene expressions, this raises new computational challenges in identifying genetic variations associated simultaneously with multiple correlated traits. In this talk, I will present algorithms that go beyond the traditional approach of examining the correlation between a single genetic marker and a single trait. Our algorithms build on a sparse regression method in statistics, and are able to discover genetic variants that perturb modules of correlated molecular and clinical phenotypes during genome-phenome association mapping. Our approach is significantly better at detecting associations when genetic markers influence synergistically a group of traits.

Speaker Biography: Seyoung Kim is currently a project scientist in the Machine Learning Department at Carnegie Mellon University. Her work as a postdoctoral fellow and project scientist at Carnegie Mellon University has included developing machine-learning algorithms for disease association mapping. She received her Ph.D. in computer science from the University of California, Irvine, in 2007. During her Ph.D., she worked on statistical machine learning methods for problems in biomedical domain.

March 25, 2010

Graphical models such as Markov random fields have been successfully applied to a wide variety of fields, from computer vision and natural language processing, to computational biology. Exact probabilistic inference is generally intractable in complex models having many dependencies between the variables. In this talk, I will discuss recent work on using linear programming relaxations to perform approximate inference. By solving the LP relaxations in the dual, we obtain efficient message-passing algorithms that, when the relaxations are tight, can provably find the most likely (MAP) configuration. Our algorithms succeed at finding the MAP configuration in protein side-chain placement, protein design, and stereo vision problems. More broadly, this talk will highlight emerging connections between machine learning, polyhedral combinatorics, and combinatorial optimization.

Speaker Biography: David Sontag is a Ph.D. candidate in Computer Science at MIT. He received his Bachelor’s degree in Computer Science from the University of California, Berkeley in 2005. His research focuses on theory and practical algorithms for learning and probabilistic inference in large statistical models. His work has been awarded with an outstanding student paper award at NIPS in 2007 and a best paper award at UAI in 2008. He currently has the Google Fellowship in Machine Learning.

March 30, 2010

A great variety of sensors are being built into mobile phones today. This opens up a new research frontier, where mobile devices have the potential to significantly impact many aspects of our everyday life: from health-care, to sustainability, safety, entertainment, and business. However, the broad impact of this vision will be jeopardized without advances in the computational models, which turn raw sensor data into inferences (ranging from recognizing physical activities to tracking community-wide social interaction patterns). My group focuses on developing mobile sensing and machine learning techniques for analyzing and interpreting the behavior of individuals and social groups, including their context, activities, and social networks.

Although solutions to this problem using standard machine learning techniques is possible, how they can be solved efficiently without requiring significant effort from the end-users is still an open problem. In this talk I will describe the research we have done to bridge this gap and advance the state of the art of people-aware computing, by developing novel learning algorithms and evaluating resulting systems through a series of real-world deployments. I will conclude by providing some specific examples of how the methods we have developed can be applied to better understand and enhance the lives of people.

Speaker Biography: Tanzeem Choudhury is an assistant professor in the computer science department at Dartmouth. She joined Dartmouth in 2008 after four years at Intel Research Seattle. She received her PhD from the Media Laboratory at MIT. Tanzeem develops systems that can reason about human activities, interactions, and social networks in everyday environments. Tanzeem’s doctoral thesis demonstrated for the first time the feasibility of using wearable sensors to capture and model social networks automatically, on the basis of face-to-face conversations. MIT Technology Review recognized her as one of the top 35 innovators under the age of 35 (2008 TR35) for her work in this area. Tanzeem has also been selected as a TED Fellow and is a recipient of the NSF CAREER award.

Student Seminar

March 31, 2010

The advent of computer-integrated surgery (CIS) technologies has renewed interest in improving the current learning and evaluation paradigms for surgeons. The ability of CIS systems to record quantitative motion and video data of the surgical workspace opens up the possibility of creating descriptive mathematical models to represent and analyze surgical training and performance. These models can then form the basis for evaluating and training surgeons, producing quantitative measures of surgical proficiency, automatically annotating surgical recordings, and providing data for a variety of other applications in medical informatics.

In developing mathematical models to recognize and evaluate surgical dexterity, we must first investigate the underlying structure in surgical motion. We hypothesize that motion during a surgery is not a random set of gestures but a deliberate sequence of gestures, each with its own surgical intent. In this talk, I present highlights in our investigation of the existence of structure in surgical motion. During our research, we made no assumptions about the construct of the structure. We borrowed techniques and ideas from computer vision, image processing, speech processing, language theory, machine learning and statistical analysis to help in our investigation. Our focus was on the analyses of fundamental surgical tasks, such as suturing, knot tying and needle hoop passing, across varying surgical skill levels.

April 1, 2010

The dilemma of suffering high ownership costs, or facing limited system scale out present in commercial databases has started a trend where many communities develop their own nimble, lightweight data management tools, as seen with mapreduce and key-value stores. Update-intensive applications, such as algorithmic trading on order books, compute cloud management and personal status feeds (e.g., Facebook, Twitter), are clear cases in point of this trend, since to this date, databases have a notoriously poor reputation for handling updates efficiently. Current techniques such as incremental view maintenance and stream processing either involve significant repetition of work, or apply in a limited setting.

I introduce DBToaster, a novel SQL compilation framework that reconsiders the foundations, and program structure of state-of-the-art query processors to generate lightweight, high-performance query engines. These engines achieve orders of magnitude efficiency gains by fully exploiting queries for incremental processing. DBToaster engines use map data structures instead of highly-optimized relational operators, resulting in very simple, efficient query processing programs. I will present DBToaster’s novel recursive compilation technique, which determines maps to maintain by repeatedly simplifying queries. I will also discuss ongoing work on Cumulus, a massive-scale online query processor based on DBToaster’s extremely simple intermediate language of map maintenance, a language that is embarrassingly parallel and reflects the goals of achieving scalability through simplicity.

Speaker Biography: Yanif Ahmad is a postdoctoral associate in the Database Group at Cornell University with Prof. Christoph Koch, having received his Ph.D. from Brown University in January 2009 under the supervision of Prof. Ugur Cetintemel. His research focuses on data stream processing and distributed data management. Yanif is the recipient of an IBM Ph.D. fellowship, a Best Research Paper award at the ICDE 2008 conference, and a Best Demonstration Award at SIGMOD 2005, and has interned at both IBM Almaden and Microsoft Research.

Student Seminar

April 5, 2010

A hypergraph or “packed forest” is a compact data structure that can represent exponentially many trees generated (for a given input sentence) by a tree-based machine translation system. This talk presents novel inference, training, and decoding methods that work with hypergraphs.

The hypergraph is weighted, with the weights (which are often negative log-probabilities) being learnt by a discriminative training method. One common drawback of such method is that it relies on the existence of high-quality supervised data (i.e., bilingual data), which may be expensive to obtain. We present two unsupervised discriminative training methods: minimum imputed-risk training, and contrastive language model estimation, both can exploit monolingual English data to perform discriminative training.

Decoding in this context refers to finding a translation that has a maximum posterior probability (i.e., MAP decoding) in the hypergraph. However, this is intractable due to spurious ambiguity, a situation where the probability of a translation string is split among many distinct derivations (e.g., trees or segmentations). We present a novel technique, variational decoding, which effectively approximates the intractable MAP decoding based on variational inference.

Finally, many atomic inference operations on hypergraphs such as finding the one-best or k-best trees, or computing expectations have wider applicability beyond machine translation. A general framework to achieve these operations is semiring-weighted logic programming. Within this framework, we first extend the expectation semiring, originally proposed for a finite state automaton, to a hypergraph. We then propose a novel second-order expectation semiring. These semirings can be used to efficiently compute a variety of frequently used expectations (e.g., entropy and its gradient) over the exponentially many trees in a hypergraph.

The modeling, decoding and algorithmic advances are presented in the context of a machine translation task, but we believe that they will also find applications in other areas of natural language processing (e.g., parsing and speech recognition) and more generally in pattern recognition.

April 8, 2010

In this talk, we address two algorithmic challenges in machine learning.

First, with the increase in electronic record-keeping, many datasets that learning algorithms work with relate to sensitive information about individuals. Thus the problem of privacy-preserving learning — how to design learning algorithms that operate on the sensitive data of individuals while still guaranteeing the privacy of individuals in the training set — has achieved great practical importance. In this talk, we address the problem of privacy-preserving classification, and we present an efficient classifier which is private in the differential privacy model of Dwork et al. Our classifier works in the ERM (empirical loss minimization) framework, and includes privacy preserving logistic regression and privacy preserving support vector machines. We show that our classifier is private, provide analytical bounds on the sample requirement of our classifier, and evaluate it on some real data.

Second, we address the problem of clustering, when data is available from multiple domains or views. For example, when clustering a document corpus such as Wikipedia, we have access to the contents of the documents and their link structure. In this talk, we address this problem of Multiview Clustering, and show how to use information from multiple views to improve clustering performance. We present an algorithm for multiview clustering, provide analytical bounds on the performance of our algorithm under certain statistical assumptions, and finally evaluate our algorithm on some real data.

Based on joint work with Sham Kakade (UPenn), Karen Livescu (TTI Chicago), Claire Monteleoni (CCLS Columbia), Anand Sarwate (ITA UCSD), and Karthik Sridharan (TTI Chicago).

Speaker Biography: Kamalika Chaudhuri received a Bachelor of Technology degree in Computer Science and Engineering in 2002 from the Indian Institute of Technology, Kanpur, and a PhD in Computer Science from UC Berkeley in 2007. She is currently a postdoctoral researcher at the Computer Science and Engineering Department at UC San Diego. Kamalika’s research is on the design and analysis of machine-learning algorithms and their applications. In particular, her interests lie in clustering, online learning, and privacy-preserving machine-learning, and applications of machine-learning and algorithms to practical problems in other areas.

April 13, 2010

As networked systems grow and traffic patterns evolve, management applications are increasing in complexity and functionality. To address these demands, equipment vendors and administrators today depend on incremental solutions that increase the complexity of network elements and deployment costs for operators. Despite this increased complexity and cost, there is still a fundamental disconnect between the policy objectives of system administrators and today’s mechanisms.

I argue that much of this disconnect arises from the narrow device-centric view of current solutions. Such piecemeal solutions are inefficient: network elements duplicate tasks and some locations become overloaded. Worse still, administrators struggle to retrofit their high-level goals within device-centric configurations. I describe a clean-slate system-wide approach for resource management in large-scale networked systems based on three high-level principles: (1) systematic selection and placement of device-level primitives, (2) lightweight coordination mechanisms that enable different network elements to operate in synchrony, and (3) practical optimization models that capture operating constraints and policy objectives. In this talk, I will highlight the benefits of such a system-wide approach in the context of meeting fine-grained coverage requirements in traffic monitoring and implementing a redundancy elimination service to improve network performance. Further, these principles can be more broadly applied in other contexts such as intrusion detection and prevention, managing cloud services, and multi-hop wireless and sensor networks.

Speaker Biography: Vyas Sekar is a final year PhD student in the Computer Science Department at Carnegie Mellon University, co-advised by Michael Reiter and Hui Zhang. His research interests are in systems, networking, and security. Before CMU, he earned his bachelor’s degree from the Indian Institute of Technology Madras, where he was awarded the President of India Gold Medal.

Distinguished Lecturer

April 15, 2010

Attribute-Based Access Control (ABAC) provides a strategy for setting up access rules by exploiting attributes of principals and objects from an enterprise information system or digital credentials. ABAC can replace or complement other approaches like Access Control Lists (ACLs) and Role-Based Access Control (RBAC). In recent years, there has been a growth of other attribute-based systems including Attribute-Based Encryption (ABE) and Attribute-Based Messaging (ABM). In ABM email messages use addresses that describe recipient attributes rather than an explicit list of the recipients. Such addressing makes messages more efficient, exclusive, and intensional but raises challenges for security and privacy. This talk will discuss attribute-based security systems in general and use of ABAC and ABE to solve security problems faced by ABM.

We describe requirements for ABM and a practical architecture that addresses them. We have built a prototype and collected performance results that show its feasibility for at least mid-size organizations. We end with some speculation on other ways to exploit attribute-based security techniques for goals like adding protection to databases and multi-tier web systems.

Speaker Biography: Dr. Gunter is a professor in the Computer Science Department and Director of the Illinois Security Lab. He does research and teaches in his areas of technical expertise: security, networks, programming languages, and software engineering. His work includes contributions to the semantics and design of programming and policy languages, models and analysis techniques for networks and security, and applications of formal logic in computer science. His current projects focus on security for networked sensors, attribute-based security systems, models and counter-measures for Denial of Service (DoS), and applications of these technologies in electric power systems and health care. He is the author of more than 80 scientific research publications and patents and a textbook on semantics of programming languages published by MIT Press. He is a founder of Probaris Technologies, a company that provides identity management technologies, and has served as a consultant to research labs and companies and as an expert witness on legal cases concerning fraud, contract, copyright, and patent infringement.

April 20, 2010

The information revolution has produced huge quantities of digitized knowledge. Information users, such as web searchers, business analysts, and medical professionals, are overwhelmed by vast quantities of information. As new information sources move online, information overload will worsen and the need for intelligent information systems will grow. The recent focus on information processing in statistical methods has produced numerous high quality tools for processing language, including knowledge extraction, organization and analysis. With more data and better statistical methods, the state of the art advances. However, these statistical methods can have difficulty scaling up to huge quantities of diverse data.

This talk will present techniques designed for processing large data collections, with a particular focus on sparse representations common to many domains with a large number of features. I will present Confidence Weighted Learning, an online (streaming) machine learning algorithm designed for these types of data distributions. Confidence weighted learning maintains a distribution over linear classifiers and updates the distribution after each example. I’ll show how this framework can be extended to multi-class and structured prediction problems, as well as extensions for modeling seconds order feature interactions and noisy data.

Speaker Biography: Mark Dredze is as an Assistant Research Professor in the department of Computer Science and a Senior Research Scientist at the Human Language Technology Center of Excellence at The Johns Hopkins University. His research interests include machine learning, natural language processing and intelligent user interfaces. His focus is on novel applications of machine learning to solve language processing challenges as well as applications of machine learning and natural language processing to support intelligent user interfaces for information management. He earned his PhD from the University of Pennsylvania and has worked at Google, IBM and Microsoft.

Distinguished Lecturer

April 22, 2010

Computing at the limits of technology calls for numerous engineering decisions and tradeoffs. General purpose solutions do not work at the extremes. Traditional HPC has been analyzed for decades, resulting in specialized architectures. Systems for life critical systems, those for large enterprises, those for tiny devices, also present their own special requirements.

The area of data intensive computing is newer, and the computing models are less established. To support large (millions) of users doing similar but different computations, expecting to have access to enormous amounts of information (petabytes, not gigabytes) and to get prompt responses and global access calls for different compromises. Different applications present their own requirements and difficulties.

This talk will address some of those needs – different models of storage and data management that are appropriate for different types of application, networking demands for parallelism and global access, management of large numbers of fallible processors and storage. Support for such computing also calls for different approaches to software methodology, system management, and deployment. But massive data also opens new ways to approach science and to get remarkable results, ranging from fast advertising (yes,it’s very hard) to language translation to (of course) search.

Speaker Biography: Stuart Feldman, Vice President Engineering, Google Stu is responsible for engineering activities at Google’s offices in the eastern part of the Americas, with projects affecting most of the company’s focus areas. He is also responsible for several important Google products. Stu did his academic work in astrophysics and mathematics and earned his AB at Princeton and his PhD at MIT. He is Past President of ACM (Association for Computing Machinery) and received the 2003 ACM Software System Award. He is also a Fellow of the IEEE, a Fellow of the ACM, a Fellow of the AAAS, a member of the Board of Directors of the AACSB (Association to Advance Collegiate Schools of Business, International). He serves on a number of government advisory committees. After graduating, Stu was a computer science researcher at Bell Labs and a research manager at Bellcore. In addition he was the creator of Make as well as the architect for a large new line of software products at Bellcore. He then worked at IBM. Most recently, he was Vice President for Computer Science in IBM Research, where he drove the long-term and exploratory worldwide science strategy in computer science and related fields, led programs for open collaborative research with universities, and influenced national and global computer science policy. Prior to that, Stu served as Vice President for Internet Technology and was responsible for IBM strategies, standards, and policies relating to the future of the Internet, and managed a department that created experimental Internet-based applications. Earlier, he was the founding Director of IBM’s Institute for Advanced Commerce, which was dedicated to creating intellectual leadership in e-commerce.

Student Seminar

April 28, 2010

Random Forest regression is a non-parametric regression technique which performs competitively with other commonly used techniques. Random Forests arrive at regression estimates by training an ensemble of randomized regression trees on bootstrap samples drawn from the training set. We describe a novel interpretation of these individual regression tree estimates which implies asymptotically normally distributed regression errors, and which suggests parameter estimators for the error distributions of each new test object, independent of other test objects. We demonstrate this technique on several data sets, and we offer a theoretical motivation for why this interpretation, in some form, should apply to data of arbitrary underlying distribution.

April 29, 2010

Modern society is increasingly dependent on, and fearful of, the availability of electronic information. There are numerous realistic scenarios where sensitive data must be (sometimes reluctantly or suspiciously) shared between two or more entities without mutual trust. As often happens, the research community has foreseen the need for mechanisms to enable limited (privacy-preserving) sharing of sensitive information and a number of effective (if not always efficient) solutions have been proposed. Among them, Private Set Intersection techniques are particularly appealing whenever two parties wish to compute an intersection of their respective sets of items without revealing to each other any other information.

This talk motivates the need for Private Set Intersection (PSI) techniques with various features and privacy properties and illustrates several concrete Private Set Intersection protocols that offer appreciably better efficiency than prior work. We also demonstrate their practicality via experimental results obtained from a prototype implementation and discuss a number of systems issues encountered in developing a toolkit that provides various flavors of PSI.

Speaker Biography: Gene Tsudik is a Professor in the Department of Computer Science at the University of California, Irvine (UCI). He obtained his PhD in Computer Science from USC in 1991 for research on firewalls and Internet access control. Before coming to UCI in 2000, he was a Project Leader at IBM Zurich Research Laboratory (1991-1996) and USC Information Science Institute (1996-2000). Over the years, his research interests included: routing, firewalls, authentication, mobile networks, secure e-commerce, anonymity ad privacy, group communication, digital signatures, key management, mobile ad hoc networks, as well as database privacy and secure storage. He is currently serving as the Director of Secure Computing and Networking Center (SCONCE) at UCI and the Vice-Chair of the Computer Science Department. In 2007, he was on sabbatical at the University of Rome as a Fulbright Senior Scholar. Since 2009, he is the Editor-in-Chief of ACM Transactions on Information and Systems Security (TISSEC).

May 6, 2010

High-frequency trading (HFT) involves fully automated trading systems that attempt to profit from short-term pricing inefficiencies or market liquidity imbalances. Such profit opportunities can last from microseconds to minutes (or even hours). Typically, HFT groups apply strategies in the automated market making, short-term statistical or volatility arbitrage, or liquidity detection areas.

This type of trading activity is inherently dependent on advanced computer systems to process large datasets, cope with message rates above 1 million events a second, execute algorithms at the fastest possible speed, while of course being correct and reliable in the face of inevitable failures.

Moreover, rapid improvements in technology, from smarter algorithms to low-level code optimizations and beyond, lead directly to increased profits. On the other hand, unreliable, poorly performing, unscalable or unpredictable jittery systems are one of the easier ways that a HFT trading group can put itself out of business.

After a brief introduction to HFT, this talk will explore a prototypical HFT architecture to illustrate the many systems engineering and research questions we tackle as we meet these challenges. In particular, I will focus on the programmability of these and future systems while meeting stringent performance, reliability, and efficiency goals.