The information revolution has produced huge quantities of digitized knowledge. Information users, such as web searchers, business analysts, and medical professionals, are overwhelmed by vast quantities of information. As new information sources move online, information overload will worsen and the need for intelligent information systems will grow. The recent focus on information processing in statistical methods has produced numerous high quality tools for processing language, including knowledge extraction, organization and analysis. With more data and better statistical methods, the state of the art advances. However, these statistical methods can have difficulty scaling up to huge quantities of diverse data.
This talk will present techniques designed for processing large data collections, with a particular focus on sparse representations common to many domains with a large number of features. I will present Confidence Weighted Learning, an online (streaming) machine learning algorithm designed for these types of data distributions. Confidence weighted learning maintains a distribution over linear classifiers and updates the distribution after each example. I’ll show how this framework can be extended to multi-class and structured prediction problems, as well as extensions for modeling seconds order feature interactions and noisy data.
Mark Dredze is as an Assistant Research Professor in the department of Computer Science and a Senior Research Scientist at the Human Language Technology Center of Excellence at The Johns Hopkins University. His research interests include machine learning, natural language processing and intelligent user interfaces. His focus is on novel applications of machine learning to solve language processing challenges as well as applications of machine learning and natural language processing to support intelligent user interfaces for information management. He earned his PhD from the University of Pennsylvania and has worked at Google, IBM and Microsoft.