Mark Dredze

     Johns Hopkins University
Research Scientist Human Language Technology Center of Excellence (HLTCOE)
Assistant Research Professor Department of Computer Science
Center for Language and Speech Processing (CLSP)
Machine Learning Group
Center for Population Health Information Technology (CPHIT), Bloomberg
Health Sciences Informatics, School of Medicine
Contact:   |  www.cs.jhu.edu/~mdredze   www.dredze.com
Office: Stieff 181    (410) 516-6786

About Me

I am an assistant research professor of computer science at Johns Hopkins University. I primarily work in the Human Language Technology Center of Excellence (COE). I am affiliated with the Center for Speech and Language Processing (CLSP) and am part of the Machine Learning Group.

Research Interests
I have a range of research interests in machine learning, natural language processing, speech, intelligent user interfaces and health informatics. The most exciting work combines these areas.

How to be a Successful PhD Student
Hanna Wallach and I wrote a guide on How to be a Successful PhD Student. It's geared for PhD students in general, but focuses on CS, machine learning and NLP students. We welcome any suggestions you have for improvements.

"I want to work with you."
Great! Please contact me if you are a JHU student. It is probably helpful if you've taken natural language, machine learning or artificial intelligence. If you are not a JHU student, do not contact me directly asking about open positions. If you do I will ignore your email. Instead, please see the admissions process at the CLSP and CS department.

"What do you do?"
In addition to listing my research interests above, I like to describe the evolution of my research.

I began my research as an undergraduate at Northwestern University working on intelligent user interfaces in the InfoLab, specifically an interface for bringing contextual information to television news viewers. The work required various learning components: generating queries, segmenting stories, classification, etc. Realizing that I knew little about these technologies, I decided to learn more about components that would support good UIs. At IBM TJ Watson Research Center, I began work on a natural medium for these applications: email. After working on email activity management, I began my PhD at the University of Pennsylvania with Fernando Pereira working on the Calo Project. I branched off into other email tasks, including reply prediction and attachment prediction, both designed to improve the email experience. This taught me the importance of building learning models specific to each user since different users operate in different environments.

Subsequently, I expanded my interests to other applications and developments of these learning technologies, such as semi-supervised learning and online learning. Observations about differences in user behaviors lead to work in domain/user adaptation, an important problem for natural language processing.

Over the years I have worked at a variety of industrial research labs, including Google, IBM and Microsoft. This work has given me an appreciation for real world deployment challenges of intelligent systems. In general, I am interested in these challenges and how we can develop more robust learning systems.

News
3/19/2013 — I spoke at the Big Data in Public Health conference.
3/4/2013 — Our work was featured in the Washington Post.
2/28/2013 — I spoke at the Social Media And Response Management Interface Event (SMARMIE). My talk and subsequent panel are online (starts at 3:39.) I am also in the final panel.
2/23/2013 — David Broniatowski, my colleague in CAM in Emergency Medicine, was on Sound Medicine talking about our work.
2/22/2013Midday with Dan Rodricks. My segment starts around 25:00.
2/20/2013 — I was on the WYPR health minute [Here] and [Here].
2/14/2013 — I co-authored the report Youth Violence: What we Need to Know as part of the Subcommittee on Youth Violence of the Advisory Committee to the Social, Behavioral and Economic Sciences Directorate, National Science Foundation.
2/13/2013 — Our work is covered in a Nature News article.
2/7/2013 — Talk at Columbia in the Biomedical Informatics Department.
1/30/2013 — Our work on tracking influenza with Twitter received a lot of press, including CNN, MarketWatch TV and Michigan Radio.
1/18/2013 — My student Michael Paul has been awarded a Microsoft Research PhD Fellowship.
11/13/2012 — Talk at Georgia Tech in the Robotics and Intelligent Machines Center (RIM) Seminar Series.
10/20/2012 — I will be the NAACL 2013 area chair for machine learning with Phil Blunsom.
10/19/2012 — I am helping to organize the second annual Mid-Atlantic Student Colloquium on Speech, Language and Learning (MASC 2012).
10/11/2012 — I attended the CCC Computing and Healthcare Symposium.
5/3/2012 — Talk at the University of Pennsylvania.
4/30/2012 — Talk at the University of Delaware.
4/1/2012 — I am co-organizing a AAAI 2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text.
3/29/2012 — Talk at the Applied Physics Lab.
3/12/2012 — Work featured on Twitter Stories. Excellent Video.
3/11/2012 — I am a panelist at SXSW. Notes from an audience member.
3/9/2012 — Talk at University of Texas- Dallas.
2/16/2012 — I am a panelist at the Conference of Digital Disease Detection. A video of my talk is available online.
2/2/2012 — We hosted NACLO.
2/1/2012 — I am teaching Current Topics in Machine Learning (600.775) in the Spring of 2012. [Class site]
12/7/2011 — Talk at JHU's BioStats Seminar.
11/15/2011 —Talk at Carnegie Mellon University.
10/28/2011 —Talk at Mass General Center for Experimental Drugs and Diagnostics.
9/23/2011 — We are hosting 140 people at the first Mid-Atlantic Student Colloquium on Speech, Language and Learning.
9/12/2011 — My student Ariya Rastrow was awarded the inaugural Frederick Jelinek Fellowship.
9/1/2011 — A feature on our Twitter+Health work in The Johns Hopkins Magazine.
9/1/2011 — I will be the ACL 2012 area chair for machine learning with Trevor Cohn.
8/15/2011 — I am teaching Machine Learning (600.475) in the Fall of 2011. [Class site]
8/10/2011 — I was on NPR. What fun! Midday with Dan Rodricks [Audio]. Also, see the WYPR health minute [Here] and [Here].
7/16/2011 — Our health mining research received some press coverage: great articles by The Atlantic and NPR; also see CBS News, Wall Street Journal (blog), Huffington Post, BBC, GigaOM. We've also been on TV! WJLA and PressHereTV.
6/30/2011 — Talk at the CLSP summer school. [Video]
6/2/2011 — Congratulations to Carolina Parada, who successfully defended her thesis! She is off to Google Research.
5/23/2011 — Giving a talk at the Applied Physics Lab on entity disambiguation.
4/29/2011 — Giving a talk at Health Informatics Seminar. [Video]
2/3/2011 — We served as one of the largest NACLO host sites.
1/30/2011 — I am teaching a new seminar on Machine Learning.
7/1/2010 — I organized a workshop on Amazon Mechanical Turk at NAACL 2010. All of the data is available: Check it out. We received some press coverage on the workshop.
6/15/2010 — My student Carolina Parada was awarded the 2010 Google Graduate Fellowship in Speech.


Publications

Click to show abstract.

     2013 (7 Publications)
Expand Me    Justin Snyder, Rebecca Knowles, Mark Dredze, Matthew R. Gormley, Travis Wolfe. Topic Models and Metadata for Visualizing Text Corpora. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2013.
Effectively exploring and analyzing large text corpora requires visualizations that provide a high level summary. Past work has relied on faceted browsing of document metadata or on natural language processing of document text. In this paper, we present a new web-based tool that integrates topics learned from an unsupervised topic model in a faceted browsing experience. The user can manage topics, filter documents by topic and summarize views with metadata and topic graphs. We report a user study of the usefulness of topics in our tool.
 
      Damianos Karakos, Mark Dredze, Sanjeev Khudanpur. Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation. Technical Report 8, Johns Hopkins University, 2013. [PDF]
 
      Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
 
      Mahesh Joshi, Mark Dredze, William W. Cohen, Carolyn P. Rose. What's in a Domain? Multi-Domain Learning for Multi-Attribute Data. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
 
      Alex Lamb, Michael Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
 
      Michael Paul, Mark Dredze. Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2013.
 
      Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Machine Learning, 2013. [PDF]
 

     2012 (16 Publications)
      Mark Dredze. How Social Media Will Change Public Health. IEEE Intelligent Systems, 2012. [Link]
 
      Michael J. Paul, Mark Dredze. Factorial LDA: Sparse Multi-Dimensional Text Models. Neural Information Processing Systems (NIPS), 2012.
 
Expand Me    Alex Lamb, Michael J. Paul, Mark Dredze. Investigating Twitter as a Source for Studying Behavioral Responses to Epidemics. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF]
We present preliminary results for mining concerned awareness of influenza tweets. We describe our data set construction and experiments with binary classification of data into influenza versus general messages and classification into concerned awareness and existing infection.
 
Expand Me    Atul Nakhasi, Ralph J Passarella, Sarah G Bell, Michael J Paul, Mark Dredze, Peter J Pronovost. Malpractice and Malcontent: Analyzing Medical Complaints in Twitter. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF]
In this paper we report preliminary results from a study of Twitter to identify patient safety reports, which offer an immediate, untainted, and expansive patient perspective un- like any other mechanism to date for this topic. We identify patient safety related tweets and characterize them by which medical populations caused errors, who reported these er- rors, what types of errors occurred, and what emotional states were expressed in response. Our long term goal is to improve the handling and reduction of errors by incorpo- rating this patient input into the patient safety process.
 
Expand Me    Michael J. Paul, Mark Dredze. Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF]
Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of a multi-dimensional latent text model -- factorial LDA -- that captures orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interest to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.
 
      Ralph Passarella, Atul Nakhasi, Sarah Bell, Michael J. Paul, Peter Pronovost, Mark Dredze. Twitter as a Source for Learning about Patient Safety Events. Annual Symposium of the American Medical Informatics Association (AMIA), 2012.
 
Expand Me    Damianos Karakos, Brian Roark, Izhak Shafran, Kenji Sagae, Maider Lehr, Emily Prud'hommeaux, Puyang Xu, Nathan Glenn, Sanjeev Khudanpur, Murat Saraclar, Dan Bikel, Mark Dredze, Chris Callison-Burch, Yuan Cao, Keith Hall, Eva Hasler, Philip Koehn, Adam Lopez, Matt Post, Darcey Riley. Deriving conversation-based features from unlabeled speech for discriminative language modeling. International Speech Communication Association (INTERSPEECH), 2012. [PDF]
The perceptron algorithm was used in [1] to estimate discriminative language models which correct errors in the output of ASR systems. In its simplest version, the algorithm simply increases the weight of n-gram features which appear in the correct (oracle) hypothesis and decreases the weight of n-gram features which appear in the 1-best hypothesis. In this paper, we show that the perceptron algorithm can be successfully used in a semi-supervised learning (SSL) framework, where limited amounts of labeled data are available. Our framework has some similarities to graph-based label propagation [2] in the sense that a graph is built based on proximity of unlabeled conversations, and then it is used to propagate confidences (in the form of features) to the labeled data, based on which perceptron trains a discriminative model. The novelty of our approach lies in the fact that the confidence "flows" from the unlabeled data to the labeled data, and not vice-versa, as is done traditionally in SSL. Experiments conducted at the 2011 CLSP Summer Workshop on the conversational telephone speech corpora Dev04f and Eval04f demonstrate the effectiveness of the proposed approach.
 
Expand Me    Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Structured Language Modeling for Speech Recognition. International Speech Communication Association (INTERSPEECH), 2012. [PDF]
The structured language model (SLM) of [1] was one of the first to successfully integrate syntactic structure into language models. We extend the SLM framework in two new directions. First, we propose a new syntactic hierarchical interpolation that improves over previous approaches. Second, we develop a general information-theoretic algorithm for pruning the underlying Jelinek-Mercer interpolated LM used in [1], which substantially reduces the size of the LM, enabling us to train on large data. When combined with hill-climbing [2] the SLM is an accurate model, space-efficient and fast for rescoring large speech lattices. Experimental results on broadcast news demonstrate that the SLM outperforms a large 4-gram LM.
 
Expand Me    Nicholas Andrews, Jason Eisner, Mark Dredze. Name Phylogeny: A Generative Model of String Variation. Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF]
Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
 
Expand Me    Mahesh Joshi, Mark Dredze, William Cohen, Carolyn Rose. Multi-Domain Learning: When Do Domains Matter?. Empirical Methods in Natural Language Processing (EMNLP), 2012. [PDF]
We present a systematic analysis of existing multi-domain learning approaches with respect to two questions. First, many multi-domain learning algorithms resemble ensemble learning algorithms. (1) Are multi-domain learning improvements the result of ensemble learning effects? Second, these algorithms are traditionally evaluated in a balanced label setting, although in practice many multi-domain settings have domain-specific label biases. When multi-domain learning is applied to these settings, (2) are multi-domain methods improving because they capture domain-specific class biases? An understanding of these two issues presents a clearer idea about where the field has had success in multi-domain learning, and it suggests some important open questions for improving beyond the current state of the art.
 
Expand Me    Ariya Rastrow, Sanjeev Khudanpur, Mark Dredze. Revisiting the Case for Explicit Syntactic Information in Language Models. NAACL Workshop on the Future of Language Modeling for HLT, 2012. [PDF]
Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naive, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactically well-motivated models. This unusual resilience of n-grams, as well as their weaknesses, are examined here. It is demonstrated that n-grams are good word-predictors, even linguistically speaking, in a large majority of word-positions, and it is suggested that to improve over n-grams, one must explore syntax-aware (or other) language models that focus on positions where n-grams are weak.
 
Expand Me    Spence Green, Nicholas Andrews, Matthew Gormley, Mark Dredze, Christopher D. Manning. Entity Clustering Across Languages. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF]
Standard entity clustering systems commonly rely on mention (string) matching, syntactic features, and linguistic resources like English WordNet. When co-referent text mentions appear in different languages, these techniques cannot be easily applied. Consequently, we develop new methods for clustering text mentions across documents and languages simultaneously, producing cross-lingual entity clusters. Our approach extends standard clustering algorithms with cross-lingual mention and context similarity measures. Crucially, we do not assume a pre-existing entity list (knowledge base), so entity characteristics are unknown. On an Arabic-English corpus that contains seven different text genres, our best model yields a 24.3% F1 gain over the baseline.
 
Expand Me    Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2012. [PDF]
With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.
 
Expand Me    Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining. Association for Computational Linguistics (ACL), 2012. [PDF]
Long-span features, such as syntax, can improve language models for tasks such as speech recognition and machine translation. However, these language models can be difficult to use in practice because of the time required to generate features for rescoring a large hypothesis set. In this work, we propose substructure sharing, which saves duplicate work in processing hypothesis sets with redundant hypothesis structures. We apply substructure sharing to a dependency parser and part of speech tagger to obtain significant speedups, and further improve the accuracy of these tools through up-training. When using these improved tools in a language model for speech recognition, we obtain significant speed improvements with both N-best and hill climbing rescoring, and show that up-training leads to WER reduction.
 
Expand Me    Koby Crammer, Alex Kulesza, Mark Dredze. New H-∞ Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2012. [PDF]
The well known recursive least squares (RLS) algorithm has been widely used for many years. Most analyses of RLS have assumed statistical properties of the data or noise process, but recent robust H∞ analyses have been used to bound the ratio of the performance of the algorithm to the total noise. In this paper, we provide the first additive analysis bounding the difference between performance and noise. Our analysis provides additional convergence guarantees in general, and particularly with structured input data. We illustrate the analysis using human speech and white noise.
 
Expand Me    Koby Crammer, Mark Dredze, Fernando Pereira. Confidence-Weighted Linear Classification for Text Categorization. Journal of Machine Learning Research (JMLR), 2012.
Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing.
 

     2011 (12 Publications)
      Spence Green, Nicholas Andrews, Matthew R. Gormley, Mark Dredze, Christopher Manning. Cross-lingual Coreference Resolution: A New Task for Multilingual Comparable Corpora. Technical Report 6, Human Language Technology Center of Excellence, Johns Hopkins University, 2011.
 
      Matthew R. Gormley, Mark Dredze, Benjamin Van Durme, Jason Eisner. Shared Components Topic Models with Application to Selectional Preference. NIPS Workshop on Learning Semantics, 2011.
 
Expand Me    Damianos Karakos, Mark Dredze, Kenneth Church, Aren Jansen, Sanjeev Khudanpur. Estimating Document Frequencies in a Speech Corpus. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF]
Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df(w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
 
Expand Me    Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF]
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of- domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures.
 
Expand Me    Ariya Rastrow, Mark Dredze, Sanjeev Khudanpur. Efficient Discrimnative Training of Long-Span Language Models. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011. [PDF]
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long- span LMs based on a hill climbing rescoring algorithm. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.
 
Expand Me    Delip Rao, Paul McNamee, Mark Dredze. Entity Linking: Finding Extracted Entities in a Knowledge Base. Multi-source, Multi-lingual Information Extraction and Summarization, 2011.
In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.
 
      Ann Irvine, Mark Dredze, Geraldine Legendre, Paul Smolensky. Optimality Theory Syntax Learnability: An Empirical Exploration of the Perceptron and GLA. CogSci Workshop on OT as a General Cognitive Architecture, 2011.
 
Expand Me    Carolina Parada, Mark Dredze, Frederick Jelinek. OOV Sensitive Named-Entity Recognition in Speech. International Speech Communication Association (INTERSPEECH), 2011.
Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named entities and always produce transcription errors. In this work, we improve speech NER by including features indicative of OOVs based on a OOV detector, allowing for the identification of regions of speech containing named entities, even if they are incorrectly transcribed. We construct a new speech NER data set and demonstrate significant improvements for this task.
 
Expand Me    Michael J. Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011. [PDF]
We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ailments, such as influenza, infections, obesity, as compared to standard topic models. Furthermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data.
 
Expand Me    Michael J. Paul, Mark Dredze. You Are What You Tweet: Analyzing Twitter for Public Health. International Conference on Weblogs and Social Media (ICWSM), 2011. [PDF]
Analyzing user messages in social media can mea- sure different population haracteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and in- somnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.
 
Expand Me    Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow. Learning Sub-Word Units for Open Vocabulary Speech Recognition. Association for Computational Linguistics (ACL), 2011. [PDF]
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to \em learn the sub-word lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. %We present results on a Broadcast News and MIT Lectures data sets. A hybrid model with our learned sub-word lexicon reduces error by 6.3\% and 7.6\% (absolute) at a 5\% false alarm rate on an English Broadcast News and MIT Lectures task respectively.
 
Expand Me    Ariya Rastrow, Markus Dreyer, Abhinav Sethy, Sanjeev Khudanpur, Bhuvana Ramabhadran, Mark Dredze. Hill Climbing on Speech Lattices: A New Rescoring Framework. International Conference on Acoustics, Speech and Signal Processing, 2011. [PDF]
We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.
 

     2010 (12 Publications)
Expand Me    Mark Dredze, Aren Jansen, Glen Coppersmith, Kenneth Church. NLP on Spoken Documents without ASR. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF]
There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-of-vocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long ( 1 sec) repetitions in speech, and clusters them into pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudo-terms; performance on a Switchboard task approaches a baseline using gold standard manual transcriptions.
 
Expand Me    Mark Dredze, Tim Oates, Christine Piatko. We're Not in Kansas Anymore: Detecting Domain Changes in Streams. Empirical Methods in Natural Language Processing (EMNLP), 2010. [PDF]
Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention -- detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses A-distance, a metric for detecting shifts in data streams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.
 
Expand Me    Carolina Parada, Abhinav Sethy, Mark Dredze, Fred Jelinek. A Spoken Term Detection Framework for Recovering Out-of-Vocabulary Words Using the Web. International Speech Communication Association (INTERSPEECH), 2010. [PDF]
Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.
 
Expand Me    Delip Rao, Paul McNamee, Mark Dredze. Streaming Cross Document Entity Coreference Resolution. Conference on Computational Linguistics (Coling), 2010. [PDF]
Previous research in cross-document entity coreference has generally been restricted to the offline scenario where the set of documents is provided in advance. As a consequence, the dominant approach is based on greedy agglomerative clustering techniques that utilize pairwise vector comparisons and thus require O(n^2) space and time. In this paper we explore identifying coreferent entity mentions across documents in high-volume streaming text, including methods for utilizing orthographic and contextual information. We test our methods using several corpora to quantitatively measure both the efficacy and scalability of our streaming approach. We show that our approach scales to at least an order of magnitude larger data than previous reported methods.
 
Expand Me    Mark Dredze, Paul McNamee, Delip Rao, Adam Gerber, Tim Finin. Entity Disambiguation for Knowledge Base Population. Conference on Computational Linguistics (Coling), 2010. [PDF]
The integration of facts derived from information extraction systems into existing knowledge bases requires a system to disambiguate entity mentions in the text. This is challenging due to issues such as non-uniform variations in entity names, mention ambiguity, and entities absent from a knowledge base. We present a state of the art system for entity disambiguation that not only addresses these challenges but also scales to knowledge bases with several million entries using very little resources. Further, our approach achieves performance of up to 95% on entities mentioned from newswire and 80% on a public test set that was designed to include challenging queries.
 
Expand Me    Chris Callison-Burch, Mark Dredze. Creating Speech and Language Data With Amazon's Mechanical Turk. Workshop on Creating Speech and Language Data With Mechanical Turk at NAACL-HLT, 2010. [PDF]
In this paper we give an introduction to using Amazon\'s Mechanical Turk crowdsourcing platform for the purpose of collecting data for human language technologies. We survey the papers published in the NAACL-2010 Workshop. 24 researchers participated in the workshop\'s shared task to create data for speech and language applications with $100.
 
Expand Me    Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, Mark Dredze. Annotating named entities in Twitter data with crowdsourcing. Workshop on Creating Speech and Language Data With Mechanical Turk at NAACL-HLT, 2010. [PDF]
We describe our experience using both Amazon Mechanical Turk (MTurk) and CrowdFlower to collect simple named entity annotations for Twitter status updates. Unlike most genres that have traditionally been the focus of named entity experiments Twitter is far more informal and abbreviated. The collected annotations and annotation techniques will provide a first step towards the full study of named entity recognition in domains like Facebook and Twitter. We also briefly describe how to use MTurk to collect judgements on the quality of "word clouds."
 
Expand Me    Matthew R. Gormley, Adam Gerber, Mary Harper, Mark Dredze. Non-Expert Correction of Automatically Generated Relation Annotations. Workshop on Creating Speech and Language Data With Mechanical Turk at NAACL-HLT, 2010. [PDF]
We explore a new way to collect human annotated relations in text using Amazon Mechanical Turk. Given a knowledge base of relations and a corpus, we identify sentences which mention both an entity and an attribute that have some relation in the knowledge base. Each noisy sentence/relation pair is presented to multiple turkers, who are asked whether the sentence expresses the relation. We describe a design which encourages user efficiency and aids discovery of cheating. We also present results on inter-annotator agreement.
 
Expand Me    Courtney Napoles, Mark Dredze. Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language. Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids at NAACL-HLT 2010, 2010. [PDF]
Text simplification is the process of changing vocabulary and grammatical structure to create a more accessible version of the text while maintaining the underlying information and content. Automated tools for text simplification are a practical way to make large corpora of text accessible to a wider audience lacking high levels of fluency in the corpus language. In this work, we investigate the potential of Simple Wikipedia to assist automatic text simplification by building a statistical classification system that discriminates simple English from ordinary English. Most text simplification systems are based on hand-written rules (e.g., PEST and its module SYSTAR), and therefore face limitations scaling and transferring across domains. The potential for using Simple Wikipedia for text simplification is significant; it contains nearly 60,000 articles with revision histories and aligned articles to ordinary English Wikipedia. Using articles from Simple Wikipedia and ordinary Wikipedia, we evaluated different classifiers and feature sets to identify the most discriminative features of simple English for use across domains. These findings help further understanding of what makes text simple and can be applied as a tool to help writers craft simple text.
 
Expand Me    Justin Ma, Alex Kulesza, Koby Crammer, Mark Dredze, Lawrence Saul, Fernando Pereira. Exploiting Feature Covariance in High-Dimensional Online Learning. AIStats, 2010. [PDF]
Some online algorithms for linear classification model the uncertainty in their weights over the course of learning. Modeling the full covariance structure of the weights can provide a significant advantage for classification. However, for high-dimensional, large-scale data, even though there may be many second-order feature interactions, it is computationally infeasible to maintain this covariance structure. To extend second-order methods to high-dimensional data, we develop low-rank approximations of the covariance structure. We evaluate our approach on both synthetic and real-world data sets using the confidence-weighted online learning framework. We show improvements over diagonal covariance matrices for both low and high-dimensional data.
 
Expand Me    Carolina Parada, Mark Dredze, Denis Filimonov, Fred Jelinek. Contextual Information Improves OOV Detection in Speech. North American Chapter of the Association for Computational Linguistics (NAACL), 2010. [PDF]
Out-of-vocabulary (OOV) words represent an important source of error in large vocabulary continuous speech recognition (LVCSR) systems. These words cause recognition failures, which propagate through pipeline systems impacting the performance of downstream applications. The detection of OOV regions in the output of a LVCSR system is typically addressed as a binary classification task, where each region is independently classified using local information. In this paper, we show that jointly predicting OOV regions, and including contextual information from each region, leads to substantial improvement in OOV detection. Compared to the state-of-the-art, we reduce the missed OOV rate from 42.6% to 28.4% at 10% false alarm rate.
 
Expand Me    Mark Dredze, Alex Kulesza, Koby Crammer. Multi-Domain Learning by Confidence-Weighted Parameter Combination. Machine Learning, 2010. [PDF] [Tech Report]
State-of-the-art statistical NLP systems for a variety of tasks learn from labeled training data that is often domain specific. However, there may be multiple domains or sources of interest on which the system must perform. For example, a spam filtering system must give high quality predictions for many users, each of whom receives emails from different sources and may make slightly different decisions about what is or is not spam. Rather than learning separate models for each domain, we explore systems that learn across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of disparate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.
 

     2009 (6 Publications)
      Mark Dredze. Intelligent Email: Aiding Users with AI. PhD Thesis, Computer and Information Science, University of Pennsylvania, 2009. [PDF]
 
      Paul McNamee, Mark Dredze, Adam Gerber, Nikesh Garera, Tim Finin, James Mayfield, Christine Piatko, Delip Rao, David Yarowsky, Markus Dreyer. HLTCOE Approaches to Knowledge Base Population at TAC 2009. Text Analysis Conference (TAC), 2009.
 
Expand Me    Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Advances in Neural Information Processing Systems (NIPS), 2009. [PDF]
We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data.
 
Expand Me    Mark Dredze, Partha Pratim Talukdar, Koby Crammer. Sequence Learning from Data with Multiple Labels. ECML/PKDD Workshop on Learning from Multi-Label Data, 2009. [PDF]
We present novel algorithms for learning structured predictors from instances with multiple labels in the presence of noise. The proposed algorithms improve performance on two standard NLP tasks when we have a small amount of training data (low quantity) and when the labels are noisy (low quality). In these settings, the methods improve performance over using a single label, in some cases exceeding performance using gold labels. Our methods could be used in a semi-supervised setting, where a limited amount of labeled data could be combined with a rule based automatic labeling of unlabeled data with multiple possible labels.
 
Expand Me    Koby Crammer, Mark Dredze, Alex Kulesza. Multi-Class Confidence Weighted Algorithms. Empirical Methods in Natural Language Processing (EMNLP), 2009. [PDF]
The recently introduced online confidence-weighted (CW) learning algorithm for binary classification performs well on many binary NLP tasks. However, for multi-class problems CW learning updates and inference cannot be computed analytically or solved as convex optimization problems as they are in the binary case. We derive learning algorithms for the multi-class CW setting and provide extensive evaluation using nine NLP datasets, including three derived from the recently released New York Times corpus. Our best algorithm outperforms state-of-the-art online and batch methods on eight of the nine tasks. We also show that the confidence information maintained during learning yields useful probabilistic information at test time.
 
Expand Me    Mark Dredze, Bill Schilit, Peter Norvig. Suggesting Email View Filters for Triage and Search. International Joint Conference on Artificial Intelligence (IJCAI), 2009. [PDF]
Growing email volumes cause flooded inboxes and swelled email archives, making search and new email processing difficult. While emails have rich metadata, such as recipients and folders, suitable for creating filtered views, it is often difficult to choose appropriate filters for new inbox messages without first examining messages. In this work, we consider a system that automatically suggests relevant view filters to the user for the currently viewed messages. We propose several ranking algorithms for suggesting useful filters. Our work suggests that such systems quickly filter groups of inbox messages and find messages more easily during search.
 

     2008 (15 Publications)
Expand Me    Kevin Lerman, Ari Gilder, Mark Dredze, Fernando Pereira. Reading the Markets: Forecasting Public Opinion of Political Candidates by News Analysis. Conference on Computational Linguistics (Coling), 2008. [PDF]
Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of political candidates. Our system uses daily newspaper articles to predict shifts in public opinion as reflected in prediction markets. We discuss various types of features designed for this problem. The news system improves market prediction over baseline market systems.
 
Expand Me    Mark Dredze, Joel Wallenberg. Further Results and Analysis of Icelandic Part of Speech Tagging. Technical Report MS-CIS-08-13, University of Pennsylvania, Department of Computer and Information Science, 2008. [PDF]
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. This paper presents further results and analysis to the original work.
 
Expand Me    Mark Dredze, Joel Wallenberg. Icelandic Data-Driven Part of Speech Tagging. Association for Computational Linguistics (ACL), 2008. [PDF]
Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions.
 
Expand Me    Kuzman Ganchev, Mark Dredze. Small Statistical Models by Random Feature Mixing. Workshop on Mobile NLP at ACL, 2008. [PDF]
The application of statistical NLP systems to resource constrained devices is limited by the need to maintain parameters for a large number of features and an alphabet mapping features to parameters. We introduce random feature mixing to eliminate alphabet storage and reduce the number of parameters without severely impacting model performance.
 
Expand Me    Koby Crammer, Mark Dredze, John Blitzer, Fernando Pereira. Batch Performance for an Online Price. The NIPS 2007 Workshop on Efficient Machine Learning, 2008. [PDF]
Batch learning techniques achieve good performance, but at the cost of many (sometimes even hundreds) of passes over the data. For many tasks, such as web-scale ranking of machine translation hypotheses, making many passes over the data is prohibitively expensive, even in parallel over thousands of machines. Online algorithms, which treat data as a stream of examples, are conceptually appealing for these large scale problems. In practice, however, online algorithms tend to underperform batch methods, unless they are themselves run in multiple passes over the data. <br>In this work we explore a new type of online learning algorithm that incorporates a measure of confidence to the algorithm. The model maintains a confidence for each parameter, reflecting previously observed properties of the data. While this requires an additional parameter for each feature of the data, this is a minimal cost when compared to running the algorithm multiple times over the data. The resulting algorithm learns faster, requiring both fewer training instances and fewer passes over the training data, often approaching batch performance with only a single pass through the data.
 
Expand Me    Mark Dredze, Krzysztof Czuba. Learning to Admit You're Wrong: Statistical Tools for Evaluating Web QA. The NIPS 2007 Workshop on Machine Learning for Web Search, 2008. [PDF]
Web search engines provide specialized results to specific queries, often relying on the output of a QA system. However, targeted answers, while helpful, are embarrassing when wrong. Automated techniques are required to avoid wrong answers and improve system performance. We present the Expected Answer System, a statistical data-driven framework that analyzes the performance of a QA system with the goal of improving system accuracy. Our system is used for wrong answer prediction, missing answer discovery, and question class analysis. An empirical study of a production QA system, one of the first such evaluations presented in the literature, motivates our approach.
 
Expand Me    Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum, Mark Dredze. Lightly-Supervised Attribute Extraction for Web Search. The NIPS 2007 Workshop on Machine Learning for Web Search, 2008. [PDF]
Web search engines can greatly benefit from knowledge about attributes of entities present in search queries. In this paper, we introduce lightly-supervised methods for extracting entity attributes from natural language text. Using these methods, we are able to extract large numbers of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a previously proposed pattern-based relation extractor, showing that the new methods give considerable improvements over that baseline. We also demonstrate that query expansion using extracted attributes improves retrieval performance on underspecified information-seeking queries.
 
Expand Me    Mark Dredze, Hanna Wallach, Danny Puller, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Intelligent User Interfaces (IUI), 2008. [PDF]
Email summary keywords, used to concisely represent the gist of an email, can help users manage and prioritize large numbers of messages. We develop an unsupervised learning framework for selecting summary keywords from emails using latent representations of the underlying topics in a user's mailbox. This approach selects words that describe each message in the context of existing topics rather than simply selecting keywords based on a single message in isolation. We present and compare four methods for selecting summary keywords based on two well-known models for inferring latent topics: latent semantic analysis and latent Dirichlet allocation. The quality of the summary keywords is assessed by generating summaries for emails from twelve users in the Enron corpus. The summary keywords are then used in place of entire messages in two proxy tasks: automated foldering and recipient prediction. We also evaluate the extent to which summary keywords enhance the information already available in a typical email user interface by repeating the same tasks using email subject lines.
 
Expand Me    Mark Dredze, Hanna Wallach, Danny Puller, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Aiding Users with AI. American National Conference on Artificial Intelligence (AAAI), NECTAR Paper, 2008. [PDF]
Email occupies a central role in the modern workplace. This has led to a vast increase in the number of email messages that users are expected to handle daily. Furthermore, email is no longer simply a tool for asynchronous online communication - email is now used for task management, personal archiving, as well both synchronous and asynchronous online communication. This explosion can lead to "email overload" - many users are overwhelmed by the large quantity of information in their mailboxes. In the human--computer interaction community, there has been much research on tackling email overload. Recently, similar efforts have emerged in the artificial intelligence (AI) and machine learning communities to form an area of research known as intelligent email.\nIn this paper, we take a user-oriented approach to applying AI to email. We identify enhancements to email user interfaces and employ machine learning techniques to support these changes. We focus on three tasks - summary keyword generation, reply prediction and attachment prediction - and summarize recent work in these areas.
 
Expand Me    Mark Dredze, Tova Brooks, Josh Carroll, Joshua Magarick, John Blitzer, Fernando Pereira. Intelligent Email: Reply and Attachment Prediction. Intelligent User Interfaces (IUI), 2008. [PDF]
We present two prediction problems under the rubric of Intelligent Email that are designed to support enhanced email interfaces that relieve the stress of email overload. Reply prediction alerts users when an email requires a response and facilitates email response management. Attachment prediction alerts users when they are about to send an email missing an attachment or triggers a document recommendation system, which can catch missing attachment emails before they are sent. Both problems use the same underlying email classification system and task specific features. Each task is evaluated for both single-user and cross-user settings.
 
Expand Me    Mark Dredze, Hanna Wallach. User Models for Email Activity Management. IUI Workshop on Ubiquitous User Modeling, 2008. [PDF]
A single user activity, such as planning a conference trip, typically involves multiple actions. Although these actions may involve several applications, the central point of co-ordination for any particular activity is usually email. Previous work on email activity management has focused on clustering emails by activity. Dredze et al. accomplished this by combining supervised classifiers based on document similarity, authors and recipients, and thread information. In this paper, we take a different approach and present an unsupervised framework for email activity clustering. We use the same information sources as Dredze et al.- namely, document similarity, message recipients and authors, and thread information - but combine them to form an unsupervised, non-parametric Bayesian user model. This approach enables email activities to be inferred without any user input. Inferring activities from a user's mailbox adapts the model to that user. We next describe the statistical machinery that forms the basis of our user model, and explain how several email properties may be incorporated into the model. We evaluate this approach using the same data as Dredze et al., showing that our model does well at clustering emails by activity.
 
Expand Me    Koby Crammer, Mark Dredze, Fernando Pereira. Exact Convex Confidence-Weighted Learning. Advances in Neural Information Processing Systems (NIPS), 2008. [PDF]
Confidence-weighted (CW) learning, an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods.
 
Expand Me    Mark Dredze, Koby Crammer, Fernando Pereira. Confidence-Weighted Linear Classification. International Conference on Machine Learning (ICML), 2008. [PDF]
We introduce confidence-weighted linear classifiers, which add parameter confidence information to linear classifiers. Online learners in this setting update both classifier parameters and the estimate of their confidence. The particular online algorithms we study here maintain a Gaussian distribution over parameter vectors and update the mean and covariance of the distribution with each instance. Empirical evaluation on a range of NLP tasks show that our algorithm improves over other state of the art online and batch methods, learns faster in the online setting, and lends itself to better classifier combination after parallel training.
 
Expand Me    Mark Dredze, Koby Crammer. Active Learning with Confidence. Association for Computational Linguistics, 2008. [PDF]
Active learning is a machine learning approach to achieving high-accuracy with a small amount of labels by letting the learning algorithm choose instances to be labeled. Most of previous approaches based on discriminative learning use the margin for choosing instances. We present a method for incorporating confidence into the margin by using a newly introduced online learning algorithm and show empirically that confidence improves active learning.
 
Expand Me    Mark Dredze, Koby Crammer. Online Methods for Multi-Domain Learning and Adaptation. Empirical Methods in Natural Language Processing (EMNLP), 2008. [PDF]
NLP tasks are often domain specific, yet systems can learn behaviors across multiple domains. We develop a new multi-domain online learning framework based on parameter combination from multiple classifiers. Our algorithms draw from multi-task learning and domain adaptation to adapt multiple source domain classifiers to a new target domain, learn across multiple similar domains, and learn across a large number of dispirate domains. We evaluate our algorithms on two popular NLP domain adaptation tasks: sentiment classification and spam filtering.
 

     2007 (8 Publications)
      John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. North East Student Colloquium on Artificial Intelligence (NESCAI), 2007.
 
      Danny Puller, Hanna Wallach, Mark Dredze, Fernando Pereira. Generating Summary Keywords for Emails Using Topics. Women in Machine Learning Workshop (WiML) at Grace Hopper, 2007.
 
Expand Me    Neal Parikh, Mark Dredze. Graphical Models for Primarily Unsupervised Sequence Labeling. Technical Report MS-CIS-07-18, University of Pennsylvania, Department of Computer and Information Science, 2007. [PDF]
Most models used in natural language processing must be trained on large corpora of labeled text. This tutorial explores a 'primarily unsupervised' approach (based on graphical models) that augments a corpus of unlabeled text with some form of prior domain knowledge, but does not require any fully labeled examples. We survey probabilistic graphical models for (supervised) classification and sequence labeling and then present the prototype-driven approach of Haghighi and Klein (2006) to sequence labeling in detail, including a discussion of the theory and implementation of both conditional random fields and prototype learning. We show experimental results for English part of speech tagging.
 
Expand Me    Mark Dredze, Reuven Gevaryahu, Ari Elias-Bachrach. Learning Fast Classifiers for Image Spam. Conference on Email and Anti-Spam (CEAS), 2007. [PDF] [Data]
Recently, spammers have proliferated image spam, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes imagespam classification practical by providing both high accuracy features and a method to learn fast classifiers.
 
Expand Me    Koby Crammer, Mark Dredze, Kuzman Ganchev, Partha Pratim Talukdar, Steven Carroll. Automatic Code Assignment to Medical Text. BioNLP Workshop at ACL, 2007. [PDF]
Code assignment is important for handling large amounts of electronic medical data in the modern hospital. However, only expert annotators with extensive training can assign codes. We present a system for the assignment of ICD-9-CM clinical codes to free text radiology reports. Our system assigns a code configuration, predicting one or more codes for each document. We combine three coding systems into a single learning system for higher accuracy. We compare our system on a real world medical dataset with both human annotators and other automated systems, achieving nearly the maximum score on the Computational Medicine Center's challenge.
 
      Mark Dredze, Hanna M. Wallach. Email Keyword Summarization and Visualization with Topic Models. North East Student Colloquium on Artificial Intelligence (NESCAI), 2007.
 
Expand Me    John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association for Computational Linguistics (ACL), 2007. [PDF]
Automatic sentiment classification has been extensively studied and applied in recent years. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is impractical. We investigate domain adaptation for sentiment classifiers, focusing on online reviews for different types of products. First, we extend to sentiment classification the recently-proposed structural correspondence learning (SCL) algorithm, reducing the relative error due to adaptation between domains by an average of 30% over the original SCL algorithm and 46% over a supervised baseline. Second, we identify a measure of domain similarity that correlates well with the potential for adaptation of a classifier from one domain to another. This measure could for instance be used to select a small set of domains to annotate whose trained classifiers would transfer well to many other domains.
 
Expand Me    Mark Dredze, John Blitzer, Partha Pratim Talukdar, Kuzman Ganchev, Joao Graca, Fernando Pereira. Frustratingly Hard Domain Adaptation for Parsing. Shared Task - Conference on Natural Language Learning - CoNLL 2007 shared task, 2007. [PDF]
We describe some challenges of adaptation in the 2007 CoNLL Shared Task on Domain Adaptation. Our error analysis for this task suggests that a primary source of error is differences in annotation guidelines between treebanks. Our suspicions are supported by the observation that no team was able to improve target domain performance substantially over a state of the art baseline.
 

     2006 (4 Publications)
      Mark Dredze, John Blitzer, Koby Crammer, Fernando Pereira. Feature Design for Transfer Learning. North East Student Colloquium on Artificial Intelligence (NESCAI), 2006. [PDF]
 
Expand Me    Mark Dredze, John Blitzer, Fernando Pereira. ``Sorry, I Forgot the Attachment:'' Email Attachment Prediction. Conference on Email and Anti-Spam (CEAS), 2006. [PDF]
The missing attachment problem: a missing attachment generates a wave of emails from the recipients notifying the sender of the error. We present an attachment prediction system to reduce the volume of missing attachment mail. Our classifier could prompt an alert when an outgoing email is missing an attachment. Additionally, the system could activate an attachment recommendation system, whereby suggested documents are offered once the system determines the user is likely to include an attachment, effectively reminding the user to include the attachment. We present promising initial results and discuss implications of our work.
 
Expand Me    Mark Dredze, Tessa Lau, Nicholas Kushmerick. Automatically classifying emails into activities. Intelligent User Interfaces (IUI), 2006. [PDF]
Email-based activity management systems promise to give users better tools for managing increasing volumes of email, by organizing email according to a user\'s activities. Current activity management systems do not automatically classify incoming messages by the activity to which they belong, instead relying on simple heuristics (such as message threads), or asking the user to manually classify incoming messages as belonging to an activity. This paper presents several algorithms for automatically recognizing emails as part of an ongoing activity. Our baseline methods are the use of message reply-to threads to determine activity membership and a naive Bayes classifier. Our SimSubset and SimOverlap algorithms compare the people involved in an activity against the recipients of each incoming message. Our SimContent algorithm uses IRR (a variant of latent semantic indexing) to classify emails into activities using similarity based on message contents. An empirical evaluation shows that each of these methods provide a significant improvement to the baseline methods. In addition, we show that a combined approach that votes the predictions of the individual methods performs better than each individual method alone.
 
      Nicholas Kushmerick, Tessa Lau, Mark Dredze, Rinat Khoussainov. Activity-Centric Email: A Machine Learning Approach. American National Conference on Artificial Intelligence (AAAI), 2006. [PDF]
 

     2005 (3 Publications)
      Rie Kuboto Ando, Mark Dredze, Tong Zhang. Trec 2005 Genomics Track Experiments at IBM Watson. Text REtrieval Conference (TREC), 2005. [PDF] (Group invited talk at TREC 2005, ranked 3rd and 4th out of 53 entries)
 
Expand Me    Mark Dredze, John Blitzer, Fernando Pereira. Reply Expectation Prediction for Email Management. Conference on Email and Anti-Spam (CEAS), 2005. [PDF]
We reduce email overload by addressing the problem of waiting for a reply to one's email. We predict whether sent and received emails necessitate a reply, enabling the user to both better manage his inbox and to track mail sent to others. We discuss the features used to discriminate emails, show promising initial results with a logistic regression model, and outline future directions for this work.
 
Expand Me    Catalina Danis, Wendy Kellogg, Tessa Lau, Mark Dredze, Jeffrey Stylos, Nicholas Kushmerick. Managers Email: Beyond Tasks and To-Dos. Conference on Human Factors in Computing Systems (CHI), 2005. [PDF]
In this paper, we describe preliminary findings that indicate that managers and non-mangers think about their email differently. We asked three research managers and three research non-managers to sort about 250 of their own email messages into categories that "would help them to manage their work." Our analyses indicate that managers create more categories and a more differentiated category structure than non-managers. Our data also suggest that managers create "relationship-oriented" categories more often than non-managers. These results are relevant to research on "email overload" that has highlighted the use of email for activities beyond communication. In particular, our findings suggest that too strong a focus on task management may be incomplete, and that a user's organizational role has an impact on their conceptualization and likely use of email.
 

     2004 (1 Publications)
      Mark Dredze, Jeffrey Stylos, Tessa Lau, Wendy Kellogg, Catalina Danis, Nicholas Kushmerick. Taxie: Automatically identifying tasks in email. Unpublished Manuscript, 2004.
 

     2003 (1 Publications)
      Kevin Livingston, Mark Dredze, Kristian Hammond, Larry Birnbaum. Beyond Broadcast. Proceedings of the 2003 International Conference on Intelligent User Interfaces, 2003.
 


     Masters Thesis
      For my masters thesis in Jewish Studies at Yeshiva University, I completed a thesis titled: The Values of Traditional Judaism in Chicago. Please email me if you'd like a copy of this thesis.
 

Students

Current Students

Nicholas Andrews (Co-advised with Jason Eisner)
Matt Gormley [www] (Co-advised with Jason Eisner)
Michael Paul [www] (Co-advised with Jason Eisner)
Travis Wolfe [www]
Violet (Nanyun) Peng

Former Students
Ariya Rastrow [www] (Co-advised with Sanjeev Khudanpur). ECE PhD, 2012. First Job: Amazon.
Carolina Parada [www] (Co-advised with Hynek Hermansky). ECE PhD, 2011. First Job: Google Research.


Undergraduate Projects
Project Student Year
Information Extraction from Biomedical Text Leah Hanson 2011


Previous UPenn Undergraduate Projects
Project Student
Email keyword summarization Danny Puller UPenn Summer Provost Fellowship
Sentiment classification Ian Cohen
Email Attachment Prediction Josh Magarick
Prototype Driven Learning and Graphical Models Neal Parikh
Machine Learning in Prediction Markets Ari Gilder
Kevin Lerman
Winner Best CS Senior Design Project, Honorary Mention Best Engineering Design Project
User Adaptation in Email Reply Prediction Tova Brooks
Josh Carroll
Formal and Informal Meeting Extraction from Email Lauren Paone

Teaching

Fall 2012: CS 600.475 Current Topics in Machine Learning [Class site]
Spring 2012: CS 600.775 Current Topics in Machine Learning [Class site]
Fall 2011: CS 600.475 Machine Learning [Class site]
Spring 2011: CS 600.775 Current Topics in Machine Learning [Class site]
Fall 2010: CS 600.475 Machine Learning [Class site]
Fall 2009: CS 600.475 Machine Learning [Class site]

Data/Code

I get a lot of emails asking me for data or code from one of my papers. If you are wondering, the answer is yes! I try to provide both data and code so that others can reproduce or compare against my results. Sadly, I don't post data or code due to the lack of time, but I usually make them available if you email me.

Datasets
TAC 2009 Entity Linking (Email for data)
A collection of manually linked training examples to supplement those provided in the TAC 2009 KBP task. These are described in my Coling 2010 paper on entity linking.

Image Spam Dataset [Link]
A collection of ham and spam images taken from real user email.

Multi-Domain Sentiment Dataset [Link]
Product reviews from several different product types taken from Amazon.com.

Attachment Prediction Email (Email for data)
Enron emails annotated with attachment information and cleaned of numerous artificats inserted by email programs.

Code
Structured Learning at Penn [Link]
This is a collection of software developed by me and others in Fernando Pereria's research group at UPenn. It is designed for a range of machine learning tasks, such as dependency parsing, structured learning, gene prediction and gene mention finding.

Confidence Weighted Learning Library (Email for code)
We have collected most of the core algorithms in the confidence weighted learning framework for release as a software library. Please email me for the code.

Colleagues

I have worked with a lot of amazing people on a wide variety of projects. Here are a few of them:

Kedar Bellare
Axel Bernal
Larry Birnbaum
John Blitzer
Koby Crammer
Krzysztof Czuba
Kris Hammond
Ryan Gabbard
Kuzman Ganchev
João Graça
David Johnson
Rie Johnson (Ando)
Alex Kulesza
Nicholas Kushmerick
Tessa Lau
                 Kevin Lerman
Qian Liu
Ryan McDonald
David Mimno
Peter Norvig
Fernando Pereira
Jeff Reynar
Doug Riecken
Sam Roweis
Bill Schilit
Partha Pratim Talukdar
Hanna M. Wallach
Joel Wallenberg
Casey Whitelaw
Tong Zhang

Links

Links:
AAAI 2008 Workshop on Enhanced Messaging. Email, IM, AI, HCI, and all that. Check it out.

Email Research Website