# Mark Dredze

 Research Scientist Human Language Technology Center of Excellence (HLTCOE) Assistant Research Professor Department of Computer Science Center for Language and Speech Processing (CLSP) Machine Learning Group Center for Population Health Information Technology (CPHIT), Bloomberg Health Sciences Informatics, School of Medicine
 Contact: |  www.cs.jhu.edu/~mdredze   www.dredze.com Office: Stieff 181    (410) 516-6786

## Publications

Click to show abstract.

 2013 (9 Publications) Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [Code] Michael J Paul, Byron Wallace, Mark Dredze. What Affects Patient (Dis)satisfaction? Analyzing Online Doctor Ratings with a Joint Topic-Sentiment Model. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. Justin Snyder, Rebecca Knowles, Mark Dredze, Matthew R. Gormley, Travis Wolfe. Topic Models and Metadata for Visualizing Text Corpora. North American Chapter of the Association for Computational Linguistics (NAACL) (Demo Paper), 2013. Effectively exploring and analyzing large text corpora requires visualizations that provide a high level summary. Past work has relied on faceted browsing of document metadata or on natural language processing of document text. In this paper, we present a new web-based tool that integrates topics learned from an unsupervised topic model in a faceted browsing experience. The user can manage topics, filter documents by topic and summarize views with metadata and topic graphs. We report a user study of the usefulness of topics in our tool. Damianos Karakos, Mark Dredze, Sanjeev Khudanpur. Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation. Technical Report 8, Johns Hopkins University, 2013. [PDF] Shane Bergsma, Mark Dredze, Benjamin Van Durme, Theresa Wilson, David Yarowsky. Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. Mahesh Joshi, Mark Dredze, William W. Cohen, Carolyn P. Rose. What's in a Domain? Multi-Domain Learning for Multi-Attribute Data. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. Alex Lamb, Michael J. Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] Michael Paul, Mark Dredze. Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF] Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Machine Learning, 2013. [PDF]

 2009 (6 Publications) Mark Dredze. Intelligent Email: Aiding Users with AI. PhD Thesis, Computer and Information Science, University of Pennsylvania, 2009. [PDF] Paul McNamee, Mark Dredze, Adam Gerber, Nikesh Garera, Tim Finin, James Mayfield, Christine Piatko, Delip Rao, David Yarowsky, Markus Dreyer. HLTCOE Approaches to Knowledge Base Population at TAC 2009. Text Analysis Conference (TAC), 2009. Koby Crammer, Alex Kulesza, Mark Dredze. Adaptive Regularization of Weight Vectors. Advances in Neural Information Processing Systems (NIPS), 2009. [PDF] We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. Mark Dredze, Partha Pratim Talukdar, Koby Crammer. Sequence Learning from Data with Multiple Labels. ECML/PKDD Workshop on Learning from Multi-Label Data, 2009. [PDF] We present novel algorithms for learning structured predictors from instances with multiple labels in the presence of noise. The proposed algorithms improve performance on two standard NLP tasks when we have a small amount of training data (low quantity) and when the labels are noisy (low quality). In these settings, the methods improve performance over using a single label, in some cases exceeding performance using gold labels. Our methods could be used in a semi-supervised setting, where a limited amount of labeled data could be combined with a rule based automatic labeling of unlabeled data with multiple possible labels. Koby Crammer, Mark Dredze, Alex Kulesza. Multi-Class Confidence Weighted Algorithms. Empirical Methods in Natural Language Processing (EMNLP), 2009. [PDF] The recently introduced online confidence-weighted (CW) learning algorithm for binary classification performs well on many binary NLP tasks. However, for multi-class problems CW learning updates and inference cannot be computed analytically or solved as convex optimization problems as they are in the binary case. We derive learning algorithms for the multi-class CW setting and provide extensive evaluation using nine NLP datasets, including three derived from the recently released New York Times corpus. Our best algorithm outperforms state-of-the-art online and batch methods on eight of the nine tasks. We also show that the confidence information maintained during learning yields useful probabilistic information at test time. Mark Dredze, Bill Schilit, Peter Norvig. Suggesting Email View Filters for Triage and Search. International Joint Conference on Artificial Intelligence (IJCAI), 2009. [PDF] Growing email volumes cause flooded inboxes and swelled email archives, making search and new email processing difficult. While emails have rich metadata, such as recipients and folders, suitable for creating filtered views, it is often difficult to choose appropriate filters for new inbox messages without first examining messages. In this work, we consider a system that automatically suggests relevant view filters to the user for the currently viewed messages. We propose several ranking algorithms for suggesting useful filters. Our work suggests that such systems quickly filter groups of inbox messages and find messages more easily during search.

 2008 (15 Publications) Kevin Lerman, Ari Gilder, Mark Dredze, Fernando Pereira. Reading the Markets: Forecasting Public Opinion of Political Candidates by News Analysis. Conference on Computational Linguistics (Coling), 2008. [PDF] Media reporting shapes public opinion which can in turn influence events, particularly in political elections, in which candidates both respond to and shape public perception of their campaigns. We use computational linguistics to automatically predict the impact of news on public perception of political candidates. Our system uses daily newspaper articles to predict shifts in public opinion as reflected in prediction markets. We discuss various types of features designed for this problem. The news system improves market prediction over baseline market systems. Mark Dredze, Joel Wallenberg. Further Results and Analysis of Icelandic Part of Speech Tagging. Technical Report MS-CIS-08-13, University of Pennsylvania, Department of Computer and Information Science, 2008. [PDF] Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. This paper presents further results and analysis to the original work. Mark Dredze, Joel Wallenberg. Icelandic Data-Driven Part of Speech Tagging. Association for Computational Linguistics (ACL), 2008. [PDF] Data driven POS tagging has achieved good performance for English, but can still lag behind linguistic rule based taggers for morphologically complex languages, such as Icelandic. We extend a statistical tagger to handle fine grained tagsets and improve over the best Icelandic POS tagger. Additionally, we develop a case tagger for non-local case and gender decisions. An error analysis of our system suggests future directions. Kuzman Ganchev, Mark Dredze. Small Statistical Models by Random Feature Mixing. Workshop on Mobile NLP at ACL, 2008. [PDF] The application of statistical NLP systems to resource constrained devices is limited by the need to maintain parameters for a large number of features and an alphabet mapping features to parameters. We introduce random feature mixing to eliminate alphabet storage and reduce the number of parameters without severely impacting model performance. Koby Crammer, Mark Dredze, John Blitzer, Fernando Pereira. Batch Performance for an Online Price. The NIPS 2007 Workshop on Efficient Machine Learning, 2008. [PDF] Batch learning techniques achieve good performance, but at the cost of many (sometimes even hundreds) of passes over the data. For many tasks, such as web-scale ranking of machine translation hypotheses, making many passes over the data is prohibitively expensive, even in parallel over thousands of machines. Online algorithms, which treat data as a stream of examples, are conceptually appealing for these large scale problems. In practice, however, online algorithms tend to underperform batch methods, unless they are themselves run in multiple passes over the data.

 2006 (4 Publications) Mark Dredze, John Blitzer, Koby Crammer, Fernando Pereira. Feature Design for Transfer Learning. North East Student Colloquium on Artificial Intelligence (NESCAI), 2006. [PDF] Mark Dredze, John Blitzer, Fernando Pereira. Sorry, I Forgot the Attachment:'' Email Attachment Prediction. Conference on Email and Anti-Spam (CEAS), 2006. [PDF] The missing attachment problem: a missing attachment generates a wave of emails from the recipients notifying the sender of the error. We present an attachment prediction system to reduce the volume of missing attachment mail. Our classifier could prompt an alert when an outgoing email is missing an attachment. Additionally, the system could activate an attachment recommendation system, whereby suggested documents are offered once the system determines the user is likely to include an attachment, effectively reminding the user to include the attachment. We present promising initial results and discuss implications of our work. Mark Dredze, Tessa Lau, Nicholas Kushmerick. Automatically classifying emails into activities. Intelligent User Interfaces (IUI), 2006. [PDF] Email-based activity management systems promise to give users better tools for managing increasing volumes of email, by organizing email according to a user\'s activities. Current activity management systems do not automatically classify incoming messages by the activity to which they belong, instead relying on simple heuristics (such as message threads), or asking the user to manually classify incoming messages as belonging to an activity. This paper presents several algorithms for automatically recognizing emails as part of an ongoing activity. Our baseline methods are the use of message reply-to threads to determine activity membership and a naive Bayes classifier. Our SimSubset and SimOverlap algorithms compare the people involved in an activity against the recipients of each incoming message. Our SimContent algorithm uses IRR (a variant of latent semantic indexing) to classify emails into activities using similarity based on message contents. An empirical evaluation shows that each of these methods provide a significant improvement to the baseline methods. In addition, we show that a combined approach that votes the predictions of the individual methods performs better than each individual method alone. Nicholas Kushmerick, Tessa Lau, Mark Dredze, Rinat Khoussainov. Activity-Centric Email: A Machine Learning Approach. American National Conference on Artificial Intelligence (AAAI), 2006. [PDF]

 2004 (1 Publications) Mark Dredze, Jeffrey Stylos, Tessa Lau, Wendy Kellogg, Catalina Danis, Nicholas Kushmerick. Taxie: Automatically identifying tasks in email. Unpublished Manuscript, 2004.

 2003 (1 Publications) Kevin Livingston, Mark Dredze, Kristian Hammond, Larry Birnbaum. Beyond Broadcast. Proceedings of the 2003 International Conference on Intelligent User Interfaces, 2003.

 Masters Thesis For my masters thesis in Jewish Studies at Yeshiva University, I completed a thesis titled: The Values of Traditional Judaism in Chicago. Please email me if you'd like a copy of this thesis.

## Students

Current Students

 Nicholas Andrews (Co-advised with Jason Eisner) Matt Gormley [www] (Co-advised with Jason Eisner) Michael Paul [www] (Co-advised with Jason Eisner) Travis Wolfe [www] Violet (Nanyun) Peng

Former Students
 Ariya Rastrow [www] (Co-advised with Sanjeev Khudanpur). ECE PhD, 2012. First Job: Amazon. Carolina Parada [www] (Co-advised with Hynek Hermansky). ECE PhD, 2011. First Job: Google Research.

 Project Student Year Information Extraction from Biomedical Text Leah Hanson 2011

 Project Student Email keyword summarization Danny Puller UPenn Summer Provost Fellowship Sentiment classification Ian Cohen Email Attachment Prediction Josh Magarick Prototype Driven Learning and Graphical Models Neal Parikh Machine Learning in Prediction Markets Ari Gilder Kevin Lerman Winner Best CS Senior Design Project, Honorary Mention Best Engineering Design Project User Adaptation in Email Reply Prediction Tova Brooks Josh Carroll Formal and Informal Meeting Extraction from Email Lauren Paone

## Teaching

Fall 2012: CS 600.475 Current Topics in Machine Learning [Class site]
Spring 2012: CS 600.775 Current Topics in Machine Learning [Class site]
Fall 2011: CS 600.475 Machine Learning [Class site]
Spring 2011: CS 600.775 Current Topics in Machine Learning [Class site]
Fall 2010: CS 600.475 Machine Learning [Class site]
Fall 2009: CS 600.475 Machine Learning [Class site]

## Data/Code

I get a lot of emails asking me for data or code from one of my papers. If you are wondering, the answer is yes! I try to provide both data and code so that others can reproduce or compare against my results. Sadly, I don't post data or code due to the lack of time, but I usually make them available if you email me.

Datasets
TAC 2009 Entity Linking (Email for data)
A collection of manually linked training examples to supplement those provided in the TAC 2009 KBP task. These are described in my Coling 2010 paper on entity linking.

A collection of ham and spam images taken from real user email.

Product reviews from several different product types taken from Amazon.com.

Attachment Prediction Email (Email for data)
Enron emails annotated with attachment information and cleaned of numerous artificats inserted by email programs.

Code
This is a collection of software developed by me and others in Fernando Pereria's research group at UPenn. It is designed for a range of machine learning tasks, such as dependency parsing, structured learning, gene prediction and gene mention finding.

Confidence Weighted Learning Library (Email for code)
We have collected most of the core algorithms in the confidence weighted learning framework for release as a software library. Please email me for the code.

Carmen is a library for geolocating tweets. Given a tweet, Carmen will return Location objects that represent a physical location. Carmen uses both coordinates and other information in a tweet to make geolocation decisions. It's not perfect, but this greatly increases the number of geolocated tweets over what Twitter provides.

## Colleagues

I have worked with a lot of amazing people on a wide variety of projects. Here are a few of them:

 Kedar Bellare Axel Bernal Larry Birnbaum John Blitzer Koby Crammer Krzysztof Czuba Kris Hammond Ryan Gabbard Kuzman Ganchev João Graça David Johnson Rie Johnson (Ando) Alex Kulesza Nicholas Kushmerick Tessa Lau Kevin Lerman Qian Liu Ryan McDonald David Mimno Peter Norvig Fernando Pereira Jeff Reynar Doug Riecken Sam Roweis Bill Schilit Partha Pratim Talukdar Hanna M. Wallach Joel Wallenberg Casey Whitelaw Tong Zhang