Department of Computer Science, Johns Hopkins University
spacerHomeAbout UsWhy Join UsPeopleAcademicsResearchEventsServices
Department of Computer Science, Johns Hopkins Universityspacer

February 24, 2011 - Panagiotis Ipeirotis

Title: Get Another Label? Improving Data Quality and Machine Learning using Multiple, Noisy Labelers

Abstract:
I will discuss the repeated acquisition of "labels" for data items when the labeling is imperfect. Labels are values provided by humans for specified variables on data items, such as "PG-13" for "Adult Content Rating on this Web Page." With the increasing popularity of micro-outsourcing systems, such as Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. We present repeated-labeling strategies of increasing complexity, and show several main results: (i) Repeated-labeling can improve label quality and model quality (per unit data-acquisition cost), but not always. (ii) Simple strategies can give considerable advantage, and carefully selecting a chosen set of points for labeling does even better (we present and evaluate several techniques). (iii) Labeler (worker) quality can be estimated on the fly (e.g., to determine compensation, control quality or eliminate Mechanical Turk spammers) and systematic biases can be corrected. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data modelers should have in their repertoire. I illustrate the results with a real-life application from on-line advertising: using Mechanical Turk to help classify web pages as being objectionable to advertisers.

This is joint work with Foster Provost, Victor S. Sheng, and Jing Wang.
An earlier version of the work received the Best Paper Award Runner-up at the ACM SIGKDD Conference.














































spacerSearchContact UsIntegrity CodeAcademics FAQLibrary ResourcesJob Center