Demographics Race/Ethnicity Training Data

Task: Train machine learning classifiers to predict the race/ethnicity of Twitter users.

Authors: Zach Wood-Doughty, Paiheng Xu, Xiao Liu, Mark Dredze

Last Updated: April 23, 2020

Our paper, “Using Noisy Self-Reports to Predict Twitter User Demographics,” produced a dataset of Twitter users whose profile descriptions may self-report their race or ethnicity. We used this dataset to train classifiers for these demographic labels, and showed that models trained on the collected data perform better on gold standard survey data than models trained only on crowd-sourced data.

This page provides instructions on how to obtain the data collected for this paper, as well as links to associated resources.

Important Notice (especially for those outside the United States)

As described below, you must obtain approval from an IRB or equivalent ethics board that has the same standards for review of human subjects research. Many individuals, especially from outside the United States, sign the agreement and obtain a letter of approval, only to find out that they cannot have the data because of a lack of IRB review. This is a non-negotiable requirement. Requests for data that ignore this requirement will in turn be ignored.

How can I get the data and models?

You will need to complete the following tasks to obtain the data and models from our paper.

  1. Notify Mark Dredze <> that you intend to request the data.
  2. Obtain a letter from your Institutional Review Board (IRB), or equivalent ethics board, that they have approved your proposed project and use of the data. Your IRB may rule this an exempt study, or require a review of the research protocols. Either way, you must produce a letter from the IRB approving the project.
  3. Complete the Data Use and Confidentiality Agreement.

What is an IRB?

An institutional review board (IRB) is a committee that applies research ethics by reviewing the methods proposed for research to ensure that they are ethical. IRB approval is (typically) required for human subjects research in the United States. See the Wikipedia page for more information.

If you are outside the United States you typically have an equivalent ethics board. See HHS Office for Human Research Protections International Guidelines for more information.

How do I get started with an IRB application?

Your university will have an IRB coordinator or administrator. Start by talking to this person.

Do you have advice for how to write the IRB application?

If this is your first IRB application, you should discuss the proposed project with your IRB contact or administrator. You may also want to ask a colleague for an example IRB application.

For issues specific to social media data and health research, we suggest:

Adrian Benton, Glen Coppersmith, Mark Dredze. Ethical Research Protocols for Social Media Health Research. EACL Workshop on Ethics in Natural Language Processing, 2017.


Where can I find a description of the data and trained models?

See our paper:

Zach Wood-Doughty*, Paiheng Xu*, Xiao Liu, Mark Dredze. Using Noisy Self-reports to Predict Twitter User Demographics. arXiv, 2020.

PDF: forthcoming

Please cite this paper as the reference for the data or models.

Is there any software available for working with this data?

The trained models can be easily run by loading them into the Demographer package. The package can be downloaded via pip install demographer. The instructions for where to store the trained models is given in the Demographer README.

Can I use the data or models commercially?

In short, no. The data usage agreement prohibits the use of this data for:

commercial purposes of any kind, including but not limited to algorithm development or evaluation, model development or evaluation, evaluation of features, feature engineering, reports, or visualizations used for any for-profit purpose, where for-profit purposes include but are not limited to prototyping, product development, marketing, public relations, or pursuit of funding.

Please contact us with questions of how the data can be used commercially.