The Computational Linguistics and Clinical Psychology (CLPsych) workshop has hosted shared and unshared tasks for several years.
In 2015 the shared task used data from Twitter users who state a diagnosis of depression or post traumatic stress disorder (PTSD) along with demographically-matched community controls. The shared task provided an apples-to-apples comparisons of various approaches to modeling language relevant to mental health from social media. The shared task consisted of three binary classification experiments: (1) depression versus control, (2) PTSD versus control, and (3) depression versus PTSD.
This site provides instructions on how to obtain the data used in the shared task, as well as links to associated resources.
You will need to complete the following tasks to obtain the shared task data.
An institutional review board (IRB) is a committee that applies research ethics by reviewing the methods proposed for research to ensure that they are ethical. IRB approval is (typically) required for human subjects research in the United States. See the Wikipedia page for more information.
If you are outside the United States you typically have an equivalent ethics board. See HHS Office for Human Research Protections International Guidelines for more information.
Your university will have an IRB coordinator or administrator. Start by talking to this person.
If this is your first IRB application, you should discuss the proposed project with your IRB contact or administrator. You may also want to ask a colleague for an example IRB application.
For issues specific to social media data and health research, we suggest:
Adrian Benton, Glen Coppersmith, Mark Dredze. Ethical Research Protocols for Social Media Health Research. EACL Workshop on Ethics in Natural Language Processing, 2017.
See the shared task overview paper:
Glen Coppersmith, Mark Dredze, Craig Harman, Kristy Hollingshead, Margaret Mitchell. CLPsych 2015 Shared Task: Depression and PTSD on Twitter. NAACL Workshop on Computational Linguistics and Clinical Psychology, 2015.
Please cite this paper as the reference for the data.
This github project contains code for working with the data and running evaluations: https://github.com/clpsych/shared_task
The teams who participated in the original shared task submitted papers, which are available in the official ACL proceedings. They are listed here:
The data usage agreement prohibits the use of this data for:
commercial purposes of any kind, including but not limited to algorithm development or evaluation, model development or evaluation, evaluation of features, feature engineering, reports, or visualizations used for any for-profit purpose, where for-profit purposes include but are not limited to prototyping, product development, marketing, public relations, or pursuit of funding.
Please contact us with questions of how the data can be used commercially.
Please be advised. The total number of users in the dataset is 1711 (training and test set), but in the anonymized_user_info_by_chunk file there are 1989 users. Screen names that are in the missing chunks (51-59 and 90) are also missing from the dataset. This is a known issue and we currently are unable to fix it.