Learning Multiview Representations of Twitter Users
Input views used to learn multiview Twitter user embeddings
Twitter's terms of service prevents sharing of large scale Twitter corpora. Instead, we share the
1000-dimensional PCA vectors produced for each user's tweet and network views. These embeddings can be used in place of the user data to reproduce our methods and to compare new methods against our work.
File: user_6views_tfidf_pcaEmbeddings_userTweets+networks.tsv.gz (1.4 GB)
- One row per user, tab-delimited
- First field is Twitter user ID
- Next 6 fields at indicator features for whether this view contains data for this specific user
- The final 6 fields are views, each containing a 1000-dimensional space-delimited vector
The data file contains these fields in tab separated format:
Vector dimensions are sorted in order of decreasing variance, so evaluating
a 50-dimensional PCA vector means just using the first 50 values in each view.
User IDs for user engagement and friend prediction tasks
The user IDs for the user engagement and friend prediction tasks can be found here:
Each row in a file corresponds to a single hashtag or celebrity. The first field is the hashtag users posted or celebrity they follow. All following entries are the user IDs of everyone who engaged. The first 10 user IDs were used to compute the query embedding (rank all other user IDs by cosine similarity). Hashtags are split into development and test, as used in the paper.
Code for learning weighted GCCA embeddings can be found at https://github.com/abenton/wgcca
For details on how the data was generated, or to reference them in your work, use:
Adrian Benton, Raman Arora, and Mark Dredze. Learning Multiview Representations of Twitter Users. Association for Computational Linguistics (ACL), 2016.
Direct your questions or comments to:
adrian dot author1_surname at gmail dot com