Learning Multiview Representations of Twitter Users

Input views used to learn multiview Twitter user embeddings

Twitter's terms of service prevents sharing of large scale Twitter corpora. Instead, we share the 1000-dimensional PCA vectors produced for each user's tweet and network views. These embeddings can be used in place of the user data to reproduce our methods and to compare new methods against our work.

File: user_6views_tfidf_pcaEmbeddings_userTweets+networks.tsv.gz (1.4 GB)

One row per user, tab-delimited
First field is Twitter user ID
Next 6 fields at indicator features for whether this view contains data for this specific user
The final 6 fields are views, each containing a 1000-dimensional space-delimited vector

Format
The data file contains these fields in tab separated format:

UserID
EgoTweets
MentionTweets
FriendTweets
FollowerTweets
FriendNetwork
FollowerNetwork

Vector dimensions are sorted in order of decreasing variance, so evaluating a 50-dimensional PCA vector means just using the first 50 values in each view.

User IDs for user engagement and friend prediction tasks

The user IDs for the user engagement and friend prediction tasks can be found here:
friend_and_hashtag_prediction_userids.zip

Each row in a file corresponds to a single hashtag or celebrity. The first field is the hashtag users posted or celebrity they follow. All following entries are the user IDs of everyone who engaged. The first 10 user IDs were used to compute the query embedding (rank all other user IDs by cosine similarity). Hashtags are split into development and test, as used in the paper.

Code

Code for learning weighted GCCA embeddings can be found at https://github.com/abenton/wgcca

For details on how the data was generated, or to reference them in your work, use:

Adrian Benton, Raman Arora, and Mark Dredze. Learning Multiview Representations of Twitter Users. Association for Computational Linguistics (ACL), 2016.

Direct your questions or comments to:

adrian dot author1_surname at gmail dot com