Learning Multiview Representations of Twitter Users

Input views used to learn multiview Twitter user embeddings

Twitter's terms of service prevents sharing of large scale Twitter corpora. Instead, we share the 1000-dimensional PCA vectors produced for each user's tweet and network views. These embeddings can be used in place of the user data to reproduce our methods and to compare new methods against our work.

File: user_6views_tfidf_pcaEmbeddings_userTweets+networks.tsv.gz (1.4 GB)

Format
The data file contains these fields in tab separated format:

Vector dimensions are sorted in order of decreasing variance, so evaluating a 50-dimensional PCA vector means just using the first 50 values in each view.

Code

Code for learning weighted GCCA embeddings can be found at https://github.com/abenton/wgcca

For details on how the data was generated, or to reference them in your work, use:

Adrian Benton, Raman Arora, and Mark Dredze. Learning Multiview Representations of Twitter Users. Association for Computational Linguistics (ACL), 2016.

Direct your questions or comments to:

adrian dot author1_surname at gmail dot com