Publications tagged:
-
Learning universal authorship representations
Determining whether two documents were composed by the same author, also known as authorship verification, has traditionally been tackled using statistical methods. Recently, authorship representations learned using neural networks have been found to outperform alternatives, particularly in large-scale settings involving hundreds of thousands of authors. But do such representations learned in a particular domain transfer to other domains? Or are these representations inherently entangled with domain-specific features? To study these questions, we conduct the first large-scale study of cross-domain transfer for authorship verification considering zero-shot transfers involving three disparate domains: Amazon reviews, fanfiction short stories, and Reddit comments. We find that although a surprising degree of transfer is possible between certain domains, it is not so successful between others. We examine properties of these domains that influence generalization and propose simple but effective methods to improve transfer.
Rafael A Rivera-Soto , Olivia Elizabeth Miano , Juanita Ordonez , Barry Y Chen , Aleem Khan , Marcus Bishop , Nicholas Andrews
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021
-
A Deep Metric Learning Approach to Account Linking
We consider the task of linking social media accounts that belong to the same author in an automated fashion on the basis of the content and metadata of their corresponding document streams. We focus on learning an embedding that maps variable-sized samples of user activity -- ranging from single posts to entire months of activity -- to a vector space, where samples by the same author map to nearby points. The approach does not require human-annotated data for training purposes, which allows us to leverage large amounts of social media content. The proposed model outperforms several competitive baselines under a novel evaluation framework modeled after established recognition benchmarks in other domains. Our method achieves high linking accuracy, even with small samples from accounts not seen at training time, a prerequisite for practical applications of the proposed linking framework.
Aleem Khan , Elizabeth Fleming , Noah Schofield , Marcus Bishop , Nicholas Andrews
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021
-
Learning Invariant Representations of Social Media Users
The evolution of social media users’ behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naive approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users’ invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia.
Nicholas Andrews , Marcus Bishop
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019
-
Convolutions Are All You Need (For Classifying Character Sequences)
While recurrent neural networks (RNNs) are widely used for text classification, they demonstrate poor performance and slow convergence when trained on long sequences. When text is modeled as characters instead of words, the longer sequences make RNNs a poor choice. Convolutional neural networks (CNNs), although somewhat less ubiquitous than RNNs, have an internal structure more appropriate for long-distance character dependencies. To better understand how CNNs and RNNs differ in handling long sequences, we use them for text classification tasks in several character-level social media datasets. The CNN models vastly outperform the RNN models in our experiments, suggesting that CNNs are superior to RNNs at learning to classify character-level data.
Zach Wood-Doughty , Nicholas Andrews , Mark Dredze
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, 2018
-
Predicting Twitter User Demographics from Names Alone
Social media analysis frequently requires tools that can automatically infer demographics to contextualize trends. These tools often require hundreds of user-authored messages for each user, which may be prohibitive to obtain when analyzing millions of users. We explore character-level neural models that learn a representation of a user's name and screen name to predict gender and ethnicity, allowing for demographic inference with minimal data. We release trained models1 which may enable new demographic analyses that would otherwise require enormous amounts of data collection
Zach Wood-Doughty , Nicholas Andrews , Rebecca Marvin , Mark Dredze
Proceedings of the Second Workshop on Computational Modeling of People's Opinions, Personality, and Emotions in Social Media, 2018
-
Twitter at the Grammys: A Social Media Corpus for Entity Linking and Disambiguation
Work on cross document coreference resolution (CDCR) has primarily focused on news articles, with little to no work for social media. Yet social media may be particularly challenging since short messages provide little context, and informal names are pervasive. We introduce a new Twitter corpus that contains entity annotations for entity clusters that supports CDCR. Our corpus draws from Twitter data surrounding the 2013 Grammy music awards ceremony, providing a large set of annotated tweets focusing on a single event. To establish a baseline we evaluate two CDCR systems and consider the performance impact of each system component. Furthermore, we augment one system to include temporal information, which can be helpful when documents (such as tweets) arrive in a specific order. Finally, we include annotations linking the entities to a knowledge base to support entity linking. Our corpus is available: https://bitbucket.org/mdredze/tgx
Mark Dredze , Nicholas Andrews , Jay DeYoung
Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, 2016