Introducing Little Bird - A Python Package for Tweet Processing

As a researcher working with NLP, I’ve spent a lot of time wrangling Twitter data. While it’s a rich source of information, the raw data can be messy and requires a fair amount of pre-processing before it’s ready for any real analysis. I found myself writing the same boilerplate code over and over again to read, clean, and tokenize tweets.

To streamline my own workflow, I decided to bundle all of my most-used functions into a simple, reusable Python package. The result is littlebird, a collection of utilities designed to make processing tweets just a little bit easier.

What can it do?

littlebird provides a few key components to help you get from raw tweet files to clean, tokenized text.

Reading and Writing Tweet Files

The first hurdle is always just reading the data. Tweets often come in large, gzipped, JSONlines files. The TweetReader and TweetWriter classes handle this for you automatically.

You can easily iterate through a tweet file, and the reader can even be configured to automatically skip deleted tweets, withheld statuses, and retweets.

Here’s a quick example of how you can read a file, filter for tweets that contain hashtags, and write them to a new file:

from littlebird import TweetReader, TweetWriter

# Automatically handles .gz files
reader = TweetReader("my_tweets.json.gz")
writer = TweetWriter("filtered_tweets.json")

# Filter for tweets with at least one hashtag
hashtag_tweets = []
for tweet in reader.read_tweets():
    if len(tweet["entities"]["hashtags"]) > 0:
        hashtag_tweets.append(tweet)

# Write the filtered tweets to a new file
writer.write(hashtag_tweets)

Flexible Tweet Tokenization

The real core of the package is the TweetTokenizer. Cleaning and tokenizing tweets is notoriously tricky. You have to deal with usernames, URLs, hashtags, emojis, and all sorts of slang and non-standard language.

The TweetTokenizer is designed to be flexible. You can customize the tokenization pattern, provide your own list of stopwords, choose whether to remove hashtags or lowercase the text, and more.

For researchers using specific pre-trained models, I’ve also included:

  • GloVeTweetTokenizer: A tokenizer that replicates the pre-processing used for the popular GloVe Twitter embeddings.
  • BERTweetTokenizer: A tokenizer that normalizes tweets according to the pre-processing steps used for the BERTweet model.

Here’s how you might use the GloVeTweetTokenizer:

from littlebird import GloVeTweetTokenizer

tokenizer = GloVeTweetTokenizer()

tweet_text = "This is soooooo cool!!! #NLP #Python"
tokens = tokenizer.tokenize(tweet_text)

# Output:
# ['this', 'is', 'so', '<elong>', 'cool', '!', '<repeat>', '<hashtag>', 'nlp', '<hashtag>', 'python']

Get Started

You can install littlebird directly from PyPI:

pip install littlebird-twitter-utils

Or you can install it from the source by cloning the GitHub repository:

git clone https://github.com/AADeLucia/littlebird.git
cd littlebird
python setup.py develop

This package has been a huge help in my own research, and I hope it can be useful to others in the NLP community as well. It’s still a work in progress, and I welcome any feedback or contributions. Feel free to open an issue or pull request on GitHub!




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Making Slurm Life a Little Easier
  • How to Apply to a PhD in Computer Science
  • How to Write a Literature Review
  • Tips for Debugging PyTorch Errors