This is an old corpus used for email spam detection. Some messages are known to be unwanted mass emails ("spam"); the rest are known to be genuine ("gen"). The GenSpam corpus was prepared by Medlock (2006): https://www.researchgate.net/publication/220271836_An_Adaptive_Semi-Structured_Language_Model_Approach_to_Spam_Filtering_on_a_New_Corpus Here is a quote from that paper: We source the Adaptation and Test sets from the contents of two users inboxes, collected over a number of months (Nov 2002–June 2003), retaining both spam and genuine messages. We take this approach rather than simply extracting a test set from the corpus as a whole, so that the test set represents a real-world spam filtering instance. The 600 messages making up the adaptation [development] set are randomly extracted from the same source as the test set, facilitating experimentation into the behaviour of the classifier given a small set of highly relevant samples and a large background corpus. For this homework, we have provided training sets of different sizes: e.g., gen spam gen-times2 spam-times2 gen-times4 spam-times4 gen-times8 spam-times8 Each of these training files contains many messages -- one message per line. gen-times8 and spam-times8 are the complete sets of genuine and spam training messages, but it may be convenient to start with smaller datasets such as gen and spam.