Cross-Cultural Blog and Forum Dataset

This dataset includes English-language text from two social media sources pertaining to three different countries: India, Singapore, and the U.K. It was introduced and described in the paper below.

Because the text is all in the same language, direct comparisions can be made between the data for the three countries. Furthermore, one set represents text written by authors from the countries, whereas the other set represents text written about the countries from travelers, offering two different perspectives on these countries.

Note that this includes only the processed text extracted from the sources. The original web structure and formatting is no longer intact.

(349mb) [link]

  • Michael Paul and Roxana Girju. Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models. In the proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009), pages 1408-1417, Singapore. August 2009.