----------------------------------------------------
  README for Arabic Online Commentary Dataset v1.0
----------------------------------------------------

The dataset was created by crawling the websites of three Arabic newspapers and extracting the readers' comments from the online articles.  The extraction crawled roughly a 6-month period covering early April 2010 to early October 2010.  The three newspapers are:

  1) Al-Ghad (الغد), a Jordanian newspaper (www.alghad.com)
  2) Al-Riyadh (الرياض), a Saudi newspaper (www.alriyadh.com)
  3) Al-Youm Al-Sabe' (اليوم السابع), an Egyptian newspaper (www.youm7.com)

The extracted comments were split into segments based on hard returns entered by the author, as indicated by the HTML <br> tag.  No further punctuation-based segmentation was performed, though it is perfectly reasonable for you to do so.

The dataset contains three XML files, one per newspaper.  The XML files contain the segments themselves, in addition to some other relevant information for each comment:

  (*) The URL of the newspaper article.
  (*) The date on which the comment was posted, formatted dd/mm/yyyy.
  (*) The time at which the comment was posted, formatted hh:mm (or hh:mm:ss for Al-Ghad), following a 24-hour format (i.e. hh is between 0 and 23).
  (*) The author "ID" associated with that comment, as entered by the author.

For comments extracted from youm7.com, there is an additional "subtitle" field, which is a header entered by the author for their comment.  For comments extracted from alghad.com, there is an additional subtitle field, and also an author e-mail field and an author location field.  (All these fields are entered by the author.)  Comments extracted from alriyadh.com only have the four fields listed above.

As far as size is concerned, the data consists of 3.1M segments, corresponding to 52.1M words (word: longest sequence of non-space characters).  Here is the breakdown across the three sources: (make sure you view this using a fixed-width font)

----------------------------------------------------------------------------------------------
      Source        |   Al-Ghad      Al-Riyadh   Al-Youm Al-Sabe'      ALL
----------------------------------------------------------------------------------------------
  File size         |     19.7 MB       340.0 MB       446.3 MB      806.0 MB (195 MB zipped)
  # news articles   |      6,299         34,163         45,667        86.1K articles
  # comments        |     26,648        804,968        564,853         1.4M comments
  # segments        |     63,304      1,685,533      1,383,952         3.1M segments
  # words           |  1,235,300     18,782,395     32,132,157        52.1M words
  # characters      |  6,878,512    104,231,502    177,604,767       288.7M characters
----------------------------------------------------------------------------------------------
  comments/article  |      4.23          23.56          12.37         16.21
  segments/comment  |      2.38           2.09           2.45          2.24
  words/segment     |     19.51          11.14          23.22         16.65
  characters/word   |      5.57           5.55           5.53          5.54
----------------------------------------------------------------------------------------------



If you have any questions about the dataset, please contact Omar F. Zaidan (ozaidan@cs.jhu.edu).

November 1, 2010
