These data were extracted from a number of international government
documents that were issued bilingually, in both English and Spanish.

The documents can be found in RawData/English-corpus and
RawData/Spanish-corpus if you're interested.

RawData also contains extracts of varying lengths from these
documents.  These extracts were then tokenized by letter to produce
the training, development, and test sets.

We've removed accent marks on letters to make it a little harder
to distinguish English and Spanish just by the letter unigrams.