This is the dataset used for the experiments in the following paper:

   Markus Dreyer and Jason Eisner (2011).  Discovering morphological
   paradigms from plain text using a Dirichlet process mixture model.
   Proceedings of the Conference on Empirical Methods in Natural
   Language Processing (EMNLP), pp. 616-627.  Supplementary material
   (9 pages) also available.

   http://cs.jhu.edu/~jason/papers/#dreyer-eisner-2011

The dataset is derived from the CELEX2 database.  Unfortunately, that
database is under copyright, so we cannot legally redistribute it or
any substantial portion of it.  You must license your own copy through
the Linguistic Data Consortium: https://catalog.ldc.upenn.edu/LDC96L14 .

Once you have licensed a copy of CELEX2, you can run a script from us
that automatically extracts our experimental dataset from it.  This
script is not independently useful and is not under copyright.

You may run the script as follows:
    ./extract.sh dreyer+eisner.emnlp11.data.zip

For simplicity and portability, our script does not do any text
processing on your copy of CELEX2.  Instead, it uses your copy of
CELEX2 to derive an decryption key that is used to decrypt and
uncompress files from a zip archive.  If you do not have access to
CELEX2, then the zip archive is just a useless collection of bytes 
and you will not be able to obtain our dataset.

The script should run on most standardly configured Linux or Unix
systems.  If you are unable to run it on your system, then you can
still directly extract our dataset from the zip file, using any zip
utility.  The password is the SHA-256 hash of the largest CELEX2 file,
namely german/gpw/gpw.cd (a 28982785-byte Unix text file).  There are
several free utilities and websites that can compute a file's SHA-256
hash for you.
