Morphological Inflection

UniMorph is a project that JHU is spearheading to develop a universal morphological annotation schema. We have cleaned data for many languages that we'll be using in this homework.

We will be dealing with the task of morphological inflection. Given a lemma and a feature set, the goal is to generate the inflected form of the word. For example, in Spanish:

Lemma: abajar
Features: V;IND;PRS;1;PL
Inflected form: abajamos

This has been a SIGMORPHON shared task for the last few years. Take a look at their page for more info.

We will treat this task as a machine translation problem, where each "word" is a character or feature designation.

Source: a b a j a r # V IND PRS 1 PL
Target: a b a j a m o s

Download these processing and evaluation scripts. Then, pick a language and download some data from this Github repo.

Use the provided makebitext.jl script to preprocess this data and generate the data splits. You can run it with

julia makebitext.jl eng.trn eng.dev eng.tst

where eng.trn etc. are the files from UniMorph. The script makes a folder called data, with six files containing the bitext.

We'll use OpenNMT for our neural MT toolkit. It is easy to setup and use: just run conda install pytorch and pip install OpenNMT-py.

Following the OpenNMT instructions, there are three steps we need to do:

  1. Preprocess the data.

    onmt_preprocess -train_src data/train.src -train_tgt data/train.tgt -valid_src data/dev.src -valid_tgt data/dev.tgt -save_data run/data
    
  2. Train the model. We're only using a CPU, so this will take several minutes.

    onmt_train -data run/data -save_model run/model -encoder_type rnn -rnn_type lstm -rnn_size 128 -layers 1 -word_vec_size 128 -save_checkpoint_steps 200 -valid_steps 200 -early_stopping 2 > train.log
    
  3. Translate the dev data and evaluate your results (replace step_2200 with whatever it says your best model is). The evaluation script measures accuracy (did you get the word right?) and character edit distance (how many characters were you off?)

    onmt_translate -model run/model_step_2200.pt -src data/dev.src -output dev.hyp -replace_unk -verbose
    julia evaluate.jl data/dev.tgt dev.hyp > dev.eval
    
  4. Try tweaking the model. Take a look at the list of training arguments for more info. The above training command specify smaller RNN size and embedding size than usual to reduce the number of parameters and speed up model training. You can try changing these values to see if performance improves. The baseline for English is around 95-96%.

    One modification I demonstrated in class was to use copy attention, which basically lets the model copy words from the source to the target. Since the lemma and inflected forms often share characters, the model with copy attention performs particularly well. To do this, specify -copy_attn in the train command. You'll have to preprocess with -dynamic_dict.

  5. Translate the test data.

    onmt_translate -model run/models_step_2200.pt -src data/test.src -output test.hyp -replace_unk -verbose
    julia evaluate.jl data/test.tgt test.hyp > test.eval