{"cells": [{"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#!/usr/bin/env python3"]}, {"cell_type": "markdown", "metadata": {}, "source": ["This file illustrates how you might experiment with the HMM interface.\n", "You can paste these commands in at the Python prompt, or execute `test_en.py` directly.\n", "A notebook interface is nicer than the plain Python prompt, so we provide\n", "a notebook version of this file as `test_en.ipynb`, which you can open with\n", "`jupyter` or with Visual Studio `code` (run it with the `nlp-class` kernel)."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import logging\n", "import math\n", "import os\n", "from pathlib import Path"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from corpus import TaggedCorpus\n", "from eval import eval_tagging, model_cross_entropy, viterbi_error_rate\n", "from hmm import HiddenMarkovModel\n", "from crf import ConditionalRandomField"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Set up logging."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["logging.root.setLevel(level=logging.INFO)\n", "log = logging.getLogger(\"test_en\") # For usage, see findsim.py in earlier assignment.\n", "logging.basicConfig(format=\"%(levelname)s : %(message)s\", level=logging.INFO) # could change INFO to DEBUG"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Switch working directory to the directory where the data live. You may need to edit this line."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["os.chdir(\"../data\")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["entrain = TaggedCorpus(Path(\"ensup\"), Path(\"enraw\")) # all training\n", "ensup = TaggedCorpus(Path(\"ensup\"), tagset=entrain.tagset, vocab=entrain.vocab) # supervised training\n", "endev = TaggedCorpus(Path(\"endev\"), tagset=entrain.tagset, vocab=entrain.vocab) # evaluation\n", "print(f\"{len(entrain)=} {len(ensup)=} {len(endev)=}\")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["known_vocab = TaggedCorpus(Path(\"ensup\")).vocab # words seen with supervised tags; used in evaluation\n", "log.info(f\"Tagset: f{list(entrain.tagset)}\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Make an HMM. Let's do some pre-training to approximately maximize the\n", "regularized log-likelihood on supervised training data. In other words, the\n", "probabilities at the M step will just be supervised count ratios.\n", "\n", "On each epoch, you will see two progress bars: first it collects counts from\n", "all the sentences (E step), and then after the M step, it evaluates the loss\n", "function, which is the (unregularized) cross-entropy on the training set.\n", "\n", "The parameters don't actually matter during the E step because there are no\n", "hidden tags to impute. The first M step will jump right to the optimal\n", "solution. The code will try a second epoch with the revised parameters, but\n", "the result will be identical, so it will detect convergence and stop.\n", "\n", "We arbitrarily choose \u03bb=1 for our add-\u03bb smoothing at the M step, but it would\n", "be better to search for the best value of this hyperparameter."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Hidden Markov Model (HMM)\")\n", "hmm = HiddenMarkovModel(entrain.tagset, entrain.vocab) # randomly initialized parameters \n", "loss_sup = lambda model: model_cross_entropy(model, eval_corpus=ensup)\n", "hmm.train(corpus=ensup, loss=loss_sup, \u03bb=1.0,\n", " save_path=\"ensup_hmm.pkl\") "]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now let's throw in the unsupervised training data as well, and continue\n", "training as before, in order to increase the regularized log-likelihood on\n", "this larger, semi-supervised training set. It's now the *incomplete-data*\n", "log-likelihood.\n", "\n", "This time, we'll use a different evaluation loss function: we'll stop when the\n", "*tagging error rate* on a held-out dev set stops getting better. Also, the\n", "implementation of this loss function (`viterbi_error_rate`) includes a helpful\n", "side effect: it logs the *cross-entropy* on the held-out dataset as well, just\n", "for your information.\n", "\n", "We hope that held-out tagging accuracy will go up for a little bit before it\n", "goes down again (see Merialdo 1994). (Log-likelihood on training data will\n", "continue to improve, and that improvement may generalize to held-out\n", "cross-entropy. But getting accuracy to increase is harder.)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["hmm = HiddenMarkovModel.load(\"ensup_hmm.pkl\") # reset to supervised model (in case you're re-executing this bit)\n", "loss_dev = lambda model: viterbi_error_rate(model, eval_corpus=endev, \n", " known_vocab=known_vocab)\n", "hmm.train(corpus=entrain, loss=loss_dev, \u03bb=1.0,\n", " save_path=\"entrain_hmm.pkl\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["You can also retry the above workflow where you start with a worse supervised\n", "model (like Merialdo). Does EM help more in that case? It's easiest to rerun\n", "exactly the code above, but first make the `ensup` file smaller by copying\n", "`ensup-tiny` over it. `ensup-tiny` is only 25 sentences (that happen to cover\n", "all tags in `endev`). Back up your old `ensup` and your old `*.pkl` models\n", "before you do this."]}, {"cell_type": "markdown", "metadata": {}, "source": ["More detailed look at the first 10 sentences in the held-out corpus,\n", "including Viterbi tagging."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["def look_at_your_data(model, dev, N):\n", " for m, sentence in enumerate(dev):\n", " if m >= N: break\n", " viterbi = model.viterbi_tagging(sentence.desupervise(), endev)\n", " counts = eval_tagging(predicted=viterbi, gold=sentence, \n", " known_vocab=known_vocab)\n", " num = counts['NUM', 'ALL']\n", " denom = counts['DENOM', 'ALL']\n", " \n", " log.info(f\"Gold: {sentence}\")\n", " log.info(f\"Viterbi: {viterbi}\")\n", " log.info(f\"Loss: {denom - num}/{denom}\")\n", " xent = -model.logprob(sentence, endev) / len(sentence) # measured in nats\n", " log.info(f\"Cross-entropy: {xent/math.log(2)} nats (= perplexity {math.exp(xent)})\\n---\")"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["look_at_your_data(hmm, endev, 10)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now let's try supervised training of a CRF (this doesn't use the unsupervised\n", "part of the data, so it is comparable to the supervised pre-training we did\n", "for the HMM). We will use SGD to approximately maximize the regularized\n", "log-likelihood. \n", "\n", "As with the semi-supervised HMM training, we'll periodically evaluate the\n", "tagging accuracy (and also print the cross-entropy) on a held-out dev set.\n", "We use the default `eval_interval` and `tolerance`. If you want to stop\n", "sooner, then you could increase the `tolerance` so the training method decides\n", "sooner that it has converged.\n", "\n", "We arbitrarily choose reg = 1.0 for L2 regularization, learning rate = 0.05,\n", "and a minibatch size of 10, but it would be better to search for the best\n", "value of these hyperparameters.\n", "\n", "Note that the logger reports the CRF's *conditional* cross-entropy, log p(tags\n", "| words) / n. This is much lower than the HMM's *joint* cross-entropy log\n", "p(tags, words) / n, but that doesn't mean the CRF is worse at tagging. The\n", "CRF is just predicting less information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Conditional Random Field (CRF)\\n\")\n", "crf = ConditionalRandomField(entrain.tagset, entrain.vocab) # randomly initialized parameters \n", "crf.train(corpus=ensup, loss=loss_dev, reg=1.0, lr=0.05, minibatch_size=10,\n", " save_path=\"ensup_crf.pkl\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's examine how the CRF does on individual sentences. \n", "(Do you see any error patterns here that would inspire additional CRF features?)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["look_at_your_data(crf, endev, 10)"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}