{"cells": [{"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#!/usr/bin/env python3"]}, {"cell_type": "markdown", "metadata": {}, "source": ["This file illustrates how you might experiment with the HMM interface.\n", "You can paste these commands in at the Python prompt, or execute `test_ic.py` directly.\n", "A notebook interface is nicer than the plain Python prompt, so we provide\n", "a notebook version of this file as `test_ic.ipynb`, which you can open with\n", "`jupyter` or with Visual Studio `code` (run it with the `nlp-class` kernel)."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import logging, math, os\n", "from pathlib import Path"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["import torch\n", "from torch import tensor"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["from corpus import TaggedCorpus\n", "from eval import model_cross_entropy, write_tagging\n", "from hmm import HiddenMarkovModel\n", "from crf import ConditionalRandomField"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Set up logging."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log = logging.getLogger(\"test_ic\")       # For usage, see findsim.py in earlier assignment.\n", "logging.root.setLevel(level=logging.INFO)\n", "logging.basicConfig(level=logging.INFO)  # could change INFO to DEBUG\n", "# torch.autograd.set_detect_anomaly(True)    # uncomment to improve error messages from .backward(), but slows down"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Switch working directory to the directory where the data live.  You may want to edit this line."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["os.chdir(\"../data\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Get vocabulary and tagset from a supervised corpus."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["icsup = TaggedCorpus(Path(\"icsup\"), add_oov=False)\n", "log.info(f\"Ice cream vocabulary: {list(icsup.vocab)}\")\n", "log.info(f\"Ice cream tagset: {list(icsup.tagset)}\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Two ways to look at the corpus ..."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["os.system(\"cat icsup\")   # call the shell to look at the file directly"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(icsup)          # print the TaggedCorpus python object we constructed from it"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Make an HMM."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Hidden Markov Model (HMM) test\\n\")\n", "hmm = HiddenMarkovModel(icsup.tagset, icsup.vocab)\n", "# Change the transition/emission initial probabilities to match the ice cream spreadsheet,\n", "# and test your implementation of the Viterbi algorithm.  Note that the spreadsheet \n", "# uses transposed versions of these matrices.\n", "hmm.B = tensor([[0.7000, 0.2000, 0.1000],    # emission probabilities\n", "                [0.1000, 0.2000, 0.7000],\n", "                [0.0000, 0.0000, 0.0000],\n", "                [0.0000, 0.0000, 0.0000]])\n", "hmm.A = tensor([[0.8000, 0.1000, 0.1000, 0.0000],   # transition probabilities\n", "                [0.1000, 0.8000, 0.1000, 0.0000],\n", "                [0.0000, 0.0000, 0.0000, 0.0000],\n", "                [0.5000, 0.5000, 0.0000, 0.0000]])\n", "log.info(\"*** Current A, B matrices (using initalizations from the ice cream spreadsheet)\")\n", "hmm.printAB()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Try it out on the raw data from the spreadsheet, available in `icraw``."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Viterbi results on icraw with hard coded parameters\")\n", "icraw = TaggedCorpus(Path(\"icraw\"), tagset=icsup.tagset, vocab=icsup.vocab)\n", "write_tagging(hmm, icraw, Path(\"icraw_hmm.output\"))  # calls hmm.viterbi_tagging on each sentence\n", "os.system(\"cat icraw_hmm.output\")   # print the file we just created"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Did the parameters that we guessed above get the \"correct\" answer, \n", "as revealed in `icdev`?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["icdev = TaggedCorpus(Path(\"icdev\"), tagset=icsup.tagset, vocab=icsup.vocab)\n", "log.info(f\"*** Compare to icdev corpus:\\n{icdev}\")\n", "from eval import viterbi_error_rate\n", "viterbi_error_rate(hmm, icdev, show_cross_entropy=False)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now let's try your training code, running it on supervised data.\n", "To test this, we'll restart from a random initialization.\n", "(You could also try creating this new model with `unigram=true`, \n", "which will affect the rest of the notebook.)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["hmm = HiddenMarkovModel(icsup.tagset, icsup.vocab)\n", "log.info(\"*** A, B matrices as randomly initialized close to uniform\")\n", "hmm.printAB()"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Supervised HMM training on icsup\")\n", "cross_entropy_loss = lambda model: model_cross_entropy(model, icsup)\n", "hmm.train(corpus=icsup, loss=cross_entropy_loss, tolerance=0.0001)\n", "log.info(\"*** A, B matrices after training on icsup (should \"\n", "         \"match initial params on spreadsheet [transposed])\")\n", "hmm.printAB()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now that we've reached the spreadsheet's starting guess, let's again tag\n", "the spreadsheet \"sentence\" (that is, the sequence of ice creams) using the\n", "Viterbi algorithm."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Viterbi results on icraw\")\n", "icraw = TaggedCorpus(Path(\"icraw\"), tagset=icsup.tagset, vocab=icsup.vocab)\n", "write_tagging(hmm, icraw, Path(\"icraw_hmm.output\"))  # calls hmm.viterbi_tagging on each sentence\n", "os.system(\"cat icraw_hmm.output\")   # print the file we just created"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Next let's use the forward algorithm to see what the model thinks about \n", "the probability of the spreadsheet \"sentence.\""]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Forward algorithm on icraw (should approximately match iteration 0 \"\n", "             \"on spreadsheet)\")\n", "for sentence in icraw:\n", "    prob = math.exp(hmm.logprob(sentence, icraw))\n", "    log.info(f\"{prob} = p({sentence})\")"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Finally, let's reestimate on the icraw data, as the spreadsheet does.\n", "We'll evaluate as we go along on the *training* perplexity, and stop\n", "when that has more or less converged."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Reestimating on icraw (perplexity should improve on every iteration)\")\n", "negative_log_likelihood = lambda model: model_cross_entropy(model, icraw)  # evaluate on icraw itself\n", "hmm.train(corpus=icraw, loss=negative_log_likelihood, tolerance=0.0001)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** A, B matrices after reestimation on icraw \"\n", "         \"should match final params on spreadsheet [transposed])\")\n", "hmm.printAB()"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Viterbi results on icraw after reestimation on icraw\")\n", "icraw = TaggedCorpus(Path(\"icraw\"), tagset=icsup.tagset, vocab=icsup.vocab)\n", "write_tagging(hmm, icraw, Path(\"icraw_hmm.output\"))  # calls hmm.viterbi_tagging on each sentence\n", "os.system(\"cat icraw_hmm.output\")   # print the file we just created"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now let's try out a randomly initialized CRF on the ice cream data. Notice how\n", "the initialized A and B matrices now hold non-negative potentials,\n", "rather than probabilities that sum to 1."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Conditional Random Field (CRF) test\\n\")\n", "crf = ConditionalRandomField(icsup.tagset, icsup.vocab)\n", "log.info(\"*** Current A, B matrices (potentials from small random parameters)\")\n", "crf.printAB()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now let's try your training code, running it on supervised data. To test this,\n", "we'll restart from a random initialization. \n", "\n", "Note that the logger reports the CRF's *conditional* cross-entropy, \n", "log p(tags | words) / n.  This is much lower than the HMM's *joint* \n", "cross-entropy log p(tags, words) / n, but that doesn't mean the CRF\n", "is worse at tagging.  The CRF is just predicting less information."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Supervised CRF training on icsup\")\n", "cross_entropy_loss = lambda model: model_cross_entropy(model, icsup)\n", "crf.train(corpus=icsup, loss=cross_entropy_loss, lr=0.1, tolerance=0.0001)\n", "log.info(\"*** A, B matrices after training on icsup\")\n", "crf.printAB()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Let's again tag the spreadsheet \"sentence\" (that is, the sequence of ice\n", "creams) using the Viterbi algorithm.  The trained CRF might get a different\n", "answer than the trained HMM.  (Try comparing the two icraw_*.output files.)"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["log.info(\"*** Viterbi results on icraw with trained parameters\")\n", "icraw = TaggedCorpus(Path(\"icraw\"), tagset=icsup.tagset, vocab=icsup.vocab)\n", "write_tagging(crf, icraw, Path(\"icraw_crf.output\"))  # calls hmm.viterbi_tagging on each sentence\n", "os.system(\"cat icraw_crf.output\")   # print the file we just created"]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4"}}, "nbformat": 4, "nbformat_minor": 2}