# 600.465 Introduction to NLP (Fall 2000)

# Assignment #3

# Due: Nov 29 2pm

# Tagging

Instructor: Jan Hajic
*
Email: hajic@cs.jhu.edu *

TA: Gideon S. Mann

*
Email: mailto:gsm@cs.jhu.edu *

Back to syllabus.

## Requirements

For all parts of this homework, work either alone or in a group of
max. two people (identical grade will be assigned to both of you in
such a case - thus please make sure you understand what your colleague
is doing, and that s/he is doing it right!). On top of the
results/requirements specific to a certain part of the homework, turn
in all of your **code**, **commented** in such a way that it is
possible to determine what, how and why you did what you did solely
from the comments, and a **discussion/comments** of/on the results
(in a plain text/html) file.
Technically, follow the usual pattern (see the Syllabus):
For this whole homework, use data found in
`
barley:~hajic/cs465/texten2.ptg
`

`
barley:~hajic/cs465/textcz2.ptg
`

In the following, "the data" refers to both English and Czech, as usual.

Split the data in the following way: use last 40,000 words for testing
(data S), and from the remaining data, use the last 20,000 for
smoothing (data H, if any). Call the rest "data T" (training).

## 1. Brill's Tagger & Tagger Evaluation

Download Eric Brill's supervised tagger,
either from his home page still at www.cs.jhu.edu/~brill, or directly
from
ftp://ftp.cs.jhu.edu/pub/brill/Programs/RULE_BASED_TAGGER_V.1.14.tar.Z.
Install it (i.e., uncompress, untar, and make). If
you work on `hops`, you will need to do the following changes
in his package's `Makefile` before running `make`:

1. Add the following line:

`CC = gcc`

right after the first comment line in it, i.e. after the line beginning

`# Makefile for Transformation...`

2. Change all references to the `cc` compiler to `$(CC)`
(there are 7 such references).

This change will enable the GNU C-compiler which is properly installed
on `hops`, as opposed to the standard Sun's `cc`
compiler, which will give you messages such as `/usr/ucb/cc:
language optional software package not installed`.

After installation, get the data, train it
on as much data from T as time allows (in the package, there is an extensive
documentation on how to train it on new data), and evaluate on data
S. Tabulate the results.

Do cross-validation of the results: split the data into S', [H',] T'
such that S' is the first 40,000 words, and T' is the last but the
first 20,000 words from the rest. Train Eric Brill's tagger on T'
(again, use as much data as time allows) and evaluate on S'. Again,
tabulate the results.

Do three more splits of your data (using the same formula: 40k/20k/the
rest) in some way or another (as different as possible), and get
another three sets of results. Compute the mean (average) accuracy and
the standard deviation of the accuracy. Tabulate all results.

## 2. Unsupervised Learning: HMM Tagging

Use the datasets T, H, and S. Estimate the parameters of an HMM tagger
using supervised learning off the T data (trigram and lower models for
tags). Smooth (both the trigram tag model as well as the lexical
model) in the same way as in Homework No. 1 (use data H). Evaluate
your tagger on S, using the Viterbi
algorithm.
Now use only the first 10,000 words of T to estimate the initial (raw)
parameters of the HMM tagging model. Strip off the tags from the
remaining data T. Use the Baum-Welch
algorithm to improve on the initial parameters. Smooth as
usual. Evaluate your unsupervised HMM tagger and compare the results
to the supervised HMM tagger.

Tabulate and compare the results of the HMM tagger vs. the Brill's tagger.