We assume that you’ve made a local copy of http://www.cs.jhu.edu/~jason/465/hw-lm/ (for example, by
downloading and unpacking the zipfile there) and that you’re currently
in the code/ subdirectory.
You can activate the same environment you created for Homework 2.
conda activate nlp-class
You may want to look again at the PyTorch tutorial materials in the Homework 2 INSTRUCTIONS, this time paying more attention to the documentation on automatic differentiation.
The command lines below use wildcard notations such as
../data/gen_spam/dev/gen/* and
../data/gen_spam/train/{gen,spam}. These are supposed to
expand automatically into lists of matching files.
This will work fine if you are running a shell like
bash, which is standard on Linux and MacOS. If you’re using
Windows, we recommend that you install bash
there, but you could alternatively force
Windows PowerShell to expand wildcards.
We provide a script ./build_vocab.py for you to build a
vocabulary from some corpus. Type ./build_vocab.py --help
to see documentation. Once you’ve familiarized yourself with the
arguments, try running it like this:
./build_vocab.py ../data/gen_spam/train/{gen,spam} --threshold 3 --output vocab-genspam.txt
This creates vocab-genspam.txt, which you can look at:
it’s just a set of word types including OOV and
EOS.
Once you’ve built a vocab file, you can use it to build one or more smoothed language models. If you are comparing two models, both models should use the same vocab file, to make the probabilities comparable (as explained in the homework handout).
We also provide a script ./train_lm.py for you to build
a smoothed language model from a vocab file and a corpus. (The code for
actually training and using models is in the probs.py
module, which you will extend later.)
Type ./train_lm.py --help to see documentation. Once
you’ve familiarized yourself with the arguments, try running it like
this:
./train_lm.py vocab-genspam.txt add_lambda --lambda 1.0 ../data/gen_spam/train/gen
Here add_lambda is the type of smoothing, and
--lambda specifies the hyperparameter λ=1.0. While the
documentation mentions additional hyperparameters like
--l2_regularization, they are not used by the
add_lambda smoothing technique, so specifying them will
have no effect on it.
Since the above command line doesn’t specify an --output
file to save the model in, the script just makes up a long filename
(ending in .model) that mentions the choice of
hyperparameters. You may sometimes want to use shorter filenames, or
specific filenames that are required by the submission instructions that
we’ll post on Piazza.
The file
corpus=gen~vocab=vocab-genspam.txt~smoother=add_lambda~lambda=1.0.model
now contains a pickled copy of
a trained Python LanguageModel object. The object contains
everything you need to use the language model, including the
type of language model, the trained parameters, and a copy of the
vocabulary. Other scripts can just load the model object from the file
and query it to get information like \(p(z
\mid xy)\) by calling its methods. They don’t need to know how
the model works internally or how it was trained.
You can now use your trained models to assign probabilities to new
corpora using ./fileprob.py. Type
./fileprob.py --help to see documentation. Once you’ve
familiarized yourself with the arguments, try running the script like
this:
./fileprob.py [mymodel] ../data/gen_spam/dev/gen/*
where [mymodel] refers to the long filename above. (You
may not have to type it all: try typing the start and hitting Tab, or
type *.model if it’s the only model matching that
pattern.)
Note: It may be convenient to use symbolic links (shortcuts) to avoid typing long filenames or directory names. For example,
ln -sr corpus=gen~vocab=vocab-genspam.txt~smoother=add_lambda~lambda=1.0.model gen.model
will make gen.model be a shortcut for the long model
filename, and
ln -sr ../data/speech/train sptrain
will make sptrain be a shortcut to that directory, so
that sptrain/switchboard is now a shortcut to
../data/speech/train/switchboard.
Copy fileprob.py to textcat.py.
Modify textcat.py so that it does text categorization.
textcat.py should have almost the same command-line API as
./fileprob.py, except it should take two models
instead of just one.
You could train your language models with lines like
./train_lm.py vocab-genspam.txt add_lambda --lambda 1.0 gen --output gen.model
./train_lm.py vocab-genspam.txt add_lambda --lambda 1.0 spam --output spam.model
which saves the trained models in a file but prints no output. You should then be able to categorize the development corpus files in question 3 like this:
./textcat.py gen.model spam.model 0.7 ../data/gen_spam/dev/{gen,spam}/*
Note that LanguageModel objects have a
vocab attribute. You should do a sanity check in
textcat.py that both language models loaded for text
categorization have the same vocabulary. If not, raise a
exception (ValueError is appropriate for illegal or
inconsistent arguments), or if you don’t like throwing uncaught
exceptions back to the user, print an error message
(log.critical) and halt (sys.exit(1)).
(It’s generally wise to include sanity checks in your code that will
immediately catch problems, so that you don’t have to track down
mysterious behavior. The assert statement is used to check
statements that should be correct if your code is internally
correct. Once your code is correct, these assertions should
never fail. Some people even turn assertion-checking off in the
final version, for speed. But even correct code may encounter conditions
beyond its control; for those cases, you should raise an
exception to warn the caller that the code couldn’t do what it was asked
to do, typically because the arguments were bad or the required
resources were unavailable.)
You want to support the add_lambda_backoff argument to
train_lm.py. This makes use of
BackoffAddLambdaLanguageModel class in
probs.py. You will have to implement the
prob() method in that class.
Make sure that for any bigram \(xy\), you have \(\sum_z p(z \mid xy) = 1\), where \(z\) ranges over the whole vocabulary including OOV and EOS.
As you are only adding a new model, the behavior of your old models
such as AddLambdaLanguageModel should not change.
Now add the sample() method to probs.py.
Did your good object-oriented programming principles suggest the best
place to do this?
To make trigram_randsent.py, start by copying
fileprob.py. As the handout indicates, the graders should
be able to call the script like this:
./trigram_randsent.py [mymodel] 10 --max_length 20
to get 10 samples of length at most 20.
You want to support the log_linear argument to
train_lm.py. This makes use of
EmbeddingLogLinearLanguageModel in probs.py.
Complete that class.
For part (b), you’ll need to complete the train() method
in that class.
For part (d), you want to support log_linear_improved.
This makes use of ImprovedLogLinearLanguageModel, which you
should complete as you see fit. It is a subclass of the LOGLIN model, so
you can inherit or override methods as you like.
As you are only adding new models, the behavior of your old models should not change.
Training the log-linear model on en.1K can be done with
simple “for” loops and 2D array representation of matrices. However,
you’re encouraged to use PyTorch’s tensor operations, as discussed in
the handout. This will reduce training time and might simplify your
code.
TA’s note: “My original implementation took 22 hours per epoch. Careful vectorization of certain operations, leveraging PyTorch, brought that runtime down to 13 minutes per epoch.”
Make sure to use the torch.logsumexp method for
computing the log-denominator in the log-probability.
The reading handout has a section with this title.
To recover Algorithm 1 (convergent SGD), you can use a modified
optimizer that we provide for you in SGD_convergent.py:
from SGD_convergent import ConvergentSGD
optimizer = ConvergentSGD(self.parameters(), eta0=eta0, lambda_=2*C/N)
To break the epoch model as suggested in the “Shuffling” subsection,
check out the method draw_trigrams_forever in
probs.py.
For mini-batching, you could modify either read_trigrams
or draw_trigrams_forever.
In the starter code for this class, we have generally tried to follow
good Python practice by annotating
the type of function arguments, function return values, and some
variables. This serves as documentation. It also allows a type checker
like mypy
or pylance to report likely bugs – places in your code
where it can’t prove that your function is being applied on arguments of
the declared type, or that your variable is being assigned a value of
the declared type. You can run the type checker manually or configure
your IDE to run it continually.
Ordinarily Python doesn’t check types at runtime, as this would slow
down your code. However, the typeguard module does allow
you to request runtime checking for particular functions.
Runtime checking is especially helpful for tensors. All PyTorch
tensors have the same type – namely torch.Tensor – but that
doesn’t mean that they’re interchangeable. For example, trying to
multiply a 4 x 7 matrix by a 10 x 7 matrix will raise a runtime
exception, which will alert you that one of your matrices was wrong.
Perhaps you meant to transpose the 10 x 7 matrix into a 7 x 10 matrix.
Unfortunately, this exception only happens once that line of code is
actually executed, and only because 10 happened to be different from 7;
if the second matrix were 7 x 7, then the dimensions would
coincidentally match and there would be no exception, just a wrong
answer.
So it helps to use stronger typing than in standard Python. The jaxtyping
module enables you to add finer-grained type annotations for a tensor’s
shape (including named dimensions to distinguish the two 7’s
above), dtype, etc. (This package is already in the
nlp-class.yml environment.)
With typeguard, your tensors can be checked at runtime
to ensure that their actual types match the declared types. Without
typeguard, jaxtyping is just for documentation
purposes.
# EXAMPLE
import torch
from jaxtyping import Float32
from typeguard import typechecked
@typechecked
def func(x: Float32[torch.Tensor, "batch", "num_features"],
y: Float32[torch.Tensor, "num_features", "batch"]) -> Float32[torch.Tensor, "batch", "batch"]:
return torch.matmul(x,y)
func(torch.rand((10,8)), torch.rand((8,10))) # passes test
func(torch.rand((10,8)), torch.rand((10,8))) # complains that dimension named "batch" has inconsistent sizes
You can use the same language models as before, without changing
probs.py or train_lm.py.
In this question, however, you’re back to using only one language
model as in fileprob (not two as in textcat).
So, initialize speechrec.py to a copy of
fileprob.py, and then edit it.
Modify speechrec.py so that, instead of evaluating the
prior probability of the entire test file, it separately evaluates the
prior probability of each candidate transcription in the file. It can
then select the transcription with the highest posterior
probability and report its error rate, as required.
The read_trigrams function in probs.py is
no longer useful, since a speech dev or test file has a special format.
You don’t want to iterate over all the trigrams in such a file. You may
want to make an “outer loop” utility function that iterates over the
candidate transcriptions in a given speech dev or test file, along with
an “inner loop” utility function that iterates over the trigrams in a
given candidate transcription.
(The outer loop is specialized to the speechrec format, so it
probably belongs in speechrec.py. The inner loop is similar
to read_trigrams and might be more generally useful, so it
probably belongs in probs.py.)
This section is optional but is a good way to speed up your experiments, as discussed in the reading handout.
(If you have your own GPU, you don’t need Kaggle; just include the
command-line option --device cuda when running the starter
code. Or if you have a Mac, you may be able to use
--device mps.)
To get a Kaggle account (if you don’t already have one), follow the instructions at https://www.kaggle.com/account/login. Kaggle allows one account per email address.
Visit https://www.kaggle.com/code and click on the “New Notebook” button.
Check the sharing settings of your notebook by opening the “File” menu and choosing “Share”. You can add your homework partner there as a collaborator. Academic honesty is required in all work that you and others submit to be graded, so please do not make your homework public.
You can click on the notebook’s title to rename it. Now click on “Add Data” and add the dataset https://www.kaggle.com/datasets/jhunlpclass/hw-lm-data, which has the data for this homework.
How do you work the interface? If you hover over a cell, some useful
buttons will appear nearby. Others will appear if you click on the cell
to select it, or right-click on it (Command-click on a Mac) to bring up
a context menu. You can also drag a cell. There are more buttons and
menus to explore at the top of the notebook. Finally, there are lots of
keyboard
shortcuts – starting with very useful shortcuts like
Shift+Enter for running the code you’ve just edited in the
current cell.
Try adding and running some Code cells, like
import math
x=2
math.log(x)
Try editing cells and re-running them.
It’s also wise to create Markdown cells where you can keep notes on your computations. That’s why it’s called a Notebook. Try it out!
Also try other operations on cells: cut/paste, collapse/expand, delete, etc.
Try some shell commands. Their effects on the filesystem will persist
as long as the kernel does. However, each ! line is run in
its own temporary bash shell, so the working directory and environment
variables will be forgotten at the end of the line.
!cd /; ls # forgotten after this line
!pwd # show name of current dir
!ls -lF /kaggle/input/hw-lm-data # look at the dataset you added
If you want to change the working directory persistently, use Python commands, since the Python session lasts for as long as the kernel does:
import os
os.chdir('/kaggle/input/hw-lm-data/lexicons')
!ls
This is the first method sketched in the reading handout, and is preferred if you’re comfortable with git.
Make a private github repository hw-lm-code for your
homework code. You’d like to clone it from your Notebook like this:
!git clone https://github.com/<username>/hw-lm-code.git
Unfortunately, your repo is private and there’s no way to enter your password from the Notebook (well, maybe there is).
You might think you could include your password somewhere on the above command line, but github now disallows this since they don’t want you to expose your password. Instead, github requires you to create a personal access token. Click on “Generate new token” here to get a fine-grained personal access token. Configure your token as follows so that it gives access to what is necessary, and no more:
hw-lm-code repository.You should now be able to put the following in a Notebook cell and run it:
!git clone https://oauth2:<token>@github.com/<github username>/hw-lm-code.git
where <token> is your token (about 100 chars) and
<github username> is your github account username. If
the Notebook reports that it can’t find github.com, then
try again after turning “Internet on” in the “Notebook options” section
of the “Notebook” pane. (The “Notebook” pane appears to the right of the
Notebook; you might have to close the “Add Data” pane first in order to
see it.)
Congratulations! You now have a directory
/kaggle/working/hw-lm-code. You can now run your code from
your Notebook. For example, here’s a shell command:
!hw-lm-code/fileprob.py ../input/hw-lm-data/data/gen_spam/train/{gen,spam} --threshold 3 --output vocab-genspam.txt
If you want to use your modules interactively, try
pip install jaxtyping # needed dependency (we're not using nlp-class environment)
sys.path.append('/kaggle/working/hw-lm-code') # tells Python where to look for your modules (symlink)
import probs # import your own module
tokens = probs.read_tokens("/kaggle/input/hw-lm-data/data/gen_spam/train/spam")
list(tokens)[:20]
Whenever you push committed changes from your local machine to github, you can pull them down again into the Notebook environment by running this:
!cd hw-lm-code; git pull
This is the second method sketched in the reading handout. You’ll
upload your code files to Kaggle as a new private Dataset; call
it hw-lm-code.
One option is to do this manually:
Go to the web interface at , click on the “New Dataset” button,
and upload your code files (individually or as a .zip
file).
If you later change your code, go to your Dataset
https://www.kaggle.com/datasets/<kaggle username>/hw-lm-code
and click “New Version” to replace files.
Another option is to upload from your local machine’s command line. This makes it easier to update your code, but it will require a little setup.
Go to the “Account” tab of your Kaggle profile
(https://www.kaggle.com/<kaggle username>/account)
and select “Create API Token”. A file kaggle.json will be
downloaded.
Move this file to $HOME/.kaggle/kaggle.json if you’re on
MacOS/Linux, or to
C:\Users\<windows username>\.kaggle\kaggle.json if
you’re on Windows.
You should make the file private, e.g.,
chmod 600 ~/.kaggle/kaggle.json.
Generate a metadata file in your code directory:
kaggle datasets init -p <code directory>
In the generated file, datapackage.json, edit the
title to your liking and set id to
<kaggle username>/hw-lm-code.
You don’t have to do pip install kaggle because you
already activated the nlp-class conda environment, which
includes that package.
Now you can create the Dataset with
kaggle datasets create -r zip -p <code directory>If you later change your code, update your Dataset to a new version with
kaggle datasets version -r zip -p <code directory> -m'your comment here'Once you’ve created the hw-lm-code Dataset (via either
option), use your Notebook’s “Add Data” button again to add it to your
Notebook as a second Dataset. Congratulations! Your Notebook can now see
your code under /kaggle/input/hw-lm-code.
Let’s create a symlink from /kaggle/working/hw-lm-code,
so we can find the code at the same location as with the
git clone method:
!ln -s /kaggle/input/hw-lm-code .
Unfortunately, everything under /kaggle/input/ is
read-only, since Kaggle thinks it’s data. So you can’t make the
.py files executable (or modify them in any other way). But
you can still execute the Python interpreter on them:
!python hw-lm-code/fileprob.py ../input/hw-lm-data/data/gen_spam/train/{gen,spam} --threshold 3 --output vocab-genspam.txt
And you can still use the modules interactively just as before:
pip install jaxtyping # needed dependency (we're not using nlp-class environment)
sys.path.append('/kaggle/working/hw-lm-code') # tells Python where to look for your modules (symlink)
import probs # import your own module
tokens = probs.read_tokens("/kaggle/input/hw-lm-data/data/gen_spam/train/spam")
list(tokens)[:20]
Whenever you update your Kaggle Dataset, you’ll have to tell the
Notebook to reload /kaggle/input/hw-lm-code. Find your
dataset in the “Notebook” pane (under “Datasets”) and select “Check for
updates” from its “⋮” menu. (The “Notebook” pane appears to the right of
the Notebook; you might have to close the “Add Data” pane first in order
to see it.)
A dashboard in the upper right of the notebook lets you see how long the associated kernel (background Python session) has been running and its CPU, GPU, RAM, and disk usage.
Your kernel will be terminated if you close or reload the webpage, or if you’ve been inactive for over 40 minutes. Your Notebook will still be visible in the webpage. However, if you resume using it, a new kernel will be started. You’ll have to re-run the Notebook’s cells (see the “Run All” command on the “Run” menu) to restore the state of your Python session and the contents of your working directory. (Alternatively, they will be automatically restored at the start of the new session if you turned on the “Persistence” features under “Notebook options” in the “Notebook” pane.)
Fortunately, there is a way to run your Notebook in the background for up to 12 hours. Click “Save Version” in the upper-right corner of the Notebook webpage. Make sure the version type is “Save and Run All (Commit)”. This will take a snapshot of the Notebook, start a new kernel, and run all of the Notebook’s Code cells from top to bottom.
The resulting “committed” version of the Notebook will include the
output, meaning the output of each cell and the contents of
/kaggle/working. To go look at it while it is still
running, use the “View
Active Events” button in the lower left corner of the Kaggle
webpage. This gives a list of running kernels. Find the one you want,
and select “Open in Viewer” or “Open Logs in Viewer” from its “…”
menu.
“Save Version” can also do a “Quick Save” that saves the current notebook without re-running anything. You can choose whether or not to save the output.
To see all of your saved versions – both quick and committed – click on the version number next to the “Save Version” button.
If your code creates /kaggle/working/whatever.model, how
can you then download that file to your local machine so that you can
submit it to Gradescope?
Your interactively running Notebook has a “Notebook” pane at the right. (To see it, you might have to close another pane such as “Add Data”.) In the “Output” section of that pane, browse to your file and select “Download” from its “⋮” menu.
Each saved version of your Notebook has an Output tab. You
can click around there to download individual files from
/kaggle/working. That tab also provides a command you can
run on your local machine to download all the output, which will work if
you set up a kaggle.json file with an API token as described earlier.
Another option: If you cloned a git repo, the end of your Notebook could push the changes back up to github. For example:
# give git enough info for its commit log (this modifies hw-lm-code/.git/config)
!cd hw-lm-code; git config --global user.email '<username>@jh.edu'
!cd hw-lm-code; git config --global user.name '<Your Name> via Kaggle'
# copy your models into your local working tree
!cp *.model hw-lm-code
# add the model files to the repo, commit them, and push
!cd hw-lm-code; git add *.model; git commit -a -m'newly trained models'; git push
You can now git pull the changes down to your local
machine.
Now, why again did we bother to get all of that working? Oh right! It’s time to get our speedup.
Include the option --device cuda when running
train_lm.py (and perhaps also when running
fileprob.py or textcat.py). In our starter
code, that flag tells PyTorch to create all tensors on the GPU by
default. (Every tensor has an torch.device
attribute that specifies the device where it will be computed.)
However, creating a tensor will now throw a runtime exception if your kernel doesn’t have access to a GPU. To give it access, choose “Accelerator” from the “⋮” menu in the upper right of the Notebook, and change “None” to “GPU P100”.
When you recompute your notebook in the background with “Save and Run All (commit)”, ordinarily it uses whatever GPU acceleration is turned on for the Notebook. However, you can change this under “Advanced options” when saving.
Kaggle’s GPU usage is subject to limitations:
The weekly 30 hours rolls over every Friday night at 8pm EDT / 7pm EST.
Important: Some tips about conserving GPU usage are here. You may want to turn on persistence so that you don’t have to rerun training every time you reload your Notebook, and do Quick Saves rather than commits so that you don’t rerun training every time you save your Notebook.
Be warned that when an interactive session with acceleration is active, it counts towards your 30 hours, even when no process is running. To conserve your hours when you’re not actually using the GPU, change the Accelerator back to None, or shut down the kernel altogether by clicking the power icon in the upper right or closing the webpage.
A version of this Python port for an earlier version of this assignment was kindly provided by Eric Perlman, a previous student in the NLP class. Thanks to Prof. Jason Baldridge at U. of Texas for updating the code and these instructions when he borrowed that assignment. They were subsequently modified for later versions of the assignment by Xuchen Yao, Mozhi Zhang, Chu-Cheng Lin, Arya McCarthy, Brian Lu, and Jason Eisner.
The Kaggle instructions were written by Camden Shultz and Jason Eisner. Thanks to David Chiang for suggesting the classroom use of Kaggle and the idea of uploading code files as a Dataset.