Homework 1

In the first part of this assignment, you will build a computer program whose performance can be measured in a variety of ways. The second part will be posted on 10/19, after lecture 2, and will consist of a small problem set.

Part 1 of 2

You will write two programs, one to compress (or encode) data, and one to decompress (or decode) a compressed file. You are to write it on your own, and you may not use anyone else's code or any existing, nonstandard libraries (you definitely may not use libz, so don't ask). Your programs must run on the ugrad Linux machines on the command line. If you aren't yet comfortable with Unix/Linux, use google or email the instructors.

The input to your program could be any sequence of bytes. The output (to standard output) will be a (single) sequence of bytes. Your program is not to generate any auxiliary files.

We can evaluate your compression/decompression program pair in three ways:

  1. How fast does it run (on a given file, on a certain type of file, on a large set of files, etc.; also compression and decompression speed)?
  2. What is the rate of compression (how much smaller is the output of encode than the input)?
  3. What is the lossiness (if we run cat foo | ./encode | ./decode > foo2 how similar are foo and foo2)?

Note that the cat program is fast (1) and lossless (3) but does no actual compression. Note that reading and writing nothing out at all (ignoring the input) has extremely good compression and speed but is very lossy! You don't have to be smart to write a tool that meets the specs; you may have to do a bit of googling to learn a bit about compression to write tools that do "reasonably" well on all three criteria.

Note that you can measure these things, too. You should start out by getting a set of data files (from your home directory, the Web, or wherever you'd like - you probably want some variety in the set), and writing a script that can report speed, compression, and loss. To report runtime, run:
time ./encode < foo > foo.compressed
To list byte differences in two files, run:
cmp -l foo foo2
This might help you track your progress over time and figure out what kinds of inputs your program has problems with. You can also compare with standard compression programs like gzip. You are welcome to read anything you like about compression. A place to start might be http://en.wikipedia.org/wiki/Data_compression. Remember that you are to work alone on this assignment.

Deliverables

Everything will be submitted electronically to nasmith@gmail.com. You will submit all source code, a makefile, and an ASCII plaintext or pdf file describing, in less than one page, how your program works. If you had a particular strategy (e.g., your program is guaranteed not to be lossy) then you may point this out. If you used standard algorithms, remember to cite work appropriately. Your program will eventually be compared with the other students' programs. We will give a bit of extra credit to the best-performing program (where performance is a combination of speed, compression, and losslessness). However, most of your grade be based on successful, effortless compilation and execution (no crashes!) and on a concise, understandable summary of how your program works. We will not read your code unless we think it will be interesting (and definitely not if it doesn't compile), and we won't give partial credit for beautiful code that doesn't work. To make things easier for us, you must: To get you moving, we are requiring that you submit twice. The first version of your program (which must meet all of the above criteria) must be submitted by 8:00 pm, Tuesday, 10/18. The second version is due by 11:59 pm, Tuesday, 10/25 (not 8:00 pm as previously noted). Please keep a copy of each submission (separately) for the future.

Part 2 of 2

This part of the assignment can be extremely easy if you use the right tools. It only takes a few lines of R to generate what we are asking for. It probably only takes a few lines of Matlab. You can do it in Microsoft Excel. It would also be a very short Perl script (or even C program). We don't care what tool you use. We do not want your code, just your answers to the questions. This part of the assignment is due in hardcopy in class on Wednesday, 10/26.

Download the file hw1.data. It contains two (labeled) columns, "foo" and "bar." There are 200 entries in each column, each a numerical value.

  1. Plot a histogram* for foo and for bar. Do you think foo has anything to do with bar? (You may want to look at more than just the histograms to answer this question.)
  2. Compute the mean and median of foo and of bar and the correlation between them.
  3. Compute the standard error of the sample means (using the closed form).
  4. For B = {10, 50, 100, 200}, use the bootstrap to compute the standard error of these (five) statistics.
  5. For each of the five statistics in the B = 200 case, draw a histogram of the values of the bootstrap replicate statistic. Which of these look normal, and which don't?
*For any of the histograms, a stem-and-leaf or ASCII representation of a histogram is okay.
Noah A. Smith | Empirical Research Methods in Computer Science