Homework 2

In this homework, you will carry out several experiments with your own compression programs and with gzip.

  1. Collect some text files and some image files from the web or from your own computer. Determine whether the mean compression rate of your program is higher for text or images or whether there is no difference. Describe carefully the statistical tests and reasoning you use. A clearly written discussion is much more informative than the smallest p-value. What is the null hypothesis? How much data are you basing your conclusions on? Compression rate is simply the size in bytes of the compressed file divided by the size of the original file.

  2. With the same data, characterize the differences, if any, in the variance of your program's compression rate between text and images.

  3. How does the mean compression of text and images differ with gzip? Make sure you use the same options with all files.

  4. Perform the same analysis on gzip's compression rate variance.

  5. Did your program's compression significantly change from its first to its second version?

  6. Did your program's speed (bytes compressed/time for compressing and decompressing) change significantly between versions? We used the Unix systemcall gettimeofday (available in C, perl, etc.) to measure time. Describe the method you use.
Just to be clear, questions 1-5 deal with compression
rate = (compressed bytes/original bytes)
Question 6 deals with compression
speed = (original bytes/(compression time + decompression time))
We are looking for a convincing, readable, understandable story in English. You may submit in pdf, postscript, plaintext, or hardcopy. The assignment is due Friday, November 11, at 11:59 pm.
Noah A. Smith and David A. Smith | Empirical Research Methods in Computer Science