Homework 1
In the first part of this assignment, you will build a computer program whose
performance can be measured in a variety of ways. The second part will be posted on 10/19, after lecture 2, and will consist of a small problem set.
Part 1 of 2
You will write two programs, one to compress (or encode) data, and
one to decompress (or decode) a compressed file. You are to write
it on your own, and you may not use anyone else's code or any
existing, nonstandard libraries (you definitely may not use libz, so don't ask). Your programs
must run on the ugrad Linux machines on the command line. If you aren't yet comfortable with Unix/Linux, use google or email the instructors.
The input to your program could be any sequence of bytes.
The output (to standard output) will be a (single) sequence of bytes. Your program is not to generate any auxiliary files.
We can evaluate your compression/decompression program pair in three ways:
- How fast does it run (on a given file, on a certain type of
file, on a large set of files, etc.; also compression and decompression speed)?
- What is the rate of compression (how much smaller is the output
of encode than the input)?
- What is the lossiness (if we run
cat foo | ./encode |
./decode > foo2 how similar are foo and foo2)?
Note that the cat program is fast (1) and lossless (3) but
does no actual compression. Note that reading and writing nothing out at all (ignoring
the input) has extremely good compression and speed but is
very lossy! You don't have to be smart to write a tool that meets the specs; you may have
to do a bit of googling to learn a bit about compression to write tools that do "reasonably" well
on all three criteria.
Note that you can measure these things, too. You should
start out by getting a set of data files (from your home directory,
the Web, or wherever you'd like - you probably want some variety in
the set), and writing a script that can report speed, compression, and
loss. To report runtime, run:
time ./encode < foo >
foo.compressed
To list byte differences in two files, run:
cmp -l foo foo2
This might help you track your
progress over time and figure out what kinds of inputs your program
has problems with. You can also compare with standard compression
programs like gzip.
You are welcome to read anything you like about compression. A
place to start might be http://en.wikipedia.org/wiki/Data_compression.
Remember that you are to work alone on this assignment.
Deliverables
Everything will be submitted electronically to nasmith@gmail.com.
You will submit all source code, a makefile, and an ASCII plaintext
or pdf file describing, in less than one page, how your program
works. If you had a particular strategy (e.g., your program is
guaranteed not to be lossy) then you may point this out. If you
used standard algorithms, remember to cite work appropriately.
Your program will eventually be compared with the other students' programs.
We will give a bit of extra credit to the best-performing program (where performance is a combination of speed, compression, and losslessness).
However, most of your grade be based on
successful, effortless compilation and execution (no crashes!) and
on a concise, understandable summary of how your program works. We
will not read your code unless we think it will be interesting (and
definitely not if it doesn't compile), and we won't give partial
credit for beautiful code that doesn't work.
To make things easier for us, you must:
- Name your programs encode and decode.
- Put all source code, your makefile, and your program description into one directory.
Inside that directory, run
tar -cv * | gzip
-c > yourfirstname.yourlastname.tar.gz
If you then run
gzip -cd
yourfirstname.yourlastname.tar.gz | tar
-t
You should see a listing of the files without any
preceding path. You must turn in your source code, and it must compile for us.
You may not use any nonstandard libraries (ask if you're not
sure) or anyone else's code. Your code must compile on the
linux ugrad machines using a compiler that's already installed
there. We recommend you use C; talk to us if you don't plan
to use C, C++, Java, Perl, or Python. Email the file yourfirstname.yourlastname.tar.gz to nasmith@gmail.com.
- Once we unpack your code, we should be able to run simply:
make The executables (named encode and
decode) should now be present in the directory. We don't want to have to even know what language your programs are written in!
- Your programs should run on the command line, reading (only)
from standard input and writing (only) to standard output. So
we should be able to run:
./encode < foo >
foo.compressed
./decode < foo.compressed > foo2
Of course, the files foo and foo2 might be different if your compression is lossy.
- Your program may generate diagnostic output to standard error. We would prefer you make this optional (e.g.,
./encode -v prints diagnostic
but ./encode is silent).
To get you moving, we are requiring that you submit twice. The first
version of your program (which must meet all of the above criteria)
must be submitted by 8:00 pm, Tuesday, 10/18. The second version
is due by 11:59 pm, Tuesday, 10/25 (not 8:00 pm as previously noted). Please keep a copy of each submission (separately) for the future.
Part 2 of 2
This part of the assignment can be extremely easy if you
use the right tools. It only takes a few lines of R to generate
what we are asking for. It probably only takes a few lines of
Matlab. You can do it in Microsoft Excel. It would also be a
very short Perl script (or even C program). We don't care what
tool you use. We do not want
your code, just your answers to the questions. This part of the
assignment is due in hardcopy in class on Wednesday, 10/26.
Download the file hw1.data. It contains
two (labeled) columns, "foo" and "bar." There are 200 entries in
each column, each a numerical value.
- Plot a histogram* for foo and for bar. Do you think foo has anything to do with bar? (You may want to look at more than just the histograms to answer this question.)
- Compute the mean and median of foo and of bar and the correlation between them.
- Compute the standard error of the sample means (using the closed form).
- For B = {10, 50, 100, 200}, use the bootstrap to compute the standard error of these (five) statistics.
- For each of the five statistics in the B = 200 case, draw a histogram of the values of the bootstrap replicate statistic. Which of these look normal, and which don't?
*For any of the histograms, a stem-and-leaf or ASCII representation of a histogram is okay.
Noah A. Smith | Empirical Research
Methods in Computer Science