600.226: Data Structures

Fall Semester 2005: September 8, 2005 - December 12, 2005

Assignment 8: Hashed Speech

Out on: October 27, 2005
Due by: November 2, 2005 by 5:59 pm for full credit (11:59 pm for 10% off, hard deadline)
Collaboration: Pairs
Grading: Packaging 10%, Style 10%, Performance 30%, Design 10%, Functionality 40%

Overview

The eighth assignment for 600.226: Data Structures deals mostly with maps of one sort or another. There are no "written" problems as such, but you can still rant a lot in your README file.

Note that each pair hands in one assignment only. Be sure to include the relevant information (who is in the pair!) in your README file! Both of you are getting the same score for the assignment.

Here are the necessary interfaces and exception classes: maps.tar.gz As usual, you are not allowed to change the interfaces in any way! If you think something is wrong, email the discussion list and check there.

Problem 1: Simple Maps

Your first task is to write a "simple" Map implementation for reference. You are free to use arrays, your own lists, or Java's ArrayList class as you wish. Your SimpleMap implementation could be sorted or unsorted, use a variation of "move-to-front" or not, etc. The only thing it cannot be is a hash table. :-) The details are up to you, but please discuss your choices in the README file.

As usual, please provide a toString() method to return a String representation of the map. Also, you should either provide a suitable main() method for testing, or you should adapt the code you wrote for comparing Set implementations last week (see Problem 2 below).

Problem 2: Hashed Maps

Your second task is to write an implementation of Map based on hash tables, called HashedMap for obvious reasons. Once again you are free to use arrays, your own lists, or Java's ArrayList class; you could even use your SimpleMap as part of this, depending on how you choose to handle collisions (see below). There are a number of options for your implementation, and you may want to consider them a bit before committing to one or the other; you might even decide to change things after you see how your first prototype performs. Here we go:

Once again, the exact choices are up to you, so you can determine the amount of work you do pretty flexibly. However, please keep in mind that we emphasize performance more than in previous assignments (check the rubric above) so it is in your best interest to make choices that will result in the best possible performance under the widest variety of conditions.

As usual, please provide a toString() method to return a String representation of the map. Also, you should either provide a suitable main() method for testing, or you should adapt the code you wrote for comparing Set implementations last week. If you choose to do the latter, you can write one program TestMaps instead of two main() methods for Problems 1 and 2. If you plan to do some performance comparisons I recommend the TestMaps approach; and if you happen to write multiple versions of the hash table code, I also recommend that you keep "old" versions around for reference; your "best" version should be in HashMap, the other ones could be named whatever you want.

Please describe the test method you chose, the alternate implementations you provide (if any), and the results of your performance comparison (if any) in your README file.

Problem 3: Analyzing Speech

A simple way to get an idea of what a certain text is about is to count how often certain words appear in it. Your final task for this assignment is to write a program WordFreq (based on your Map implementations) that performs this kind of analysis.

Your program should accept input text (in plain ASCII format) from standard input and produce a list of the 32 most frequently occurring words on the standard output; for each word, the number of times it appeared should be given as well. Consider the command java WordFreq <in.txt >out.txt. If in.txt contains

Bla bla, balla bla. Balla balla bla
bla bla blue balla bla balla bla.

bla!!

          -- Bla balla blue bla.

then out.txt should contain

bla 11
balla 6
blue 2

and nothing else. As you can see, you should ignore capitalization, punctuation, and white space. In order to get some "meaningful" data out of this, you also have to ignore very frequent words such as "a" or "to" or "or" or "the" or... You get the idea. This site has various lists of "noise words" but I am not sure how good they are; for now I suggest we use their "27 words" as reproduced here:

the, and, a, to, of, in, i, is, that, it, on, you, this,
for, but, with, are, have, be, at, or, as, was, so, if,
out, not

Project Gutenberg is a good source for test data. I suggest Einstein, Kafka, and Marx as simple test cases. Religious texts are more voluminous and thus provide more challenging test cases, for example The Bible or The Koran. Feel free to test on whatever you want, but we'll pick our test cases from Project Gutenberg. as well.

Deliverables

Please turn in a gzip compressed tarball of your assignment (the extension should be .tar.gz). The tarball should uncompress into a directory cs226-assignment-8-login1-login2 with login1 and login2 replaced by your Unix login names; uncompressing should not create any other files in the current directory. The tarball should contain no derived files whatsoever (i.e. no .class files, no .html files, etc.), but allow building all derived files. Include a README file that briefly explains what your programs do and contains any other notes you want us to check out before grading (and of course your answers to "written" problems).

Grading

For reference, here is a short explanation of the grading criteria. Packaging refers to the proper organization of the stuff you hand in, following the guidelines for Deliverables above. Style refers to Java programming style, including things like consistent indentation, appropriate identifiers, useful comments, suitable javadoc documentation, etc. Simple, clean, readable code is what you should be aiming for. Performance refers to how fast your program can produce the required results compared to other submissions. Design refers to proper modularization and the proper choice of algorithms and data structures. Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous and you had to make a certain choice, defend that choice in your README file.

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using javac -Xlint we will take off 10% (except if you document a very good reason). If your programs fail miserably even once, i.e. terminate with an exception of any kind, we will take off 10%.

Bonus Problem

It is quite interesting to compare the various hash functions people have developed over the years, both in terms of how efficiently you can compute them and how "collision free" they are. For the bonus problem, write a program that compares various hash functions for strings (the most common kind of key out there). Once again you can base this on the SortingAlgorithm framework to some extent: Just define an interface HashFunction instead and feed in text (similar to Problem 3 above); keep track of the hash values you get in a Bag and you can tell how frequently collisions occur; time how long it takes to compute all the hashes and divide either by number of words or number of characters to get a performance measure. This site has lots of hash functions to choose from. You could also try to confirm or disprove the properties claimed for certain hash functions in our text book. Note that we won't give you extra points for this, but we'll give you extra kudos. :-)

Updated: $Id: assignment-8.html 206 2005-10-28 16:45:15Z phf $ Validate: XHTML CSS
Copyright © 2005 Peter H. Fröhlich. All rights reserved.