Spring Semester 2006: January 30, 2006 - May 5, 2006
Out on:
April 7, 2006
Due by:
April 14, 2006 by 5:59 pm for full credit (11:59 pm for 10% off, hard deadline)
Collaboration:
Pairs
Grading:
Packaging 10%, Style 10%, Performance 30%, Design 10%, Functionality 40%
The ninth assignment for
600.226: Data Structures
deals mostly with ordered maps of one sort or another.
There are some "written" problems as well, to be answered in the
README file.
Note that each pair hands in one assignment!
Decide early on who is going to be responsible for submitting the
assignment and when.
Make sure to include all the relevant information (who is in the
pair?) in your README file!
Both of you will get the same score for the assignment.
Here are the necessary interfaces and exception classes: omaps.tar.gz As usual, you are not allowed to change the code we provide in any way! Warning: This is a new version of the assignment and there may be serious bugs in these interfaces. If you think you found a bug, please email the course staff about it immediately. Thanks!
Your first task is to write a class
SimpleOrderedMap<K,V>
that implements the
OrderedMap<K,V>
interface we provided above.
This class should use basic binary
search trees to implement the
OrderedMap<K,V>
operations, so no fancy "balancing acts" are allowed.
Please don't use existing Java data
structures for this, write the tree code from scratch
and implement in terms of a "linked" representation
with separate Node objects that hold
Entries consisting of keys and values.
Your Node and Entry classes
should be nested inside your
SimpleOrderedMap<K,V>
class.
Try to make your code as simple as possible:
it's more important to have a correct implementation
for reference than to have the fastest possible implementation;
performance is addressed in the second problem below.
As usual, please provide a toString()
method to return a String representation
of the map, and a main() method that
performs basic unit testing for your implementation.
A new map into which the pairs
("Peter", 35),
("Hans", 70),
and
("Toni", 67)
were inserted should print as
{Hans: 70, Peter: 35, Toni: 67};
the order has to follow the order defined for the
keys you are using.
Your second task is to write an efficient
implementation of the
OrderedMap<K,V>
interface
using some kind of balanced search tree.
The exact data structure used for your
BalancedOrderedMap<K,V>
is up to you (as long as you select from the following list anyway).
There are a number of options, and you may want to consider them
a bit before committing to one or the other; you might even decide
to change things after you see how your first prototype performs.
Here are your options:
Once again, the exact choices are up to you, so you can determine the amount of work you do pretty flexibly. However, please keep in mind that we emphasize performance more than in previous assignments (check the rubric above) so it is in your best interest to make choices that will result in the best possible performance under the widest variety of conditions.
As usual, please provide a toString() method
to return a String representation of the map
(see Problem 1 for the format).
In fact, given that balanced search trees are pretty
complicated beasts, you may want to add another operation
to generate DOT code; this will allow you to "visually
debug" your data structure fairly easily.
Also, you should again provide a suitable main()
method for testing.
Please describe the test method you chose, the alternate
implementations you provide (if any), and the results of
your performance comparison (if any) in your
README file.
This problem is identical to last week's Problem 3, except for the changes in bold toward the end.
A simple way to get an idea of what a certain text is about
is to count how often certain words appear in it.
Your final task for this assignment is to write a program
WordFreq
(based on your Map implementations) that
performs this kind of analysis.
Your program should accept input text (in plain ASCII format)
from standard input and produce a list of the 32 most frequently
occurring words on the standard output; for each word, the number
of times it appeared should be given as well.
Consider the command java WordFreq <in.txt >out.txt.
If in.txt contains
Bla bla, balla bla. Balla balla bla
bla bla blue balla bla balla bla.
bla!!
-- Bla balla blue bla.
then out.txt should contain
bla 11 balla 6 blue 2
and nothing else. As you can see, you should ignore capitalization, punctuation, and white space. In order to get some "meaningful" data out of this, you also have to ignore very frequent words such as "a" or "to" or "or" or "the" or... You get the idea. This site has various lists of "noise words" but I am not sure how good they are; for now I suggest we use their "27 words" as reproduced here:
the, and, a, to, of, in, i, is, that, it, on, you, this, for, but, with, are, have, be, at, or, as, was, so, if, out, not
Project Gutenberg is a good source for test data. I suggest Einstein, Kafka, and Marx as simple test cases. Religious texts are more voluminous and thus provide more challenging test cases, for example The Bible or The Koran. Feel free to test on whatever you want, we'll pick our test cases from Project Gutenberg as well.
In addition to this, you should compare the performance of
your two binary search tree implementations and your hash
table implementation from last week.
Measure the running time of your WordFreq program
for the two hash table implementations you have
access to (from last week) and the two binary search
tree implementation for this week.
Use texts of different sizes and different word distributions
and explain the behavior of these various data structures in
your README file.
Compare your experimental measurements to the theoretical
properties these data structures should have as well, and
discuss whether your data is in line with theoretical
predictions or not. If not, try to explain why not.
Please turn in a
gzip
compressed
tarball
of your assignment;
the filename should be
cs226-assign-9-login1-login2.tar.gz
with login1 and login2
replaced by your Unix login names on ugradx.cs.jhu.edu.
The tarball should contain no derived files whatsoever
(i.e. no .class files, no .html files, etc.),
but allow building all derived files.
Include a README file that briefly explains what your
programs do and contains any other notes you want us to check out
before grading; don't forget to include your answers to "written"
problems as well.
For reference, here is a short explanation of the grading criteria.
Packaging refers to the proper organization of the
stuff you hand in, following the guidelines for Deliverables above.
Style refers to Java programming style, including
things like consistent indentation, appropriate identifiers,
useful comments, suitable javadoc documentation, etc.
Simple, clean, readable code is what you should be aiming for.
Performance refers to how fast your program can
produce the required results compared to other submissions.
Design refers to proper modularization and the
proper choice of algorithms and data structures.
Functionality refers to your programs being
able to do what they should according to the specification
given above; if the specification is ambiguous and you had
to make a certain choice, defend that choice in your
README file.
If your programs cannot be built you will get no points whatsoever.
If your programs cannot be built without warnings using
javac -Xlint
we will take off 10% (except if you document a very good reason).
If your programs fail miserably even once,
i.e. terminate with an exception of any kind,
we will take off 10%.
No bonus problem this week, hacking good balanced search tree code should given you enough problems already. And if you really think you need more work, you can make up your own bonus problem by now... :-)