Fall Semester 2005: September 8, 2005 - December 12, 2005
Out on:
November 3, 2005
Due by:
November 9, 2005 by 5:59 pm for full credit (11:59 pm for 10% off, hard deadline)
Collaboration:
Pairs
Grading:
Packaging 10%, Style 10%, Performance 30%, Design 10%, Functionality 40%
The ninth assignment for
600.226: Data Structures
once again deals with maps.
We focus on ordered maps implemented as search trees this time.
There are no "written" problems as such, but you can still rant a lot
in your README file.
Note that each pair hands in one assignment only.
Be sure to include the relevant information (who is in the pair!)
in your README file!
Both of you are getting the same score for the assignment.
Here are the necessary interfaces and exception classes: omaps.tar.gz As usual, you are not allowed to change the interfaces in any way! If you think something is wrong, email the discussion list and check there.
Your first task is to write a simple
OrderedMap implementation using a
basic binary search tree for reference.
Your SimpleOrderedMap implementation should be
linked, i.e. your representation should
consist of Node objects with left, right,
and parent references (and the Node class
should be nested inside the SimpleOrderedMap
class of course).
Try to make your code as simple as possible: it's more important
to have a correct implementation for reference
than to have the fastest possible implementation; performance is
the concern of the second problem below.
As usual, please provide a toString() method
to return a String representation of the map;
as we are dealing with ordered maps, the string representation
should be sorted by key.
Also, you should either provide a suitable main()
method for testing, or you should adapt the code you wrote
for comparing Map
implementations last week (see Problem 2 below).
Your second task is to write an efficient
implementation of the OrderedMap interface
using some kind of balanced search tree.
The exact
data structure for your BalancedOrderedMap
implementation is up to you.
Here are your options:
You can choose among these data structures however you want, but of course your goal should be to have the most efficient map data structure possible: As in the last assignment, we emphasize performance more than in previous assignments (check the rubric). Of course your implementation should be well-encapsulated (using nested classes, etc.) but you're used to that by now, right?
Please provide a toString() method
to return a String representation of the map.
In fact, given that balanced search trees are pretty
complicated beasts, you may want to add another operation
to generate DOT code; this will allow you to "visually
debug" your data structure fairly easily.
Also, you should either provide a suitable main()
method for testing, or you should adapt the code you wrote
for comparing Map implementations last week.
If you choose to do the latter, you can write one program
TestOrderedMaps instead of two main()
methods for Problems 1 and 2.
If you plan to do some performance comparisons I recommend
the TestOrderedMaps approach; and if you happen to write
multiple versions of the balanced tree code, I also recommend
that you keep "old" versions around for reference; your
"best" version should be in BalancedOrderedMap, the other
ones could be named whatever you want.
Please describe the test method you chose, the alternate
implementations you provide (if any), and the results of
your performance comparison (if any) in your
README file.
This problem is identical to last week's Problem 3; it's
reproduced here for reference.
A simple way to get an idea of what a certain text is about
is to count how often certain words appear in it.
Your final task for this assignment is to write a program
WordFreq
(based on your Map implementations) that
performs this kind of analysis.
Your program should accept input text (in plain ASCII format)
from standard input and produce a list of the 32 most frequently
occurring words on the standard output; for each word, the number
of times it appeared should be given as well.
Consider the command java WordFreq <in.txt >out.txt.
If in.txt contains
Bla bla, balla bla. Balla balla bla
bla bla blue balla bla balla bla.
bla!!
-- Bla balla blue bla.
then out.txt should contain
bla 11 balla 6 blue 2
and nothing else. As you can see, you should ignore capitalization, punctuation, and white space. In order to get some "meaningful" data out of this, you also have to ignore very frequent words such as "a" or "to" or "or" or "the" or... You get the idea. This site has various lists of "noise words" but I am not sure how good they are; for now I suggest we use their "27 words" as reproduced here:
the, and, a, to, of, in, i, is, that, it, on, you, this, for, but, with, are, have, be, at, or, as, was, so, if, out, not
Project Gutenberg is a good source for test data. I suggest Einstein, Kafka, and Marx as simple test cases. Religious texts are more voluminous and thus provide more challenging test cases, for example The Bible or The Koran. Feel free to test on whatever you want, but we'll pick our test cases from Project Gutenberg. as well.
Please turn in a
gzip
compressed
tarball
of your assignment (the extension should be .tar.gz).
The tarball should uncompress into a directory
cs226-assignment-9-login1-login2
with login1 and login2
replaced by your Unix login names;
uncompressing should not create any other files
in the current directory.
The tarball should contain no derived files whatsoever
(i.e. no .class files, no .html files, etc.),
but allow building all derived files.
Include a README file that briefly explains what your
programs do and contains any other notes you want us to check out
before grading (and of course your answers to "written" problems).
For reference, here is a short explanation of the grading criteria.
Packaging refers to the proper organization of the
stuff you hand in, following the guidelines for Deliverables above.
Style refers to Java programming style, including
things like consistent indentation, appropriate identifiers,
useful comments, suitable javadoc documentation, etc.
Simple, clean, readable code is what you should be aiming for.
Performance refers to how fast your program can
produce the required results compared to other submissions.
Design refers to proper modularization and the
proper choice of algorithms and data structures.
Functionality refers to your programs being
able to do what they should according to the specification
given above; if the specification is ambiguous and you had
to make a certain choice, defend that choice in your
README file.
If your programs cannot be built you will get no points whatsoever.
If your programs cannot be built without warnings using
javac -Xlint
we will take off 10% (except if you document a very good reason).
If your programs fail miserably even once,
i.e. terminate with an exception of any kind,
we will take off 10%.
No bonus problem this week, hacking good balanced search tree code should given you enough problems already. :-)