Assignment 7: Hashing Out JHUgle

Out on: Tue, 7/28
Due by: Sat 7/27 by 11:59pm
Collaboration: Pairs!
Grading: [100 points total]
- Hashing functionality [30 pts]
- JHUgle functionality [25 pts]
- Testing [5 pts]
- Performance [20 pts]
- Discussion [10 pts]
- Style [10 pts]

Overview

Important - Special Late Day Policy: Because there is a gradescope outage scheduled for Sunday night, using a late day will result in a 48 hour extension instead of a 24 hour extension. This is because we want to make sure you have access to the autograder throughout your extra day (in this case, days).

This assignment is a little more open-ended than what you’re used to for this course. We are not breaking things down into separate problems below; the entire assignment is the problem now. This also means that the “rubric” for the assignment is the grading breakdown above.

Your “one” task for this assignment is to write a simplified search engine that that must be called JHUgle.java, subject to the specifications and constraints detailed below. Deciding what data structures and implementations to use, and implementing them as efficiently as possible, will be critical in earning maximum performance points. You are expected to do this by (once again) implementing the Map interface, this time using various hash table techniques (your choice, see below).

Important - Performance: The 20 points for Performance will be awarded by evaluating your submissions according to both time and space required by your search engine. Less time is better, less space is better, and both are considered separately for 10 points each. Performance points are only awarded for programs that produce correct results! You will not be competing against each other, but rather evaluated against our own baseline implementations.

Package Setup

Similar to the last assignment, we are not giving you all the files you need to edit for this assignment. In fact, we’re not giving you any code. You should copy the Map.java interface from your assignment 6, changing the package to hw7. Similarly, you’ll want to start with a combination of the tests from both partner’s MapTest.java files from hw6, again changing the package declaration to hw7.tests. The only things we’ve provided in the zip for this assignment are a bunch of data files for you to use on the JHUgle program, and a python script for you to create more if you’d like.

Below is the expected package hierarchy and files as the autograder will use them. Files you will be creating and editing are marked with an asterisk (*).

hw7/
    HashMap.java *
    JHUgle.java *
    Map.java   (from hw6, change package to hw7)
    tests/
        HashMapTest.java *
        MapTest.java *  (yours from hw6, change package to hw7.tests)
---
hw7-student.zip contains datasets for JHUgle input
    joanne.txt  (tiny)
    jhu.txt  (tiny)
    apache.txt  (large-ish)
    newegg.txt  (large-ish)
    random164.txt  (big!)

Partners

You are expected to work in pairs for this assignment (expected means required unless there is a compelling reason for solo work pre-approved by Joanne). Both partners will receive the same grade. When submitting on Gradescope you must make a group and include your partner. The most recent submission by either of you will be the one that gets graded. Make sure that both names are in the README as well.

Also, late days must be available to both students if the pair submits late. We suggest finding a partner in class or on Piazza, but make sure that you both have comparable late days left in case they are needed. Otherwise the assignment must be submitted within the late allowance for the student that comes soonest.

Paired programming is an excellent technique for working with a partner: you write the code together, taking turns acting as driver (typing) and navigator (reading, watching, guiding). This is way more effective than trying to split up the work and coding separately. Also, you will want to spend some time discussing the project and coming up with a game plan before you begin. Your discussion, analysis and approach decisions must be documented in the README file you submit. You must also document how the code was developed and who contributed to which parts.

JHUgle Search Engine

The “good news” is that you get to follow in some very famous footsteps and write an entire search engine! (Well, at least a basic one…) This will require applying the knowledge and techniques you’ve been learning all semester!

Your JHUgle.java program must have a single command-line argument which is the name of a file containing a list of websites and the keywords that are contained on each site. You’ll want to use this file to construct an index that maps each keyword to a collection of URLs in which it appears. Since reading in this file and constructing the index will take significant time, your program should output “Index Created” followed by a newline once that operation completes. (Hint: you might want to time this operation.)

The file will be structured to have the URL for each site on one line, followed by a single line containing words found on that site, separated by whitespace. Note, this could be more than one space character, or a tab! There are plenty of methods in Java’s Scanner that will help you parse these lines. There will be multiple URLs in the file, alternating on each line between the URL and the word list. These files could be many many many lines long, but a very short example named urls.txt might look as follows:

http://www.cars.com/
car red truck wheel blue fast brown silver gray
https://en.wikipedia.org/wiki/Cat
cat feline small fur red brown fast gray
http://www.foobar.com/baz
foo bar baz lorem ipsum red blue yellow silver

The end-goal of this program is to provide a way for the user to ask for the set of URLs that contain some set of words. The sets of words will be specified as a logical expression. In the example above, if I were to ask for the pages containing “red AND fast”, JHUgle should return the first two URLs, but not the third one, as it does not contain “fast” even though it does contain “red”. Similarly, if I were to ask for “car OR baz”, the second URL would be omitted, as it contains neither of these words. Note that queries may contain many operands (words) and operators (AND, OR).

To make the input simpler, we will be specifying these queries in postfix notation similar to Assignment 4. We will also use symbols for the operations rather than words so that the search engine can differentiate words in queries from the operations. The queries above would be given as red fast && and car baz || respectively. The main JHUgle program will read queries as one word or logical operation at a time from standard input, in response to a > prompt. However, remember that the input is incrementally providing one long query. Therefore, each time you process a binary operation, not only should the result be printed, but the result should replace the operands that produced it for future operations.

There are two more operations your JHUgle implementation will need to handle. The first is ?, which requires your program to print the URLs corresponding to the expression at the top of the query stack, one per line with no additional formatting. Your program will loop until the quit operation is given, an !. When the user quits, simply exit the program without producing any additional output.

Here is simple sample run that demonstrates all aspects of the program operation, with results based on the short sample input file above. In this snippet, $ is the command line prompt, but > is the prompt your program must print for the user.

$ java JHUgle urls.txt
Index Created
> ?
> baz
> red
> ?
http://www.cars.com/
https://en.wikipedia.org/wiki/Cat
http://www.foobar.com/baz
> &&
> ?
http://www.foobar.com/baz
> !
$

Similar to assignment 4, if the user gives an invalid command (such as && for the first command), you may print an error to standard error if you want, but otherwise must simply ignore invalid requests. Your good output must not contain any error messages.

One way to see how quickly your JHUgle program works is using the time Unix command. To do this you’ll want to make a file that contains words and commands in response to the prompts that can be used to run the program with standard input redirection. For example, if you put the above input sequence (? baz red ? && ? !) in a plain text file called ops.txt and have a dataset called urls.txt, you can time and run the program using:

$ ./time java JHUgle urls.txt <ops.txt >output.txt

The results will go to output.txt instead of the screen in this case; however the prompts will still appear on the screen. This is all to be expected. We have provided small and large URL input files, but will also test and measure your solutions with other input files as well. You might want to create a first iteration of this program using one of the Map implementations that we previously provided, or that you have from former assignments. This would also provide good baseline performance data.

The Hash Table

You are expected to achieve excellent performance in JHUgle.java by developing and comparing various hashing techniques. You must name your best version that is used in the JHUgle program HashMap.java. Obviously your HashMap.java must implement the Map interface, but beyond that you have quite a few options for how to proceed:

Your hash table can use any of the following techniques (but no others!): separate chaining or open addressing.
For open addressing you can resolve collisions by linear probing, quadratic probing, double hashing, or cuckoo hashing. (Make sure you understand how the various strategies relate to the size of the bucket array.)

Make sure that you include extensive comments at the start of your HashMap implementation to clarify what type of collision resolution strategy it implements.

You must have a concrete JUnit testing HashMapTest class. You should be able to reuse the MapTest test cases from the previous assignment, however you better make sure that they are complete and cover all operations (including the iterator() and toString() methods) and exception conditions.

And that’s it. Yes, you’re really on your own for figuring out what kind of hash tables you should implement, and which one to use in the end and how to effectively do so in order to implement the search engine.

All critical map operations, except insert, must run in O(1) expected time (or better); insert can run in O(1) amortized time in case you have to grow the bucket array table to keep the load factor down.

Got Extra Hash Tables?

Depending on just how serious you are about those Performance points, you may well end up writing multiple different hash tables over the course of this assignment. However, you have to pick one of those to use in JHUgle.java, named HashMap.java. You are welcome to submit other implementations as well, each named accordingly to indicate what type of hash table technique it uses. Include all the benchmarking data, results and analysis that contributed to your final decision on which implementation to use for the search engine in your README file.

Discussion

You should use your README file to explain how you approached this assignment, what you did in what order, how it worked out, how your plans changed, etc. Try to summarize all the different ways you developed, evaluated, and improved your JHUgle application and various Hashmaps over time. If you don’t have a story to tell here, you probably didn’t do enough…

Deliverables

Go to the Assignment 7 page for Gradescope and click submit. Note that you can resubmit any time up until the deadline. You will be prompted to upload your files at which point you will upload all of the necessary source files. You must build your solution with our Map.java interface as provided for assignment 7. We will use our own version, so no changes are permitted and you don’t need to submit it. Your submission must include the following files you are developing:

  README
  HashMap.java
  HashMapTest.java
  JHUgle.java
  MapTest.java

You need to submit all of these files to the autograder along with a README. You can upload them individually or in a zip file. If you upload them in a zip file make sure they are all at the top level, you cannot have any extra directories or else the autograder won’t be able to find them. This even applies to the test file - do not submit it in a tests subdirectory!

Make sure the code you hand in does not produce any extraneous debugging output. If you have commented out lines of code that no longer serve any purpose you should remove them.

Iterative Development

You cannot know how fast your HashMap is until it’s actually written. You cannot improve your HashMap until you can tell how fast it is. So the worst mistake you can make is to “think about it” for days without writing any code. (Thinking ahead is good in principle, thinking ahead for too long is the problem here.)

We recommend you start right now by writing the simplest HashMap you can think of and making that work. For example you could write one based on separate chaining but with a fixed array size.

You want your test cases and benchmarks in place before you keep going. Make sure that your test cases are complete and that your benchmarks tell you how well the various Map operations work for that first version of HashMap. You should probably save a backup (or even submit early!) as soon as you get done with the first round.

From then on, it’s “try to improve things” followed by “see if the tests still pass” followed by “benchmark to see if things actually got better” followed by either “Woops, that was a bad idea, let’s undo that” or “Yay, I made progress, let’s save a backup of the new version” and so on and so forth. We predict that there will be a correlation between how well you do and how often you “went around” this iterative development cycle.

What Classes Are Allowed?

You may not use java.util.HashMap or java.util.LinkedHashMap to implement your HashMaps and you may not use those classes or java.util.HashSet in your JHUgle solution either. You also may not use other map implementations (yours or Java’s) as part of your own HashMap classes. However, in order to write the JHUgle.java application you may (and are expected to) reuse other interfaces and classes that have been provided, that you developed for other assignments, or from the Java library.

If in doubt about what is permitted, better to ask on Piazza first! You don’t want to find out minutes before the deadline that you used something that’s not okay…

Random Hints

You have at least two more Map implementations from the last assignment. Use those to set some goals for how fast your new hash tables should be. (You should certainly be able to beat AvlTreeMap; beating TreapMap is harder, but not impossible.)
We’ll try to give some guidelines on performance expectations for particular JHUgle inputs, but no promises.
All the benchmarking techniques you’ve tried for prior assignments should be tried on this one.
You should probably keep track of how the performance changed over various versions of your implementations. If you don’t do that, you may end up with changes that made things worse but you never noticed since you didn’t have the data to check.
Hash tables are implemented with arrays. You may be tempted to use the ArrayList class from the Java API, but will likely find it easier (and faster?) in the long run to work directly with standard arrays, resizing manually when necessary. It is okay to have a compiler warning about casting to a generic array type as a result; no other warnings are allowed however.

README

You must hand in the source code and a README file. The README file can be plain text (README with no extension), or markdown (README.md). In your README be sure to answer the discussion questions posed in this description. You should discuss your solution as a whole and let the staff know anything important. If you are going to be using late days on an assignment, we ask that you note it in your README.

If you want to learn markdown formatting, here is a good starting point.

Submitting to Gradescope

Once you are ready to submit your files, go to the assignment 7 page for Gradescope and click submit. Note that you can resubmit any time up until the deadline. Only your most recent submission will be graded. Please refer to course policies as far as policies regarding late days and penalties.

After you submit, the autograder will run and you will get feedback on your functionality and how you performed on our test cases. Some test cases are “hidden” from you so you won’t actually know your final score on the test cases until after grades are released. We also include your checkstyle score as a test case.

If you see the “Autograder Failed to Execute” message, then either your submission did not compile at all or there was a packaging error. Please see the Gradescope Submission Notes in Piazza Resources for help debugging why your submission is not working.

You do not need to fully implement each file before you submit, but you’ll probably fail the test cases for the parts of the assignment you haven’t done yet. Also note that only the files with // TODO items in them will be used. You cannot modify any of the provided interface files as the autograder will overwrite any changes you made with the original provided file.

Grading

For reference, here is a short explanation of the grading criteria; some of the criteria don’t apply to all problems, and not all of the criteria are used on all assignments.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments.

Style refers to Java programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable javadoc documentation, etc. Many aspects of this are enforced automatically by Checkstyle when run with the provided configuration file.

You will lose 1 point for each type of checkstyle violation; excessive instances of the same type of violation might be docked extra.
Style also includes proper modularization of your code (into interfaces, classes, methods, using public, protected, and private appropriately, etc.). Simple, clean, readable code is what you should be aiming for.

Testing refers to proper unit tests for all of the data structure classes you developed for this assignment, using the JUnit 4 framework as introduced in lecture. Make sure you test all parts of the implementation that you can think of and all exception conditions that are relevant.

Performance refers to how fast/with how little memory your program can produce the required results compared to other submissions.

Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous and you had to make a certain choice, defend that choice in your README file.

If your submission does not compile, you will not receive any of the autograded-points for that assignment. It is always better to submit code that at least compiles. You will get freebie points just for compiling.

If your programs have unnecessary warnings when using javac -Xlint:all you will be penalized 10% functionality per failed part. (You are also unable to use the @SuppressWarnings annotation - we use it just to filter our accepted warnings from yours.)

If your programs fail because of an unexpected exception, you will be penalized 10% functionality per failed part. (You are not allowed to just wrap your whole program in to a universal try-catch.)