Project 3: Six Degrees of Awesome

Projects are designed to test your mastery of course material as well as your programming skills; think of them as “take-home exams” and don’t communicate with anyone about possible solutions. This project focuses on the implementation, testing, and application of graphs.

Note that there has been a change to the submission process, specifically to the naming convention for your tarball. Read all about it below!

Problem 1: Graph Implementation (70%)

Your first task for this project is to develop an “efficient” implementation of the Graph<V, E> interface for directed graphs. You’ll follow the basic design for a linked representation in the form of incidence lists that we outlined in lecture. Your class implementing the Graph<V, E> interface must be called SparseGraph<V, E>. Note that you must not change the provided Graph interface in any way or you risk losing all points for this problem.

We put “efficient” in quotes above because while the ideal is that all graph operations take O(1) time, the reality is that at least some operations will take more. For example, when inserting an edge you need to check for duplicate edges which will take O(d) where d is the maximum indegree / outdegree of the vertices in question. Similarly, when removing an edge, you need to find the edge you’re removing in the incoming / outgoing edge lists of the vertices in question, again something that will take O(d) time. Finally, depending on what you use internally to represent the various lists, you may have to make copies as part of operations like vertices or outgoing as discussed in lecture (and in a big comment in Graph.java as well), and those will not be O(1) operations either. We will eventually see an implementation of Graph<V, E> that gets O(1) expected time for at least the first two, but that’s a few lectures down the road.

As part of your SparseGraph<V, E> implementation, you will need two private nested or inner classes, one to represent vertices, the other to represent edges. Sadly some things are easier with nested classes, other things are easier with inner classes. Making this decision can be challenging, so carefully think through the implications of each choice before you go too far down a particular implementation path. (Under no circumstances should you make one a nested class and the other an inner class, that would just be horrible.)

Be prepared for some growing pains as you navigate the weird world of nested/inner classes with several type parameters! In particular the convert methods are bound to drive you insane as, for example, casting from a Vertex<V> interface to your private class will most likely lead to an “unchecked cast” warning. As a special exemption from the usual rule regarding warnings, you’re allowed to have two of these “unchecked cast” warnings for free on this problem! This is not because we like them, but because the way to make them disappear (short of using @SuppressWarnings which you’re not allowed to do) is shrouded in deep mystery.

Note that you are allowed (in principle anyway) to use the Java collection classes java.util.List, java.util.ArrayList, and java.util.LinkedList. However, as discussed in lecture, using these will cause iterator complications that you’ll have to deal with. (See also the long comment regarding iterators in the Graph<V, E> interface.) The other option of course is to continue using our own home-grown interfaces and implementations.

Testing

Your SparseGraph<V, E> class should come with a full suite of JUnit 4 test cases in a file SparseGraphTest.java. You should follow the approach we showed you for testing stacks, namely using a GraphTestBase.java class to hold the test cases applicable to all graphs and using SparseGraphTest.java only to create the correct object to test. Be sure to test all methods and be sure to test all exceptions for error situations as well.

Printing Graphs

Finally, your SparseGraph<V, E> class has to implement a special toString() method as well. For a simple Graph<String, Integer> into which the vertices holding data A and B as well as an edge holding data 7 from A to B have been inserted, the string returned by toString() should have the following format:

digraph {
  "A";
  "B";
  "A" -> "B" [label="7"];
}

Note that it is very important that you follow this format exactly as given, curly braces and quote marks and all. In this format, all the vertices are listed first, one per line, then all the edges, one per line. A semi-colon ends each line. The “label” is particular to the format being used, but not to be confused with the label methods in our Graph interface. Rather the word “label” here is used to designate the data value stored in an edge. (Interface-related labels stored in edges or vertices during algorithm execution are only temporary and not part of the toString output in any way.) If you want you can stop reading this problem now; but if you are curious why we picked this format, read on.

Drawing Graphs with GraphViz

The strange format for “printing” the graph is useful because various tools in the GraphViz package can be used to turn those strings into “actual” graphs:

This particular image was made by running the command

$ dot -Tpng graph.dot >graph.png

with graph.dot containing the string shown above. Note that you’re not required to do anything in particular with GraphViz for this problem, just follow the format as described and you’ll be fine. (But again, if you’re curious to try, you can install graphviz easily in Lubuntu using the Synaptic Package Manager.)

Documentation

Don’t forget to add proper javadoc comments for your SparseGraph class. You should not repeat documentation from the Graph interface unless your implementation of a method deviates from what is stated there.

Problem 2: Six Degrees of Angelina Jolie (30%)

Your second task for this project is to write a program that will help you play the Six Degrees of Kevin Bacon game. This game is played on a graph of movies and actors and actresses who starred in them. You can read all about the original game at the link above, but we’ll be playing a variant called “Six Degrees of Angelina Jolie” instead. It’s only fair.

The Jolie Number is defined as follows: Angelina Jolie has a Jolie Number of 0; people who co-starred in a movie with Angelina Jolie have a Jolie Number of 1; people who co-starred in a movie with someone who has a Jolie Number of 1 in turn have a Jolie Number of 2; and so on.

Given any actor or actress, the goal of the game is to establish their Jolie Number by a sequence of movies and co-stars. For example, using the “action” data set posted on Piazza, we can show that Peter Sellers has a Jolie Number of 2 as follows:

Sellers, Peter
Casino Royale (1967)
Bisset, Jacqueline
Mr. & Mrs. Smith (2005)
Jolie, Angelina

Angelina Jolie (0) starred in “Mr. & Mrs. Smith” with Jacqueline Bisset (1) who starred in “Casino Royale” with Peter Sellers (2). Or let’s try Meryl Streep instead:

Streep, Meryl
River Wild, The (1994)
Bratt, Benjamin (I)
Great Raid, The (2005)
Macintosh, Laird
Alexander (2004)
Jolie, Angelina

Angelina Jolie (0) starred in “Alexander” with Laird Macintosh (1) who starred in “The Great Raid” with Benjamin Bratt (2) who starred in “The River Wild” with Meryl Streep (3). Interestingly this works way into the past as well:

De Rosselli, Rex
Elmo, the Mighty (1919)
Lincoln, Elmo
Story of Dr. Wassell, The (1944)
Roosevelt, Franklin Delano
Great Raid, The (2005)
Macintosh, Laird
Alexander (2004)
Jolie, Angelina

Turns out that Rex De Roselli died in 1941 and yet he has a Jolie Number of just 4! Actually Rex really has a Jolie Number of 3 but we need to use a much bigger data set to show this:

De Rosselli, Rex
Lazy Lightning (1926)
Gaynor, Janet (I)
Century of Cinema, A (1994)
Hopkins, Anthony
Alexander (2004)
Jolie, Angelina

We have posted two data sets suitable for the program on Piazza (the data sets are courtesy of Robert Sedgewick):

The format of these data sets is rather simple: Each line is a movie, and each movie consists of several fields separated by the “/” character. The first field is the name of the movie itself, all the following fields are the names of actors and actresses starring in the movie. Reading this data is not complicated (although we did have to clean up the data sets to work around Java’s weird issues with UTF-8 files), but we hand you the parsing code anyway so you can focus on the search algorithm instead.

In order to find the smallest Jolie Number for an actor we proceed as follows: First we identify the vertices for both Angelina Jolie and the actor in question (we already give you that code). Then we start a breadth-first search at the vertex for Angelina; as we do this we keep track of the “previous vertex” that got us to the one we’re currently investigating. Once we find the vertex for the other actor, we are done: We just have to print out the path that got us here. This implementation of BFS is the only thing you have to write for this problem! Please put your code into the Angelina.java file we posted on Piazza for you. Make sure that the user can run it as follows from the command line:

$ java Angelina action06.txt "Oldman, Gary"
Oldman, Gary
Batman Begins (2005)
Bedella, David
Alexander (2004)
Jolie, Angelina

In other words, the first argument to the program is the database to read and the second argument to the program is the actor for whom we want to determine the Jolie Number. You don’t have to worry too much about error handling for this problem, as long as correct command line invocations (a database file that’s actually present and has the correct format) work you’ll be fine. Of course you still shouldn’t throw exceptions…

As a special exemption from the usual rule regarding warnings, you’re allowed to have one “unchecked cast” warning on this problem!

Note: You may run into Java heap memory issues when running this program in our Lubuntu environment with the very large all06.txt file. If so, you can manually specify the heap size for the Java Virtual Machine when you run the program as follows. For example, if your VM machine has 1024 MB of internal memory, you can run the program this way:

$ java -Xmx1024m Angelina all06.txt "Oldman, Gary"

The -Xmx option allows you to specify a particular heap size for the Java VM.

General Hints

Deliverables

You must turn in a gzip-compressed tarball of your project; the filename should be cs226-assignment-number-jhed.tar.gz with number replaced by the number of this project (see above) and jhed replaced by your JHED identifier. (For example, Peter would use cs226-assignment-3-pfroehl2.tar.gz for his submission of Project 3.) The tarball should contain no derived files whatsoever (i.e. no .class files, no .html files, etc.), but should allow building all derived files. Include a plain text README file (not README.txt or README.docx or whatnot) that briefly explains what your programs do and contains any other notes you want us to check out before grading; your answers to written problems should be in this file as well. Finally, make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Grading

For reference, here is a short explanation of the grading criteria; some of the criteria don’t apply to all problems, and not all of the criteria are used on all projects.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for projects.

Style refers to Java programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable javadoc documentation, etc. Many aspects of this are enforced automatically by Checkstyle when run with the configuration file available on Piazza. Style also includes proper modularization of your code (into interfaces, classes, methods, using public, protected, and private appropriately, etc.). Simple, clean, readable code is what you should be aiming for.

Testing refers to proper unit tests for all of the data structure classes you developed for this project, using the JUnit 4 framework as introduced in lecture. Make sure you test all (implied) axioms that you can think of and all exception conditions that are relevant.

Performance refers to how fast/with how little memory your program can produce the required results compared to other submissions.

Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous and you had to make a certain choice, defend that choice in your README file.

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using javac -Xlint:all we will take off 10% (except if you document a very good reason; no, you cannot use the @SuppressWarnings annotation either). If your programs fail miserably even once, i.e. terminate with an exception of any kind, we will take off 10% (however we’ll also take those 10% off if you’re trying to be “excessively smart” by wrapping your whole program into a universal try-catch).