Project 2: A Sorta Kinda Web Cache

Overview

The first project was mostly a warmup, this second project has a lot more meat on its proverbial bones. With that in mind, here’s a friendly piece of advice: Start as early as possible! If not on the actual programming, then at least on trying to understand all the pieces…

Your job is to build something sorta kinda like a web cache. It’s going to be pretty far from an actual web cache, but at least some of the pieces are the same. The basic job of your web cache is to read a URL from standard input, check if it has a cached copy of that URL, and if so write the cached copy to standard output. Simple in principle, but all the dirty details are below.

There are two parts to this project: one in which you build a persistent key-value store, and one in which you build the actual caching program. The two parts are connected through the interface defined by the sdbm.h header file posted on Piazza. You’ll want to carefully read through that file first and ask plenty of questions about it: If you don’t fully understand the API, you won’t be able to complete either part.

Note that you’re expected to also develop (and submit) a useful testing infrastructure for your project. This will involve shell scripting as before to test the web cache as a whole, but it will also involve writing unit tests for your sdbm.c implementation of the persistent key-value store.

The biggest challenge for this project will be coming up with a good conceptual model for how your key-value store will work. There are a number of “moving parts” all of which have to work together properly for a decent solution. To complicate things further, you have to invent most of them yourself. Try to go for the simplest possible approach that won’t embarrass you later!

The Web Cache Program

Let’s start with an example:

$ ./webber samplecache
http://www.cs.jhu.edu/
404 Not Found
http://fake.url.com/
200 OK
<html>Fake HTML document</html>
http://cs220.rocks/
200 OK
<html>Fake HTML document</html>
$

The webber program is started with the name of a “database” that holds the content of the web cache, samplecache in this case. Once the program is started, it waits for a line of input spelling out an URL to look up. If the program doesn’t have a copy of that URL in its database, it prints 404 Not Found; if the program does have a copy of the URL in its database, it prints 200 OK followed by the cached content (the two fake HTML documents in the example above). In either case, it then waits for another URL to look up. As per usual CTRL-D ends the program.

Note that all the output above was written to stdout! Even though you might think of 404 Not Found as an “error message” at first, it’s not in fact an error message from webber itself: Nothing went wrong in the program, it simply didn’t have the requested information. So this is not an “error message” in the same sense that “can’t open database” would be. Here’s what a “real” error message (printed to stderr would look like):

$ ./webber somewrongname
error: can't open database
$

All your error messages have to start on a new line with the string error: but the rest of the error message is up to you; the only other constraint is that an error message can only be a single line. If a “real” error occurs, the program is supposed set EXIT_FAILURE as an exit status eventually, otherwise it’s supposed to set EXIT_SUCCESS as per usual.

Note that we provide a program mksample.c for you that will use your key-value store to create an example cache database. You’ll probably want to write a similar program to create a larger, more complex cache eventually, for testing purposes. (Keep in mind that we’re grading for performance this time, so you will definitely want a larger cache database to see how well your program does.)

The Key-Value Store

You’ll implement another key-value store for this project. However, in contrast to the kvs homework assignment, this key-value store will be persistent. Say program A creates a database X, inserts a bunch of key-value pairs into X, and then exits. Later another program B is started, opens the database X, and prints all the key-value pairs in it: program B will output the stuff program A inserted earlier. For this to work, the information has to be stored on disk in the form of one (or more) files. (Note that in the context of this project you can think of mksample.c as program A and of webber.c as program B.)

On Piazza you’ll find the sdbm.h header file that describes the interface to the key-value store. Do not change sdbm.h under any circumstances! Make sure that you read the interface carefully and understand what each operation is supposed to do before you start hacking! Your job is to write sdbm.c to implement all the operations from sdbm.h. You’ll have to decide how to store the data which will in turn determine the code you’ll have to write to access/update it. (We’ll discuss one implementation option in lecture, but it’s neither the best one nor the simplest one. Thinking for yourself is highly encouraged.)

Testing

You should have a testing infrastructure in place for your project. Since there are two distinct parts, the key-value store and the web cache, there will be two distinct approaches to testing:

We’ll discuss two approaches to unit testing in lecture, one based on “hand written” test drivers that use C’s assert macro and one based on the ct unit testing framework. It’ll be up to you to decide which approach you want to use: The former is easier to get started with, the latter provides better overall support for unit testing.

Another choice that will be up to you is whether you want to keep using valgrind to guard against memory-related bugs in your code or if you prefer to use the more general “sanitizers” available in gcc that we covered in lecture.

Finally, you’ll have to perform coverage analysis for your test cases. That is, you’ll have to use gcov to determine what parts of your code base (in sdbm.c and webber.c) your tests cases actually test. Your goal is to get 100% line coverage for both files, but it may not be realistic to achieve that; try to get as close as possible and defend any missing coverage in your README file.

Hints

Deliverables

All your core C code for the web cache should be in webber.c and sdbm.c. Please don’t write additional modules that complicate how your program must be linked!

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also, be sure to include a Makefile that sets the appropriate compiler flags and builds all programs by default. The Makefile should also have clean and test targets as per usual; the test target should run both system and unit tests; ideally it also runs coverage analysis for you. Finally, make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Grading

For reference, here is a short explanation of the grading criteria; some of the criteria don’t apply to all problems, and not all of the criteria are used on all assignments.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.

Style refers to C programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for. Make sure you follow the style guide posted on Piazza!

Design refers to proper modularization (functions, modules, etc.) and an appropriate choice of algorithms and data structures.

Performance refers to how fast/with how little memory your programs can produce the required results compared to other submissions.

Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous, ask for clarification! (It also refers to you simply doing the required work, which may not be programming alone.)

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using the required compiler options given on Piazza we will take off 10% (except if you document a very good reason). If your programs cannot be built using make we will take off 10%. If your programs fail miserably even once, i.e. terminate with an exception of any kind or dump core, we will take off 10% (for each such case).