Spring Semester 2008

January 28, 2008 – May 2, 2008

Assignment 6: Revenge of the Data Base

Out on: March 3, 2008
Due by: March 12, 2008, 3:00 pm (before lecture)
Collaboration: Pairs
Grading: Packaging 10%, Design 10%, Style 10%, Performance 30%, Functionality 40%

Overview

The sixth assignment asks you to collectively improve the simple database system each one of you helped implement for the last assignment. Note that the grading criteria have changed to include performance as well! Since you are working as a pair again, it's important that you coordinate your schedules and stay in touch during the week! Working together in lab or at a laptop in a cafe somewhere is highly encouraged! After the assignment is over, each of you will be asked to evaluate the contribution your partner and yourself made to the assignment.

Interface & Implementations

This is where you can get the latest version of the interface you need. Please watch the mailing list carefully for changes to this!

Also, you need to grab this archive which contains object files for all implementations that were submitted for the last assignment. The object files were generated on ugradx.cs.jhu.edu and should probably be used there (or another Intel Linux box). See Problem 1 for details.

Problem 0: Code Review (5%)

Your first task is another code review! You have a new partner, and each of you (hopefully) has an implementation from last week. Somehow you must decide whose code you're going to work on for this assignment, and what better way is there than a thrilling code review? You don't have to write down any of the details about the code review this time, you simply have to write down the outcome of the process: In your README file, explain whose code you are picking as your starting point for this assignment, and why!

I strongly recommend being done with the code review by Thursday at the latest! You'll need time to actually work on the remaining problems!

Problem 1: Evaluating Implementations (45%)

Your second task is to evaluate a number of implementations of sdbm.h against each other. We supply object files for the implementations, renamed and stripped of certain details (see the archive above). You will need to write test programs that show how well the various implementations work.

There are two major criteria for your evaluation: Functionality and performance. Functionality asks whether an implementation works or not. You want to write nasty code that violates the interface and manages to crash the implementation. You want to write code that doesn't violate any interface restrictions but still fails for some reason. You want to write code that shows inconsistencies, e.g. that an implementation allows the same key to be inserted twice.

Performance asks how efficient an implementation is in terms of time as well as space (including runtime memory demands as well as disk memory demands). You want to write code that simulates different "loads" on the database, for example lots of insertions with few gets, lots of gets all over the database, lots of gets that repeat a lot, etc. Try to write code that allows you to distinguish all of the implementations as clearly as possible in terms of performance.

The goal for this problem is to write up a short report in your README file that compares the various implementations. The report should state what the shortcomings of each implementation are and how you found them. The report should conclude with a recommendation for the two best implementations; make sure you state why you chose those two.

Aside from that report, we expect you to turn in all the test code you wrote, all the shell scripts or scripts in other languages etc. You should give us all the pieces that allow us to independently verify your results and conclusions. It is a very bad idea to run your tests by hand, then they are not repeatable; try to automate as much of this as you can. If you do, this will come in handy for Problem 2 as well since you can use the same tests/tools to evaluate your own implementation.

Ideally we can simply say make testall and all test programs are compiled, linked, run, and an automated report is generated. :-) This is not what we require, but it is what we would like to get from you (in addition to the report itself of course).

Problem 2: Database Implementation (50%)

Your third and final task is to improve your database implementation in two main dimensions: Functionality and performance! Both of these require that you use the various tools we have discussed in the last few weeks to your fullest advantage. It helps a lot that you will have a good idea how everybody else is doing with their implementations from Problem 1. :-)

Functionality. First you should investigate if you're getting all you can out of gcc; you may want to spend an hour or so reading over the various options in man gcc to see if there are any you could add to find additional problems with your code. Second you should use splint to get even more information about your code; this is a little harder than dealing with gcc warnings, and some things that splint dislikes probably are not really an issue, but you should go through all the complaints it has at least once and decide if there's anything to fix. Third you should use gcov to see how much of your code you are actually testing; your goal should be 100% line coverage: if you can't get 100% you are certainly not doing enough testing and you should add further test cases until you are sure all your code actually gets a chance to fail. Finally you may want to go through your code and place assert() checks everywhere you make an assumption about the value of a variable or the result of a function but you're not actually checking to make sure yet; it might be even better to add the actual check, but an assert() is at least going to make you aware of an issue. Of course you're free to do even more: If you know about other helpful tools or techniques you could leverage to improve the quality of your code, use them!

Performance. First you should use gcov and gprof to find out where the bottlenecks (in terms of speed) in your implementation are. You should already have tests for evaluating performance from Problem 1, but now you can use them to actually investigate where in your code those tests spend most of their time; those "hotspots" are where you should focus your attempts at performance tuning initially However, it is certainly possible that you actually have to change your implementation more dramatically to achieve better performance. For example, if you are using linear search in a linked list right now, it will simply be impossible to beat someone who is using binary search on a sorted array the right way. So don't spend too much time "fine tuning" your code; if you can't make decent progress in a day or so, step back and think of a different way to organize the data that would allow you to gain more substantial amounts of performance. And don't focus on speed exclusively, make sure you also use valgrind and memusage to find memory problems of your implementation; you should try to strike a balance between speed and memory usage, and that can be even harder than just making things fast. Don't forget to evaluate disk space as well, it's not just about how much main memory you use during database access, it's also important to store the data on disk in the most compact way (without actually compressing it of course :-).

Report. Finally, you not only have to do all this stuff, you also have to document what you did to some extent in your README file. Don't overdo it: You do not have to write pages and pages! But you should tell us briefly what tools you used, how you used them, what information they provided to you, what kinds of decisions you made based on that information, etc. If you can't achieve 100% line coverage, you should tell us why that's the case. If you can't fix all warnings or all the feedback from splint you may want to talk about that as well. In short: We want to know what you did why and how, and which tools helped you on the way. We want to see that you're real engineers. :-)

As before, please put your implementation into a file sdbm.c and make sure that sdbm.o builds by itself so it can be linked to the separately compiled test programs. Make sure to submit all your testing code as well, including shell scripts or scripts in other languages.

Hints

Deliverables

Please turn in a gzip compressed tarball of your assignment; the filename should be cs120-assign-6-login1-login2.tar.gz with login1 and login2 replaced by your Unix login name on ugradx.cs.jhu.edu. The tarball should contain no derived files whatsoever (i.e. no executable files), but allow building all derived files. Include a README file that briefly explains what your programs do and contains any other notes you want us to check out before grading. Include a Makefile to build your project.

Grading

For reference, here is a short explanation of the grading criteria. Packaging refers to the proper organization of the stuff you hand in, following the guidelines for Deliverables above. Style refers to C programming style, including things like consistent indentation, appropriate identifiers, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for. Performance refers to how effectively your program can produce the required results compared to other submissions. Design refers to proper modularization and the proper choice of algorithms and data structures. Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous and you had to make a certain choice, defend that choice in your README file.

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using gcc -ansi -pedantic -Wall -Wextra -std=c99 -O we will take off 10% (except if you document a very good reason). If your programs cannot be built using make we will take off 10%. If your programs fail miserably even once, i.e. terminate with an exception of any kind or dump core, we will take off 10%. Finally, make sure to include your name and email address in every file you turn in!