Assignment 6: Revenge of the Data Base
Out on:
March 3, 2008
Due by:
March 12, 2008, 3:00 pm (before lecture)
Collaboration:
Pairs
Grading:
Packaging 10%, Design 10%,
Style 10%, Performance 30%, Functionality 40%
Overview
The sixth assignment asks you to collectively improve the simple database system each one of you helped implement for the last assignment. Note that the grading criteria have changed to include performance as well! Since you are working as a pair again, it's important that you coordinate your schedules and stay in touch during the week! Working together in lab or at a laptop in a cafe somewhere is highly encouraged! After the assignment is over, each of you will be asked to evaluate the contribution your partner and yourself made to the assignment.
Interface & Implementations
This is where you can get the latest version of the interface you need. Please watch the mailing list carefully for changes to this!
-
Latest version:
sdbm.h
with
md5sum 69801a25ec93880f8cef3107e014d777
Also, you need to grab this
archive which contains object files for all implementations
that were submitted for the last assignment. The object files
were generated on ugradx.cs.jhu.edu and should
probably be used there (or another Intel Linux box).
See Problem 1 for details.
Problem 0: Code Review (5%)
Your first task is another code review!
You have a new partner, and each of you (hopefully) has an
implementation from last week.
Somehow you must decide whose code you're going to work on
for this assignment, and what better way
is there than a thrilling code review?
You don't have to write down any of the details about the
code review this time, you simply have to write down the
outcome of the process:
In your README file, explain whose
code you are picking as your starting point for this assignment,
and why!
I strongly recommend being done with the code review by Thursday at the latest! You'll need time to actually work on the remaining problems!
Problem 1: Evaluating Implementations (45%)
Your second task is to evaluate a number of
implementations of sdbm.h against each other.
We supply object files for the implementations, renamed and
stripped of certain details (see the archive above).
You will need to write test programs that show how well the
various implementations work.
There are two major criteria for your evaluation: Functionality and performance. Functionality asks whether an implementation works or not. You want to write nasty code that violates the interface and manages to crash the implementation. You want to write code that doesn't violate any interface restrictions but still fails for some reason. You want to write code that shows inconsistencies, e.g. that an implementation allows the same key to be inserted twice.
Performance asks how efficient an implementation is in terms of time as well as space (including runtime memory demands as well as disk memory demands). You want to write code that simulates different "loads" on the database, for example lots of insertions with few gets, lots of gets all over the database, lots of gets that repeat a lot, etc. Try to write code that allows you to distinguish all of the implementations as clearly as possible in terms of performance.
The goal for this problem is to write up a short report in
your README file that compares the various
implementations.
The report should state what the shortcomings of each
implementation are and how you found them.
The report should conclude with a recommendation for the
two best implementations; make sure you
state why you chose those two.
Aside from that report, we expect you to turn in all the test code you wrote, all the shell scripts or scripts in other languages etc. You should give us all the pieces that allow us to independently verify your results and conclusions. It is a very bad idea to run your tests by hand, then they are not repeatable; try to automate as much of this as you can. If you do, this will come in handy for Problem 2 as well since you can use the same tests/tools to evaluate your own implementation.
Ideally we can simply say
make testall
and all test programs are compiled, linked, run, and an
automated report is generated. :-)
This is not what we require, but it is what we would
like to get from you (in addition to the
report itself of course).
Problem 2: Database Implementation (50%)
Your third and final task is to improve your database implementation in two main dimensions: Functionality and performance! Both of these require that you use the various tools we have discussed in the last few weeks to your fullest advantage. It helps a lot that you will have a good idea how everybody else is doing with their implementations from Problem 1. :-)
Functionality.
First you should investigate if you're getting all you can
out of gcc; you may want to spend an hour or
so reading over the various options in man gcc
to see if there are any you could add to find additional
problems with your code.
Second you should use splint to get even more
information about your code; this is a little harder than
dealing with gcc warnings, and some things
that splint dislikes probably are not really
an issue, but you should go through all the complaints it
has at least once and decide if there's anything to fix.
Third you should use gcov to see how much of
your code you are actually testing; your goal should be
100% line coverage: if you can't get 100% you are certainly
not doing enough testing and you should add further test
cases until you are sure all your code actually gets a
chance to fail.
Finally you may want to go through your code and place
assert() checks everywhere you make an
assumption about the value of a variable or the result
of a function but you're not actually checking to make
sure yet; it might be even better to add the actual
check, but an assert() is at least going
to make you aware of an issue.
Of course you're free to do even more: If you know about
other helpful tools or techniques you could leverage to improve the
quality of your code, use them!
Performance.
First you should use gcov and gprof
to find out where the bottlenecks (in terms of speed) in your
implementation are.
You should already have tests for evaluating performance from
Problem 1, but now you can use them to actually investigate
where in your code those tests spend most of
their time; those "hotspots" are where you should focus your
attempts at performance tuning initially
However, it is certainly possible that you actually have to
change your implementation more dramatically to achieve better
performance.
For example, if you are using linear search in a linked list
right now, it will simply be impossible to beat someone who
is using binary search on a sorted array the right way.
So don't spend too much time "fine tuning" your code; if you
can't make decent progress in a day or so, step back and think
of a different way to organize the data that would allow you
to gain more substantial amounts of performance.
And don't focus on speed exclusively, make sure you also use
valgrind and memusage
to find memory problems of your implementation; you should
try to strike a balance between speed and memory usage, and
that can be even harder than just making things fast.
Don't forget to evaluate disk space as well, it's not just
about how much main memory you use during database access,
it's also important to store the data on disk in the most
compact way (without actually compressing it of course :-).
Report.
Finally, you not only have to do all this
stuff, you also have to document what you
did to some extent in your README file.
Don't overdo it: You do not have to write pages and pages!
But you should tell us briefly what tools you used, how you
used them, what information they provided to you, what kinds
of decisions you made based on that information, etc.
If you can't achieve 100% line coverage, you should tell us
why that's the case. If you can't fix all warnings or all
the feedback from splint you may want to talk
about that as well.
In short: We want to know what you did why and how, and which
tools helped you on the way. We want to see that you're real
engineers. :-)
As before, please put your implementation into a file sdbm.c
and make sure that sdbm.o builds by itself so it
can be linked to the separately compiled test programs.
Make sure to submit all your testing code as well,
including shell scripts or scripts in other languages.
Hints
- Just in case you have trouble seeing the wood for the trees (or whatever that saying is): Pick an implementation to work on (Problem 0), break everyone else's implementation and see how fast they are (Problem 1), make yours the fastest and most stable implementation around (Problem 2).
- Feel free to bring up changes to the interface just like last week. We can discuss them on the mailing list and see what everybody thinks.
- Technically you have to worry about what happens when two programs access the database concurrently. However, this turns out to be quite involved to get right, so you can simply assume that it won't happen for now.
Deliverables
Please turn in a
gzip
compressed
tarball
of your assignment;
the filename should be
cs120-assign-6-login1-login2.tar.gz
with login1 and login2
replaced by your Unix login name on ugradx.cs.jhu.edu.
The tarball should contain no derived files whatsoever
(i.e. no executable files),
but allow building all derived files.
Include a README file that briefly explains what your
programs do and contains any other notes you want us to check out
before grading.
Include a Makefile to build your project.
Grading
For reference, here is a short explanation of the grading criteria.
Packaging refers to the proper organization of the
stuff you hand in, following the guidelines for Deliverables above.
Style refers to C programming style, including
things like consistent indentation, appropriate identifiers,
useful comments, suitable documentation, etc.
Simple, clean, readable code is what you should be aiming for.
Performance refers to how effectively your program
can produce the required results compared to other submissions.
Design refers to proper modularization and the
proper choice of algorithms and data structures.
Functionality refers to your programs being
able to do what they should according to the specification
given above; if the specification is ambiguous and you had
to make a certain choice, defend that choice in your
README file.
If your programs cannot be built you will get no points whatsoever.
If your programs cannot be built without warnings using
gcc -ansi -pedantic -Wall -Wextra -std=c99 -O
we will take off 10% (except if you document a very good reason).
If your programs cannot be built using make we will
take off 10%.
If your programs fail miserably even once,
i.e. terminate with an exception of any kind or dump core,
we will take off 10%.
Finally, make sure to include your name and email address in
every file you turn in!