Spring Semester 2008

January 28, 2008 – May 2, 2008

Assignment 2: Up and Running?

Out on: February 4, 2008
Due by: February 11, 2008, 3:00 pm (before lecture)
Collaboration: None
Grading: Packaging 10%, Design 10%, Style 10%, Functionality 70%

Overview

The second assignment shows you how to build more complex Unix applications from tarballs and asks you to do some more complex programming on your own.

Problem 1: Building from Tarballs and Reading Code (40%)

You are going to explore this Bayesian spam filter as an example for a bigger application written in C. Grab the tarball for version 0.9.4 using wget (for example) and extract it using tar and gzip. On a Linux system, you should be able to give the ./configure command followed by the make all command and see the program get built from all the separate source files. The result is an executable called bmf which, when run, tries to analyze an email it reads from standard input and tries to tag it as either spam or ham. How it does that in detail is actually not that important, but of course you're free to read around in the program. Actually, you must read it to do this problem. :-)

Check out all the source code files and for each write a short summary of what the purpose of that file is. It is not necessary for you to really understand every single line of code, rather you should concentrate on getting the "big picture" of how the program is structured. It's probably enough to write an average of two–three sentences for each file; some will only require one sentence, some will require four–five. Finally, write a short paragraph that summarizes your general impression of the code: Did you like reading it, was it horrible, if so why, what needs to be changed, etc.

Make sure you use tools such as ctags and cscope to your advantage, they make it much easier to navigate around in unknown code.

Problem 2: Formatting Text (60%)

You are to implement a highly simplified version of the Unix command fmt that breaks text from standard input into lines of a certain width. Consider, for example, the following input text:

Here is some text for you. Use it
wisely
to find out how fmt works and so on and so forth.

There are a few things to keep in mind.

  For example about indented text and what it means

  For example about more indented text and what it means

And so on, and so forth.

Feel free to try out what the real fmt program does with this, but it's not important for this problem. Your version of fmt should produce the following output given the above input:

Here is some text for you. Use
it wisely to find out how fmt
works and so on and so forth.
There are a few things to keep
in mind. For example about
indented text and what it
means For example about more
indented text and what it
means And so on, and so forth.

In other words, you break the input text (from standard input) into lines of at most 30 columns and write the result back out (to standard output). You ignore white space in the input except for using it to decide where words start and end; you never break words apart, ever. Words that are longer than 30 columns by themselves are copied to the output on a line by themselves, but in their complete length.

Just for reference, here are two more examples. From

As you   can  see we  don't     preserve    any spacing   either.

we get

As you can see we don't
preserve any spacing either.

and from

sdjafh
adlkfjhsadfkj asdfkj asddkjfh asdkdhf kasdhf kjasdhf lasdf
asdfjkhasdfjkasd dfas dfalsd fkasjhdf jashdf ksadhf asfl
sdhfjkahsdf
jasdhfkj asfkj askfhsdk fjhaksljfh aksjdhf jasdhf ka.
sadkjfhasdkljfhasdkjfhaskldhfkjasdhfkljasdhfklajsdhf

we get

sdjafh adlkfjhsadfkj asdfkj
asddkjfh asdkdhf kasdhf
kjasdhf lasdf asdfjkhasdfjkasd
dfas dfalsd fkasjhdf jashdf
ksadhf asfl sdhfjkahsdf
jasdhfkj asfkj askfhsdk
fjhaksljfh aksjdhf jasdhf ka.
sadkjfhasdkljfhasdkjfhaskldhfkjasdhfkljasdhfklajsdhf

Make sure you modularize your program into sensible functions! Each function should be relatively small and should perform a single, cohesive part of the overall program. Enjoy! :-)

Deliverables

Please turn in a gzip compressed tarball of your assignment; the filename should be cs120-assign-2-login.tar.gz with login replaced by your Unix login name on ugradx.cs.jhu.edu (so I would use cs120-assign-2-phf.tar.gz). The tarball should contain no derived files whatsoever (i.e. no executable files), but allow building all derived files. Include a README file that briefly explains what your programs do and contains any other notes you want us to check out before grading.

Grading

For reference, here is a short explanation of the grading criteria. Packaging refers to the proper organization of the stuff you hand in, following the guidelines for Deliverables above. Style refers to C programming style, including things like consistent indentation, appropriate identifiers, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for. Performance refers to how fast your program can produce the required results compared to other submissions. Design refers to proper modularization and the proper choice of algorithms and data structures. Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous and you had to make a certain choice, defend that choice in your README file.

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using gcc -ansi -pedantic -Wall -Wextra -std=c99 -O we will take off 10% (except if you document a very good reason). If your programs cannot be built using make we will take off 10%. If your programs fail miserably even once, i.e. terminate with an exception of any kind or dump core, we will take off 10%. Finally, make sure to include your name and email address in every file you turn in!