Spring Semester 2008

January 28, 2008 – May 2, 2008

Assignment 3: On Ciphers and How to Break Some

Out on: February 11, 2007
Due by: February 18, 2007, 3:00 pm (before lecture)
Collaboration: None
Grading: Packaging 10%, Design 10%, Style 20%, Functionality 60%

Overview

The third assignment asks you to do a small code review (really small!), to extend a program from this week's lab, and to break Caesar's Cipher in a completely automated fashion. The latter requires some independent compilation magic, but you have plenty of that, right? Note that we increased the weight of style in the grading criteria above!

This is the first time we use this assignment, so there may be bugs we're not aware of yet. Please email the discussion list if you suspect a bug in the assignment! We just may be able to fix it... :-)

Problem 1: Code Review (20%)

You are going to review code that Rebecca wrote in lab last week. Yes, you get to critize the stuff the staff does! Woohoo! Grab her version of grep.c and start reading it; read it twice, maybe even three times. Check her style for consistency and taste. Check her use (or misuse) of types. Try to think up input that will make her program go crazy. For example, is there a situation in which the program will access arrays outside their bounds? Maybe you want to run splint over her code as well, just in case there are some things gcc didn't catch? If you do get some warnings from splint, you may have to do some research to figure out what they mean; sounds like work, but you could learn a lot from that... :-)

Your goal for this problem is to hand in two things: First, in your README file, you should give an actual review of Rebecca's code; it's pretty much like a movie review or a book review or what not, just for code. Point out the good as well as the bad, don't be too one-sided! Second, you should hand in a new and improved version of Rebecca's code; don't rewrite it from scratch, make the minimum number of changes that will make an already good program into a true marvel of modern software engineering!

Note that compared to the reading assignment last week, you must understand everything about grep.c, not a single character in there should remain a mystery to you! Be sure you ask us for help if any mystery remains!

Problem 2: Vigenere Cipher (30%)

In lab this week, you developed a program that encrypts and decrypts messages using the infamous Caesar Cipher of the ancient world. The cipher works by replacing each letter in the plain text (cipher text) with the letter k places down in the alphabet in the cipher text (plain text); obviously this has to "wrap around" to provide complete mapping for all characters.

Your job for this assignment is to take the code from lab, caesar.c, and extend it to implement the much better Vigenere Cipher of the 19th century (well, really the 16th century, but what are 300 years among friends). The Vigenere Cipher uses n integers instead of one (which together are "the" key for the cipher) so the program has to be able to take any number of integer key arguments. These individual keys are applied one after the other to each letter in the plain text, so the first n letters of the cipher text are each encoded with a (potentially) different key value. Once n characters have been encoded, we "wrap around" in the keys and encode character n+1 with the first key again, n+2 with the second key, and so on. Here is a verse from Lewis Carrol's Jabberwocky for example:

he took his vorpal sword in hand:
  long time the manxome foe he sought --
so rested he by the tumtum tree,
  and stood awhile in thought.

If we encode it using the Caesar Cipher with key 12, we get the following:

tq faaw tue hadbmx eiadp uz tmzp:
  xazs fuyq ftq ymzjayq raq tq eagstf --
ea dqefqp tq nk ftq fgyfgy fdqq,
  mzp efaap mituxq uz ftagstf.

But if we use the Vigenere Cipher with keys 12, 7, and 23, we get the following instead:

tl qavh tpp hvobhi edldk fz oxzk:
  iaud fpjq aeq txzelyl cal eq zlgnef --
zl dlpfla tl yk aeq aryary aoql,
  xzk pfvlp httpiq pk folgnef.

As you can see, the 1st, 4th, 7th, 10th, ... character are the same as in the Caesar Cipher with key 12, but the other characters are different because the keys 7 and 23 were used for them. In other words, the Vigenere Cipher with one key is exactly the same as the Caesar Cipher.

The program you turn in for this problem should be called vigenere.c. It should be based on the caesar.c program from lab. The program must be able to encode and decode Vigenere Ciphers of any length.

Problem 3: Breaking Caesar (50%)

Say you have intercepted a secret message encrypted with the Caesar Cipher from lab. How can you find out what it is actually saying? Of course you could try all 26 keys and see which one works, but that's a pain if the message is long since you have to process it (potentially) 25 times.

There is actually a simple attack on the Caesar Cipher that allows you to break it "automatically" with (almost) certainty, at least if you know in what language the original plain text was written. The idea is to count the frequency of each letter in "normal" text in that language, and to compare how often each letter occurs in the encrypted message with those statistics. For example, in "average" English, the letter "e" occurs with approximately 10.3% likelihood, while the letter "q" occurs with approximately 0.09% likelihood. If in the encrypted message, the letter "k" occurs with about 10% likelihood, it's very probably that "k" stands for "e" in the original message, and that therefore the key that was used for encryption was 6.

Of course you don't do this comparison for just one letter, but for all 26 letters. In other words, you first build a table of how often each letter occurs in some (large) reference text of "normal" English. Then you build another table that counts how often each letter occurs in the encrypted message. Then, for each possible key, you compute how similar the two distributions are, for example using the chi-square test from statistics. The key for which you get the smallest value out of the chi-square test is, with high probability, the key you're looking for. Note that you only need the distributions, not the actual cipher text anymore, so for a large encrypted message, you can still be pretty fast in finding the key.

Your program for this problem should consist of at least three modules: the main program, a module to build histograms of letter frequencies from an input stream, and a module that can compare two distributions for similarity using the chi-square test. Your main program should be in main.c, your histogram module should be in histo.c, and your similarity module should be in stats.c. Both histo.c and stats.c only export one function, but they could internally use more. (Make sure internal functions are defined static to be hidden!) The interfaces for are histo.h and stats.h. You are not allowed to change the interfaces, you must implement them as specified! Your main program must use them as specified as well!

The name of the compiled program should be crack and it will be used as follows: The sole command line argument is the name of the reference text file used to learn the letter distributions in the language that the cipher text is presumably written in. You produce a histogram from that file, your reference histogram for the language. The cipher text will be fed into standard input, and you compute a histogram for it as well. Then, for all 26 possible shifts in the Caesar Cipher, you compute the similarity of the (shifted) cipher text histogram to the reference histogram. The lowest similarity will occur on the histogram with the shift that corresponds to the key used to encrypt it. The only output your program makes is the value of that shift as an integer.

To help you with testing, here are four encrypted files (all in English) that you can decrypt using your crack program (or can you? :-). You can also use the Jabberwocky verse from above of course, or any other example you construct yourself using the caesar.c program.

Deliverables

Please turn in a gzip compressed tarball of your assignment; the filename should be cs120-assign-3-login.tar.gz with login replaced by your Unix login name on ugradx.cs.jhu.edu (so I would use cs120-assign-3-phf.tar.gz). The tarball should contain no derived files whatsoever (i.e. no executable files), but allow building all derived files. Include a README file that briefly explains what your programs do and contains any other notes you want us to check out before grading.

Grading

For reference, here is a short explanation of the grading criteria. Packaging refers to the proper organization of the stuff you hand in, following the guidelines for Deliverables above. Style refers to C programming style, including things like consistent indentation, appropriate identifiers, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for. Performance refers to how fast your program can produce the required results compared to other submissions. Design refers to proper modularization and the proper choice of algorithms and data structures. Functionality refers to your programs being able to do what they should according to the specification given above; if the specification is ambiguous and you had to make a certain choice, defend that choice in your README file.

If your programs cannot be built you will get no points whatsoever. If your programs cannot be built without warnings using gcc -ansi -pedantic -Wall -Wextra -std=c99 -O we will take off 10% (except if you document a very good reason). If your programs cannot be built using make we will take off 10%. If your programs fail miserably even once, i.e. terminate with an exception of any kind or dump core, we will take off 10%. Finally, make sure to include your name and email address in every file you turn in!