Assignment 3: On Ciphers and How to Break Some
Out on:
February 11, 2007
Due by:
February 18, 2007, 3:00 pm (before lecture)
Collaboration:
None
Grading:
Packaging 10%, Design 10%, Style 20%, Functionality 60%
Overview
The third assignment asks you to do a small code review (really small!), to extend a program from this week's lab, and to break Caesar's Cipher in a completely automated fashion. The latter requires some independent compilation magic, but you have plenty of that, right? Note that we increased the weight of style in the grading criteria above!
This is the first time we use this assignment, so there may be bugs we're not aware of yet. Please email the discussion list if you suspect a bug in the assignment! We just may be able to fix it... :-)
Problem 1: Code Review (20%)
You are going to review code that Rebecca wrote
in lab last week. Yes, you get to critize the stuff the staff does!
Woohoo!
Grab her version of
grep.c
and start reading it; read it twice, maybe even three times.
Check her style for consistency and taste.
Check her use (or misuse) of types.
Try to think up input that will make her program go crazy.
For example, is there a situation in which the program will
access arrays outside their bounds?
Maybe you want to run splint over her code as well,
just in case there are some things gcc didn't
catch?
If you do get some warnings from splint, you may
have to do some research to figure out what they mean; sounds
like work, but you could learn a lot from that... :-)
Your goal for this problem is to hand in two things:
First, in your README file, you should give an
actual review of Rebecca's code; it's pretty
much like a movie review or a book review or what not, just for
code.
Point out the good as well as the bad, don't
be too one-sided!
Second, you should hand in a new and improved
version of Rebecca's code; don't rewrite it from scratch,
make the minimum number of changes that will
make an already good program into a true marvel of modern
software engineering!
Note that compared to the reading assignment last week, you
must understand everything
about grep.c, not a single character in there
should remain a mystery to you! Be sure you ask us for help
if any mystery remains!
Problem 2: Vigenere Cipher (30%)
In lab this week, you developed a program that encrypts and decrypts messages using the infamous Caesar Cipher of the ancient world. The cipher works by replacing each letter in the plain text (cipher text) with the letter k places down in the alphabet in the cipher text (plain text); obviously this has to "wrap around" to provide complete mapping for all characters.
Your job for this assignment is to take the code from
lab, caesar.c, and
extend it to implement the much better
Vigenere Cipher
of the 19th century (well, really the 16th century, but what
are 300 years among friends).
The Vigenere Cipher uses n integers instead of one (which together
are "the" key for the cipher) so the program has to be able
to take any number of integer key arguments.
These individual keys are applied one after the other to each
letter in the plain text, so the first n letters of the cipher
text are each encoded with a (potentially) different key value.
Once n characters have been encoded, we "wrap around" in the
keys and encode character n+1 with the first key again, n+2
with the second key, and so on. Here is a verse from Lewis
Carrol's Jabberwocky for example:
he took his vorpal sword in hand: long time the manxome foe he sought -- so rested he by the tumtum tree, and stood awhile in thought.
If we encode it using the Caesar Cipher with key 12, we get the following:
tq faaw tue hadbmx eiadp uz tmzp: xazs fuyq ftq ymzjayq raq tq eagstf -- ea dqefqp tq nk ftq fgyfgy fdqq, mzp efaap mituxq uz ftagstf.
But if we use the Vigenere Cipher with keys 12, 7, and 23, we get the following instead:
tl qavh tpp hvobhi edldk fz oxzk: iaud fpjq aeq txzelyl cal eq zlgnef -- zl dlpfla tl yk aeq aryary aoql, xzk pfvlp httpiq pk folgnef.
As you can see, the 1st, 4th, 7th, 10th, ... character are the same as in the Caesar Cipher with key 12, but the other characters are different because the keys 7 and 23 were used for them. In other words, the Vigenere Cipher with one key is exactly the same as the Caesar Cipher.
The program you turn in for this problem should be called
vigenere.c. It should be based on the
caesar.c program from lab. The program must
be able to encode and decode Vigenere Ciphers of any
length.
Problem 3: Breaking Caesar (50%)
Say you have intercepted a secret message encrypted with the Caesar Cipher from lab. How can you find out what it is actually saying? Of course you could try all 26 keys and see which one works, but that's a pain if the message is long since you have to process it (potentially) 25 times.
There is actually a simple attack on the Caesar Cipher that allows you to break it "automatically" with (almost) certainty, at least if you know in what language the original plain text was written. The idea is to count the frequency of each letter in "normal" text in that language, and to compare how often each letter occurs in the encrypted message with those statistics. For example, in "average" English, the letter "e" occurs with approximately 10.3% likelihood, while the letter "q" occurs with approximately 0.09% likelihood. If in the encrypted message, the letter "k" occurs with about 10% likelihood, it's very probably that "k" stands for "e" in the original message, and that therefore the key that was used for encryption was 6.
Of course you don't do this comparison for just one letter, but for all 26 letters. In other words, you first build a table of how often each letter occurs in some (large) reference text of "normal" English. Then you build another table that counts how often each letter occurs in the encrypted message. Then, for each possible key, you compute how similar the two distributions are, for example using the chi-square test from statistics. The key for which you get the smallest value out of the chi-square test is, with high probability, the key you're looking for. Note that you only need the distributions, not the actual cipher text anymore, so for a large encrypted message, you can still be pretty fast in finding the key.
Your program for this problem should consist of at least three
modules: the main program, a module to build histograms of letter
frequencies from an input stream, and a module that can compare
two distributions for similarity using the
chi-square
test.
Your main program should be in main.c, your histogram
module should be in histo.c, and your similarity
module should be in stats.c. Both
histo.c and stats.c only export one
function, but they could internally use more. (Make sure internal
functions are defined static to be hidden!)
The interfaces for are
histo.h
and
stats.h.
You are not allowed to change the interfaces,
you must implement them as specified! Your main
program must use them as specified as well!
The name of the compiled program should be crack and
it will be used as follows:
The sole command line argument is the name of the reference text
file used to learn the letter distributions in the language that
the cipher text is presumably written in. You produce a histogram
from that file, your reference histogram for the language.
The cipher text will be fed into standard input, and you compute
a histogram for it as well.
Then, for all 26 possible shifts in the Caesar Cipher, you compute
the similarity of the (shifted) cipher text histogram to the
reference histogram.
The lowest similarity will occur on the histogram with the shift
that corresponds to the key used to encrypt it. The only output
your program makes is the value of that shift as an integer.
To help you with testing, here are four encrypted files (all in
English) that you can decrypt using your crack
program (or can you? :-).
You can also use the Jabberwocky verse from above
of course, or any other example you construct yourself using
the caesar.c program.
Deliverables
Please turn in a
gzip
compressed
tarball
of your assignment;
the filename should be
cs120-assign-3-login.tar.gz
with login replaced by your Unix login name
on ugradx.cs.jhu.edu
(so I would use cs120-assign-3-phf.tar.gz).
The tarball should contain no derived files whatsoever
(i.e. no executable files),
but allow building all derived files.
Include a README file that briefly explains what your
programs do and contains any other notes you want us to check out
before grading.
Grading
For reference, here is a short explanation of the grading criteria.
Packaging refers to the proper organization of the
stuff you hand in, following the guidelines for Deliverables above.
Style refers to C programming style, including
things like consistent indentation, appropriate identifiers,
useful comments, suitable documentation, etc.
Simple, clean, readable code is what you should be aiming for.
Performance refers to how fast your program can
produce the required results compared to other submissions.
Design refers to proper modularization and the
proper choice of algorithms and data structures.
Functionality refers to your programs being
able to do what they should according to the specification
given above; if the specification is ambiguous and you had
to make a certain choice, defend that choice in your
README file.
If your programs cannot be built you will get no points whatsoever.
If your programs cannot be built without warnings using
gcc -ansi -pedantic -Wall -Wextra -std=c99 -O
we will take off 10% (except if you document a very good reason).
If your programs cannot be built using make we will
take off 10%.
If your programs fail miserably even once,
i.e. terminate with an exception of any kind or dump core,
we will take off 10%.
Finally, make sure to include your name and email address in
every file you turn in!