600.226 Data Structures: Class Challenge (Spring 2000)


Back to syllabus.

New on this page:


Let me repeat: this Class Challenge is NOT part of your assignments. You really do NOT have to participate.

The Challenge

The Task

Write a program that minimizes a given FSA (Finite State Automaton) with labeled edges. Write a companion program which takes an FSA (minimized or not) as an input and dumps all the paths found in the input (assuming the FSA is acyclic) to the standard output. The input & output is SGML-coded (for format, see below). Obviously, you are supposed to use Java.

You think it's too easy? Think again; here is the catch: it should work for really huge data, and you will be evaluated on an Evaluation Data FSA (see below) which are quite large. As an additional constraint, at any point, your memory usage should not grow over 250MB of memory, and it should take less than 1 hour of CPU time (user+system, on a 300MHz, single CPU or slower) to minimize the Evaluation Data FSA.

Since the data is an acyclic FSA (as you have undoubtedly found by now...), you have a wider choice of algorithms. It's up to you to choose the right one for the type of data you will be evaluated upon.

The Details

Additional requirements

The following requirements must be met, since your code will be evaluated automatically:

Deadline

The deadline is Monday, April 3, 2000, at 11:00am.

Submission method

You are entitled to submit only one solution. In case you send more, only the one with the latest arrival time before the deadline will be considered.

Pack all of your code into a single file, and send it by e-mail to hajic@comma.cs.jhu.edu.

The packed file must have the following format:

You might proceed like this when packing the code:

cd <your directory where you develop the code>
mkdir 123-45-6789
cp -p * 123-45-6789
tar -czvf 123-45-6789.tgz 123-45-6789/*

assuming your SSN is 123-45-6789. Mail the resulting file (123-45-6789.tgz) as an attachment to hajic@comma.cs.jhu.edu. By submitting the code you also state and agree that your code has been written by you and only you, and that it can be made publicly available.

Data & Sources You Need

Minimization of an FSA

An deterministic FSA (Finite State Automaton) is a device, which is defined as a 5-tuple (T, Q, S, F, d), where

An FSA accepts a given input string I = [i1,...,in], if the successive application of the transition function on all the symbols from I, starting in the initial state S, leads to (some) final state qn:

Certainly it is imaginable that two FSAs accept exactly the same set of input strings; one of them might have smaller number of states. Such an FSA, which is the smallest of all those accepting the same set of strings as some orginal FSA, is called the minimized FSA of the original FSA. Here is an example:

This is a deterministic FSA:

And this is its minimized version:

since both FSAa accept exactly these four strings: A, AE, CE and C.

Test Data

These are available at /usr/local/data/cs226, on machines like hops, barley, etc.

Evaluation Data

This is really big data which your code will be evaluated upon. It is stored in the same directory as the test data: /usr/local/data/cs226/big.fsa. Given the setup of the undergrad network, you will not be able to copy this data anywhere; you can only read them. Also, be aware that the undergrad machines have barely those limiting 250MB of memory (64MB real + 256MB swap) you will probably need, unless you can come up with a much leaner method. You will also be unable to store the results; thus, the only purpose why you have access to this file is that you get a feel for the size and time needed to process the Real Thing. So PLEASE do not try to use the evaluation data unless you have a reasonable chance of succeeding, and your code is working reasonably fast and with low memory requirements on the small.fsa data.

The Rewards

For all who fulfill the requirements (i.e., working code, technical and other requirements met, max. 250MB of memory, under 1 hour, etc.): For the one who fulfills the requirements, plus gets the fastest algorithm of all those fulfilling the requirements: In case of a close tie (under 1 second user + system CPU time difference as measured by the UNIX time command), both (or more) participants are entitled to the higher reward.

In case nobody can fulfill all the requirements, and the only requirement which is broken is the 1 hour CPU time limit, the first three solutions (as measured by the CPU time) will be entitled to the lesser reward anyway.

Needless to say, no "cooperation" is allowed. See the homeworks page for the paragraph about plagiarism. Obviously, no submission which is too similar to some other (and vice versa!) will be accepted.

The Consequences

None, even if you withdraw after you start, or even if you submit something totally stupid :-).

Back to syllabus.