600.226 Data Structures: Class Challenge (Spring 2000)
Back to syllabus.
New on this page:
03/13/00 Class Challenge: based on
the feedback I got at the last meeting (Thu 3/9), I extend the deadline
for submitting the Challenge project to 4/3 (Monday) 11am.
- 02/27/00 Two algorithms for FSA minimization have been
published.
- 02/25/00 In order to make your life
easier, and for you to be able to concentrate on the algorithmic
issues rather than the I/O technicalities, the states in the input
file (tags <f>, <t> as well as <S> and <F>) are now
continuously numbered, starting with number 1, leaving thus
zero for whatever "null" you need it. The requirement that you have to
handle general strings for state names no longer applies. [I realized
you will probably not be able to fit into the memory constraint if you
have to keep all the strings which represent the node names.] All
examples have been updated to reflect this change. The files
tiny.*,
small.fsa,
and big.fsa
have been uploaded with numbered nodes only (at the same time making
the big file a little smaller). A hint related to this issue (to
make your life even easier :-)): if you need to do so, you are allowed
to use the number of states, number of final states, and/or the number
of arcs as constants in your code (you can find those numbers by using
the standard Unix utilities on the big.fsa
file).
- 02/24/00 List of accepted strings in Class Challenge
FSA example corrected (thanks to Eric Yang). The extra <N> tags
have been deleted, too (in the SGML
format example).
Let me repeat: this Class Challenge is NOT part of your
assignments. You really do NOT have to participate.
The Challenge
The Task
Write a program that minimizes a given FSA (Finite State Automaton)
with labeled edges. Write a companion program which takes an FSA
(minimized or not) as an input and dumps all the paths found in the
input (assuming the FSA is acyclic) to the standard output. The input
& output is SGML-coded (for format, see below). Obviously, you are
supposed to use Java.
You think it's too easy? Think again; here is the catch: it should
work for really huge data, and you will be evaluated on an Evaluation
Data FSA (see below) which are quite large. As an additional
constraint, at any point, your memory usage should not grow over 250MB
of memory, and it should take less than 1 hour of CPU time
(user+system, on a 300MHz, single CPU or slower) to minimize the
Evaluation Data FSA.
Since the data is an acyclic FSA (as you have undoubtedly found by
now...), you have a wider choice of algorithms. It's up to you to choose the right one
for the type of data you will be evaluated upon.
The Details
Additional requirements
The following requirements must be met, since your code will be
evaluated automatically:
-
Name your main class for the minimization code
"MinimizeFSA", and the main class for
printing all paths "PrintAllPathsFSA".
-
Read you data from standard input, and write your output to the
standard output.
-
Comment your code.
- Data format (both input and output) for MinimizeFSA:
The input and output format is exactly same.
- The I/O format is a simple SGML format using 6 different tags:
- <S> Start state of the FSA. End tag required.
- <F> Final state of the FSA. End tag required.
- <e> Denotes an edge. End tag
required. Within <e>, any sequence of the following three tags
(but only once each) must be present.
- <f> Edge "from" node. Positive integer,
continuous (dense) numbering. End tag should not be used.
- <i> Edge "value" node. Any string of
letters, numbers and symbols allowed, except for <. (Don't be
misled by the test data, which only contain a single character here -
*any* string should be handled.) End tag should not be used.
- <f> Edge "to" node. Positive integer,
continuous (dense) numbering. End tag should not be used.
-
On the output, the sequence of the tags within an edge tag
(<e>) should be <f>, <i>,
<t>.
-
There should be exactly one of <S>, <F>,
or <e> (with contents) on every given line.
- No spaces are allowed anywhere except within the
<i> tag, where they represent themselves (i.e., are not
to be stripped).
- The lines are not in any particular order; therefore, you also do
not need to output them in any particular order.
- You may use any names for output nodes. E.g., they might be
numbered, and the numbering does not even have to be continuous (dense).
- Example of the I/O format:
<S>1</S>
<F>10</F>
<e><f>1<i>j<t>11</e>
<e><f>1<i>sk<t>2</e>
<e><f>11<i>a<t>12</e>
<e><f>12<i>r<t>13</e>
<e><f>13<i>oo<t>10</e>
etc.
- Data format for PrintAllPathsFSA:
- Input: same as above for MinimizeFSA.
- Output: One path per line, with no separators between edge
values.
Deadline
The deadline is Monday, April 3, 2000, at 11:00am.
Submission method
You are entitled to submit only one solution. In case you send more,
only the one with the latest arrival time before the deadline will be
considered.
Pack all of your code into a single file, and send it by e-mail to hajic@comma.cs.jhu.edu.
The packed file must have the following format:
-
it must be a tar, gzipped file with suffix "tgz" and name which is your SSN:
e.g., 123-45-6789.tgz
-
When unpacked, everything must go to a relative subdirectory which is
also named after your SSN.
You might proceed like this when packing the code:
cd <your directory where you develop the code>
mkdir 123-45-6789
cp -p * 123-45-6789
tar -czvf 123-45-6789.tgz 123-45-6789/*
assuming your SSN is 123-45-6789. Mail the resulting file
(123-45-6789.tgz) as an attachment to hajic@comma.cs.jhu.edu. By
submitting the code you also state and agree that your code has been
written by you and only you, and that it can be made publicly
available.
Data & Sources You Need
Minimization of an FSA
An deterministic FSA (Finite State Automaton) is a device, which is
defined as a 5-tuple (T, Q, S, F, d), where
- T is a set of input symbols,
- Q is a set of states,
- S from Q is a single initial state,
- F (a subset of Q) is a set of final states,
- d is a transition function: i.e., a mapping from Q x T into Q.
An FSA accepts a given input string I =
[i1,...,in], if the successive application of
the transition function on all the symbols from I, starting in the
initial state S, leads to (some) final state qn:
q1 = d(S,i1),
q2 = d(q1,i2),
etc., up to:
qn = d(qn-1,in).
Certainly it is imaginable that two FSAs accept exactly the same set
of input strings; one of them might have smaller number of
states. Such an FSA, which is the smallest of all those accepting the
same set of strings as some orginal FSA, is called the minimized FSA
of the original FSA.
Here is an example:
This is a deterministic FSA:
And this is its minimized version:
since both FSAa accept exactly these four strings: A, AE, CE and C.
Test Data
These are available at /usr/local/data/cs226, on
machines like hops, barley, etc.
- tiny.fsa: really tiny FSA for
debugging purposes. One of the possible minimizations is available
in the same directory as
tiny.min.fsa. Another file,
tiny.printout contains the (sorted) output you
should get from the PrintAllPathsFSA accompanying program
(the one printing the concatenated values of all paths from a given
FSA) when run on either tiny.fsa or
tiny.min.fsa. You might want to use these
files for comparison with the results you obtain.
- small.fsa: A small, but already a
reasonably sized FSA. You should still be able to copy it to your own
directory, keep (some) intermediate results etc.
Evaluation Data
This is really big data which your code will be evaluated upon. It is
stored in the same directory as the test data:
/usr/local/data/cs226/big.fsa. Given the
setup of the undergrad network, you will not be able to copy this data
anywhere; you can only read them. Also, be aware that the undergrad
machines have barely those limiting 250MB of memory (64MB real + 256MB
swap) you will probably need, unless you can come up with a much
leaner method. You will also be unable to store the results; thus, the
only purpose why you have access to this file is that you get a feel
for the size and time needed to process the Real Thing. So PLEASE do
not try to use the evaluation data unless you have a reasonable chance
of succeeding, and your code is working reasonably fast and with low
memory requirements on the small.fsa data.
The Rewards
For all who fulfill the requirements (i.e., working code, technical and other
requirements met, max. 250MB of memory, under 1 hour, etc.):
- 30 extra points for "class participation";
- up to 100 points to compensate for a not-so-perfect homework(s)
and/project(s) of your choice (e.g., you can completely ignore
homework 4, and still get 100 points for it)
For the one who fulfills the requirements, plus gets the fastest algorithm of all those fulfilling the requirements:
- 100 extra points for "class participation";
- up to 200 points to compensate for a not-so-perfect homework(s)
and/project(s) of your choice (e.g., you can completely ignore
homeworks 3 & 4, and still get 100 points for it)
In case of a close tie (under 1 second user + system CPU time
difference as measured by the UNIX time command), both (or more)
participants are entitled to the higher reward.
In case nobody can fulfill all the requirements, and the only
requirement which is broken is the 1 hour CPU time limit, the first
three solutions (as measured by the CPU time) will be entitled to the
lesser reward anyway.
Needless to say, no "cooperation" is allowed. See the homeworks page
for the paragraph about plagiarism. Obviously, no submission
which is too similar to some other (and vice versa!) will be accepted.
The Consequences
None, even if you withdraw after you start, or even if you submit
something totally stupid :-).
Back to syllabus.