In the first part of the compiler project, you are going to develop a scanner to perform lexical analysis of SIMPLE programs. You are also going to hack the basic driver program which you will use for the rest of the semester to integrate more and more components of your compiler. You can get the complete concrete grammar for the SIMPLE programming language here.
The final compiler will consist of a number of modules and classes
working together to translate programs written in SIMPLE into equivalent
programs written in assembly language. While these “bits and pieces” are
spread out over the entire semester, you can already implement the basic
driver program that will orchestrate their work. The driver will be
sc and is invoked from the shell as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .
This describes the syntax of the command line in EBNF. After
itself, the user can supply one option (introduced by “
-”) to tell
the driver which parts of the compiler to run and what kind of output to
For this assignment, the option “
-s” to run the scanner by itself and
to produce a list of recognized tokens is the only relevant option.
If another option, or no option at all, is given, you should abort with
an error message (see below). Eventually “no option” will mean “generate
code” as it does for “real” compilers.
If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.
If you detect an error, whether in this or future assignments, you should output (to standard error of course) an error message in the following form:
error: some helpful description
You must output the string “
error:” on a new line, followed by
one blank, followed by whatever text makes sense for the error in
question. Our automated grading suite relies on this format and you’ll
get penalized if you do something else. Note that if an error is
detected in this assignment, the (partial) list of tokens should
still be output.
I suggest using exceptions for error handling throughout the
project, if your language of choice supports them. This way, you can
simply put a big
try into the main program,
catch at the end, and
just output “
error:” followed by some string obtained from the
exception object. As your compiler grows and more components can signal
errors, this will be simpler than handling them one by one.
The scanner for SIMPLE reads the source program as a sequence of characters and recognizes “larger” textual units called tokens. Consider the following source program (additional blanks inserted to separate the characters, underscores represent the actual blanks):
I F _ c = 1 0 _ T H E N
The scanner would produce the following sequence of tokens in return (blanks now separate tokens):
IF c = 10 THEN
That is, a token for the keyword
IF, a token for the identifier
token for the equals sign, a token for the integer
10, a token for the
THEN and so on and so forth. Note that whereas whitespace is
sometimes needed to separate tokens (for example between
which would otherwise be read as the identifier
IFc) it does not
itself constitute tokens.
Aside from regular spaces, your scanner should also “filter out” tabs,
line feeds, form feeds, carriage returns, and comments. In SIMPLE,
comments start with the symbol “
(*” and end with the symbol “
unlike comments in C, comments in SIMPLE can be nested; this
requires a bit of extra hackery in the scanner; much like whitespace,
comments are not part of the actual grammar,
but they still have to be handled/filtered properly by the scanner.
Before you start hacking the scanner you should study the grammar carefully and identify the vocabulary of the language. Try to classify the individual tokens in a useful way (keywords, single character symbols, multi character symbols, etc.). Also, you want to note the characters that are “legal” in SIMPLE programs. If you ever encounter a character that does not have any purpose, and if that character occurs outside of a comment, then you should throw an exception (display an error, see above) that notifies the user of the problem. (This is essentially the only error message your compiler can generate so far.)
Once you know the vocabulary, you should write a class
Instances of this class will represent the tokens your scanner
recognized and passes on to the next phase. The class needs to be able
to store a variety of information depending on the kind of token
we’re dealing with. One kind of tokens are keywords such as
WHILE which don’t have any additional “semantic” information attached.
Another kind of token are identifiers such as
which you need to store (a) the fact that they are identifiers, and (b)
the actual string. For numbers such as
64738 you should store their
Each token should also store the position it occurs at in the source
text for use in error messages. The first character you read is position
0, the next is position 1, and so on. The position of a token consists
of the positions of the first and last characters that constitute that
token. (Note that processing whitespace, while not yielding tokens, will
still increase the position in the file.) Your
Token class should also
be able to return a string (in the format described below) that explains
what token it is, what value is attached (if any), and where it occurred
in the source. You might want to test the
Token class before you go
on. (BTW, you could use a base class and separate subclasses for each
token kind, but I’d discourage it in this case. Avoid over-design!)
Now you are ready to implement the actual scanner. I suggest
implementing it as a class
Scanner which accepts a string
(representing the source text) in its constructor. (Using a class for
this is not 100% optimal from a principled design perspective. However,
just like using exceptions for error handling, it is convenient.)
Your class should offer two public functions: A function
returns the next token only, and a function
all that returns
all tokens in the source text as a list. You can call the first
function repeatedly to implement the second one, and you can use a
standard container class for the list. Note that once
next was called
from another part of the compiler, you should not allow it to be
followed by a call to
next is called after the whole source
text has been processed, you could throw an exception. However, it
will be far more useful to introduce another token kind—let’s call it
eof for “end of file”—which gets returned instead of real tokens once
the actual source is exhausted.
To recognize the tokens themselves, you should write functions that tell
you whether a given character belongs to a certain “class of characters”
such as “letter” or “digit” as well as functions that handle a sequence
of these (once recognized) and instantiate the appropriate token object.
You also need to be able to check whether a certain sequence of
“letters” is a keyword or not; if not, it is an identifier. In other
words: Decompose the task of lexical analysis further into private
functions, do not do everything in the
next function directly.
Let’s look at an example for how this first iteration of the SIMPLE compiler would be used from the command line:
$ ./sc -s VAR ics142: ARRAY 5 OF INTEGER; VAR@(0, 2) identifier<ics142>@(4, 9) :@(10, 10) ARRAY@(12, 16) integer<5>@(18, 18) OF@(20, 21) identifier<INTEGER>@(23, 29) ;@(30, 30) eof@(32, 32)
The first line shows the shell prompt and the user starting your driver program without a file name, so input will come from standard input. The next line shows what the user is typing as input; what is not shown is that this time the user needs to end the input with a “end of file” character from the terminal, not just by hitting the “return” key. The following lines are the individual tokens recognized by the scanner, one on each line. Each token starts with the “kind” of token it is; this is optionally followed by its “semantic value” in angle brackets; this is followed by the position of the token in the format shown. (Yes, this is indeed the format you should use to print your tokens.)
Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!
Regardless of your programming language of choice, we expect to build
your project using
make (if it needs building at all) and we expect to
run your project using
./sc (which stands for “SIMPLE compiler”).
You are free to use the standard library for your language of choice,
except for modules/classes that allow you to avoid writing large
parts of the code for an assignment; so no regular expressions, no parsing
Depending on your language of choice, compliance with certain tools
valgrind), compiler flags, or additional style
guides may also be required; see Piazza for details.
For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.
Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.
Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.
Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.
Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.
Functionality refers to your programs being able to do what they
should according to the specification given above.
(It also refers to you simply doing the required work, which may not be
If the specification is ambiguous, ask for clarification!
If no clarification is forthcoming, defend the choices you have made
If your project cannot be built, or if it is otherwise obvious that you
never tested it, you will get no points whatsoever.
If you project cannot be built without warnings using the required
compiler options we will take off 10%.
If your programs cannot be built using
make we will take off 10%.
valgrind detects memory errors in your programs, we will take off 10%.
If your project fails miserably even once, i.e. terminates with an
exception of any kind or dumps core, we will take off 10%.
Presumably you see the pattern here?