In the second part of the compiler project, you are going to develop a parser to perform syntactic analysis of SIMPLE programs. You are also going to extend your driver program with options to display the concrete syntax tree of a SIMPLE program. You can get the complete EBNF grammar for the SIMPLE programming language here.
The final compiler will consist of a number of modules and classes
working together to translate programs written in SIMPLE into equivalent
programs written in assembly language. While these “bits and pieces” are
spread out over the entire semester, you can already implement the basic
driver program that will orchestrate their work. The driver will be
sc and is invoked from the shell as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .
This describes the syntax of the command line in EBNF. After
itself, the user can supply one option (introduced by “
-”) to tell
the driver which parts of the compiler to run and what kind of output to
With this assignment the option
-c is allowed on the command line for
sc. The remaining options, including “no options at all,” should still
result in errors, except for
-s of course which is unchanged from
For this assignment, the option
-c is supposed to run the parser and
to display the concrete syntax tree (aka parse tree) for the
given input. If
-c is given but an error is detected, the (partial)
parse tree should not be output.
If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.
Of course the parser will need to access the scanner, so you need to
decide how the two will communicate. The two main options are (a) using
all() from your driver and passing the list of tokens to the parser’s
constructor, or (b) passing the scanner object to the parser’s
constructor and having the parser call
next() whenever it needs a new
The parser for SIMPLE reads the source program as a sequence of tokens and recognizes its structure in the form of a parse tree. Note that this tree is not constructed explicitly! You should use the method of recursive descent to implement the parser, and thus the tree will only exist in form of the call stack, i.e. which parser method called which other parser method.
If production for example, and consider the following
source program fragment (in terms of tokens):
... IF c = 10 THEN ...
The parser should call the method to match an
If, which will then
match the token
IF, call the method to match a
Condition (which will
call others in turn and so on), match the token
THEN, and so on.
Implicitly, the parser thus “built” the following parse tree by matching
terminals and calling methods to match non-terminals:
... If IF@(43,44) Condition Expression Term Factor Designator identifier<c>@(46,46) Selector =@(48,48) Expression Term Factor integer<10>@(50,51) THEN@(53,56) ...
This is actually the “first taste” of what kind of output you need to generate for this assignment. Before you start hacking the parser, you should study the grammar carefully and identify the non-terminals of the language. These are the productions you need to recognize, and for each of them you will write a method in the parser class. Also, you might want to make sure that the grammar can be parsed using recursive descent before you start the implementation, i.e. check whether it meets the conditions for an LL(1) grammar.
I suggest implementing a class
Parser which should offer only one
public function: The function
parse which processes a complete SIMPLE
program before it returns. Note that the function should not return
any value: In later assignments, the parser will be extended to build a
number of data structures (the intermediate representation) instead. All
the actual parsing methods should be private (or at least protected),
and the first one you should call from
parse is of course
You might also want to implement some helper methods, for example the
match() I described in the lecture.
Aside from implementing the actual parser, an important concern is how
we get the textual representation of the parse tree out of the parser
and to the user. There are a number of options. The simplest would be to
just hardcode a bunch of prints all over the parser. However, this is
not a good design if we want to reuse the parser in a variety of
contexts. Another option would be to build a big string in a member
variable during parsing, and to add a method returning that string to
Parser class. The driver could then ask for this string after the
parse method returned and print it. There’s also the option of passing
the actual stream (e.g.
std::cout) from the driver to the parser as a
parameter, and to have the parser write its output there. However, the
best option is probably to use the Observer pattern as described
in lecture to separate output of the parse tree almost entirely from the
actual parsing process. (Kudos to Michael Kurth (Spring 2004) who came
up with this use of the Observer pattern.) Whatever you decide to do,
please be aware that it must be possible to switch the output of the
parse tree on and off without recompiling the code; in later
assignments, you can’t simply remove the output code, it is still needed
when the user gives the
Here is a sample interaction with the SIMPLE compiler:
$ ./sc -c PROGRAM X; VAR i: INTEGER; END X. Program PROGRAM@(0, 6) identifier<X>@(8, 8) ;@(9, 9) Declarations VarDecl VAR@(11, 13) IdentifierList identifier<i>@(15, 15) :@(16, 16) Type identifier<INTEGER>@(18, 24) ;@(25, 25) END@(27, 29) identifier<X>@(31, 31) .@(32, 32)
As before, the first line shows the shell prompt and the user starting your driver program. The next line shows what the user is typing as input, terminated by a newline and “end of file” from the terminal, The following lines are the parse tree for this particular program; note how we indent lines that represent children of a particular “goal” during parsing by two spaces.
If you detect an error, whether in this or future assignments, you should output (to standard error of course) an error message in the following form:
error: some helpful description
You must output the string “
error:” on a new line, followed by
one blank, followed by whatever text makes sense for the error in
question. Our automated grading suite relies on this format and you’ll
get penalized if you do something else.
The advice from Assignment 1 about using exceptions for error handling is still in effect, as is the required format of such messages as given above.
For this assignment, a large number of errors will be detected in the
match function: When the token actually read does not correspond to
the expected token, you can raise an exception and thus produce an
error. However, sometimes you expect one out of a number of possible
tokens, for example inside the
Condition production. Instead of doing
these tests “by hand” you should consider writing a “smart”
function that takes a list of possible tokens and not only matches them,
but also returns the token actually encountered (as we will need
that token for later assignments).
Consider the following example program:
PROGRAM As3; CONST x = -47; TYPE T = RECORD f: INTEGER; END; VAR a: ARRAY 12 OF T; BEGIN a.f := -x END As3.
The parse tree for this program, in its textual form as required by this assignment, is as follows:
Program PROGRAM@(0, 6) identifier<As3>@(8, 10) ;@(11, 11) Declarations ConstDecl CONST@(13, 17) identifier<x>@(19, 19) =@(21, 21) Expression -@(23, 23) Term Factor integer<47>@(24, 25) ;@(26, 26) TypeDecl TYPE@(28, 31) identifier<T>@(33, 33) =@(35, 35) Type RECORD@(37, 42) IdentifierList identifier<f>@(44, 44) :@(45, 45) Type identifier<INTEGER>@(47, 53) ;@(54, 54) END@(56, 58) ;@(59, 59) VarDecl VAR@(61, 63) IdentifierList identifier<a>@(65, 65) :@(66, 66) Type ARRAY@(68, 72) Expression Term Factor integer<12>@(74, 75) OF@(77, 78) Type identifier<T>@(80, 80) ;@(81, 81) BEGIN@(83, 87) Instructions Instruction Assign Designator identifier<a>@(91, 91) Selector [@(92, 92) ExpressionList Expression Term Factor integer<7>@(93, 93) ]@(94, 94) .@(95, 95) identifier<f>@(96, 96) :=@(98, 99) Expression -@(101, 101) Term Factor Designator identifier<x>@(102, 102) Selector END@(104, 106) identifier<As3>@(108, 110) .@(111, 111)
While it is possible to recognize the structure of the parse tree in this format, it is much easier to see if we actually draw the tree (click for full size):
This drawing was rendered using DOT, a language and tool that takes a textual description of a graph and generates cute PostScript or PNG figures. See http://www.graphviz.org/ for more information on DOT and related tools.
If you’re using the Observer pattern to produce output already, it
is not very difficult produce these nice diagrams as well! First extend
the driver program with an option
-g to indicate that
produce graphical output instead of textual output. So the modified
command line would be as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] ["-g"] [filename] .
For now it is an error to supply
-g by itself or with any option other
-c; however, later assignments will allow
-g with several other
Now implement a second Observer class that contains the code to
output DOT syntax instead of the earlier textual form. Integration with
the driver is easy: If only
-c is given, your driver creates an
instance of the basic Observer and connects it to the parser before
parse() on it; if
-g is given as well, your driver creates
an instance of the “DOT Observer” instead and connects that to the
parser. Note that you can simply write DOT output to standard output and
redirect into a file to render it. Please follow the same format we used
above: diamonds for terminals and rectangles for non-terminals.
If you are taking this course at the graduate level, there are additional requirements when it comes to error handling.
First your error messages must include accurate position information. For now the simplest way to do this is to print the errorneous tokens in such a way that positions are included (similar to what you did in the scanner driver). Later it may be necessary to “infer” compound positions, but for now a single position should be fine.
Second you must implement the error handling technique described in
lecture: After detecting a syntax error, you will surpress further error
messages until at least four additional tokens have been processed.
You will also handle weak and strong symbols as described in
lecture. If a weak symbol (e.g. a closing parenthesis or
missing, the parser flags an error but then assumes that the symbol was
actually present. If a syntax error not involving a weak symbol occurs,
the parser flags an error and then skips tokens until it finds an
appropriate strong symbol (e.g.
IF). Note that the latter
requires rolling back the parser’s state far enough for the strong
symbol to actually resynchronize the parser with the token stream (which
is why using exceptions for errors is such a good idea). As a result of
these techniques, you will be able to diagnose multiple syntax
errors instead of just the first one.
Note that the rule of not producing a parse tree after a syntax error still applies.
Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!
Regardless of your programming language of choice, we expect to build
your project using
make (if it needs building at all) and we expect to
run your project using
./sc (which stands for “SIMPLE compiler”).
You are free to use the standard library for your language of choice,
except for modules/classes that allow you to avoid writing large
parts of the code for an assignment; so no regular expressions, no parsing
Depending on your language of choice, compliance with certain tools
valgrind), compiler flags, or additional style
guides may also be required; see Piazza for details.
For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.
Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.
Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.
Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.
Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.
Functionality refers to your programs being able to do what they
should according to the specification given above.
(It also refers to you simply doing the required work, which may not be
If the specification is ambiguous, ask for clarification!
If no clarification is forthcoming, defend the choices you have made
If your project cannot be built, or if it is otherwise obvious that you
never tested it, you will get no points whatsoever.
If you project cannot be built without warnings using the required
compiler options we will take off 10%.
If your programs cannot be built using
make we will take off 10%.
valgrind detects memory errors in your programs, we will take off 10%.
If your project fails miserably even once, i.e. terminates with an
exception of any kind or dumps core, we will take off 10%.
Presumably you see the pattern here?