In the fourth part of the compiler project you will extend your parser to build an abstract syntax tree that encodes the expressions, conditions, and instructions in a SIMPLE program. You will also enforce several context conditions that have not been checked so far. These tasks are part of our semantic analysis phase for SIMPLE. You are also going to extend your driver program with options to display the abstract syntax tree for a SIMPLE program. You can get the concrete grammar for the SIMPLE programming language here. You can get the abstract grammar for the SIMPLE programming language here. The context conditions for this assignment are given here.
The final compiler will consist of a number of modules and classes
working together to translate programs written in SIMPLE into equivalent
programs written in assembly language. While these “bits and pieces” are
spread out over the entire semester, you can already implement the basic
driver program that will orchestrate their work. The driver will be
sc and is invoked from the shell as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .
This describes the syntax of the command line in EBNF. After
itself, the user can supply one option (introduced by “
-”) to tell
the driver which parts of the compiler to run and what kind of output to
With this assignment the option
-a is allowed on the command line for
sc. The remaining options, including “no options at all,” should still
result in errors, except for
-t which are
unchanged from Assignments 1, 2, and 3.
For this assignment, the option
-a is supposed to build and display
the abstract syntax tree for the given input program. If
given but an error is detected, the (partial) abstract syntax tree
should not be displayed.
If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.
The semantic actions to build the AST and to enforce context conditions will be inside the relevant parser methods for expressions, conditions, and instructions. Also you will need to decide how the driver or other parts of the compiler will access the abstract syntax tree once it has been built.
The abstract syntax tree will keep track of the instructions, expressions, and conditions in a SIMPLE program. In fact, the AST will be the primary data structure for the interpreter and code-generator(s) you will develop in the following assignments.
First you need to decide how you will represent the instructions,
expressions, and conditions in a SIMPLE program. The abstract grammar
for SIMPLE describes the structure of the AST, but it doesn’t define
the necessary details. I suggest you introduce a base class
all kinds of nodes in the AST; remember that you might want to introduce
several abstract methods here later on. Derived from
Node you should
Condition to model
the “big three” categories.
Consider the abstract grammar for conditions for a moment. Obviously you can already make a concrete class in this case, storing two expression pointers (left and right) and the actual relation being checked. This is the “pattern” you will follow for the other concrete classes as well: each class will store what it is supposed to store according to its production in the abstract grammar.
Now consider the abstract grammar for instructions. There are five
different instructions that can occur in a SIMPLE program (after the
WHILE has been transformed away as discussed in lecture), so you will
from the base class
Instruction; note that instructions need a
pointer since you have to encode the order in which they occur in the
For expressions you can proceed in a similar fashion. For locations, however, things get a little tricky. I don’t want to give you too many hints, but I should at least give one slightly bigger example. Consider the designator in the following assignment instruction:
a.x.y := -20
Here is the corresponding AST (click for full size):
Obviously encoding all those selectors can get a little complicated.
Notice that pointers are labeled in the AST to make it a little easier
to understand: the label ST indicates that an AST node points to an
object of one of the
Entry classes from the symbol table assignment
(although not all those objects are actually part of the symbol table);
the labels location and expression indicate whether we are
interested in an address or a value as it were (note that
expression pointers can refer to locations, but location pointers cannot
refer to expressions).
For purposes of type-checking you should give AST nodes derived from
Expression a member variable that stores the type associated with
the node as well. (These type pointers are not shown in the AST
above, but they must be there). The type of
a must be an array of some
sort, otherwise indexing would not make sense. The type of
be a record of some sort in which a field
x is declared. The type of
a.x must again be an array, and the type of
a.x must again
be a record, this time with a field
y declared. The type of the entire
a.x.y is the type of field
y. (If we take the
assignment into account as well, the type of
y must actually be
Besides the basic methods to create objects of these classes, you should
make sure that all of them have a
toString method of some sort,
similar to what you did for tokens and symbol table entries before.
However, please note that these are mostly for debugging, you should
create the actual output differently (see below). While you hack all
these classes, you should also test them in isolation using simple
unit tests; all the bugs you find here will not distract you later on,
which is a good thing. (You can save yourself a lot of trouble if
you use the type system of your implementation language to ensure
that certain invalid ASTs cannot be built in the first place.)
The next step is to extend the parser methods for instructions, expressions, and conditions to create the proper AST nodes and subtrees. Each method should return the “top node” of the subtree it creates.
For example, in the method
Condition you first call
will parse the required number of tokens and return a pointer to an
Expression node representing what it just recognized. Then you match
one of the allowed relations and remember it. Then you call
again, getting a pointer to another
Expression node, this time for the
right-hand side of the comparison. Now you have all the “ingredients” to
Condition node, filling in the left and right subtrees as
well as the relation being checked. The method
Condition then returns
a pointer to the node it just created, to be used by whatever method
Condition in the first place (the parser method
example). Thus the AST is built “bottom-up” as you parse the program
text top-down and left to right.
In this way, each of these methods returns the subtree it recognized and
its caller can “hook” that subtree into a larger one it in turn passes
back. The tree for the complete program will be returned by the call to
Program. Note that in the case of
will not just return a tree, but you also pass a tree as a
Selector would not know what it is being applied
to, and you could not enforce the necessary context conditions. You
should test your extensions with a number of simple (!) SIMPLE (!!)
programs before you move on.
In the methods parsing expressions you should now add constant
folding as described in lecture. Obviously only literal numbers
and identifiers that refer to symbol table constants are indeed
constant. In both of these cases, you can return a
Number node with a
pointer to a
Before you produce
Binary nodes, however, you should check whether
both sides are constant. If they are indeed, you perform the operation
directly and return a
Number node with a pointer to a
filled with the result of the operation. Thus an expression with
only constant parts will in fact not lead to a convoluted tree,
but just to a single
Number node with the final value already
Once constant folding works, you can “hook up” the AST with the ST from
the previous assignment. Where you assumed the value five before, for
Type when parsing an
ARRAY constructor, you can now
actually use the result of
Expression. For array types, you must
ensure that the expression you get back is indeed a
Number node, that
its value is greater than zero (see context
conditions!), and then fill in that value in your
Note that this approach to constant folding is far from perfect. For
example, the expression
a is a variable will not be
a+4 as you might wish. We only fold adjacent
nodes, not entire expressions; however, that’s good enough for our
purposes here: ensuring that expressions that must be constant are
The advice from earlier assignments about using exceptions for error handling is still in effect, as is the required format for your error messages:
error: some helpful description
Enforcing context conditions for SIMPLE programs will lead to a number of “new” errors, for example when the types on both sides of an assignment instruction do not agree. If you followed the advice for error handling on previous assignments, you should have little trouble handling those new errors.
Since all that matters for automated grading (for undergraduates anyway) is that you actually detect errors, the error messages themselves would not have to make any sense. However, you might want to try to make your error messages as informative as possible, for example by including details about where an error has occured. As the context conditions we enforce get more complex, you are really doing yourself a favor if you have decent error messages. For example, if you detect a missing record field, instead of saying
error: no such record field
you may want to at least say
error: no such record field @ (138, 138)
or even better
error: the designator "a.x" @ (128, 136) does not refer to a record with a field "y" @ (138, 138)
Again, for undergraduates these “fancy” error messages are not required, but they will help you quite a bit as you debug your compiler.
Once you have successfully built the abstract syntax tree for an input program, you must produce output that illustrates its structure. Your starting point for the output should be the root of the AST you built for the program itself, which will output the entire program. Here is an example program:
PROGRAM X; CONST sz = 47; VAR a: ARRAY sz OF INTEGER; i: INTEGER; BEGIN i := 0; REPEAT a[i] := 64738 UNTIL i >= sz END END X.
The textual output for this program should be as follows:
instructions => Assign: location => Variable: variable => VAR BEGIN type: INTEGER END VAR expression => Number: value => CONST BEGIN type: INTEGER value: 0 END CONST Repeat: condition => Condition (>=): left => Variable: variable => VAR BEGIN type: INTEGER END VAR right => Number: value => CONST BEGIN type: INTEGER value: 47 END CONST instructions => Assign: location => Index: location => Variable: variable => VAR BEGIN type: ARRAY BEGIN type: INTEGER length: 47 END ARRAY END VAR expression => Variable: variable => VAR BEGIN type: INTEGER END VAR expression => Number: value => CONST BEGIN type: INTEGER value: 64738 END CONST
As you can see, at certain places you will have to include output for pieces of the symbol table as well, hopefully your solution for the previous assignment is modular enough to handle this. This form of output is slightly inaccurate, but at least it’s fairly easy to implement. I suggest that you use the Visitor design pattern to traverse the AST once it is built. Inside your visitor you can maintain an indentation count as well as a string buffer in which you assemble the output. You can then print from your driver after a successful parse. (If you don’t want to use the Visitor pattern you can achieve the same results by using a few mutually recursive functions instead.) You must have textual output in this form to get points for the assignment.
While it is possible to recognize the structure of the abstract syntax tree in the textual format described above, it is much easier to see if we actually draw the AST (click for full size):
As an added benefit, the graphical representation is also 100% accurate: It gives an (almost) exact representation of what the data structures should look like in memory.
This drawing was again rendered using DOT, a tool that takes a textual description of a graph and generates cute PostScript or PNG figures. See http://www.graphviz.org/ for more information on DOT and related tools. Note that you simply write the DOT format to the standard output; we will take care of actually rendering the graphics if we want to look at your symbol table while grading.
You have to extend the driver program with an option
-g to indicate
-a should produce graphical output instead of textual output.
So the modified command line would be as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] ["-g"] [filename] .
For now it is an error to supply
-g by itself or with any option other
-a. Please follow the same format we used above:
nodes in the AST should be rectangles; also please do not
include the complete symbol table output, things would get too
complicated; instead just indicate variables and constants as shown, do
not follow type pointers any further.
If you are taking this course at the graduate level, the additional error-handling requirements from Assignments 2 and 3 are still in effect. You are, however, once again faced with new problems now that we’re building an AST and use it to check all (static) context conditions of the language. I won’t belabor the details here as I did on previous assignments, the point is that you must be able to keep parsing even after a semantic error such as a missing record field in a designator or mismatched types in an assignment. Needless to say, you need accurate position information for all your error messages, and ideally your error messages show where the actual problem is by including relevant pieces of source text.
The good news is that after this assignment, you are mostly done with additional requirements regarding error handling. Of course there will be new graduate-level requirements to compensate.
Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!
Regardless of your programming language of choice, we expect to build
your project using
make (if it needs building at all) and we expect to
run your project using
./sc (which stands for “SIMPLE compiler”).
You are free to use the standard library for your language of choice,
except for modules/classes that allow you to avoid writing large
parts of the code for an assignment; so no regular expressions, no parsing
Depending on your language of choice, compliance with certain tools
valgrind), compiler flags, or additional style
guides may also be required; see Piazza for details.
For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.
Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.
Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.
Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.
Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.
Functionality refers to your programs being able to do what they
should according to the specification given above.
(It also refers to you simply doing the required work, which may not be
If the specification is ambiguous, ask for clarification!
If no clarification is forthcoming, defend the choices you have made
If your project cannot be built, or if it is otherwise obvious that you
never tested it, you will get no points whatsoever.
If you project cannot be built without warnings using the required
compiler options we will take off 10%.
If your programs cannot be built using
make we will take off 10%.
valgrind detects memory errors in your programs, we will take off 10%.
If your project fails miserably even once, i.e. terminates with an
exception of any kind or dumps core, we will take off 10%.
Presumably you see the pattern here?