In the third part of the compiler project you will extend your parser to build a symbol table that encodes the declarations in a SIMPLE program. You will also enforce several context conditions that have not been checked so far. These tasks are part of our semantic analysis phase for SIMPLE. You are also going to extend your driver program with options to display the symbol table for a SIMPLE program. You can get the complete EBNF grammar for the SIMPLE programming language here. The context conditions for this assignment are given here.
The final compiler will consist of a number of modules and classes
working together to translate programs written in SIMPLE into equivalent
programs written in assembly language. While these “bits and pieces” are
spread out over the entire semester, you can already implement the basic
driver program that will orchestrate their work. The driver will be
sc and is invoked from the shell as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .
This describes the syntax of the command line in EBNF. After
itself, the user can supply one option (introduced by “
-”) to tell
the driver which parts of the compiler to run and what kind of output to
With this assignment the option
-t is allowed on the command line for
sc. The remaining options, including “no options at all,” should still
result in errors, except for
-c of course which are
unchanged from Assignments 1 and 2.
For this assignment, the option
-t is supposed to build and display
the symbol table for the given input program. If
-t is given but
an error is detected, the (partial) symbol table should not be
If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.
The semantic actions to build the symbol table and to enforce context conditions will be inside the relevant parser methods for declarations. Also you will need to decide how the driver or other parts of the compiler will access the symbol table once it has been built.
The symbol table is a shared data structure that all the remaining phases of the compiler will need to access in various ways. It is in your interest to make your design and implementation for this data structure particularly simple and clean. We have plenty of suggestions regarding design and implementation below, but you are of course free to ignore them as long as your implementation is able to produce the required output.
The symbol table will keep track of the declarations made in a
SIMPLE program. For now there are three kinds of declarations: constant
CONST) introduce a name for a constant value (integers
for now), variable declarations (
VAR) introduce a name for mutable
data (of a certain type), type declarations (
TYPE) introduce a
name for a type. Type declarations, especially for array types
ARRAY) and record types (
RECORD), are the most complex
declarations to handle.
First you need to decide how you will represent the declarations in a
SIMPLE program. I suggest you introduce a base class
Entry for all
entries you are going to make into the symbol table and derived classes
Type to represent the kind of
For constants you should store (a) the pointer to their type object
and (b) their actual value. I recommend storing the type explicitly
despite the fact that for now it will always be
INTEGER in our
SIMPLE programs: by storing the type of a constant it will be easier to
add more types (maybe
For variables you only have to store a pointer to their type object. We will add more later (for example the actual address that a variable will be stored at) but for now we need nothing but the type.
For types, you should define separate derived classes
Record. For the integer type you don’t have to store
anything; but you must make sure that you never create more than
one instance of the
Integer class! (You may want to look at the
Singleton design pattern for a nice way of achieving this; if you
don’t apply the Singleton pattern, you’ll just have to be extra-careful
to not create multiple instances by accident.)
For array types you have to store (a) a pointer to their element
type object and (b) the length of the array. Note that you can
either store the length by pointing to the appropriate constant
object, or you can just embed a plain integer into the
the former makes things slightly more consistent, the latter makes the
code slightly shorter.
For record types you have to store a pointer to a scope object (see
below) containing the fields of the record; each field is essentially a
variable. (Note that you could introduce a separate class
Entry to distinguish “global” variables from
“record-field” variables; this could make a few things a little easier
later. If you do add a
Field class, make sure that it generates the
same output a
Variable would for this assignment.)
Besides the basic methods to create objects of these classes, you should
make sure that all of them have a
toString method of some sort,
similar to what you did for tokens before. However, please note that
these are mostly for debugging, you should create the actual output
differently (see below). While you hack all these classes, you should
also test them in isolation using unit tests; all the bugs you find
here will not distract you later on, which is a good thing.
Now that you have the things that go into the symbol table, it is time to implement the symbol table itself. As mentioned in lecture, you can use just about any data structure available in the language of your choice, as long as it supports the typical “dictionary” operations of inserting and finding data using a string key. (In Java and C++ use one of the existing “map” classes from the standard library; in Python or Go use the built-in “dictionary” or “map” types; in C you’re on your own, but please write the simplest data structure you can think off!)
Whatever data structure you pick, I recommend wrapping it inside your
Scope class that offers the following operations:
(name, value) pair into the scope;
find the value associated with a
given name in the scope or any “outer” scopes it may be attached to; and
local which returns
true if the given name is in this scope but
false if it doesn’t exist or is in an outer scope. Speaking of
“outer” scopes: You can just pass an outer scope in the constructor and
keep track of it; if a
find fails in the current scope, just recurse
on the outer scope (unless it’s
NULL because you are already in the
Again you should add a
toString method for debugging purposes, but
most likely you do not want to follow the
outer pointer here. You
should test your code for scopes before you move on.
Your parser should create the universe scope before you start
parsing, and it should insert the singleton instance of the
class under the name
INTEGER to set things up correctly. At the start
Program you should create the actual program scope with the
universe as its “outer” scope; the program scope starts out empty.
When parsing a declaration, you should first collect the identifier you
will need to make an entry for. Consider the declaration
CONST con = 47; for example. You would first remember the name “con”
and then call
Expression to parse the number. But since you will
not write the semantic actions for
Expression in this assignment,
you should just assume that you read the number five regardless
of what you actually parsed as an expression! Now you have both the name
(really!) and the value (assumed!), which is all you need to put the
declaration into the symbol table. You create a
Constant object, put a
pointer to the
Integer instance in its
type field (because for now
all constants are integers), and put the number 5 in its
Then you call
insert on the current scope with “con” as the name
Constant you created as the value. For variables and record
fields you can proceed in a similar fashion.
Type method of your parser will actually be the most complicated:
You should extend it to return a
Type object representing whatever
it just parsed. There are three cases to consider according to the
grammar: the “lonely”
identifier case, the
ARRAY case, and the
Typereads an identifier, it looks the identifier up in the current scope and returns the associated
Typeobject (see context conditions).
ARRAYit first parses the
Expressionfor the length of the array (and you will of course assume 5 again for this assignment); then it calls itself recursively to parse the element type of the array. Once that recursive call returns, you have all you need: the length as well as a pointer to the element type. You create an
Arrayinstance, put the element type into its
typefield and the length into its
lengthfield. Then you return the
Arrayobject you just created.
RECORDit first creates a new
Scopeobject and makes it the current scope; the previously current scope will temporarily become the new scope’s outer scope; it then parses the record fields and inserts them into the current scope. At the
ENDof the record type, it creates the actual
Recordtype object and sets its
scopefield to the current scope; it then restores the outer scope back to the current scope and cuts the outer pointer of the record scope before returning the
In the end there’s not all that much code to write, but it’s a bit more subtle than the code we had to write before for the scanner and the parser.
Why are we first setting up the outer scope pointer for a record scope
only to sever it when we’re done? While we are parsing the fields, we
want to be able to look up types all the way to the universe scope, and
since all lookups start at the current scope we need to be part of that
hierarchy. However, after we are done parsing the fields, we do not want
the outer scope attached anymore: For the rest of the program, the
record scope should contain only the identifiers that actually
denote record fields, not all identifiers reachable through the
One last note about
IdentifierList: This method should return a list
std::vector or whatever you want) with all the identifier tokens
it recognized. In methods that need such a list,
VarDecl for example,
you can then run over the list in a loop and create a bunch of entries
in the symbol table as needed. And don’t forget to create a new
Variable object for each of the identifiers!
There are two fundamental context conditions to be enforced while building the symbol table:
The first condition rules out recursive declarations. (The “before” is interpreted as “textually preceding” here.) To make sure things work out correctly, you should only make an entry in the symbol table when you are completely done parsing it.
For the second condition you should remember that there are exactly
three kinds of scopes in SIMPLE so far: the universe scope containing
INTEGER, the program scope for “actual” declarations, and
record scopes (one for each record type, containing its fields). Of
these, only record scopes can be nested arbitrarily deep, i.e. you can
RECORD within a
RECORD within a
RECORD. Two more context
conditions you need to enforce are given here.
The advice from earlier assignments about using exceptions for error handling is still in effect, as is the required format for your error messages:
error: some helpful description
Enforcing context conditions for SIMPLE programs will lead to a number of “new” errors, for example for identifiers that are used but never declared or for identifiers that are declared more than once in a given scope. If you followed the advice for error handling on previous assignments, you should have little trouble handling those new errors.
Since all that matters for automated grading (for undergraduates anyway) is that you actually detect errors, the error messages themselves would not have to make any sense. However, you might want to try to make your error messages as informative as possible, for example by including details about where an error has occured. As the context conditions we enforce get more complex, you are really doing yourself a favor if you have decent error messages. For example, if you detect a duplicate declaration, instead of saying
error: duplicate declaration
you may want to at least say
error: duplicate declaration at (128, 128)
or even better
error: duplicate declaration of "x" at (128, 128) conflicts with "x" at (34, 34)
Again, for undergraduates these “fancy” error messages are not required, but they will help you quite a bit as you debug your compiler.
Once you have successfully built the symbol table for an input program, you must produce output that illustrates its structure. Your starting point for the output should be the program scope which contains all the declarations actually made by the input program. Please do not include the universe scope itself in your output! Here is an example program:
PROGRAM X; CONST a = 47; VAR i: INTEGER; TYPE X = RECORD a, b: ARRAY 7 OF INTEGER; END; END X.
The textual output for this program should be as follows except that you will have 5 all over the place:
SCOPE BEGIN X => RECORD BEGIN SCOPE BEGIN a => VAR BEGIN type: ARRAY BEGIN type: INTEGER length: 7 END ARRAY END VAR b => VAR BEGIN type: ARRAY BEGIN type: INTEGER length: 7 END ARRAY END VAR END SCOPE END RECORD a => CONST BEGIN type: INTEGER value: 47 END CONST i => VAR BEGIN type: INTEGER END VAR END SCOPE
Note that the identifiers in a given scope are printed in sorted
order (yes, upper-case letters sort before lower-case letters) and that
each level of indentation is two spaces! This form of output is
slightly inaccurate, i.e. it seems that the types of fields
might be different when in fact they are identical. The good news is
that the textual output is fairly easy to implement. I suggest that you
use the Visitor design pattern to traverse the symbol table once it
is built. Inside your visitor you can maintain an indentation count as
well as a string buffer in which you assemble the output. You can then
print from your driver after a successful parse. (If you don’t want to
use the Visitor pattern you can achieve the same results by using a few
mutually recursive functions instead.) You must have textual output
in this form to get points for the assignment.
While it is possible to recognize the structure of the symbol table in the textual format described above, it is much easier to see if we actually draw the symbol table (click for full size, of course you should have 5 all over):
As an added benefit, the graphical representation is also 100% accurate:
It gives an (almost) exact representation of what the data structures
should look like in memory. In particular, it is now clear that the
b have identical types.
This drawing was again rendered using DOT, a tool that takes a textual description of a graph and generates cute PostScript or PNG figures. See http://www.graphviz.org/ for more information on DOT and related tools. Note that you simply write the DOT format to the standard output; we will take care of actually rendering the graphics if we want to look at your symbol table while grading.
You have to extend the driver program with an option
-g to indicate
-t should produce graphical output instead of textual output.
So the modified command line would be as follows:
Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] ["-g"] [filename] .
For now it is an error to supply
-g by itself or with any option other
-t. Please follow the same format we used above:
diamonds for constants, rectangles for scopes, circles for
variables (and record fields), and rounded rectangles for types.
If you are taking this course at the graduate level, the additional error-handling requirements from Assignment 2 are still in effect. You are, however, faced with a new problem now that we’re building a symbol table and use it to check the consistency of declarations.
First you have to keep going not only for syntactic errors as before
but also for semantic errors inside declarations. Consider for
example the declaration
TYPE A = ARRAY size OF INTEGER where
not defined before
A. You will have to flag an error because you
don’t know what
size means, but you also have to put some kind of
type into the symbol table under the name
A as the rest of the program
is likely to use
A again. One way to handle this and similar
situations is to introduce a special InvalidType class and to have
the entry for
A point to an instance of it. However, this could lead
to problems later, for example if after the declaration
VAR a: A you
run across an instructions like
a := 10. Ideally you would still
like to know that
a is of some kind of array type and therefore it’s
okay to apply the indexing brackets. So you probably want to cook up a
slightly more flexible mechanism that allows you to retain some
information about a broken declaration but also ensures that you’ll be
able to tell “good” ones from “bad” ones.
Another way to end up with an invalid declaration is if there is a
syntax error inside the declaration and the error is severe enough
so you cannot recover from it while still building a “correct”
description of it in the symbol table. Consider for example the
TYPE R = RECORD a, b, INTEGER END where either an
identifier (and a colon) are missing or we have an excess comma (and a
missing colon). Depending on how good your error recovery is, you may
not be able to salvage more than “R is a record type” but you don’t know
the fields, or “R is a record type with field a and b” but you don’t
know the type of those fields. This is really the same case as above,
but here it is caused by a syntax error and not by a semantic error.
Finally consider position information: what is the position of
TYPE A = ARRAY size OF INTEGER? This becomes important
if later in the program you find another declaration
VAR A: INTEGER
and you want to generate an error message that refers back to the
first declaration of
A. It may be useful to give each entry in the
symbol table a position that starts at the first token that was
relevant for the declaration and ends at the last such token:
that way you can clearly identify the “stretch” of source text that the
declaration originally came from.
Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!
Regardless of your programming language of choice, we expect to build
your project using
make (if it needs building at all) and we expect to
run your project using
./sc (which stands for “SIMPLE compiler”).
You are free to use the standard library for your language of choice,
except for modules/classes that allow you to avoid writing large
parts of the code for an assignment; so no regular expressions, no parsing
Depending on your language of choice, compliance with certain tools
valgrind), compiler flags, or additional style
guides may also be required; see Piazza for details.
For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.
Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.
Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.
Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.
Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.
Functionality refers to your programs being able to do what they
should according to the specification given above.
(It also refers to you simply doing the required work, which may not be
If the specification is ambiguous, ask for clarification!
If no clarification is forthcoming, defend the choices you have made
If your project cannot be built, or if it is otherwise obvious that you
never tested it, you will get no points whatsoever.
If you project cannot be built without warnings using the required
compiler options we will take off 10%.
If your programs cannot be built using
make we will take off 10%.
valgrind detects memory errors in your programs, we will take off 10%.
If your project fails miserably even once, i.e. terminates with an
exception of any kind or dumps core, we will take off 10%.
Presumably you see the pattern here?