Assignment 3: Semantic Analysis / Symbol Table

Overview

In the third part of the compiler project you will extend your parser to build a symbol table that encodes the declarations in a SIMPLE program. You will also enforce several context conditions that have not been checked so far. These tasks are part of our semantic analysis phase for SIMPLE. You are also going to extend your driver program with options to display the symbol table for a SIMPLE program. You can get the complete EBNF grammar for the SIMPLE programming language here. The context conditions for this assignment are given here.

Problem 1: Driver Program (10%)

The final compiler will consist of a number of modules and classes working together to translate programs written in SIMPLE into equivalent programs written in assembly language. While these “bits and pieces” are spread out over the entire semester, you can already implement the basic driver program that will orchestrate their work. The driver will be called sc and is invoked from the shell as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .

This describes the syntax of the command line in EBNF. After ./sc itself, the user can supply one option (introduced by “-”) to tell the driver which parts of the compiler to run and what kind of output to produce.

Arguments and Options

With this assignment the option -t is allowed on the command line for sc. The remaining options, including “no options at all,” should still result in errors, except for -s and -c of course which are unchanged from Assignments 1 and 2.

For this assignment, the option -t is supposed to build and display the symbol table for the given input program. If -t is given but an error is detected, the (partial) symbol table should not be displayed.

If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.

The semantic actions to build the symbol table and to enforce context conditions will be inside the relevant parser methods for declarations. Also you will need to decide how the driver or other parts of the compiler will access the symbol table once it has been built.

The symbol table is a shared data structure that all the remaining phases of the compiler will need to access in various ways. It is in your interest to make your design and implementation for this data structure particularly simple and clean. We have plenty of suggestions regarding design and implementation below, but you are of course free to ignore them as long as your implementation is able to produce the required output.

Problem 2: Symbol Table (80%)

The symbol table will keep track of the declarations made in a SIMPLE program. For now there are three kinds of declarations: constant declarations (CONST) introduce a name for a constant value (integers for now), variable declarations (VAR) introduce a name for mutable data (of a certain type), type declarations (TYPE) introduce a name for a type. Type declarations, especially for array types (ARRAY) and record types (RECORD), are the most complex declarations to handle.

Symbol Table Entries

First you need to decide how you will represent the declarations in a SIMPLE program. I suggest you introduce a base class Entry for all entries you are going to make into the symbol table and derived classes Constant, Variable, and Type to represent the kind of declaration.

For constants you should store (a) the pointer to their type object and (b) their actual value. I recommend storing the type explicitly despite the fact that for now it will always be INTEGER in our SIMPLE programs: by storing the type of a constant it will be easier to add more types (maybe BOOLEAN or REAL?) later.

For variables you only have to store a pointer to their type object. We will add more later (for example the actual address that a variable will be stored at) but for now we need nothing but the type.

For types, you should define separate derived classes Integer, Array, and Record. For the integer type you don’t have to store anything; but you must make sure that you never create more than one instance of the Integer class! (You may want to look at the Singleton design pattern for a nice way of achieving this; if you don’t apply the Singleton pattern, you’ll just have to be extra-careful to not create multiple instances by accident.)

For array types you have to store (a) a pointer to their element type object and (b) the length of the array. Note that you can either store the length by pointing to the appropriate constant object, or you can just embed a plain integer into the Array class; the former makes things slightly more consistent, the latter makes the code slightly shorter.

For record types you have to store a pointer to a scope object (see below) containing the fields of the record; each field is essentially a variable. (Note that you could introduce a separate class Field derived from Entry to distinguish “global” variables from “record-field” variables; this could make a few things a little easier later. If you do add a Field class, make sure that it generates the same output a Variable would for this assignment.)

Besides the basic methods to create objects of these classes, you should make sure that all of them have a toString method of some sort, similar to what you did for tokens before. However, please note that these are mostly for debugging, you should create the actual output differently (see below). While you hack all these classes, you should also test them in isolation using unit tests; all the bugs you find here will not distract you later on, which is a good thing.

Symbol Table Scopes

Now that you have the things that go into the symbol table, it is time to implement the symbol table itself. As mentioned in lecture, you can use just about any data structure available in the language of your choice, as long as it supports the typical “dictionary” operations of inserting and finding data using a string key. (In Java and C++ use one of the existing “map” classes from the standard library; in Python or Go use the built-in “dictionary” or “map” types; in C you’re on your own, but please write the simplest data structure you can think off!)

Whatever data structure you pick, I recommend wrapping it inside your own Scope class that offers the following operations: insert a (name, value) pair into the scope; find the value associated with a given name in the scope or any “outer” scopes it may be attached to; and local which returns true if the given name is in this scope but false if it doesn’t exist or is in an outer scope. Speaking of “outer” scopes: You can just pass an outer scope in the constructor and keep track of it; if a find fails in the current scope, just recurse on the outer scope (unless it’s NULL because you are already in the universe scope).

Again you should add a toString method for debugging purposes, but most likely you do not want to follow the outer pointer here. You should test your code for scopes before you move on.

Building the Symbol Table

Your parser should create the universe scope before you start parsing, and it should insert the singleton instance of the Integer class under the name INTEGER to set things up correctly. At the start of Program you should create the actual program scope with the universe as its “outer” scope; the program scope starts out empty.

When parsing a declaration, you should first collect the identifier you will need to make an entry for. Consider the declaration CONST con = 47; for example. You would first remember the name “con” and then call Expression to parse the number. But since you will not write the semantic actions for Expression in this assignment, you should just assume that you read the number five regardless of what you actually parsed as an expression! Now you have both the name (really!) and the value (assumed!), which is all you need to put the declaration into the symbol table. You create a Constant object, put a pointer to the Integer instance in its type field (because for now all constants are integers), and put the number 5 in its value field. Then you call insert on the current scope with “con” as the name and the Constant you created as the value. For variables and record fields you can proceed in a similar fashion.

The Type method of your parser will actually be the most complicated: You should extend it to return a Type object representing whatever it just parsed. There are three cases to consider according to the grammar: the “lonely” identifier case, the ARRAY case, and the RECORD case:

In the end there’s not all that much code to write, but it’s a bit more subtle than the code we had to write before for the scanner and the parser.

Why are we first setting up the outer scope pointer for a record scope only to sever it when we’re done? While we are parsing the fields, we want to be able to look up types all the way to the universe scope, and since all lookups start at the current scope we need to be part of that hierarchy. However, after we are done parsing the fields, we do not want the outer scope attached anymore: For the rest of the program, the record scope should contain only the identifiers that actually denote record fields, not all identifiers reachable through the outer pointer.

One last note about IdentifierList: This method should return a list (or std::vector or whatever you want) with all the identifier tokens it recognized. In methods that need such a list, VarDecl for example, you can then run over the list in a loop and create a bunch of entries in the symbol table as needed. And don’t forget to create a new Variable object for each of the identifiers!

Context Conditions

There are two fundamental context conditions to be enforced while building the symbol table:

The first condition rules out recursive declarations. (The “before” is interpreted as “textually preceding” here.) To make sure things work out correctly, you should only make an entry in the symbol table when you are completely done parsing it.

For the second condition you should remember that there are exactly three kinds of scopes in SIMPLE so far: the universe scope containing INTEGER, the program scope for “actual” declarations, and record scopes (one for each record type, containing its fields). Of these, only record scopes can be nested arbitrarily deep, i.e. you can have a RECORD within a RECORD within a RECORD. Two more context conditions you need to enforce are given here.

Error Handling

The advice from earlier assignments about using exceptions for error handling is still in effect, as is the required format for your error messages:

error: some helpful description

Enforcing context conditions for SIMPLE programs will lead to a number of “new” errors, for example for identifiers that are used but never declared or for identifiers that are declared more than once in a given scope. If you followed the advice for error handling on previous assignments, you should have little trouble handling those new errors.

Since all that matters for automated grading (for undergraduates anyway) is that you actually detect errors, the error messages themselves would not have to make any sense. However, you might want to try to make your error messages as informative as possible, for example by including details about where an error has occured. As the context conditions we enforce get more complex, you are really doing yourself a favor if you have decent error messages. For example, if you detect a duplicate declaration, instead of saying

error: duplicate declaration

you may want to at least say

error: duplicate declaration at (128, 128)

or even better

error: duplicate declaration of "x" at (128, 128) conflicts with "x" at (34, 34)

Again, for undergraduates these “fancy” error messages are not required, but they will help you quite a bit as you debug your compiler.

Symbol Table Output

Once you have successfully built the symbol table for an input program, you must produce output that illustrates its structure. Your starting point for the output should be the program scope which contains all the declarations actually made by the input program. Please do not include the universe scope itself in your output! Here is an example program:

PROGRAM X;
  CONST a = 47;
  VAR i: INTEGER;
  TYPE X = RECORD
    a, b: ARRAY 7 OF INTEGER;
  END;
END X.

The textual output for this program should be as follows except that you will have 5 all over the place:

SCOPE BEGIN
  X =>
    RECORD BEGIN
      SCOPE BEGIN
        a =>
          VAR BEGIN
            type:
              ARRAY BEGIN
                type:
                  INTEGER
                length:
                  7
              END ARRAY
          END VAR
        b =>
          VAR BEGIN
            type:
              ARRAY BEGIN
                type:
                  INTEGER
                length:
                  7
              END ARRAY
          END VAR
      END SCOPE
    END RECORD
  a =>
    CONST BEGIN
      type:
        INTEGER
      value:
        47
    END CONST
  i =>
    VAR BEGIN
      type:
        INTEGER
    END VAR
END SCOPE

Note that the identifiers in a given scope are printed in sorted order (yes, upper-case letters sort before lower-case letters) and that each level of indentation is two spaces! This form of output is slightly inaccurate, i.e. it seems that the types of fields a and b might be different when in fact they are identical. The good news is that the textual output is fairly easy to implement. I suggest that you use the Visitor design pattern to traverse the symbol table once it is built. Inside your visitor you can maintain an indentation count as well as a string buffer in which you assemble the output. You can then print from your driver after a successful parse. (If you don’t want to use the Visitor pattern you can achieve the same results by using a few mutually recursive functions instead.) You must have textual output in this form to get points for the assignment.

Problem 3: Graphical Symbol Table (10%)

While it is possible to recognize the structure of the symbol table in the textual format described above, it is much easier to see if we actually draw the symbol table (click for full size, of course you should have 5 all over):

As an added benefit, the graphical representation is also 100% accurate: It gives an (almost) exact representation of what the data structures should look like in memory. In particular, it is now clear that the fields a and b have identical types.

This drawing was again rendered using DOT, a tool that takes a textual description of a graph and generates cute PostScript or PNG figures. See http://www.graphviz.org/ for more information on DOT and related tools. Note that you simply write the DOT format to the standard output; we will take care of actually rendering the graphics if we want to look at your symbol table while grading.

You have to extend the driver program with an option -g to indicate that -t should produce graphical output instead of textual output. So the modified command line would be as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] ["-g"] [filename] .

For now it is an error to supply -g by itself or with any option other than -c or -t. Please follow the same format we used above: diamonds for constants, rectangles for scopes, circles for variables (and record fields), and rounded rectangles for types.

Graduate Level Requirements

If you are taking this course at the graduate level, the additional error-handling requirements from Assignment 2 are still in effect. You are, however, faced with a new problem now that we’re building a symbol table and use it to check the consistency of declarations.

First you have to keep going not only for syntactic errors as before but also for semantic errors inside declarations. Consider for example the declaration TYPE A = ARRAY size OF INTEGER where size is not defined before A. You will have to flag an error because you don’t know what size means, but you also have to put some kind of type into the symbol table under the name A as the rest of the program is likely to use A again. One way to handle this and similar situations is to introduce a special InvalidType class and to have the entry for A point to an instance of it. However, this could lead to problems later, for example if after the declaration VAR a: A you run across an instructions like a[4] := 10. Ideally you would still like to know that a is of some kind of array type and therefore it’s okay to apply the indexing brackets. So you probably want to cook up a slightly more flexible mechanism that allows you to retain some information about a broken declaration but also ensures that you’ll be able to tell “good” ones from “bad” ones.

Another way to end up with an invalid declaration is if there is a syntax error inside the declaration and the error is severe enough so you cannot recover from it while still building a “correct” description of it in the symbol table. Consider for example the declaration TYPE R = RECORD a, b, INTEGER END where either an identifier (and a colon) are missing or we have an excess comma (and a missing colon). Depending on how good your error recovery is, you may not be able to salvage more than “R is a record type” but you don’t know the fields, or “R is a record type with field a and b” but you don’t know the type of those fields. This is really the same case as above, but here it is caused by a syntax error and not by a semantic error.

Finally consider position information: what is the position of something like TYPE A = ARRAY size OF INTEGER? This becomes important if later in the program you find another declaration VAR A: INTEGER and you want to generate an error message that refers back to the first declaration of A. It may be useful to give each entry in the symbol table a position that starts at the first token that was relevant for the declaration and ends at the last such token: that way you can clearly identify the “stretch” of source text that the declaration originally came from.

Deliverables

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Regardless of your programming language of choice, we expect to build your project using make (if it needs building at all) and we expect to run your project using ./sc (which stands for “SIMPLE compiler”). You are free to use the standard library for your language of choice, except for modules/classes that allow you to avoid writing large parts of the code for an assignment; so no regular expressions, no parsing combinators, etc. Depending on your language of choice, compliance with certain tools (e.g. checkstyle or valgrind), compiler flags, or additional style guides may also be required; see Piazza for details.

Grading

For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.

Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.

Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.

Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.

Functionality refers to your programs being able to do what they should according to the specification given above. (It also refers to you simply doing the required work, which may not be programming alone.) If the specification is ambiguous, ask for clarification! If no clarification is forthcoming, defend the choices you have made in your README file.

If your project cannot be built, or if it is otherwise obvious that you never tested it, you will get no points whatsoever. If you project cannot be built without warnings using the required compiler options we will take off 10%. If your programs cannot be built using make we will take off 10%. If valgrind detects memory errors in your programs, we will take off 10%. If your project fails miserably even once, i.e. terminates with an exception of any kind or dumps core, we will take off 10%. Presumably you see the pattern here?