Assignment 4: Semantic Analysis / Abstract Syntax Tree

Overview

In the fourth part of the compiler project you will extend your parser to build an abstract syntax tree that encodes the expressions, conditions, and instructions in a SIMPLE program. You will also enforce several context conditions that have not been checked so far. These tasks are part of our semantic analysis phase for SIMPLE. You are also going to extend your driver program with options to display the abstract syntax tree for a SIMPLE program. You can get the concrete grammar for the SIMPLE programming language here. You can get the abstract grammar for the SIMPLE programming language here. The context conditions for this assignment are given here.

Problem 1: Driver Program (10%)

The final compiler will consist of a number of modules and classes working together to translate programs written in SIMPLE into equivalent programs written in assembly language. While these “bits and pieces” are spread out over the entire semester, you can already implement the basic driver program that will orchestrate their work. The driver will be called sc and is invoked from the shell as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .

This describes the syntax of the command line in EBNF. After ./sc itself, the user can supply one option (introduced by “-”) to tell the driver which parts of the compiler to run and what kind of output to produce.

Arguments and Options

With this assignment the option -a is allowed on the command line for sc. The remaining options, including “no options at all,” should still result in errors, except for -s, -c, and -t which are unchanged from Assignments 1, 2, and 3.

For this assignment, the option -a is supposed to build and display the abstract syntax tree for the given input program. If -a is given but an error is detected, the (partial) abstract syntax tree should not be displayed.

If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.

The semantic actions to build the AST and to enforce context conditions will be inside the relevant parser methods for expressions, conditions, and instructions. Also you will need to decide how the driver or other parts of the compiler will access the abstract syntax tree once it has been built.

Problem 2: Abstract Syntax Tree (80%)

The abstract syntax tree will keep track of the instructions, expressions, and conditions in a SIMPLE program. In fact, the AST will be the primary data structure for the interpreter and code-generator(s) you will develop in the following assignments.

Abstract Syntax Tree Nodes

First you need to decide how you will represent the instructions, expressions, and conditions in a SIMPLE program. The abstract grammar for SIMPLE describes the structure of the AST, but it doesn’t define the necessary details. I suggest you introduce a base class Node for all kinds of nodes in the AST; remember that you might want to introduce several abstract methods here later on. Derived from Node you should introduce classes Instruction, Expression, and Condition to model the “big three” categories.

Consider the abstract grammar for conditions for a moment. Obviously you can already make a concrete class in this case, storing two expression pointers (left and right) and the actual relation being checked. This is the “pattern” you will follow for the other concrete classes as well: each class will store what it is supposed to store according to its production in the abstract grammar.

Now consider the abstract grammar for instructions. There are five different instructions that can occur in a SIMPLE program (after the WHILE has been transformed away as discussed in lecture), so you will define classes Assign, If, Repeat, Read, and Write derived from the base class Instruction; note that instructions need a next pointer since you have to encode the order in which they occur in the source text.

For expressions you can proceed in a similar fashion. For locations, however, things get a little tricky. I don’t want to give you too many hints, but I should at least give one slightly bigger example. Consider the designator in the following assignment instruction:

a[1].x[2].y := -20

Here is the corresponding AST (click for full size):

Obviously encoding all those selectors can get a little complicated. Notice that pointers are labeled in the AST to make it a little easier to understand: the label ST indicates that an AST node points to an object of one of the Entry classes from the symbol table assignment (although not all those objects are actually part of the symbol table); the labels location and expression indicate whether we are interested in an address or a value as it were (note that expression pointers can refer to locations, but location pointers cannot refer to expressions).

For purposes of type-checking you should give AST nodes derived from Expression a member variable that stores the type associated with the node as well. (These type pointers are not shown in the AST above, but they must be there). The type of a must be an array of some sort, otherwise indexing would not make sense. The type of a[1] must be a record of some sort in which a field x is declared. The type of a[1].x must again be an array, and the type of a[1].x[2] must again be a record, this time with a field y declared. The type of the entire designator a[1].x[2].y is the type of field y. (If we take the assignment into account as well, the type of y must actually be integer.)

Besides the basic methods to create objects of these classes, you should make sure that all of them have a toString method of some sort, similar to what you did for tokens and symbol table entries before. However, please note that these are mostly for debugging, you should create the actual output differently (see below). While you hack all these classes, you should also test them in isolation using simple unit tests; all the bugs you find here will not distract you later on, which is a good thing. (You can save yourself a lot of trouble if you use the type system of your implementation language to ensure that certain invalid ASTs cannot be built in the first place.)

Building the Abstract Syntax Tree

The next step is to extend the parser methods for instructions, expressions, and conditions to create the proper AST nodes and subtrees. Each method should return the “top node” of the subtree it creates.

For example, in the method Condition you first call Expression. This will parse the required number of tokens and return a pointer to an Expression node representing what it just recognized. Then you match one of the allowed relations and remember it. Then you call Expression again, getting a pointer to another Expression node, this time for the right-hand side of the comparison. Now you have all the “ingredients” to create a Condition node, filling in the left and right subtrees as well as the relation being checked. The method Condition then returns a pointer to the node it just created, to be used by whatever method called Condition in the first place (the parser method Repeat for example). Thus the AST is built “bottom-up” as you parse the program text top-down and left to right.

In this way, each of these methods returns the subtree it recognized and its caller can “hook” that subtree into a larger one it in turn passes back. The tree for the complete program will be returned by the call to Instructions within Program. Note that in the case of Selector you will not just return a tree, but you also pass a tree as a parameter: Otherwise Selector would not know what it is being applied to, and you could not enforce the necessary context conditions. You should test your extensions with a number of simple (!) SIMPLE (!!) programs before you move on.

Constant Folding

In the methods parsing expressions you should now add constant folding as described in lecture. Obviously only literal numbers and identifiers that refer to symbol table constants are indeed constant. In both of these cases, you can return a Number node with a pointer to a Constant object.

Before you produce Binary nodes, however, you should check whether both sides are constant. If they are indeed, you perform the operation directly and return a Number node with a pointer to a Constant filled with the result of the operation. Thus an expression with only constant parts will in fact not lead to a convoluted tree, but just to a single Number node with the final value already computed.

Once constant folding works, you can “hook up” the AST with the ST from the previous assignment. Where you assumed the value five before, for example in Type when parsing an ARRAY constructor, you can now actually use the result of Expression. For array types, you must ensure that the expression you get back is indeed a Number node, that its value is greater than zero (see context conditions!), and then fill in that value in your Array object.

Note that this approach to constant folding is far from perfect. For example, the expression 1+a+3 where a is a variable will not be transformed into a+4 as you might wish. We only fold adjacent nodes, not entire expressions; however, that’s good enough for our purposes here: ensuring that expressions that must be constant are actually constant.

Error Handling

The advice from earlier assignments about using exceptions for error handling is still in effect, as is the required format for your error messages:

error: some helpful description

Enforcing context conditions for SIMPLE programs will lead to a number of “new” errors, for example when the types on both sides of an assignment instruction do not agree. If you followed the advice for error handling on previous assignments, you should have little trouble handling those new errors.

Since all that matters for automated grading (for undergraduates anyway) is that you actually detect errors, the error messages themselves would not have to make any sense. However, you might want to try to make your error messages as informative as possible, for example by including details about where an error has occured. As the context conditions we enforce get more complex, you are really doing yourself a favor if you have decent error messages. For example, if you detect a missing record field, instead of saying

error: no such record field

you may want to at least say

error: no such record field @ (138, 138)

or even better

error: the designator "a[1].x[2]" @ (128, 136) does not refer to a record with a field "y" @ (138, 138)

Again, for undergraduates these “fancy” error messages are not required, but they will help you quite a bit as you debug your compiler.

Abstract Syntax Tree Output

Once you have successfully built the abstract syntax tree for an input program, you must produce output that illustrates its structure. Your starting point for the output should be the root of the AST you built for the program itself, which will output the entire program. Here is an example program:

PROGRAM X;
CONST
  sz = 47;
VAR
  a: ARRAY sz OF INTEGER;
  i: INTEGER;
BEGIN
  i := 0;
  REPEAT
    a[i] := 64738
  UNTIL i >= sz END
END X.

The textual output for this program should be as follows:

instructions =>
  Assign:
  location =>
    Variable:
    variable =>
      VAR BEGIN
        type:
          INTEGER
      END VAR
  expression =>
    Number:
    value =>
      CONST BEGIN
        type:
          INTEGER
        value:
          0
      END CONST
  Repeat:
  condition =>
    Condition (>=):
    left =>
      Variable:
      variable =>
        VAR BEGIN
          type:
            INTEGER
        END VAR
    right =>
      Number:
      value =>
        CONST BEGIN
          type:
            INTEGER
          value:
            47
        END CONST
  instructions =>
    Assign:
    location =>
      Index:
      location =>
        Variable:
        variable =>
          VAR BEGIN
            type:
              ARRAY BEGIN
                type:
                  INTEGER
                length:
                  47
              END ARRAY
          END VAR
      expression =>
        Variable:
        variable =>
          VAR BEGIN
            type:
              INTEGER
          END VAR
    expression =>
      Number:
      value =>
        CONST BEGIN
          type:
            INTEGER
          value:
            64738
        END CONST

As you can see, at certain places you will have to include output for pieces of the symbol table as well, hopefully your solution for the previous assignment is modular enough to handle this. This form of output is slightly inaccurate, but at least it’s fairly easy to implement. I suggest that you use the Visitor design pattern to traverse the AST once it is built. Inside your visitor you can maintain an indentation count as well as a string buffer in which you assemble the output. You can then print from your driver after a successful parse. (If you don’t want to use the Visitor pattern you can achieve the same results by using a few mutually recursive functions instead.) You must have textual output in this form to get points for the assignment.

Problem 3: Graphical Abstract Syntax Tree (10%)

While it is possible to recognize the structure of the abstract syntax tree in the textual format described above, it is much easier to see if we actually draw the AST (click for full size):

As an added benefit, the graphical representation is also 100% accurate: It gives an (almost) exact representation of what the data structures should look like in memory.

This drawing was again rendered using DOT, a tool that takes a textual description of a graph and generates cute PostScript or PNG figures. See http://www.graphviz.org/ for more information on DOT and related tools. Note that you simply write the DOT format to the standard output; we will take care of actually rendering the graphics if we want to look at your symbol table while grading.

You have to extend the driver program with an option -g to indicate that -a should produce graphical output instead of textual output. So the modified command line would be as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] ["-g"] [filename] .

For now it is an error to supply -g by itself or with any option other than -c, -t, or -a. Please follow the same format we used above: nodes in the AST should be rectangles; also please do not include the complete symbol table output, things would get too complicated; instead just indicate variables and constants as shown, do not follow type pointers any further.

Graduate Level Requirements

If you are taking this course at the graduate level, the additional error-handling requirements from Assignments 2 and 3 are still in effect. You are, however, once again faced with new problems now that we’re building an AST and use it to check all (static) context conditions of the language. I won’t belabor the details here as I did on previous assignments, the point is that you must be able to keep parsing even after a semantic error such as a missing record field in a designator or mismatched types in an assignment. Needless to say, you need accurate position information for all your error messages, and ideally your error messages show where the actual problem is by including relevant pieces of source text.

The good news is that after this assignment, you are mostly done with additional requirements regarding error handling. Of course there will be new graduate-level requirements to compensate.

Deliverables

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Regardless of your programming language of choice, we expect to build your project using make (if it needs building at all) and we expect to run your project using ./sc (which stands for “SIMPLE compiler”). You are free to use the standard library for your language of choice, except for modules/classes that allow you to avoid writing large parts of the code for an assignment; so no regular expressions, no parsing combinators, etc. Depending on your language of choice, compliance with certain tools (e.g. checkstyle or valgrind), compiler flags, or additional style guides may also be required; see Piazza for details.

Grading

For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.

Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.

Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.

Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.

Functionality refers to your programs being able to do what they should according to the specification given above. (It also refers to you simply doing the required work, which may not be programming alone.) If the specification is ambiguous, ask for clarification! If no clarification is forthcoming, defend the choices you have made in your README file.

If your project cannot be built, or if it is otherwise obvious that you never tested it, you will get no points whatsoever. If you project cannot be built without warnings using the required compiler options we will take off 10%. If your programs cannot be built using make we will take off 10%. If valgrind detects memory errors in your programs, we will take off 10%. If your project fails miserably even once, i.e. terminates with an exception of any kind or dumps core, we will take off 10%. Presumably you see the pattern here?