Assignment 2: Syntactic Analysis

Overview

In the second part of the compiler project, you are going to develop a parser to perform syntactic analysis of SIMPLE programs. You are also going to extend your driver program with options to display the concrete syntax tree of a SIMPLE program. You can get the complete EBNF grammar for the SIMPLE programming language here.

Problem 1: Driver Program (10%)

The final compiler will consist of a number of modules and classes working together to translate programs written in SIMPLE into equivalent programs written in assembly language. While these “bits and pieces” are spread out over the entire semester, you can already implement the basic driver program that will orchestrate their work. The driver will be called sc and is invoked from the shell as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .

This describes the syntax of the command line in EBNF. After ./sc itself, the user can supply one option (introduced by “-”) to tell the driver which parts of the compiler to run and what kind of output to produce.

Arguments and Options

With this assignment the option -c is allowed on the command line for sc. The remaining options, including “no options at all,” should still result in errors, except for -s of course which is unchanged from Assignment 1.

For this assignment, the option -c is supposed to run the parser and to display the concrete syntax tree (aka parse tree) for the given input. If -c is given but an error is detected, the (partial) parse tree should not be output.

If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.

Of course the parser will need to access the scanner, so you need to decide how the two will communicate. The two main options are (a) using all() from your driver and passing the list of tokens to the parser’s constructor, or (b) passing the scanner object to the parser’s constructor and having the parser call next() whenever it needs a new token.

Problem 2: Parser (80%)

The parser for SIMPLE reads the source program as a sequence of tokens and recognizes its structure in the form of a parse tree. Note that this tree is not constructed explicitly! You should use the method of recursive descent to implement the parser, and thus the tree will only exist in form of the call stack, i.e. which parser method called which other parser method.

Consider the If production for example, and consider the following source program fragment (in terms of tokens):

... IF c = 10 THEN ...

The parser should call the method to match an If, which will then match the token IF, call the method to match a Condition (which will call others in turn and so on), match the token THEN, and so on. Implicitly, the parser thus “built” the following parse tree by matching terminals and calling methods to match non-terminals:

...
If
  IF@(43,44)
  Condition
    Expression
      Term
        Factor
          Designator
            identifier<c>@(46,46)
            Selector
    =@(48,48)
    Expression
      Term
        Factor
          integer<10>@(50,51)
  THEN@(53,56)
...

This is actually the “first taste” of what kind of output you need to generate for this assignment. Before you start hacking the parser, you should study the grammar carefully and identify the non-terminals of the language. These are the productions you need to recognize, and for each of them you will write a method in the parser class. Also, you might want to make sure that the grammar can be parsed using recursive descent before you start the implementation, i.e. check whether it meets the conditions for an LL(1) grammar.

I suggest implementing a class Parser which should offer only one public function: The function parse which processes a complete SIMPLE program before it returns. Note that the function should not return any value: In later assignments, the parser will be extended to build a number of data structures (the intermediate representation) instead. All the actual parsing methods should be private (or at least protected), and the first one you should call from parse is of course Program. You might also want to implement some helper methods, for example the method match() I described in the lecture.

Aside from implementing the actual parser, an important concern is how we get the textual representation of the parse tree out of the parser and to the user. There are a number of options. The simplest would be to just hardcode a bunch of prints all over the parser. However, this is not a good design if we want to reuse the parser in a variety of contexts. Another option would be to build a big string in a member variable during parsing, and to add a method returning that string to the Parser class. The driver could then ask for this string after the parse method returned and print it. There’s also the option of passing the actual stream (e.g. std::cout) from the driver to the parser as a parameter, and to have the parser write its output there. However, the best option is probably to use the Observer pattern as described in lecture to separate output of the parse tree almost entirely from the actual parsing process. (Kudos to Michael Kurth (Spring 2004) who came up with this use of the Observer pattern.) Whatever you decide to do, please be aware that it must be possible to switch the output of the parse tree on and off without recompiling the code; in later assignments, you can’t simply remove the output code, it is still needed when the user gives the -c option!

Here is a sample interaction with the SIMPLE compiler:

$ ./sc -c
PROGRAM X; VAR i: INTEGER; END X.
Program
  PROGRAM@(0, 6)
  identifier<X>@(8, 8)
  ;@(9, 9)
  Declarations
    VarDecl
      VAR@(11, 13)
      IdentifierList
        identifier<i>@(15, 15)
      :@(16, 16)
      Type
        identifier<INTEGER>@(18, 24)
      ;@(25, 25)
  END@(27, 29)
  identifier<X>@(31, 31)
  .@(32, 32)

As before, the first line shows the shell prompt and the user starting your driver program. The next line shows what the user is typing as input, terminated by a newline and “end of file” from the terminal, The following lines are the parse tree for this particular program; note how we indent lines that represent children of a particular “goal” during parsing by two spaces.

Error Handling

If you detect an error, whether in this or future assignments, you should output (to standard error of course) an error message in the following form:

error: some helpful description

You must output the string “error:” on a new line, followed by one blank, followed by whatever text makes sense for the error in question. Our automated grading suite relies on this format and you’ll get penalized if you do something else.

The advice from Assignment 1 about using exceptions for error handling is still in effect, as is the required format of such messages as given above.

For this assignment, a large number of errors will be detected in the match function: When the token actually read does not correspond to the expected token, you can raise an exception and thus produce an error. However, sometimes you expect one out of a number of possible tokens, for example inside the Condition production. Instead of doing these tests “by hand” you should consider writing a “smart” match function that takes a list of possible tokens and not only matches them, but also returns the token actually encountered (as we will need that token for later assignments).

Problem 3: Graphical Parse Trees (10%)

Consider the following example program:

PROGRAM As3;
CONST x = -47;
TYPE T = RECORD f: INTEGER; END;
VAR a: ARRAY 12 OF T;
BEGIN
  a[7].f := -x
END As3.

The parse tree for this program, in its textual form as required by this assignment, is as follows:

Program
  PROGRAM@(0, 6)
  identifier<As3>@(8, 10)
  ;@(11, 11)
  Declarations
    ConstDecl
      CONST@(13, 17)
      identifier<x>@(19, 19)
      =@(21, 21)
      Expression
        -@(23, 23)
        Term
          Factor
            integer<47>@(24, 25)
      ;@(26, 26)
    TypeDecl
      TYPE@(28, 31)
      identifier<T>@(33, 33)
      =@(35, 35)
      Type
        RECORD@(37, 42)
        IdentifierList
          identifier<f>@(44, 44)
        :@(45, 45)
        Type
          identifier<INTEGER>@(47, 53)
        ;@(54, 54)
        END@(56, 58)
      ;@(59, 59)
    VarDecl
      VAR@(61, 63)
      IdentifierList
        identifier<a>@(65, 65)
      :@(66, 66)
      Type
        ARRAY@(68, 72)
        Expression
          Term
            Factor
              integer<12>@(74, 75)
        OF@(77, 78)
        Type
          identifier<T>@(80, 80)
      ;@(81, 81)
  BEGIN@(83, 87)
  Instructions
    Instruction
      Assign
        Designator
          identifier<a>@(91, 91)
          Selector
            [@(92, 92)
            ExpressionList
              Expression
                Term
                  Factor
                    integer<7>@(93, 93)
            ]@(94, 94)
            .@(95, 95)
            identifier<f>@(96, 96)
        :=@(98, 99)
        Expression
          -@(101, 101)
          Term
            Factor
              Designator
                identifier<x>@(102, 102)
                Selector
  END@(104, 106)
  identifier<As3>@(108, 110)
  .@(111, 111)

While it is possible to recognize the structure of the parse tree in this format, it is much easier to see if we actually draw the tree (click for full size):

This drawing was rendered using DOT, a language and tool that takes a textual description of a graph and generates cute PostScript or PNG figures. See http://www.graphviz.org/ for more information on DOT and related tools.

If you’re using the Observer pattern to produce output already, it is not very difficult produce these nice diagrams as well! First extend the driver program with an option -g to indicate that -c should produce graphical output instead of textual output. So the modified command line would be as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] ["-g"] [filename] .

For now it is an error to supply -g by itself or with any option other than -c; however, later assignments will allow -g with several other options.

Now implement a second Observer class that contains the code to output DOT syntax instead of the earlier textual form. Integration with the driver is easy: If only -c is given, your driver creates an instance of the basic Observer and connects it to the parser before calling parse() on it; if -g is given as well, your driver creates an instance of the “DOT Observer” instead and connects that to the parser. Note that you can simply write DOT output to standard output and redirect into a file to render it. Please follow the same format we used above: diamonds for terminals and rectangles for non-terminals.

Graduate Level Requirements

If you are taking this course at the graduate level, there are additional requirements when it comes to error handling.

First your error messages must include accurate position information. For now the simplest way to do this is to print the errorneous tokens in such a way that positions are included (similar to what you did in the scanner driver). Later it may be necessary to “infer” compound positions, but for now a single position should be fine.

Second you must implement the error handling technique described in lecture: After detecting a syntax error, you will surpress further error messages until at least four additional tokens have been processed. You will also handle weak and strong symbols as described in lecture. If a weak symbol (e.g. a closing parenthesis or END) is missing, the parser flags an error but then assumes that the symbol was actually present. If a syntax error not involving a weak symbol occurs, the parser flags an error and then skips tokens until it finds an appropriate strong symbol (e.g. CONST or IF). Note that the latter requires rolling back the parser’s state far enough for the strong symbol to actually resynchronize the parser with the token stream (which is why using exceptions for errors is such a good idea). As a result of these techniques, you will be able to diagnose multiple syntax errors instead of just the first one.

Note that the rule of not producing a parse tree after a syntax error still applies.

Deliverables

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Regardless of your programming language of choice, we expect to build your project using make (if it needs building at all) and we expect to run your project using ./sc (which stands for “SIMPLE compiler”). You are free to use the standard library for your language of choice, except for modules/classes that allow you to avoid writing large parts of the code for an assignment; so no regular expressions, no parsing combinators, etc. Depending on your language of choice, compliance with certain tools (e.g. checkstyle or valgrind), compiler flags, or additional style guides may also be required; see Piazza for details.

Grading

For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.

Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.

Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.

Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.

Functionality refers to your programs being able to do what they should according to the specification given above. (It also refers to you simply doing the required work, which may not be programming alone.) If the specification is ambiguous, ask for clarification! If no clarification is forthcoming, defend the choices you have made in your README file.

If your project cannot be built, or if it is otherwise obvious that you never tested it, you will get no points whatsoever. If you project cannot be built without warnings using the required compiler options we will take off 10%. If your programs cannot be built using make we will take off 10%. If valgrind detects memory errors in your programs, we will take off 10%. If your project fails miserably even once, i.e. terminates with an exception of any kind or dumps core, we will take off 10%. Presumably you see the pattern here?