Assignment 1: Lexical Analysis

Overview

In the first part of the compiler project, you are going to develop a scanner to perform lexical analysis of SIMPLE programs. You are also going to hack the basic driver program which you will use for the rest of the semester to integrate more and more components of your compiler. You can get the complete concrete grammar for the SIMPLE programming language here.

Problem 1: Driver Program (20%)

The final compiler will consist of a number of modules and classes working together to translate programs written in SIMPLE into equivalent programs written in assembly language. While these “bits and pieces” are spread out over the entire semester, you can already implement the basic driver program that will orchestrate their work. The driver will be called sc and is invoked from the shell as follows:

Invocation = "./sc" ["-" ("s"|"c"|"t"|"a"|"i")] [filename] .

This describes the syntax of the command line in EBNF. After ./sc itself, the user can supply one option (introduced by “-”) to tell the driver which parts of the compiler to run and what kind of output to produce.

Arguments and Options

For this assignment, the option “-s” to run the scanner by itself and to produce a list of recognized tokens is the only relevant option. If another option, or no option at all, is given, you should abort with an error message (see below). Eventually “no option” will mean “generate code” as it does for “real” compilers.

If a second argument is given, it is assumed to be the file name of a SIMPLE program to process. If no filename is given, you should read the program from standard input instead. Eventually this will also determine whether the output goes to standard output or to a file, but for now all your output goes to standard output.

Error Handling

If you detect an error, whether in this or future assignments, you should output (to standard error of course) an error message in the following form:

error: some helpful description

You must output the string “error:” on a new line, followed by one blank, followed by whatever text makes sense for the error in question. Our automated grading suite relies on this format and you’ll get penalized if you do something else. Note that if an error is detected in this assignment, the (partial) list of tokens should still be output.

I suggest using exceptions for error handling throughout the project, if your language of choice supports them. This way, you can simply put a big try into the main program, catch at the end, and just output “error:” followed by some string obtained from the exception object. As your compiler grows and more components can signal errors, this will be simpler than handling them one by one.

Problem 2: Scanner (80%)

The scanner for SIMPLE reads the source program as a sequence of characters and recognizes “larger” textual units called tokens. Consider the following source program (additional blanks inserted to separate the characters, underscores represent the actual blanks):

I F _ c = 1 0 _ T H E N

The scanner would produce the following sequence of tokens in return (blanks now separate tokens):

IF c = 10 THEN

That is, a token for the keyword IF, a token for the identifier c, a token for the equals sign, a token for the integer 10, a token for the keyword THEN and so on and so forth. Note that whereas whitespace is sometimes needed to separate tokens (for example between IF and c which would otherwise be read as the identifier IFc) it does not itself constitute tokens.

Aside from regular spaces, your scanner should also “filter out” tabs, line feeds, form feeds, carriage returns, and comments. In SIMPLE, comments start with the symbol “(*” and end with the symbol “*)”; unlike comments in C, comments in SIMPLE can be nested; this requires a bit of extra hackery in the scanner; much like whitespace, comments are not part of the actual grammar, but they still have to be handled/filtered properly by the scanner.

Tokens

Before you start hacking the scanner you should study the grammar carefully and identify the vocabulary of the language. Try to classify the individual tokens in a useful way (keywords, single character symbols, multi character symbols, etc.). Also, you want to note the characters that are “legal” in SIMPLE programs. If you ever encounter a character that does not have any purpose, and if that character occurs outside of a comment, then you should throw an exception (display an error, see above) that notifies the user of the problem. (This is essentially the only error message your compiler can generate so far.)

Once you know the vocabulary, you should write a class Token. Instances of this class will represent the tokens your scanner recognized and passes on to the next phase. The class needs to be able to store a variety of information depending on the kind of token we’re dealing with. One kind of tokens are keywords such as IF and WHILE which don’t have any additional “semantic” information attached. Another kind of token are identifiers such as super and label1 for which you need to store (a) the fact that they are identifiers, and (b) the actual string. For numbers such as 64738 you should store their integer value.

Each token should also store the position it occurs at in the source text for use in error messages. The first character you read is position 0, the next is position 1, and so on. The position of a token consists of the positions of the first and last characters that constitute that token. (Note that processing whitespace, while not yielding tokens, will still increase the position in the file.) Your Token class should also be able to return a string (in the format described below) that explains what token it is, what value is attached (if any), and where it occurred in the source. You might want to test the Token class before you go on. (BTW, you could use a base class and separate subclasses for each token kind, but I’d discourage it in this case. Avoid over-design!)

Scanner

Now you are ready to implement the actual scanner. I suggest implementing it as a class Scanner which accepts a string (representing the source text) in its constructor. (Using a class for this is not 100% optimal from a principled design perspective. However, just like using exceptions for error handling, it is convenient.)

Your class should offer two public functions: A function next that returns the next token only, and a function all that returns all tokens in the source text as a list. You can call the first function repeatedly to implement the second one, and you can use a standard container class for the list. Note that once next was called from another part of the compiler, you should not allow it to be followed by a call to all. If next is called after the whole source text has been processed, you could throw an exception. However, it will be far more useful to introduce another token kind—let’s call it eof for “end of file”—which gets returned instead of real tokens once the actual source is exhausted.

To recognize the tokens themselves, you should write functions that tell you whether a given character belongs to a certain “class of characters” such as “letter” or “digit” as well as functions that handle a sequence of these (once recognized) and instantiate the appropriate token object. You also need to be able to check whether a certain sequence of “letters” is a keyword or not; if not, it is an identifier. In other words: Decompose the task of lexical analysis further into private functions, do not do everything in the next function directly.

Let’s look at an example for how this first iteration of the SIMPLE compiler would be used from the command line:

$ ./sc -s
VAR ics142: ARRAY 5 OF INTEGER;
VAR@(0, 2)
identifier<ics142>@(4, 9)
:@(10, 10)
ARRAY@(12, 16)
integer<5>@(18, 18)
OF@(20, 21)
identifier<INTEGER>@(23, 29)
;@(30, 30)
eof@(32, 32)

The first line shows the shell prompt and the user starting your driver program without a file name, so input will come from standard input. The next line shows what the user is typing as input; what is not shown is that this time the user needs to end the input with a “end of file” character from the terminal, not just by hitting the “return” key. The following lines are the individual tokens recognized by the scanner, one on each line. Each token starts with the “kind” of token it is; this is optionally followed by its “semantic value” in angle brackets; this is followed by the position of the token in the format shown. (Yes, this is indeed the format you should use to print your tokens.)

Deliverables

Please follow the submission instructions as detailed on Piazza. Make sure that your tarball contains no derived files whatsoever (i.e. no executable files), but allows building all required derived files. Also make sure to include your name and email address in every file you turn in (well, in every file for which it makes sense to do so anyway)!

Regardless of your programming language of choice, we expect to build your project using make (if it needs building at all) and we expect to run your project using ./sc (which stands for “SIMPLE compiler”). You are free to use the standard library for your language of choice, except for modules/classes that allow you to avoid writing large parts of the code for an assignment; so no regular expressions, no parsing combinators, etc. Depending on your language of choice, compliance with certain tools (e.g. checkstyle or valgrind), compiler flags, or additional style guides may also be required; see Piazza for details.

Grading

For reference, here is a short explanation of the grading criteria; not all of the criteria apply to all problems on a given assignment, and not all of the assignments even use all of the criteria.

Packaging refers to the proper organization of the stuff you hand in, following both the guidelines for Deliverables above as well as the general submission instructions for assignments on Piazza.

Style refers to programming style, including things like consistent indentation, appropriate identifier names, useful comments, suitable documentation, etc. Simple, clean, readable code is what you should be aiming for.

Design refers to proper modularization (into functions, classes, modules, etc.) and the proper choice of algorithms and data structures.

Performance refers to how fast/with how little memory your project can produce the required results compared to other submissions; in this course this can mean your actual compiler or interpreter as well as the code generated by it.

Functionality refers to your programs being able to do what they should according to the specification given above. (It also refers to you simply doing the required work, which may not be programming alone.) If the specification is ambiguous, ask for clarification! If no clarification is forthcoming, defend the choices you have made in your README file.

If your project cannot be built, or if it is otherwise obvious that you never tested it, you will get no points whatsoever. If you project cannot be built without warnings using the required compiler options we will take off 10%. If your programs cannot be built using make we will take off 10%. If valgrind detects memory errors in your programs, we will take off 10%. If your project fails miserably even once, i.e. terminates with an exception of any kind or dumps core, we will take off 10%. Presumably you see the pattern here?