Scanning and Parsing

Up to now we have had to deal with competing "board" and "SML datatype" notations for the same D programs. We want to get around this problem by writing a parser for DSR, a function from the board-representation strings to term datatype members.

Goal: given DSR program as string (in the board representation), produce DSR term datatype member.

We are going to use a simple parser for an arithmetic language as an example to convey the basic concepts.

Pre-processing the string: scannning

Parsing always includes a pre-processing phase, scanning or lexical analysis.
Group the string into quantum chunks, tokens.
Why? Makes parsing easier and faster.
structure Lex = struct
  datatype ide = IDE of string
  datatype token
    = INT of int
    | BOOL of bool
    | VARIDE of ide
    | LPAREN | RPAREN
    | PLUS | TIMES
    | EOF
end
Definition: a scanner (lexical analyzer) is a function lex : string -> Lex.token list where Lex.token is a datatype of atomic symbols of the language.

Terminology: strings representing atoms are lexemes ("+", "=>, "THEN", ...); each lexeme has a compact token representation as a member of the token datatype.
A few tokens are a bit hairy because they are parametric and have many possible lexemes ("3434", "aVariable").

lex "(2 + x * 5 + 4)";
val it = [LPAREN,INT 2,PLUS,VARIDE (IDE "x"),TIMES,INT 5,PLUS,INT 4,RPAREN]
  : token list
- 
The parser then uses the scanner as a subroutine to return a token whenever it needs one.
- parseString("(2 + x * 5 + 4)"));
val it = PLUS (PLUS (INT 2,TIMES (VAR (IDE "x"),INT 5)),INT 4)
  : term

Writing a Scanner

This is not an overly difficult programming problem. Example code assumes the characters will be read off of a stream (which can be either a file or a string) and a stream of tokens produced.
type 'a stream = {
  next : unit -> unit, (* get next element *)
  peek : unit -> 'a,   (* peek at first character on stream *)
  close : unit -> unit 
}
Main function is to get the next token from the stream, getToken().
Iterating this produces a stream (or list) of tokens.
Here is a sketch of getToken:
fun getToken () =
  let currChar = peek() in
         next ();
	 case currChar of #"\000" => EOF
	  | (#" " | #"\n" | #"\t") => getToken () (* junk; continue *)
	  | #"(" => LPAREN
	  | #")" => RPAREN
	  | #"+" => PLUS
	  | #"*" => TIMES
	  | ch =>
	    if isDigit ch then INT (getNum (ord ch - ord #"0")) (* build number *)
	    else if isAlphaNum ch then
	      case getAlpha ch   (* get next word from stream *)
	       of "TRUE" => BOOL true
		| "FALSE" => BOOL false
		| s => VARIDE (IDE s)
	    else raise Error
  end

Lex

lex is a tool for automatically building a scanner from a list of regular expressions for the lexemes. It produces C code output. There is also a verion of lex for SML, ML-lex.

Grammars

Underlying the theory of parsing is the concept of a grammar.

Grammars

  1. Define a nondeterminitic machine which, with no input, produces a string as output;
  2. define a language of strings (L(G)), the set of strings a particular grammar can produce;
  3. Not only produces strings as output, but produces a grammar derivation for that string, which is a tree.
This is very useful for parsing programs, because
  1. The token strings representing legal programs may be defined as being generated by a grammar; and,
  2. Given such a grammar, the grammar derivation trees are "very close" to the trees corresponding to term datatype members such as PLUS (PLUS (INT 2,TIMES (VAR (IDE "x"),INT 5)),INT 4).
A Grammar G is a set of rules mapping nonterminals to strings of terminals and nonterminals. D's term datatype is very close to the following D grammar:
term -> FN ide => term | term term | term + term | term = term |
         IF term THEN term ELSE term | var | int | ( term )
var -> ide
ide -> (any string)
int -> (any number)
An example rule is term -> FN ide => term, with terminals FN => and nonterminals ide term.

The | notation allows many cases, much as in SML.

The datatype defines abstract syntax trees,

              PLUS
              /  \
            INT TIMES
            /   /   \
           2  VAR  INT
               |    |
              IDE   5
               |
              "x"
while the above grammar produces concrete syntax trees that include => and other irrelevant syntax.
                 term
                /  |  \
               (  term )
                 / | \
              term + term
               |    /  | \
              int var  * int
               |   |      |
               2  ide     5
                   |
                  "x"

Other forms of syntax presentation

Backus-Naur Form (BNF): a common way of presenting syntax for a programming language; it is a grammar which uses ::= in place of ->.

Syntax diagrams (use of arrows and loops) are another equivalent form, and are used in Ullman (see e.g. p. 260 of Ullman).

The D grammar above (as well as most language grammar specs, including the one in Ullman) is however ambiguous: different grammar derivations of same string possible, meaning different concrete syntax trees possible.

Ambiguous grammars are no good for parsing:

Conclusion: re-write grammar to an equivalent but unambiguous grammar.

Making Grammars Unambiguous

This is an art; we proceed by example which solves the common problem of multiple infix operators of different precedence having left associativity.

Example grammar: simple arithmetic over identifiers.

E -> E + E | E * E | id | (E)
(note, this denotes a language with five tokens, + * id ( ))
This grammar is ambiguous, in two ways: 1: id*id+id precedence problem:
             E
            /|\
           E + E
          /|\
         E * E


             E
            /|\
           E * E
              /|\
             E + E
second tree is bad. Need: "* has precedence over +"

2: id+id+id associativity problem (still get the two trees above, replacing the "*" by "+".

Besides these two problems, every string produces a unique tree.
Solving these two problems:

First: enforce proper operator precedence. * binds more tightly than +.

Idea: no +'s off to the right of a * node: rules out second tree above.
Solution: new symbol T for *'s that has no +'s allowed.

E -> E + E | T
T -> T * T | id | (E)
now, * nodes can't have + nodes below. In general, with n operators of varying precedence this technique can be used.
note, need to use E -> T if there are no +'s at top

Next: want to rule out subtree

             E
            /|\
           E + E
              /|\
             E + E

Solution:
E -> E + T | T
...
Similar for T * T: no * wanted on right:
T -> T * F | F
F -> id | (E)
The final grammar:
E -> E + T | T
T -> T * F | F
F -> id | (E)
Assert: This grammar is unambiguous.

Parsing Given an Unambiguous Grammar

There are two schools (algorithms) of parsing.
  1. Top-down parsing, LL(1)
  2. Bottom-up parsing, LR(1)
We are going to present 1. only, for brevity. The UNIX C tool yacc (and ML-yacc) automatically generates an LR(1) parser from a grammar.

Top-down parsing algorithm

We are going to parse programs by building a grammar derivation tree (concrete syntax tree) as follows.
  1. starting initially at the top of the tree with the start symbol, in this case E;
  2. scanning the input tokens in left-to-right order;
  3. building the tree in pre-order traversal order, by applying grammar rules one at a time;
  4. only looking one token ahead to decide how to extend derivation tree next.
The trick is deciding which grammar rule to apply at the current point we are extending the tree from.

Making Grammars LL(1)

This grammar is unambiguous but it is not LL(1), meaning from a single next character it is impossible to decide which rule to apply.

Problem: given input id+id+id+id -- how many E -> E + T nodes to string out?
Answer is 3, but is impossible to tell without looking at almost whole input.
Violates one-character lookahead restriction, necessary to keep algorithm linear and not quadratic.

In general, top-down parsing weakness is with left recursion: A -> A ...

Fact: no grammar with left recursion is LL(1)

Solution: replace E -> E + T | T, which really generates T + T + ... T (think about it) by rule

E -> T {+ T }*
({ blah }* means 0 or more blah's in sequence) --grammar has same strings but no left recursion.
Similar strategy for left recursion for *.

Writing a top-down parser

Main idea: one recursive function to parse each nonterminal of the grammar.
Abstract idea: Given one such function for every grammar rule, they can call each other to build the tree. For instance, suppose a grammar had a rule
Q -> [ E ]
function parseQ reads off a token which should be a [, calls parseE to build that entire tree, and reads off a final token which should be a ]. Done.
This is known as a recursive descent parser.

The following pseudocode is derived from the actual arith parser (which is programmed in a more general style).

(* This function has the responsibility of parsing a nonterminal F of
the grammar, using tokens on the current token stream.  It must
generate the complete parse tree for that particular nonterminal. *)

fun parseF () = case peek ()
   of Lex.ID v => (next (); ID v)  (* skip by the token and build datatype result *)
    | Lex.LPAREN => next (); parseE()
      before (case peek () of Lex.RPAREN => next () | _ => raise Error)
                                                         (* ^ parens unbalanced *)
    | _ => raise Error (* illegal token at this point *)

(* build a complete grammar derivation for the T nonterminal:
   F * F * F * F * ... * F * F * F, or just F
*)

fun parseT () = let
  fun loop term =
    case peek () of
      Lex.TIMES => next();    (* another F in the list ... *)
                   loop (TIMES (term, parseF ()))  (* build datatype *)
      | _ => term    (* no more *'s therefore no more F's in the list; return *)
  in loop (parseF () (* first, parse the first F in the list *)) end


(* build a complete grammar derivation for the E nonterminal:
T * T * T * T * ... * T * T * T, or just T
*)

fun parseE () = let
  fun loop term =
    case peek () of
      Lex.PLUS => next();
                   loop (PLUS (term, parseT ()))
      | _ => term
        in loop (parseT ()) end

 
fun parse () = parseE ()

Parsing other languages

The above ideas pretty much work for any language grammar where One other grammar that is difficult is one where the else clause of an if-then-else is optional:
C -> IF E THEN C ELSE C | IF E THEN C
--there are two choices of the tree to build below C, and the initial terminal, IF, does not help in making the choice.
In such cases we can often hack up solutions
(here, after parsing the first C, if the next token is ELSE we are in the left case, and otherwise assume the right case).
In general it may be impossible to surmount the difficulties, however.
LR(1) method is somewhat more flexible, but conceptually more difficult.