term datatype members.
Goal: given DSR program as string
(in the board representation),
produce DSR term datatype member.
We are going to use a simple parser for an arithmetic language as an example to convey the basic concepts.
structure Lex = struct
datatype ide = IDE of string
datatype token
= INT of int
| BOOL of bool
| VARIDE of ide
| LPAREN | RPAREN
| PLUS | TIMES
| EOF
end
Definition: a scanner (lexical analyzer) is a
function lex : string -> Lex.token list where
Lex.token is a datatype of atomic symbols of the
language.
Terminology: strings representing atoms are lexemes ("+",
"=>, "THEN", ...); each lexeme has a compact token
representation as a member of the token datatype.
A few tokens are a bit hairy because they are parametric and have many
possible lexemes ("3434", "aVariable").
lex "(2 + x * 5 + 4)"; val it = [LPAREN,INT 2,PLUS,VARIDE (IDE "x"),TIMES,INT 5,PLUS,INT 4,RPAREN] : token list -The parser then uses the scanner as a subroutine to return a token whenever it needs one.
- parseString("(2 + x * 5 + 4)"));
val it = PLUS (PLUS (INT 2,TIMES (VAR (IDE "x"),INT 5)),INT 4)
: term
type 'a stream = {
next : unit -> unit, (* get next element *)
peek : unit -> 'a, (* peek at first character on stream *)
close : unit -> unit
}
Main function is to get the next token from the stream,
getToken(). getToken:
fun getToken () =
let currChar = peek() in
next ();
case currChar of #"\000" => EOF
| (#" " | #"\n" | #"\t") => getToken () (* junk; continue *)
| #"(" => LPAREN
| #")" => RPAREN
| #"+" => PLUS
| #"*" => TIMES
| ch =>
if isDigit ch then INT (getNum (ord ch - ord #"0")) (* build number *)
else if isAlphaNum ch then
case getAlpha ch (* get next word from stream *)
of "TRUE" => BOOL true
| "FALSE" => BOOL false
| s => VARIDE (IDE s)
else raise Error
end
lex is a tool for automatically building a scanner from a
list of regular expressions for the lexemes. It produces C code
output. There is also a verion of lex for SML, ML-lex.
Grammars
term datatype
members such as PLUS (PLUS (INT 2,TIMES (VAR (IDE "x"),INT 5)),INT 4).
term datatype is very close to the following D grammar:
term -> FN ide => term | term term | term + term | term = term |
IF term THEN term ELSE term | var | int | ( term )
var -> ide
ide -> (any string)
int -> (any number)
An example rule is term -> FN ide => term, with terminals
FN => and nonterminals ide term.
The | notation allows many cases, much as in SML.
The datatype defines abstract syntax trees,
PLUS
/ \
INT TIMES
/ / \
2 VAR INT
| |
IDE 5
|
"x"
while the above
grammar produces concrete syntax trees that include
=> and other irrelevant syntax.
term
/ | \
( term )
/ | \
term + term
| / | \
int var * int
| | |
2 ide 5
|
"x"
::= in place of ->.Syntax diagrams (use of arrows and loops) are another equivalent form, and are used in Ullman (see e.g. p. 260 of Ullman).
The D grammar above (as well as most language grammar specs, including the one in Ullman) is however ambiguous: different grammar derivations of same string possible, meaning different concrete syntax trees possible.
Ambiguous grammars are no good for parsing:
Conclusion: re-write grammar to an equivalent but unambiguous grammar.
Example grammar: simple arithmetic over identifiers.
E -> E + E | E * E | id | (E)(note, this denotes a language with five tokens,
+ * id ( ))id*id+id precedence problem:
E
/|\
E + E
/|\
E * E
E
/|\
E * E
/|\
E + E
second tree is bad. Need: "* has precedence over +"
2: id+id+id associativity problem (still get the two
trees above, replacing the "*" by "+".
Besides these two problems, every string produces a unique tree.
Solving these two problems:
First: enforce proper operator precedence. * binds more
tightly than +.
Idea: no +'s off to the right of a * node: rules out second tree above.
Solution: new symbol T for *'s that has no +'s allowed.
E -> E + E | T T -> T * T | id | (E)now, * nodes can't have + nodes below. In general, with n operators of varying precedence this technique can be used.
Next: want to rule out subtree
E
/|\
E + E
/|\
E + E
Solution:
E -> E + T | T ...Similar for T * T: no * wanted on right:
T -> T * F | F F -> id | (E)The final grammar:
E -> E + T | T T -> T * F | F F -> id | (E)Assert: This grammar is unambiguous.
yacc (and ML-yacc) automatically generates
an LR(1) parser from a grammar.
E;
Problem: given input id+id+id+id -- how many E -> E +
T nodes to string out?
Answer is 3, but is impossible to tell without looking at almost whole input.
Violates one-character lookahead restriction, necessary to keep
algorithm linear and not quadratic.
In general, top-down parsing weakness is with left recursion:
A -> A ...
Fact: no grammar with left recursion is LL(1)
Solution: replace E -> E + T | T, which
really generates T + T + ... T (think about it) by rule
E -> T {+ T }*
({ blah }* means 0 or more blah's in sequence)
--grammar has same strings but no left recursion.*.
X;
parseX to parse grammar rule X.
parseX(); it should build the entire
tree for that node, reading tokens off the stream it uses.
Q -> [ E ]function
parseQ reads off a token which should be a
[, calls parseE to build that
entire tree, and reads off a final token
which should be a ]. Done.The following pseudocode is derived from the actual arith parser (which is programmed in a more general style).
(* This function has the responsibility of parsing a nonterminal F of
the grammar, using tokens on the current token stream. It must
generate the complete parse tree for that particular nonterminal. *)
fun parseF () = case peek ()
of Lex.ID v => (next (); ID v) (* skip by the token and build datatype result *)
| Lex.LPAREN => next (); parseE()
before (case peek () of Lex.RPAREN => next () | _ => raise Error)
(* ^ parens unbalanced *)
| _ => raise Error (* illegal token at this point *)
(* build a complete grammar derivation for the T nonterminal:
F * F * F * F * ... * F * F * F, or just F
*)
fun parseT () = let
fun loop term =
case peek () of
Lex.TIMES => next(); (* another F in the list ... *)
loop (TIMES (term, parseF ())) (* build datatype *)
| _ => term (* no more *'s therefore no more F's in the list; return *)
in loop (parseF () (* first, parse the first F in the list *)) end
(* build a complete grammar derivation for the E nonterminal:
T * T * T * T * ... * T * T * T, or just T
*)
fun parseE () = let
fun loop term =
case peek () of
Lex.PLUS => next();
loop (PLUS (term, parseT ()))
| _ => term
in loop (parseT ()) end
fun parse () = parseE ()
F grammar rules).
else
clause of an if-then-else is optional:
C -> IF E THEN C ELSE C | IF E THEN C--there are two choices of the tree to build below
C, and
the initial terminal, IF, does not help in making the
choice. C, if the next token is ELSE we are in
the left case, and otherwise assume the right case).