CS 331 Spring 2013 > Lecture Notes for Monday, February 11, 2013 |
An LL grammar is a context-free grammar that can be handled by an LL parsing algorithm, such as Recursive Descent. Recall that the name comes from the fact that these parsers read their input Left-to-right, and they go through the steps required to generate a Leftmost derivation. Now we look at some of the properties that LL grammars must have.
Consider the following production.
\[A \rightarrow Ax\]
If we attempt to turn this production into a Recursive-Descent parsing function, we get something like this.
bool parse_A() { if (!parse_A()) return false; ... }
See the problem?
The first thing function parse_A
does
is to call itself.
This is recursion without a base-case check;
the function will never return.
The trouble lies in the grammar.
The right-hand side of the production for nonterminal A
starts with “A”.
This is called left recursion;
it is not permitted in an LL grammar.
Left recursion can also be more subtle.
\[ A \rightarrow Bx \] \[ B \rightarrow Ay \]
The above grammar is also considered left-recursive; it is not LL.
The grammar below illustrates another problem.
\[ A \rightarrow B \mid C\] \[ B \rightarrow x \] \[ C \rightarrow y \]
We cannot even begin to write a Recursive-Descent parser
for this grammar.
How would the code for function parse_A
begin?
Should it call parse_B
or parse_C
?
There is no way to tell.
We say the first production above is not left-factored.
An LL grammar may contain only left-factored productions.
Here is another production that is not left-factored.
\[ A \rightarrow xB \mid xC \]
In general, when parsing an LL grammar, if there is a choice to be made, then we must be able to make that choice based on the next terminal (that is, lexeme). If this cannot be done, then the grammar is not LL.
Listed below are some of the properties an LL grammar must have.
There can be other troubles as well. For example, the following grammar is not LL.
\[ A \rightarrow Bxs \] \[ B \rightarrow c \mid cxt \]
The string \(cxs\) lies in the language generated by the above grammar. But imagine a Recursive-Descent parser based on the grammar, attempting to parse this string. What would happen?
Simply because a grammar is not LL, does not mean that it is completely useless as an aid to writing a Recursive-Descent parser. We might be able to transform the grammar into an LL grammar that generates the same language. For example, consider the first example of a non-left-factored grammar, above. Here is an LL grammar generating the same language.
\[ A \rightarrow x \mid y \]
Here is a more complicated example.
Suppose we want to parse expressions
consisting of repeated sums and differences of
numbers.
A correct input string will consist of one or more
numeric literals separated by binary
“+
”
or
“-
”
operators.
expr → NUM | expr ( "
+
" | "-
" ) NUM
It may seem like this grammar is fine: we check if the first lexeme has token NUM, and if not, we make a recursive call. But consider, all strings in the generated language begin with a NUM token. Looking only at the first lexeme, there is no way to determine which choice to make in the right-hand side. The above grammar is not LL.
We can easily fix this. Here is a revised grammar.
expr → NUM [ ( "
+
" | "-
" ) expr ]
This grammar is LL,
and it generates the same language.
However, there is still likely to be a problem.
Recall that, when we parse, we are not merely interested
in syntactic correctness;
we also want to determine the structure of our input.
And the revised grammar generates the same strings,
but gives them a significantly different structure.
Essentially, the first grammar
describes a left-associative “-
” operator,
while the revised grammar describes one that is
right-associative.
In practical programming languages,
we usually want binary arithmetic operators to be left-associative:
\[ a-b-c = (a-b)-c; \] \[ a-b-c \ne a-(b-c). \]
There would seem to be no solution to our problem, in the context of LL parsing algorithms. If expr comes first, as in the first grammar, then we have left recursion. If it comes last, as in the second grammar, then we get the wrong structure.
It turns out that we can fix this by eliminating expr from its own right-hand side entirely. Recall that braces indicate an optional, repeatable section.
expr → NUM { ( "
+
" | "-
" ) NUM }
With the above grammar,
function parse_expr
would not be recursive.
Instead, it would contain a loop.
This loop could generate the proper parse tree as it went.
Alternatively, if its job is only to evaluate
the expression,
it could keep a running total, which it would return.
Here is code for a possible function parse_expr
,
based on the above grammar.
This function does not do evaluation or parse-tree generation.
Again, we assume there is a Lex
object named luthor_
.
[C++]
bool parse_expr() { if (!matchToken(Lex::NUM)) return false; while (true: { // Handle { ("+"|"-") NUM } if (matchString("+") || matchString("-")) { if (!matchToken(Lex::NUM)) return false; } } // All done return true; }
See
rdparseb.h
&
rdparseb.cpp
for a Recursive-Descent parser class based on the above grammar.
Also see
rdparseb_main.cpp
for a simple program that uses this new parser.
Transforming grammars, as in the examples above, allows us to produce LL grammars for a relatively large class of languages. Note, however, that this cannot always be done; there are context-free languages that cannot be generated by an LL grammar.
Shift-Reduce is a bottom-up parsing algorithm, in the LR category. Shift-Reduce parsers involve a large parsing table; they are generally automatically generated, based on a grammar.
The algorithm was first published by Don Knuth in 1965; however it was not really considered practical until the 1970s, when various issues related to the generation of large tables, were resolved.
A shift-Reduce parser is a state machine with an associated stack. Each stack item holds a state and a symbol (terminal or nonterminal). The state in the top-of-stack item is the current state. The parser uses two tables, called the action table and the goto table. These two are very similar, and have one row for each state, so I like to place them side-by-side.
The columns in the action table are indexed by terminals. An entry in this table can be “s#”, ”r#”, or blank, where “#” means some number. An entry of “s#” means to shift, and the number represents some state: push the symbol and the state on the stack (and then, since the top-of-stack state is the current state, we enter that state). An entry of “r#” means to reduce, and the number represents some production: pop items representing the right-hand side of the production, push the left-hand side, and use the goto table (see below) to determine the new state. Note that, when we do a reduce, the top few items on the stack will make up the right-hand side of the appropriate production. The number of items to be popped depends on the length of the right-hand side of the production. A blank entry means an error has occurred: time to quit.
The columns in the goto table are indexed by nonterminals. An entry in this table can be “g#” or blank. Blank entries are those that are not used. The “g#” entries are used after a reduce operation; the number represents the state to go to. We do a goto-table look-up “diagonally”. We have already pushed a nonterminal symbol, but no state. So we do our look-up using the symbol on the top-of-stack item, and the state in the next item down.
When we write Shift-Reduce parsing tables, we will follow the convention that the character “$” marks the end of the input.
We went over a Shift-Reduce parser in class. see the Shift-Reduce Parsing Table handout.
Shift-Reduce Parsing will be continued next time.
ggchappell@alaska.edu