CS 331 Spring 2013 > Lecture Notes for Wednesday, February 13, 2013 |
We wish to design a Shift-Reduce parser for the following grammar, whose start symbol is S.
S →
(
S)
|x
This grammar generates the language containing the following strings:
x (x) ((x)) (((x))) ((((x))))
... and so on.
In our grammar, we need to avoid the shorthand and split the above into two productions, which we number.
(1) S → “
(
” S “)
”(2) S → “
x
”
Here is the resulting Shift-Reduce parsing table. We created the following table in class on Wednesday, February 13.
Action | Goto | |||||
---|---|---|---|---|---|---|
Note | State | ( |
x |
) |
$ | S |
- [start] | 0 | s1 | s2 | g3 | ||
( |
1 | s1 | s2 | g4 | ||
x |
2 | r2 | r2 | r2 | r2 | |
-S | 3 | * | * | * | s5* | |
( S |
4 | s6 | ||||
[done] | 5* | accept | accept | accept | accept | |
( S) |
6 | r1 | r1 | r1 | r1 |
* In class, we made a design decision: the string parsed using this table must be the entire input, and so must end with $. If, instead, we thought of this string as part of a larger structure, then we could make state 3 an accepting state (put “accept” in each of action-table cells for state 3) and eliminate state 5.
Note: In class the table constructed in class,
state 2 above was actually two states,
which were labeled “-x
”
and “(x
”.
However, the rows for the two were identical,
so I have combined them into a single state.
Shift-Reduce is an LR parsing algorithm, and so it can handle a larger class of grammars than an LL algorithm like Recursive Descent. This is an advantage, of course, but it is not a huge one, since, when we write a compiler, we only need one grammar. We should also note that there are CFLs that cannot be generated by an LR grammar.
An interesting advantage of Shift-Reduce parsers over Recursive-Descent parsers, is that they tend to find syntax errors sooner in the code.
The Shift-Reduce algorithm has inspired any number of variations. These continue to dominate the field of automatically generated parsing code.
When we wrote our lexer, it was convenient to do lookahead: make decisions about the current symbol based on symbols after it. LL and LR parsing algorithms do not do lookahead, but, as with our lexer, it is often convenient to add the ability to do lookahead. An LL(k) algorithm is one that follows the basic ideas of LL algorithms, but is allowed to look at the next k lexemes for some constant k. So LL(1) is the same as LL. A parsing algorithm that looks one lexeme further is LL(2). The LL(2) grammars are those that can be handled by such an algorithm.
Additional lookahead makes an LL parsing algorithm more powerful. An LL(2) language is a language that can be generated by an LL(2) grammar. There are LL(2) languages that are not LL [that is, LL(1)] languages. Thus, we may wish to add an extra lexeme of lookahead to an LL parser.
Similarly, we can talk about LR(2), etc. However, such parsers are rare in practice, in part because lookahead does not give such nice advantages to an LR parser. There are LR(2) grammars that are not LR grammars; however, every LR(2) language is an LR language.
A very popular variation on LR was introduced in 1969 by Frank DeRemer: LookAhead LR, or LALR. Such a parser can be generated by first generating the data required for an LR parsing table, and then analyzing it using lookahead on the table iself, in order to combine multiple states into one. The result is a much smaller table, at the cost of handling a more restricted set of languages. The parser is executed using much the same method as a shift-reduce parser (shift, reduce, goto table, etc.).
The LALR method is used in many automatic parser generators, including Yacc (“Yet Another Compiler Compiler”) and its offspring—which include GNU BISON, as well as various other programs with “yacc” in their names.
One technique is to find a LALR grammar that almost generates the language you want, automatically generate a LALR parser for it, and then combine the parser with custom hand-written code to obtain a parser for the desired language.
We have discussed parsing in the context of CFGs and CFLs. There are parsing algorithms that can handle all CFGs—suitably transformed to a “canonical” form—and thus all CFLs. The fastest practical algorithms that can parse every CFL [these would be CYK/Valiant’s Algorithm] run in cubic-time: \(O(n^3)\). Note: \(n\)—the size of the input—denotes the number of lexemes in the source being parsed.
The parsing algorithms that are actually used for compilation are linear-time: \(O(n)\). These include Recursive Descent and Shift-Reduce. Such algorithms can handle only a small subset of all CFGs; fortunately, most programming languages can be described by grammars on which an efficient parser can be based.
ggchappell@alaska.edu