|CS 331 Spring 2013 > Lecture Notes for Wednesday, February 6, 2013|
With our lexeme description,
it is tricky to handle the strings
+.3” is a single lexeme (a number),
+.x” is a three lexemes
(two operators and an identifier).
In our code we handled this by peeking at the next character.
our state machine has morphed into
a somewhat more general construction that uses lookahead.
This is a common technique in all phases of parsing.
When we do lexical analysis, using lookahead can help simplify what might otherwise be convoluted code. However, it does not increase our capabilities; anything involving regular languages & grammars that can be done with lookahead can also be done without it. On the other hand, when we do syntax analysis involving more general context-free languages & grammars, lookahead does actually increase our capabilties. For the various standard parsing algorithms, there are CFGs that the algorithms cannot handle without sufficient lookahead.
Our lexeme description says that the only legal characters are ASCII 32–126 along with the various whitespace characters. What should we do if we input a character that is not legal?
There are a number of options. We could simply assume that the input consists only of legal characters, putting the responsibility on the caller to make sure this is true. We could crash (politely, with a nice error message, or impolitely).
However, recall a lexer’s job. It is the primary part of parsing code that deals directly with input characters. Relying on the caller to check for legal characters would seem to contradict this idea. Furthermore, a lexer is rarely executed directly. Rather, it is called by a parser, which may in turn be called by a compiler (which is executed by an IDE ...). Our lexer should not be thought of as user-facing code. Therefore it should not take responsibility for informing the user of errors. We see that all of the above error-handling options are unsatisfactory.
So, our lexer needs to signal the caller when it encounters
an illegal character.
We could throw an exception,
but remember that our return type (
we can simply create a new token:
When an illegal character is encountered,
we return a lexeme with this token
and a length-one string containing the offending character.
This design helps make our package robust. Robust code deals gracefully with all possible input. Doing this allows the calling code to handle illegal characters however and whenever it wants, and reduces the likelihood of painful situations for the user.
By its nature, lexical analysis serves syntax analysis. The output of a lexer is almost never needed for its own sake; lexing is just the first step in the generation of a parse tree, and possibly an executable. A lexeme description is generally written with a parsed language in mind. Thus, it is difficult to look at a lexeme description, in isolation, and call it correct or incorrect. (On the other hand, given a lexeme description, we can certainly look at a proposed lexer and determine whether it is correct.)
In any case, the Lexeme Description distributed in class does contain a feature that does not quite match the intended language. (This was intentional, but it harkens back to an actual mistake I made when designing a lexer some years ago.)
Suppose we want to use our lexer output as the input for a parser that handles arithmetic expressions. We give it the following input.
k - 4
The result is 3 lexemes.
ID k OP - NUM 4
However, suppose we remove the blanks. Then our output changes.
ID k NUM -4
If our goal is to be able to parse standard arithmetic-expression syntax, then the latter string will not parse correctly.
There are a number of ways to handle this.
We could leave the lexeme description unchanged,
thus requiring the programmer
to insert space in some places.
We could change the lexeme description so that
is a different kind of lexeme from “
But my favorite solution
is to change the lexeme description as follows:
if a lexeme is preceded by an Identifer or Number,
then we return the
Operator, if possible;
otherwise, we follow the longest-lexeme rule, as before.
Now we turn our attention to the second stage in parsing, syntax analysis, in which we look at the lexeme stream produced by the lexical analyzer, and, if all is well, we produce whatever data our caller needs—usually a parse tree. As we have noted, syntax analysis is also called “parsing”; this is the narrow sense of the term.
Parsing algorithms come in two big categories: top-down and bottom-up. A parser will need to go through the steps to find a derivation based on the grammar being used. It will not output this derivation; it generally will not even store it anywhere, but it will go through the steps.
A top-down parsing algorithm goes through the derivation from top to bottom, beginning with the start symbol, and ending with the string to be derived (i.e., the program). A bottom-up parsing algorithm goes through the derivation from bottom to top, beginning with the program and ending with the start symbol.
It is difficult to make strong general statements about these categories, but there are a few properties that these algorithms tend to have.
Top-down parsing code is often hand-coded, although top-down parser generators do exist. Top-down parsers typically expand the leftmost nonterminal first. Thus, they usually produce leftmost derivations. Many of them are in a category known as LL parsers, the name coming from the fact that they read the input Left-to-right and generate a Leftmost derivation. We will look closely at a top-down LL parsing algorithm called Recursive Descent.
Bottom-up parsing code is almost always automatically generated. Such parsers they will usually contract the leftmost nonterminal first. But thinking of the derviation from top to bottom, this would mean that the rightmost nonterminal is expanded first, resulting in a rightmost derivation. Thus, many such algorithms are in a category known as LR parsers, the name coming from the fact that they read the input Left-to-right, and generate a Rightmost derivation. We will look at a bottom-up LR parsing algorithm called Shift-Reduce.
The grammars than an LL parser can use are called LL grammars, and similarly some grammars are LR grammars. Interestingly, every LL grammar is an LR grammar, making LR algorithms a bit more general. Note, however, that when you write a compiler, you only need one grammar, and if it is an LL grammar, then an LL parser works fine.
The following diagram shows the relationships between various grammar categories.