CS 331 Spring 2013 > Lecture Notes for Friday, February 8, 2013 |
Recursive Descent is a top-down parsing algorithm, in the LL category. Recursive-descent parsers are generally hand-coded, as opposed to being automatically generated based on a grammar.
Many algorithms you are familiar with—Binary Search, various sorting algorithms, etc.—can be written once, and, if the implementation is suitably generic, never written again. But Recursive Descent is not like this. When we write a Recursive-Descent parser, we choose what functions to write, depending on what what our grammar looks like. Thus, a Recursive-Descent parser is taylored specifically for the grammar it uses; a new grammar requires writing a new parser.
A Recursive-Descent parser has one parsing function for each nonterminal in the grammar. Each parsing function is responsible for parsing all strings that the corresponding nonterminal can be expanded into. So the function corresponding to the start symbol is the one we call to parse our language.
The code for a parsing function is essentially a translation into code of the right-hand side of the production corresponding to the nonterminal. Other nonterminals in the right-hand side become function calls—and so the parsing functions are mutually recursive. Nonterminals become checks that the input string contains the proper lexemes.
We wish to write a Recursive-Descent parser for the following grammar,
which is mostly in EBNF.
Italic indicates nonterminals.
Terminals (lexemes) are either in CAPITALS, indicating a token,
or quoted "typewriter
" font when the literal characters
are given.
The start symbol is thing.
Recall that brackets indicate an optional section.
thing → "
(
" thing ")
"
thing → "[
" thing "]
"
thing → ID [ ":
" num_list ]
num_list → NUM [ ",
" NUM [ ",
" NUM ] ]
The idea is that “ID” and “NUM”
represent to the token of the same name
generated by our Lex
class.
Using that lexer as a front-end,
we get a parser that will handle strings that look something like this:
[(abc: 1, -2.5)]
The handling of thing above, is not quite what we need. We would like a single production for each nonterminal, so that we can turn the right-hand side into a function body. Therefore, we combine the three thing productions into one.
thing → "
(
" thing ")
" | "[
" thing "]
"| ID [ ":
" num_list ]
num_list → NUM [ ",
" NUM [ ",
" NUM ] ]
Typically a parser will generate a parse tree.
We will write a simple parser that only determines
whether a string is syntactically correct or not.
Each parsing function will return a bool
,
with true
indicating correct syntax.
As we will see,
we can write the code for each parsing function
simply by reading off the right-hand side of the appropriate
production.
Each parsing function will be named after the nonterminal
it handles.
For example, our parsing function for the symbol thing
will be called parse_thing
.
Parsing functions will be member functions of a parser class
with a data member of type Lex
.
Thus, our parsing functions need no parameters.
Again, they will return bool
.
Consider function parse_thing
.
The string it is parsing must begin with
“(
”,
“[
”,
or a lexeme with token ID;
otherwise there is a syntax error, and the return value is
false
.
If one of the above lexemes is found, then we proceed to parse the rest of the lexemes. We adopt the convention that, when a parse function is called, the current lexeme is the first one in the string it is to parsing, and when it returns, if there was no error, then the current lexeme is the first one past the string it parsed. Thus, when we encounter a nonterminal in the right-hand side of a production, the corresponding code simply calls the appropriate parsing function and checks its return value, returning false if this value is false, and continuing otherwise.
Now we can begin writing parse_thing
.
Here is the part that handles
"(
" thing ")
".
We assume the existence of a Lex
data member named luthor_
.
[C++]
bool parse_thing() { if (luthor_.current().first == "(") { // Handle "(" thing ")" luthor_.advance(); if (!parse_thing()) return false; if (luthor_.current().first != ")") return false; luthor_.advance(); } else if ...
Note that, in our handling of terminals, we repeatedly check whether a terminal meets some criterion, and, if it does, we tell the lexer to advance. Encapsulating this behavior in a pair of helper functions will allow us to simplify our code.
[C++]
bool matchString(const std::string & lexstring) { if (luthor_.current().first == lexstring) { luthor_.advance(); return true; } return false; } bool matchToken(Lex::Token lextoken) { if (luthor_.current().second == lextoken) { luthor_.advance(); return true; } return false; }
Here is most of parse_thing
,
using the above helper functions.
[C++]
bool parse_thing() { if (matchString("(") { // Handle "(" thing ")" if (!parse_thing()) return false; if (!matchString(")")) return false; } else if (matchString("[")) { // Handle "[" thing "]" [ As above, but with brackets ] } else if (matchToken(Lex::ID)) { // Handle ID [ ":" num_list ] if (matchString(":")) { if (!parse_numlist()) return false; } } else { // First lexeme is not one of the options return false; } // All done return true; }
See
rdparse.h
&
rdparse.cpp
for our completed Recursive-Descent parser code.
Also see
rdparse_main.cpp
for a simple program that uses our parser.
One aspect of the behavior of our parser is curious.
The string “(x)
” is syntactically correct.
If we pass our parser the string “(x
”,
then it returns false
,
due to the missing right parenthesis,
as we would expect.
However,
if we pass it the string “x)
”,
then it returns true
,
even though the left parenthesis is missing.
This is because we call function parse_thing
in order to parse the entire string.
However, recall that the parsing functions are mutually
recursive.
Function parse_thing
is not just for parsing the entire input;
it is also used to parse a portion of the input.
And “2
” is
a syntactically correct expression.
As far as parse_expr
knows,
it might lie within some other expression,
and the right parenthesis might be handled
by the function that called this particular
invocation of parse_thing
.
So we cannot detect such errors in our input using
only the return values of the parsing functions.
But these errors can still be easily detected
by checking whether the parser reads every lexeme in the input.
After parsing, we call the lexer’s done
member function.
If this returns false
,
then we know the parser was not able to parse the
entire input string,
so there must be some syntax error.
Files
rdparse.h
,
rdparse.cpp
,
and
rdparse_main.cpp
were modified to reflect the above ideas.
Recursive-Descent Parsing will be continued next time.
ggchappell@alaska.edu