CS 331 Spring 2013 > Lecture Notes for Friday, February 8, 2013

CS 331 Spring 2013
Lecture Notes for Friday, February 8, 2013

Recursive-Descent Parsing

Introduction

Recursive Descent is a top-down parsing algorithm, in the LL category. Recursive-descent parsers are generally hand-coded, as opposed to being automatically generated based on a grammar.

Many algorithms you are familiar with—Binary Search, various sorting algorithms, etc.—can be written once, and, if the implementation is suitably generic, never written again. But Recursive Descent is not like this. When we write a Recursive-Descent parser, we choose what functions to write, depending on what what our grammar looks like. Thus, a Recursive-Descent parser is taylored specifically for the grammar it uses; a new grammar requires writing a new parser.

How it Works

A Recursive-Descent parser has one parsing function for each nonterminal in the grammar. Each parsing function is responsible for parsing all strings that the corresponding nonterminal can be expanded into. So the function corresponding to the start symbol is the one we call to parse our language.

The code for a parsing function is essentially a translation into code of the right-hand side of the production corresponding to the nonterminal. Other nonterminals in the right-hand side become function calls—and so the parsing functions are mutually recursive. Nonterminals become checks that the input string contains the proper lexemes.

Simple Example

We wish to write a Recursive-Descent parser for the following grammar, which is mostly in EBNF. Italic indicates nonterminals. Terminals (lexemes) are either in CAPITALS, indicating a token, or quoted "typewriter" font when the literal characters are given. The start symbol is thing. Recall that brackets indicate an optional section.

thing → "(" thing ")"
thing → "[" thing "]"
thing → ID [ ":" num_list ]
num_list → NUM [ "," NUM [ "," NUM ] ]

The idea is that “ID” and “NUM” represent to the token of the same name generated by our Lex class. Using that lexer as a front-end, we get a parser that will handle strings that look something like this:

[(abc: 1, -2.5)]

The handling of thing above, is not quite what we need. We would like a single production for each nonterminal, so that we can turn the right-hand side into a function body. Therefore, we combine the three thing productions into one.

thing → "(" thing ")" | "[" thing "]"| ID [ ":" num_list ]
num_list → NUM [ "," NUM [ "," NUM ] ]

Typically a parser will generate a parse tree. We will write a simple parser that only determines whether a string is syntactically correct or not. Each parsing function will return a bool, with true indicating correct syntax. As we will see, we can write the code for each parsing function simply by reading off the right-hand side of the appropriate production.

Each parsing function will be named after the nonterminal it handles. For example, our parsing function for the symbol thing will be called parse_thing. Parsing functions will be member functions of a parser class with a data member of type Lex. Thus, our parsing functions need no parameters. Again, they will return bool.

Consider function parse_thing. The string it is parsing must begin with “(”, “[”, or a lexeme with token ID; otherwise there is a syntax error, and the return value is false.

If one of the above lexemes is found, then we proceed to parse the rest of the lexemes. We adopt the convention that, when a parse function is called, the current lexeme is the first one in the string it is to parsing, and when it returns, if there was no error, then the current lexeme is the first one past the string it parsed. Thus, when we encounter a nonterminal in the right-hand side of a production, the corresponding code simply calls the appropriate parsing function and checks its return value, returning false if this value is false, and continuing otherwise.

Now we can begin writing parse_thing. Here is the part that handles "(" thing ")". We assume the existence of a Lex data member named luthor_.

[C++]

bool parse_thing()
{
    if (luthor_.current().first == "(")
    {  // Handle "(" thing ")"
        luthor_.advance();
        if (!parse_thing())
            return false;
        if (luthor_.current().first != ")")
            return false;
        luthor_.advance();
    }
    else if ...

Note that, in our handling of terminals, we repeatedly check whether a terminal meets some criterion, and, if it does, we tell the lexer to advance. Encapsulating this behavior in a pair of helper functions will allow us to simplify our code.

[C++]

bool matchString(const std::string & lexstring)
{
    if (luthor_.current().first == lexstring)
    {
        luthor_.advance();
        return true;
    }
    return false;
}

bool matchToken(Lex::Token lextoken)
{
    if (luthor_.current().second == lextoken)
    {
        luthor_.advance();
        return true;
    }
    return false;
}

Here is most of parse_thing, using the above helper functions.

[C++]

bool parse_thing()
{
    if (matchString("(")
    {  // Handle "(" thing ")"
        if (!parse_thing()) return false;
        if (!matchString(")")) return false;
    }
    else if (matchString("["))
    {  // Handle "[" thing "]"
        [ As above, but with brackets ]
    }
    else if (matchToken(Lex::ID))
    {  // Handle ID [ ":" num_list ]
        if (matchString(":"))
        {
            if (!parse_numlist()) return false;
        }
    }
    else
    {  // First lexeme is not one of the options
        return false;
    }

    // All done
    return true;
}

See rdparse.h & rdparse.cpp for our completed Recursive-Descent parser code.

Also see rdparse_main.cpp for a simple program that uses our parser.

Handling Syntactically Incorrect Input

One aspect of the behavior of our parser is curious. The string “(x)” is syntactically correct. If we pass our parser the string “(x”, then it returns false, due to the missing right parenthesis, as we would expect. However, if we pass it the string “x)”, then it returns true, even though the left parenthesis is missing.

This is because we call function parse_thing in order to parse the entire string. However, recall that the parsing functions are mutually recursive. Function parse_thing is not just for parsing the entire input; it is also used to parse a portion of the input. And “2” is a syntactically correct expression. As far as parse_expr knows, it might lie within some other expression, and the right parenthesis might be handled by the function that called this particular invocation of parse_thing.

So we cannot detect such errors in our input using only the return values of the parsing functions. But these errors can still be easily detected by checking whether the parser reads every lexeme in the input. After parsing, we call the lexer’s done member function. If this returns false, then we know the parser was not able to parse the entire input string, so there must be some syntax error.

Files rdparse.h, rdparse.cpp, and rdparse_main.cpp were modified to reflect the above ideas.

Recursive-Descent Parsing will be continued next time.

CS 331 Spring 2013: Lecture Notes for Friday, February 8, 2013 / Updated: 12 Feb 2013 / Glenn G. Chappell / ggchappell@alaska.edu

CS 331 Spring 2013 Lecture Notes for Friday, February 8, 2013

Recursive-Descent Parsing

Introduction

How it Works

Simple Example

Handling Syntactically Incorrect Input

CS 331 Spring 2013
Lecture Notes for Friday, February 8, 2013