CS 331 Spring 2013  >  Lecture Notes for Wednesday, January 30, 2013

CS 331 Spring 2013 Lecture Notes for Wednesday, January 30, 2013

Grammars in Practice

BNF

In the specification of programming-language syntax, there is a need for a grammar format that:

• Can deal with terminals involving arbitrary character sets.
• Does not require unusual characters (like our “$$\rightarrow$$” and “$$\varepsilon$$”).
• Is precisely specified enough that it can be used as input to a computer program.

In addition, we would like to be able to give our nonterminals descriptive names (e.g., “for_loop”), rather than single letters.

To meet this need, Backus-Naur Form (BNF) was developed. BNF is a notation for writing context-free grammars that allows for the use of arbitrary characters, as well as improving readability. It was invented by John Backus and improved by Peter Naur in the late 1950s and early 1960s. BNF, or some variation on it, is used to specify the syntax of many programming languages.

In BNF, nonterminals are enclosed in angle brackets: <digit>. Terminals are not. If it is not clear which symbols are terminals, then terminals may be enclosed in quotes. Our arrow is replaced by “::=”, and vertical bars (|) are used the same way we have.

For example, here is a BNF production specifiying what a digit is:

<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Some versions would allow the quotes to be left off, above.

Here is a BNF grammar for a phone number. It allows things like 555-6666 and (333)555-6666. The above <digit> production should be regarded as part of this grammar. The start symbol is <phone-number>.

<phone-number> ::= <area-code> <7-dig> | <7-dig>
<area-code> ::= "(" <digit> <digit> <digit> ")"
<7-dig> ::= <digit> <digit> <digit> "-" <digit> <digit> <digit> <digit>
<digit> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Note that the use of multi-character nonterminals (“<phone-number>”) makes it easy to tell what is going on. Note also that blanks are not allowed unless explicitly quoted.

EBNF

An improved syntax, sometimes called Extended Backus-Naur Form, or EBNF, essentially allows the right-hand side of each production to be a regular expression made up of terminals and nonterminals. Braces { ... } surround sections that are optional and repeatable (like “*” in our regular-expression syntax). Brackets [ ... ] surround sections that are optional and not repeatable (like the “?” shortcut we discussed). We can use parentheses ( ... ) for grouping, where necessary.

Typically, EBNF uses a simpler “arrow” as well: colon (:) or equals (=). Often, the angle brackets are left off of nonterminals.

For example, using this syntax, we could replace the phone number grammar, above, by the following.

phone-number = [ area-code ] 7-dig
area-code = "(" digit digit digit ")"
7-dig = digit digit digit "-" digit digit digit digit
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Grammars for Programming-Language Specification

Variations on the above conventions are used to write formal grammars for many programming languages.

BNF-ish grammars can be used as inputs to parser generators. For example, the parser generator yacc takes, as input, a CFG specified using a BNF-like syntax. Here is one production from a yacc-syntax grammar for the C programming language.

compound_statement
: '{' '}'
| '{' statement_list '}'
| '{' declaration_list '}'
| '{' declaration_list statement_list '}'
;

Terminals are quoted. Nonterminals are ordinary words, possibly containing underscores (_). The arrow is replaced by a colon (:). The semicolon (;) at the end is neither a terminal nor a nonterminal symbol; rather, it marks the end of the production.

A similar syntax is also used for human-readable grammars. Here is a production from a grammar for the Haskell programming language in the Haskell 2010 report.

qual → pat <- exp
| let decls
| exp

Here, typographical differences are used to reduce the amount of punctuation. Non-terminals are in italic type, while terminals are in a typewriter font, and are not quoted. An arrow (→) is used, and the vertical bar (|) is as usual.

Here is a production from a grammar for C++ in the 2011 ISO standard.

selection-statement:
if ( condition ) statement
if ( condition ) statement else statement
switch ( condition ) statement

Here again, nonterminals are in italic, and terminals use a typewriter font. A colon (:) is used instead of an arrow, and vertical bars are omitted, with the various possible right-hand sides simply being placed on separate lines.

A point to be made is that these various CFG grammars are all very understandable. Certainly there need to be notational conventions. If one uses a grammar as input to a program, then the conventions must be rigidly enforced. But it does not matter so much exactly what the conventions are.

Introduction to Lexing & Parsing

A compiler needs to do three things.

• Determine the syntactic correctness and structure of a program.
• List all identifiers and determine what they refer to.
• Generate code.

We now turn our attention to the first part, which is called parsing. (For a more in-depth look at all three parts, see CS 431.)

Parsing is usually broken up into two phases: lexical analysis and syntax analysis.

Lexical analysis, or lexing means breaking up a program into words, which we call lexemes. This phase takes a stream of characters as input and outputs a stream of lexemes. A lexer involves computation at the level of regular languages and finite state machines.

Syntax analysis, or parsing takes a stream of lexemes as input, and, if the program is syntactically correct, outputs a parse tree (or some data structure containing equivalent information). A parser involves a somewhat higher level of computation, at the level of CFGs; there must be a stack used somewhere.

Note that the term “parsing” can mean two different things: it can mean the entire process, or just the second phase. However, in practice, this rarely leads to misunderstandings.

There are a number of reasons for separating parsing into two phases. It makes our code more modular. It simplifies and speeds up lexical analysis, since more complex parsing code does not need to be involved in that phase. It makes a parser easier to write and more portable, since this code is insulated from “the outside world”: character sets, files, etc. It also simplifies the specification of a formal grammar for the parser.

CS 331 Spring 2013: Lecture Notes for Wednesday, January 30, 2013 / Updated: 12 Feb 2013 / Glenn G. Chappell / ggchappell@alaska.edu