CS 331 Spring 2013: Lecture Notes for Monday, January 28, 2013

We now turn to the second-smallest class of languages in the Chomsky Hierarchy: context-free languages.

A context-free grammar (CFG) is a grammar, each of whose productions has a left-hand side consisting only of a single nonterminal.

A context-free language (CFL) is a language that is generated by some context-free grammar.

Context-free languages are important because, for nearly every programming language, the set of all syntactially correct programs forms a CFL. CFGs and CFLs are thus important in parsing: determining whether a program is syntactically correct, and, if so, finding its structure.

Example

In other words, the language is \(\{b, aba, aabaa, aaabaaa, \dots \}\). This is not a regular language.

CFGs are powerful enough to do things like matching parentheses. Consider the following grammar, where “\((\)” and “\()\)” are terminals.

The language generated by the above grammar consists of all sequences of properly matched parentheses. For example “\((\,(\,(\,)\,(\,)\,)\,(\,)\,)\,(\,)\)” is a string in this language.

A Syntactic Shortcut

When we write a CFG, we often have a number of productions with the same left-hand side. As a shortcut, we write these on one line, with the various right-hand sides separated by vertical bars (“\(|\)”).

And here is the second grammar, above. Again, parentheses are terminal symbols here.

Parse Trees

As we said above, parsing involves finding the structure of a program. We can express this structure using a parse tree.

This grammar generates the language \(\{x, x+x, x+x+x, x+x+x+x, \dots\}\). Here is a derivation for the string \(x+x+x\).

We can express structure based on this derivation as a tree. The root holds the start symbol. When we apply a rule and expand a nonterminal, the symbols it is expanded into become its children in the tree. Here is a parse tree for “\(x+x+x\)” based on the above CFG and derivation.

Note that the leaves of a parse tree hold terminal symbols, while the internal nodes (non-leaves) hold nonterminal symbols. Thus, we can read off the final string by looking at the leaves of the tree.

Ambiguity

The Idea

There is another parse tree for “\(x+x+x\)” based on the above grammar. It is shown below.

This means that the string \(x+x+x\) has two possible structures. A CFG like this, in which a single string has more than one parse tree, as said to be ambiguous.

Eliminating Ambiguity

Ambiguity is a property of grammars, not of languages. The grammar above is ambiguous; however, we can find another CFG that generates the same language.

In finding such a grammar, we should note that, assuming “\(+\)” represents normal addition, we like the first parse tree better than the second, since it expresses the left associativity that we usually want addition to have.

Here is a non-ambiguous grammar that generates the same language, and keeps this left-associativity idea.

Inherent Ambiguity

There are CFLs that cannot be generated by non-ambiguous CFGs. Such a CFL is inherently ambiguous. Here is one standard example.

It can be demonstrated that, no matter how we write the grammar for this language, there is going to be some string with the same number of \(a\)s, \(b\)s, \(c\)s, and \(d\)s, that has two different parse trees.

Algorithmic Implications

Ambiguity in a CFG is not a good thing. For one thing, it means that a program has two possible structures; which should we use? For another, it can slow down parsing. The fastest parsing algorithms are linear-time; however, many do not work with ambiguous grammars.

Another interesting algorithmic fact is that it is known to be impossible to write a program that can determine, for an arbitrary given CFG, whether or not it is ambiguous.

Leftmost & Rightmost Derivations

The language generated by this grammar, consists of only one string: \(xyz\). However, this string has multiple derivations; here is one.

Note that, while there are multiple derivations for \(xyz\), there is only one parse tree.

Since no string has more than one parse tree, the above grammar is not ambiguous.

In the first derivation above, at each step, we expand the leftmost nonterminal. Such a derivation is a leftmost derivation. In the second derivation, we always expand the rightmost nonterminal. Such a derivation is a rightmost derivation.

These concepts will come up in our study of parsing. Parsers generally go through the process of finding a derivation. They may not explicitly store or output the derivation, but they do go through all the steps required to find it. Parsers come in two big classes. One class constructs the derivation from top to bottom. Since the input is read left to right, the leftmost nonterminal will generally be expanded first, and so such parsers generally find leftmost derivations. The other class constructs derivations from bottom to top. These will generally collapse strings on the left first. But since the derivation is constructed backwards, when it is read top to bottom, the rightmost nonterminal is expanded first. Such parsers will tend to find rightmost derivations.

Note that some derivations are neither leftmost nor rightmost. Here is an example.

A Note on C++

As noted above, for nearly every programming language, the set of all syntactically correct programs forms a CFL. One notable exception is C++. We look briefly at why this is the case.

This has two possible interpretations. It could be a prototype of a function foo that takes a single parameter of type bar and returns an int. Or it could be a declaration of a variable named foo, of type int, with the value bar being passed to its constructor.

How does the compiler tell the difference? It determines whether bar represents a type. If so, then foo is a function; if not, then foo is a variable.

Now, if foo is a function, then we can name its parameters; we can replace “bar” with “bar x”. We cannot make this change if foo is a variable. And this syntactic difference depends on what bar represents, which is not something we can determine locally.

This determination of what bar represents, is not something we can do with a CFG. The set of all syntactically correct C++ programs does not form a CFL.

However, keeping track of what identifiers represent, is something compilers need to do anyway. In an actual C++ parser, information on what bar represents, is likely to be on hand. As a result, we can still write a C++ parser using some of the same techniques that are used with CFGs and CFLs.

CS 331 Spring 2013: Lecture Notes for Monday, January 28, 2013 / Updated: 13 Feb 2013 / Glenn G. Chappell / ggchappell@alaska.edu

CS 331 Spring 2013
Lecture Notes for Monday, January 28, 2013

Context-Free Languages

Context-Free Grammars & Context-Free Languages

Definition & Purpose

Example

A Syntactic Shortcut

Parse Trees

Ambiguity

The Idea

Eliminating Ambiguity

Inherent Ambiguity

Algorithmic Implications

Leftmost & Rightmost Derivations

A Note on C++

CS 331 Spring 2013 Lecture Notes for Monday, January 28, 2013

Context-Free Languages

Context-Free Grammars & Context-Free Languages

Definition & Purpose

Example

A Syntactic Shortcut

Parse Trees

Ambiguity

The Idea

Eliminating Ambiguity

Inherent Ambiguity

Algorithmic Implications

Leftmost & Rightmost Derivations

A Note on C++

CS 331 Spring 2013
Lecture Notes for Monday, January 28, 2013