CS 331 Spring 2013 > Lecture Notes for Monday, January 28, 2013 |
We now turn to the second-smallest class of languages in the Chomsky Hierarchy: context-free languages.
A context-free grammar (CFG) is a grammar, each of whose productions has a left-hand side consisting only of a single nonterminal.
A context-free language (CFL) is a language that is generated by some context-free grammar.
Context-free languages are important because, for nearly every programming language, the set of all syntactially correct programs forms a CFL. CFGs and CFLs are thus important in parsing: determining whether a program is syntactically correct, and, if so, finding its structure.
All of the grammars we have looked at have been CFGs. Here is another one.
\[S \rightarrow A\] \[A \rightarrow xAy\] \[A \rightarrow \varepsilon\]
The language generated by the above grammar can be described as
\[\{\, a^nba^n \mid n\ge 0 \,\}.\]
In other words, the language is \(\{b, aba, aabaa, aaabaaa, \dots \}\). This is not a regular language.
CFGs are powerful enough to do things like matching parentheses. Consider the following grammar, where “\((\)” and “\()\)” are terminals.
\[S \rightarrow SS\] \[S \rightarrow (\,S\,)\] \[S \rightarrow \varepsilon\]
The language generated by the above grammar consists of all sequences of properly matched parentheses. For example “\((\,(\,(\,)\,(\,)\,)\,(\,)\,)\,(\,)\)” is a string in this language.
When we write a CFG, we often have a number of productions with the same left-hand side. As a shortcut, we write these on one line, with the various right-hand sides separated by vertical bars (“\(|\)”).
For example, the first grammar, above, can be rewritten as follows.
\[S \rightarrow A\] \[A \rightarrow xAy \mid \varepsilon\]
And here is the second grammar, above. Again, parentheses are terminal symbols here.
\[S \rightarrow SS \mid (\,S\,) \mid \varepsilon\]
As we said above, parsing involves finding the structure of a program. We can express this structure using a parse tree.
Here is a grammar. We will allow “\(+\)” as a terminal symbol.
\[S \rightarrow S+S \mid x\]
This grammar generates the language \(\{x, x+x, x+x+x, x+x+x+x, \dots\}\). Here is a derivation for the string \(x+x+x\).
\[S\] \[S+S\] \[S+S+S\] \[x+S+S\] \[x+x+S\] \[x+x+x\]
We can express structure based on this derivation as a tree. The root holds the start symbol. When we apply a rule and expand a nonterminal, the symbols it is expanded into become its children in the tree. Here is a parse tree for “\(x+x+x\)” based on the above CFG and derivation.
Note that the leaves of a parse tree hold terminal symbols, while the internal nodes (non-leaves) hold nonterminal symbols. Thus, we can read off the final string by looking at the leaves of the tree.
There is another parse tree for “\(x+x+x\)” based on the above grammar. It is shown below.
This means that the string \(x+x+x\) has two possible structures. A CFG like this, in which a single string has more than one parse tree, as said to be ambiguous.
Ambiguity is a property of grammars, not of languages. The grammar above is ambiguous; however, we can find another CFG that generates the same language.
In finding such a grammar, we should note that, assuming “\(+\)” represents normal addition, we like the first parse tree better than the second, since it expresses the left associativity that we usually want addition to have.
Here is a non-ambiguous grammar that generates the same language, and keeps this left-associativity idea.
\[S \rightarrow S+x \mid x\]
Here is a derivation of \(x+x+x\) based on this grammar.
\[S\] \[S+x\] \[S+x+x\] \[x+x+x\]
And here is the unique parse tree for \(x+x+x\) based on our revised grammar.
There are CFLs that cannot be generated by non-ambiguous CFGs. Such a CFL is inherently ambiguous. Here is one standard example.
\[ \{\, a^m b^m c^n d^n \mid m\ge 0 \textrm{ and } n\ge 0 \,\} \cup \{\, a^m b^n c^n d^m \mid m\ge 0 \textrm{ and } n\ge 0 \,\} \]
It can be demonstrated that, no matter how we write the grammar for this language, there is going to be some string with the same number of \(a\)s, \(b\)s, \(c\)s, and \(d\)s, that has two different parse trees.
Ambiguity in a CFG is not a good thing. For one thing, it means that a program has two possible structures; which should we use? For another, it can slow down parsing. The fastest parsing algorithms are linear-time; however, many do not work with ambiguous grammars.
Another interesting algorithmic fact is that it is known to be impossible to write a program that can determine, for an arbitrary given CFG, whether or not it is ambiguous.
Here is yet another CFG.
\[S \rightarrow ABC\] \[A \rightarrow x\] \[B \rightarrow y\] \[C \rightarrow z\]
The language generated by this grammar, consists of only one string: \(xyz\). However, this string has multiple derivations; here is one.
\[S\] \[ABC\] \[xBC\] \[xyC\] \[xyz\]
And here is another.
\[S\] \[ABC\] \[ABz\] \[Ayz\] \[xyz\]
Note that, while there are multiple derivations for \(xyz\), there is only one parse tree.
Since no string has more than one parse tree, the above grammar is not ambiguous.
In the first derivation above, at each step, we expand the leftmost nonterminal. Such a derivation is a leftmost derivation. In the second derivation, we always expand the rightmost nonterminal. Such a derivation is a rightmost derivation.
These concepts will come up in our study of parsing. Parsers generally go through the process of finding a derivation. They may not explicitly store or output the derivation, but they do go through all the steps required to find it. Parsers come in two big classes. One class constructs the derivation from top to bottom. Since the input is read left to right, the leftmost nonterminal will generally be expanded first, and so such parsers generally find leftmost derivations. The other class constructs derivations from bottom to top. These will generally collapse strings on the left first. But since the derivation is constructed backwards, when it is read top to bottom, the rightmost nonterminal is expanded first. Such parsers will tend to find rightmost derivations.
Note that some derivations are neither leftmost nor rightmost. Here is an example.
\[S\] \[ABC\] \[AyC\] \[xyC\] \[xyz\]
As noted above, for nearly every programming language, the set of all syntactically correct programs forms a CFL. One notable exception is C++. We look briefly at why this is the case.
Consider the following code.
[C++]
int foo(bar);
This has two possible interpretations.
It could be a prototype of a function foo
that takes a single parameter of type bar
and returns an int
.
Or it could be a declaration of a variable named foo
,
of type int
,
with the value bar
being passed to its constructor.
How does the compiler tell the difference?
It determines whether bar
represents a type.
If so, then foo
is a function;
if not, then foo
is a variable.
[C++]
class bar; int foo(bar); // Prototype for function foo
However:
[C++]
int bar; int foo(bar); // Declaration of variable foo
Now, if foo
is a function, then
we can name its parameters;
we can replace “bar
”
with “bar x
”.
We cannot make this change if foo
is a variable.
And this syntactic difference depends on what bar
represents, which is not something we can determine locally.
[C++]
class bar; // Lots of stuff here int foo(bar x); // Syntactically correct
[C++]
int bar; // Lots of stuff here int foo(bar x); // Syntactically INCORRECT
This determination of what bar
represents,
is not something we can do with a CFG.
The set of all syntactically correct C++ programs
does not form a CFL.
However, keeping track of what identifiers represent,
is something compilers need to do anyway.
In an actual C++ parser,
information on what bar
represents,
is likely to be on hand.
As a result, we can still write a C++ parser
using some of the same techniques that are used
with CFGs and CFLs.
ggchappell@alaska.edu