CS 331 Spring 2013 > Lecture Notes for Friday, January 25, 2013 |
We now take a closer look at the smallest class of languages in the Chomsky Hierarchy: regular languages.
A regular grammar is a grammar, each of whose productions looks like one of the following.
\[A \rightarrow \varepsilon \qquad A \rightarrow b \qquad A \rightarrow bC\]
That is, the left-hand side of every production is a single nonterminal, while the right-hand side of each is either the empty string, a single terminal, or a single terminal followed by a single nonterminal.
A regular language is a language that is generated by some regular grammar.
Regular languages have two important applications. First, they are heavily used in text search & replace. Second, in most programming languages, the set of all “words” of a particular kind forms a regular language. For example, the set of all legal C++ identifiers is a regular language. So is the set of all legal C++ numeric literals, and so is the set of all legal C++ string literals. Thus we make use of regular languages in the early stages of compilation, when we break up a program into words, a process known as lexical analysis, or lexing.
Here is a grammar that generates a regular language.
\[S \rightarrow A\] \[S \rightarrow At\] \[A \rightarrow Axy\] \[A \rightarrow \varepsilon\]
We note that this grammar does not meet the requirements given above. However, we can find a grammar that does meet those requirements, and which generates the same language.
\[S \rightarrow \varepsilon\] \[S \rightarrow t\] \[S \rightarrow xB\] \[B \rightarrow yS\]
Both grammars above generate the language \( \{\varepsilon, xy, xyxy, xyxyxy, \dots, t, xyt, xyxyt, xyxyxyt, \dots \} \). The latter grammar meets the requirements—that is, it is a regular grammar—and so this language is regular.
A (deterministic) finite automaton (Latin plural “automata”), also known as a finite state machine, is a kind of recognizer for regular languages. A finite automaton consists of a finite collection of states, with transitions between these states, each of which is associated with some symbol in our alphabet—i.e., a terminal symbol. One state is the start state, and some states may be accepting states.
Below is a diagram of such an automaton. The circles represent states. The arrow entering the left-hand circle, indicates that it is the start state; the other arrows represent transitions. The bold border on the right-hand circle, indicates that it is an accepting state.
When we use such an automaton, we always consider ourselves to be in one of the states, beginning with the start state. We read a character from the input, following the appropriate transition to a new state, if one exists; if not, we give up. If we reach the end of the input, then we accept the input if we are in an accepting state; that is, the input lies in our language.
For example, the automaton drawn above recognizes the following language: \( \{ a, ab, abb, abbb, abbbb, \dots \} \).
Writing code based on finite state machines is a useful programming technique. When we write code to do lexical analysis, we will use this idea.
A generalization, called a nondeterministic finite automaton, allows multiple transitions from the same state to be labeled with the same symbol. For such automata, a string is accepted if there is some path from the start state, along transitions labeled by the symbols in the string, in order, that ends in an accepting state. Nondeterministic finite automata form another kind of recognizer for regular languages.
Given a regular grammar, it is easy to produce a nondeterminisic finite automaton that recognizes the language the grammar generates. We make one state for each nonterminal; the state corresponding to the start symbol is the start state. A production like “\(A \rightarrow bC\)” becomes a transition from state \(A\) to state \(C\) labeled by the symbol \(b\). A production like “\(A \rightarrow \varepsilon\)” means that state \(A\) is an accepting state. Lastly, if there is a production like “\(A \rightarrow b\)”, then we make a new accepting state, and a transition from state \(A\) to this new state, labeled by \(b\).
Here is a drawing of the automaton derived from the regular grammar above.
This automaton recognizes the language generated by the grammar: \( \{\varepsilon, xy, xyxy, xyxyxy, \dots, t, xyt, xyxyt, xyxyxyt, \dots \} \).
Note: We can also generate a deterministic finite automaton from a regular grammar, but this requires more work.
Another kind of generator for a regular language is a regular expression. Regular expressions are heavily used in text search & replace. Many modern languages include regular-expression facilities. For example, regular expressions are built into the programming language Perl; they are included in the standard library in Python and C++11.
It is important to note that the term “regular expression” is often used rather loosely in the programming field. In fact, the regular-expression libraries in the programming languages mentioned above all allow for rather more general expressions, which are capable of recognizing languages that are not actually regular. However, here we will use the term in a more formal sense.
Before we define regular expressions, let us consider a kind of expression that all of us are familiar with: the arithmetic expression. Here is an example of an arithmetic expression:
\[ 34 * (3 - n) + (5.6 / g + 3). \]
We can define arithmetic expressions by building them up from small pieces, as follows.
If \(A\) and \(B\) are arithmetic expressions, then so are each of the following.
Unary “\(-\)” has the highest precedence. Operators “\(*\)” and “\(/\)” are next, and then “\(+\)” and binary “\(-\)”. Unary minus is right-associative, while all four binary operators are left-associative. If we want to override these rules, then we can use parentheses for grouping. Note: left-associative means, for example, that \(1-2-3\) is the same as \((1-2)-3\), not \(1-(2-3)\).
The above defines the syntax of an arithmetic expression. In other words, it allows us to look at some text and determine whether the text is actually an arithmetic expression. However, it does not tell us how to find the value of an arithmetic expression; it does not tell us what such an expression means: its semantics.
We can define the semantics based on the syntax. The value of a numeric literal or a variable is its numeric value. The value of \(A + B\) is the sum of the value of \(A\) and the value of \(B\). The value of \(A - B\) is the difference of the value of \(A\) and the value of \(B\), and so on.
Now we define regular expressions in a similar way.
If \(A\) and \(B\) are regular expressions, then so are each of the following.
The above list proceeds from high precedence to low precedence. All are left-associative. As before, parentheses can be used for grouping.
So, for example, here is a regular expression.
\[(a|x)*cb\]
As before the above defines the syntax of regular expressions. It allows us to determine whether some text is actually a regular expression. But it does not tell us what a regular expression means: its semantics.
A regular expression defines a language. The expression is said to match each string in the language. The rules for matching are as follows.
Now suppose that \(A\), \(B\) are regular expressions.
Note: the asterisk (\(*\)) used in this way is called the Kleene star, after Stephen Kleene. “Kleene” is, somewhat mysteriously, pronounced KLAY-nee.
Now consider the regular expression above: “\((a|x)*cb\)”. The expressions “\(a\)” and “\(x\)” each match themselves. The expression “\(a|x\)” matches two strings: “\(a\)“ and “\(x\)”. So the expression “\((a|x)*\)” (with parentheses to override the precedence rules) matches any string consisting of zero or more characters from the set \(\{a,x\}\); for example, it matches “\(aaaxaxaaaxxx\)”. We conclude that the expression “\((a|x)*cb\)” matches any sequence of zero or more characters from \(\{a,x\}\), followed by \(c\), followed by \(b\). So the language generated includes the strings \(cb\), \(acb\), \(xcb\), \(aacb\), \(axcb\), \(xacb\), \(xxcb\), \(aaacb\), \(aaxcb\), and so on.
Here is a regular expression that generates the same language as our regular grammar.
\[(xy)*(\varepsilon|t)\]
This is, again, the same language as that recognized by our second finite automaton above: \( \{\varepsilon, xy, xyxy, xyxyxy, \dots, t, xyt, xyxyt, xyxyxyt, \dots \} \).
Most regular-expression libraries accept something like the above syntax, except that “\(\varepsilon\)” is replaced by an actual empty string. In addition, a number of common shortcuts are almost universally used.
First “.
” matches any single character.
Second, brackets with a list of characters between them, will match any one of the characters in the list. Thus, the following two expressions match the same strings.
[qwerty] (q|w|e|r|t|y)
In addition,
“-
”
may be used for sequences of consecutive characters.
The following expressions match the same strings.
[0-9] [0123456789] (0|1|2|3|4|5|6|7|8|9)
Placing “^
” just after the opening bracket
means that all characters not in the list are matched.
So this expression
[^0-9]
matches any character that is not a digit.
Third, “+
” means one-or-more,
in the same way that
“*
” means zero-or-more.
Thus, the following two expressions match the same strings.
abc(abc)* (abc)+
Fourth, “?
” means zero-or-one,
so that the following two expressions match the same strings.
x(abc)? x|xabc
And last, the various special characters above can be
considered as ordinary characters when preceded by a
backslash (\
).
For example,
“.
” matches any printable character,
while
“\.
” matches only “.
”.
Note: the rules for backslash escaping vary considerably
from one regular-expression library to another;
read the documentation!
The above are all just shortcuts. They make regular expressions more convenient, but they do not change which languages can be generated. However, as mentioned above, many languages include facilities that make their “regular expressions”—so-called—decidedly non-regular. That is, they allow for the generation of languages that are not regular.
This is most often done by allowing for a requirement that
two different sections of a string are the same.
For example, the following expression,
used in Perl,
matches strings b
, aba
, aabaa
,
aaabaaa
, etc.
(a*)b\1
The language generated by this expression is not regular. For the purposes of this class, we do not consider the above to be a regular expression.
ggchappell@alaska.edu