CS 331 Spring 2011  >  Lecture Notes for Monday, January 24, 2011

CS 331 Spring 2011
Lecture Notes for Monday, January 24, 2011

Languages & Grammars

Languages

A language is a collection of strings. The characters in the strings all lie in some alphabet.

The kind of language (in the above sense) that we will be primarily interested in is the set of all syntactically correct programs in some programming langugage.

How do we describe a language? We can use a generator or a recognizer. A generator is a way of producing all strings in the language. A recognizer is a way of determining whether a given string lies in the language.

We typically find generators easier to write, but recognizers more useful (every compiler must have one, right?). Often we write a generator, and then have a program use it to write a recognizer.

Grammars

A phrase-structure grammar (usually just grammar) is one form of language generator.

In a grammar, we have a collection of terminal symbols; this is our alphabet. We also have a collection of nonterminal symbols; these are like variables that eventually turn into something else. One nonterminal symbol is the start symbol.

I will use lower-case letters for terminals, upper-case letters for nonterminals, and “S” for the start symbol. An epsilon (“ε”) will represent the empty string.

We also have one or more productions, which are rules for turning some string into some other string. Starting from the start symbol, repeatedly applying rules, and ending with a string consisting only of terminals is a derivation of the final string.

For example, here is a grammar.

S → A
A → AA
A → ε
A → bcd

Here is a derivation, based on the above grammar.

S
A
AA
AAA
AbcdA
bcdbcdA
bcdbcd

Thus, the string “bcdbcd” lies in the language generated by the above grammar.

The Chomsky Hierarchy

In the late 1950s, linguist Noam Chomsky published a hierarchy of types of languages, defined in terms of the kinds of grammars that could describe them. Chomsky was trying to develop a framework for studying natural languages (e.g., English); however, his hierarchy has proved to be useful in the theoretical study of computing.

Below is the hierarchy. In the Grammar column, upper-case letters represent nonterminal symbols, while lower-case letters represent terminal symbols.

Language Category Rules Allowed in Grammar
(Generator)
Recognizer Why We Care
Chomsky’s Number Name
Type 3 Regular
  • A → ε
  • A → b
  • A → bC
Finite Automaton
Think: Program that uses a small, fixed amount of memory
  • In this category lie things like the set of all legal identifiers in some programming language. Thus: lexical analysis (breaking a program into words).
  • This is what regular expressions can describe.
Type 2 Context Free
  • A → [anything]
Nondeterministic Push-Down Automaton
Think: Finite Automaton + Stack (roughly)
  • For most programming languages, the set of all syntactically correct programs, lies in this category. Thus: parsing(determining whether a program is syntactically correct and, if so, how it is structured).
Type 1 Context Sensitive
  • [string1]A[string2] → [string1][anything][string2]
Don’t worry about it
(“Linear Bounded Automaton”, if you must know)
  • Actually, we don’t care much.
Type 0 Recursively Enumerable
  • [anything] → [anything]
Turing Machine
Think: Computer
  • This category neatly encompasses the things that computers are capable of doing.

Notes


CS 331 Spring 2011: Lecture Notes for Monday, January 24, 2011 / Updated: 15 Feb 2011 / Glenn G. Chappell / ggchappell@alaska.edu