CS 331 Spring 2013 > Lecture Notes for Friday, February 1, 2013

CS 331 Spring 2013
Lecture Notes for Friday, February 1, 2013

Lexical Analysis

Terminology

A lexer is a code module that does lexical analysis. Typically a lexer functions as a front-end for a parser. A lexer reads source text and outputs a stream of lexemes. A lexeme is a string holding a “word” from the source text, together with information about what kind of word it is (identifier, keyword, operator, etc.). This latter information is sometimes called a token.

An identifier is a name that a program can give to some entity within a program. Consider the following C++ code.

class MyClass {
public:
    void myFunc(Type1 & param1, Type2 & param2)
    {
        while (param1 != param2)
        {
            ++param1;
        }
        if (param2) return;
        ++param2;
};

In the code above, MyClass, myFunc, Type1, param1, Type2, and param2 are all identifiers.

A keyword is a word that has special meaning to a programming language. In the above code, class, public, void, while, if, and return are all keywords.

A reserved word is a word that this not a legal identifier. In many programming languages, the keywords and the reserved words are the same. However, it is not hard to envision a variant of (say) C++ in which the compiler could distinguish how a word is being used, solely by its position in the code. Then this might be legal code.

for (for for = -for; for; --for) ;

Indeed the programming language Fortran traditionally has no reserved words. The following is, famously, legal code in at least some versions of Fortran.

IF IF THEN THEN ELSE ELSE

On the other hand, there can be reserved words that are not keywords. The Java standard specifies that goto is a reserved word. However it is not a keyword. Thus, this word cannot be included in a program at all.

Lexer Operation

There are essentially three ways to write a lexer.

Automatically generated, based on a regular grammar or regular expression.
Hand-coded state machine using a table.
Entirely hand-coded state machine.

The first method might involve a software package like lex, which generates C code for a lexer, given input that consists primarily of regular expressions.

We will write a lexer using the last method. When we are done, it should not be difficult to see how we could have used a table instead.

As we have said, a lexer outputs a series of lexemes. Because a lexer is usually a front-end for a parser, there is no need to store these lexemes in a large structure. Rather, the lexer can provide get-next-lexeme functionality.

Writing a Lexer I: Design Decisions

We will now begin writing a lexer. A description of the lexemes will be available at our next meeting. For now, we will simply say that the lexer is intended for handling a (roughly) C++-like programming language. In particular:

No lexeme begins or ends with whitespace.
All whitespace will usually be treated the same.
Lexemes may be arbitrarily long.
There might not be any delimiter between lexemes.

Note that there are programming languages in which the above do not apply. For example, in Python, Haskell, Javascript, and Go, a newline can serve as and end marker, just like a semicolon (“;”) in C++. As another example, in the programming language Forth, adjacent lexemes are always separated by whitespace.

In addition:

Comments begin with a pound sign (“#”) that is not part of any lexeme that began before it, and end with the next newline or the end of the input, whichever comes first.

Our lexer will be implemented as a C++ class Lex, defined in files lex.h and lex.cpp.

The interface to the class will be as follows.

The input will be given as a single string, to either a constructor, or a member function set. Using the default constructor is the same as passing an empty string as input.
Output will be retrieved using an interface similar to that of C++ iterators, but using named functions instead of operators.
- Member function current will return the current lexeme. If this function is called repeatedly, with no other member-function calls intervening, then it will return the same value.
- Member function advance will move to the next lexeme.
- Member function done will return true if there are no more lexemes to read.
A returned lexeme will be of type Lex::Lexeme, which will be a pair (std::pair) consisting of a std::string holding the text of the lexeme, and an enum of type Lex::Token.
If done is true, then current will return a lexeme with token NONE.

The following code would print the text of all lexemes in the string prog, each on a separate line.

Lex luthor;
for (luthor.set(prog); !luthor.done(); luthor.advance())
    cout << luthor.current().first << endl;

Note that current can be called as soon as the input is given to the Lex object; there is no need to tell the lexer to go to the first lexeme. This suggests that the current lexeme should be stored in a data member, and the constructor, set, and advance should all ensure that the new lexeme is stored there when they exit. We probably want the constructor to call set, which, in turn, calls advance.

Class Lex needs three data members.

input_: A std::string holding the input.
pos_: An integer (std::size_t) holding the index of the next character to be read from input_.
currlexeme_: A Lexeme holding the current lexeme. This is returned by member function current.

Note: I am fond of the convention that names of data members end with an underscore (_). You are not required to like this convention; nor are you required to follow it. But it certainly is a good idea to distinguish names of data members in some fashion.

It is important to be clear about class invariants. In particular, what is the value of member pos_ when class code is not being executed? Remember that there might be whitespace anywhere; as noted in class, simply because pos_ is past the last lexeme, does not mean it is beyond the end of the input string. In class Lex, pos_ will always point to the beginning of the next lexeme, or just past the end of the string (pos_ == input_.size()) if there are no more lexemes.

Despite the above, we should not check the value of pos_ to determine whether there are more lexemes to be read. After the final lexeme is processed and placed into currlexeme_, the value of pos_ will be input_.size(). However, current will still return a valid lexeme; from the point of view of the outside world, we are not yet past the end. Therefore, we write done as follows.

bool done() const
{ return currlexeme_.second == NONE; }

Class Lex needs two public member types.

Token: An enum used to specify the token of a lexeme. One of its values is NONE; this is the token of a returned lexeme when there are no more lexemes in the input.
Lexeme
: The return type of member function current. This is defined to be std::pair<std::string, Token>.

Class Lex has a single constructor, which takes an optional string argument—defaulting to an an empty string—and simply calls member function set.

The automatically generated destructor, copy constructor, and copy assignment operator will be used.

See lex.h & lex.cpp for our lexer code. As of this writing, the code is unfinished. The code compiles, it and can be called, but in its current form it does not produce correct results. All public members are defined, but function advance does not do anything. There is an internal-use function skipSpace, which moves pos_ to skip over both whitespace and comments.

Also see lex_main.cpp for a simple program that uses class Lex.

Lexical Analysis will be continued next time.

CS 331 Spring 2013: Lecture Notes for Friday, February 1, 2013 / Updated: 1 Feb 2013 / Glenn G. Chappell / ggchappell@alaska.edu

CS 331 Spring 2013 Lecture Notes for Friday, February 1, 2013

Lexical Analysis

Terminology

Lexer Operation

Writing a Lexer I: Design Decisions

CS 331 Spring 2013
Lecture Notes for Friday, February 1, 2013