CS 331 Spring 2013  >  Lecture Notes for Monday, February 4, 2013

CS 331 Spring 2013
Lecture Notes for Monday, February 4, 2013

Lexical Analysis (cont’d)

Writing a Lexer II: Coding a State Machine

State Machines

A state machine is a useful construct for various kinds of programming. The module representing the state machine has a state, which is stored somehow (I will use a simple variable). Then it repeats the following: look at the next input (character, in this case), and, based on the current state and the input, go to a new state, perhaps performing some action as well. We generally need some way to indicate that the state machine should terminate; we might use a “done” state.

It is convenient to write a lexer using a state machine. The state represents knowledge about the input so far. We can store the state in a variable of enum type, then have a while loop containing a switch on the state.

We will write such a lexer for the lexemes described in the Lexeme Description [PDF] handout.

An important rule is that we are in the same state in two situations when we would always make the same decisions, for all possible future inputs. For example, in our lexer, reading a digit (e.g., “3”) puts the machine in the DIGIT state. Reading “-2564” puts the machine in this same state by the above rule. However, reading “7.” puts the machine in a different state (DIGDOT). To see why this is, suppose that the next character is a dot (“.”). “3.” and “-2564.” are legal numbers in our lexeme description. However, “7..” is not.

As we build our state machine, the above rule tells us when we need to define a new state. Having defined a state (DIGIT) for “3”, we do not need a new state for “-2564”. But we do need a new state (DIGDOT) for “7.”.

An idea that I have found helpful—but which is by no means required—is to name each state after a short input that will put the machine into it. Thus, LETTER for a single letter, DIGIT for a single digit, DIGDOT for a digit followed by a dot, etc. Note that, by the above rule, these are not the only sequences that will put the machine into the respective states.

See lex.h & lex.cpp for our completed lexer code. The state-machine code is in function advance, whose (rather lengthy) description is in the source file.

Also see lex_main.cpp for a simple program that uses class Lex.

A Note on Lexing a String

Our lexer takes input from a string object using the bracket operator. A common source of bugs in such situations is to read characters past the end of the string. Therefore we rigidly enforce the following rule:

Every time a character is read from the input string, check first whether the subscript is in range.

Although we did not do so in our code, a useful idea is to encapsulate character reads in one or more functions that do the bounds checking, and then perform all reads using one of these functions.

Lexical Analysis will be continued next time.


CS 331 Spring 2013: Lecture Notes for Monday, February 4, 2013 / Updated: 4 Feb 2013 / Glenn G. Chappell / ggchappell@alaska.edu