|CS 331 Spring 2013 > Lecture Notes for Monday, February 4, 2013|
A state machine is a useful construct for various kinds of programming. The module representing the state machine has a state, which is stored somehow (I will use a simple variable). Then it repeats the following: look at the next input (character, in this case), and, based on the current state and the input, go to a new state, perhaps performing some action as well. We generally need some way to indicate that the state machine should terminate; we might use a “done” state.
It is convenient to write a lexer using a state machine.
The state represents knowledge about the input so far.
We can store the state in a variable of
then have a
switch on the state.
We will write such a lexer for the lexemes described in the Lexeme Description [PDF] handout.
An important rule is that
we are in the same state in two situations
when we would always make the same decisions,
for all possible future inputs.
in our lexer, reading a digit (e.g., “
puts the machine in the
puts the machine in this same state by the above rule.
puts the machine in a different state (
To see why this is, suppose that the next character is
a dot (“
are legal numbers in our lexeme description.
As we build our state machine,
the above rule tells us when we need to define a new state.
Having defined a state (
we do not need a new state for
But we do need a new state (
An idea that I have found helpful—but
which is by no means required—is
to name each state after a short input that will put
the machine into it.
LETTER for a single letter,
DIGIT for a single digit,
DIGDOT for a digit followed by a dot, etc.
Note that, by the above rule,
these are not the only sequences
that will put the machine into the respective states.
for our completed lexer code.
The state-machine code is in function
whose (rather lengthy) description is in the source file.
for a simple program that uses class
Our lexer takes input from a
using the bracket operator.
A common source of bugs in such situations is to read characters
past the end of the
Therefore we rigidly enforce the following rule:
Every time a character is read from the input
string, check first whether the subscript is in range.
Although we did not do so in our code, a useful idea is to encapsulate character reads in one or more functions that do the bounds checking, and then perform all reads using one of these functions.
Lexical Analysis will be continued next time.