CS 331 Spring 2013 > Lecture Notes for Monday, February 4, 2013 |
A state machine is a useful construct for various kinds of programming. The module representing the state machine has a state, which is stored somehow (I will use a simple variable). Then it repeats the following: look at the next input (character, in this case), and, based on the current state and the input, go to a new state, perhaps performing some action as well. We generally need some way to indicate that the state machine should terminate; we might use a “done” state.
It is convenient to write a lexer using a state machine.
The state represents knowledge about the input so far.
We can store the state in a variable of enum
type,
then have a while
loop
containing a switch
on the state.
We will write such a lexer for the lexemes described in the Lexeme Description [PDF] handout.
An important rule is that
we are in the same state in two situations
when we would always make the same decisions,
for all possible future inputs.
For example,
in our lexer, reading a digit (e.g., “3
”)
puts the machine in the DIGIT
state.
Reading “-2564
”
puts the machine in this same state by the above rule.
However, reading
“7.
”
puts the machine in a different state (DIGDOT
).
To see why this is, suppose that the next character is
a dot (“.
”).
“3.
”
and
“-2564.
”
are legal numbers in our lexeme description.
However,
“7..
”
is not.
As we build our state machine,
the above rule tells us when we need to define a new state.
Having defined a state (DIGIT
) for
“3
”,
we do not need a new state for
“-2564
”.
But we do need a new state (DIGDOT
) for
“7.
”.
An idea that I have found helpful—but
which is by no means required—is
to name each state after a short input that will put
the machine into it.
Thus, LETTER
for a single letter,
DIGIT
for a single digit,
DIGDOT
for a digit followed by a dot, etc.
Note that, by the above rule,
these are not the only sequences
that will put the machine into the respective states.
See
lex.h
&
lex.cpp
for our completed lexer code.
The state-machine code is in function advance
,
whose (rather lengthy) description is in the source file.
Also see
lex_main.cpp
for a simple program that uses class Lex
.
Our lexer takes input from a string
object
using the bracket operator.
A common source of bugs in such situations is to read characters
past the end of the string
.
Therefore we rigidly enforce the following rule:
Every time a character is read from the input
string
, check first whether the subscript is in range.
Although we did not do so in our code, a useful idea is to encapsulate character reads in one or more functions that do the bounds checking, and then perform all reads using one of these functions.
Lexical Analysis will be continued next time.
ggchappell@alaska.edu