CS 331 Spring 2013 > Lecture Notes for Friday, February 1, 2013 |
A lexer is a code module that does lexical analysis. Typically a lexer functions as a front-end for a parser. A lexer reads source text and outputs a stream of lexemes. A lexeme is a string holding a “word” from the source text, together with information about what kind of word it is (identifier, keyword, operator, etc.). This latter information is sometimes called a token.
An identifier is a name that a program can give to some entity within a program. Consider the following C++ code.
class MyClass { public: void myFunc(Type1 & param1, Type2 & param2) { while (param1 != param2) { ++param1; } if (param2) return; ++param2; };
In the code above, MyClass
, myFunc
,
Type1
, param1
,
Type2
, and param2
are all identifiers.
A keyword is a word that has special meaning to
a programming language.
In the above code, class
,
public
,
void
,
while
,
if
,
and
return
are all keywords.
A reserved word is a word that this not a legal identifier. In many programming languages, the keywords and the reserved words are the same. However, it is not hard to envision a variant of (say) C++ in which the compiler could distinguish how a word is being used, solely by its position in the code. Then this might be legal code.
for (for for = -for; for; --for) ;
Indeed the programming language Fortran traditionally has no reserved words. The following is, famously, legal code in at least some versions of Fortran.
IF IF THEN THEN ELSE ELSE
On the other hand, there can be reserved words that are not
keywords.
The Java standard specifies that goto
is a reserved word.
However it is not a keyword.
Thus, this word cannot be included in a program at all.
There are essentially three ways to write a lexer.
The first method might involve a software package like lex
,
which generates C code for a lexer,
given input that consists primarily of regular expressions.
We will write a lexer using the last method. When we are done, it should not be difficult to see how we could have used a table instead.
As we have said, a lexer outputs a series of lexemes. Because a lexer is usually a front-end for a parser, there is no need to store these lexemes in a large structure. Rather, the lexer can provide get-next-lexeme functionality.
We will now begin writing a lexer. A description of the lexemes will be available at our next meeting. For now, we will simply say that the lexer is intended for handling a (roughly) C++-like programming language. In particular:
Note that there are programming languages in which the above do
not apply.
For example, in Python, Haskell, Javascript,
and Go, a newline can serve as and end marker,
just like a semicolon (“;
”) in C++.
As another example, in the programming language Forth,
adjacent lexemes are always separated by whitespace.
In addition:
#
”)
that is not part of any lexeme that began before it,
and end with the next newline or the end of the input,
whichever comes first.
Our lexer will be implemented as a C++ class Lex
,
defined in files lex.h
and lex.cpp
.
The interface to the class will be as follows.
set
.
Using the default constructor is the same as passing
an empty string as input.current
will return the current lexeme.
If this function is called repeatedly,
with no other member-function calls intervening,
then it will return the same value.advance
will move to the next lexeme.done
will return true if there
are no more lexemes to read.Lex::Lexeme
,
which will be a pair (std::pair
)
consisting of a std::string
holding the text of the lexeme,
and an enum
of type Lex::Token
.done
is true
,
then current
will return a lexeme
with token NONE
.
The following code would print the text of all lexemes
in the string prog
,
each on a separate line.
Lex luthor; for (luthor.set(prog); !luthor.done(); luthor.advance()) cout << luthor.current().first << endl;
Note that current
can be called as soon as the
input is given to the Lex
object;
there is no need to tell the lexer to go to the first lexeme.
This suggests that the current lexeme should be stored in
a data member,
and the constructor, set
,
and advance
should all ensure that the new lexeme
is stored there when they exit.
We probably want the constructor to call set
,
which, in turn, calls advance
.
Class Lex
needs three data members.
input_
std::string
holding the input.pos_
std::size_t
)
holding the index of the next character to be read
from input_
.currlexeme_
Lexeme
holding the current lexeme.
This is returned by member function current
.
Note: I am fond of the convention that
names of data members end with an underscore (_
).
You are not required to like this convention;
nor are you required to follow it.
But it certainly is a good idea to distinguish names of data members
in some fashion.
It is important to be clear about class invariants.
In particular, what is the value of member pos_
when class code is not being executed?
Remember that there might be whitespace anywhere;
as noted in class,
simply because pos_
is past the last lexeme,
does not mean it is beyond the end of the input string.
In class Lex
, pos_
will always point to
the beginning of the next lexeme, or just past the end
of the string (pos_ == input_.size()
)
if there are no more lexemes.
Despite the above,
we should not check the value of pos_
to determine whether there are more lexemes to be read.
After the final lexeme is processed and placed into
currlexeme_
,
the value of pos_
will be input_.size()
.
However, current
will still return a valid lexeme;
from the point of view of the outside world, we are not yet
past the end.
Therefore, we write done
as follows.
bool done() const { return currlexeme_.second == NONE; }
Class Lex
needs two public member types.
Token
enum
used to specify the token of a lexeme.
One of its values is NONE
;
this is the token of a returned lexeme when there are
no more lexemes in the input.Lexeme
current
.
This is defined to be
std::pair<std::string, Token>
.
Class Lex
has a single constructor,
which takes an optional string
argument—defaulting
to an an empty string—and simply calls
member function set
.
The automatically generated destructor, copy constructor, and copy assignment operator will be used.
See
lex.h
&
lex.cpp
for our lexer code.
As of this writing, the code is unfinished.
The code compiles, it and can be called,
but in its current form it does not produce correct results.
All public members are defined,
but function advance
does not do anything.
There is an internal-use function skipSpace
,
which moves pos_
to skip over both whitespace and comments.
Also see
lex_main.cpp
for a simple program that uses class Lex
.
Lexical Analysis will be continued next time.
ggchappell@alaska.edu