CS 331 Spring 2009
Assignment 2
Assignment 2 is due at 5 p.m. Tuesday, February 24.
It is worth 25 points.
Procedures
E-mail
answers to the exercises below to
ffggc@uaf.edu,
using the subject
“PA2”.
- Your answers should consist of files containing the
source code from Exercise A.
These file (or an archive file containing them)
should be attached to your e-mail message.
- Be sure to include your name in your e-mail.
- I may not read your homework e-mail immediately.
If you wish to discuss the assignment (or anything else)
with me, send me a separate message with a different subject line.
Exercises (25 pts total)
Exercise A — Lexical Analyzer
Purpose
In this exercise, you will write a state-machine-based
lexical analyzer for a C-like language.
Instructions
Write, and turn in source code for:
- A state-machine-based lexical analyzer.
- A program that uses the above lexical analyzer.
General requirements.
- The source code for lexical analyzer and the program that uses it
should be stored in separate files.
- The code should be written in C or C++, unless you have gotten
approval from me to use another language.
(No general approval will be given; you must speak to me individually.)
- Code should be neat and readable, and should follow standard file conventions
for the language you use
(e.g., in C++, use the standard header-source conventions,
with #ifndef, #include, etc.)
- Code should be usable according to the standard conventions of the
language.
For example, in C++, I should be able to add all the files to a project
in any standard IDE, build, and run.
- Code may use built-in features of a language, along with its
standard libraries.
No third-party libraries may be used.
Requirements for the lexical analyzer.
- It should be usable separately from your main program.
That is, one can toss out the main program and write a different package
that uses the same lexical analyzer.
- It should take input from a file, whose name is specified by the client code.
- It should not print anything.
Rather, its results are returned to the caller.
- It should analyze according to the lexeme description below
(see “Lexemes”).
- For each lexeme, it should return an indication of the lexeme’s type
(KEYWORD, OPERATOR, etc.), as well as the text of the lexeme.
- It should do most of its processing via a state machine.
(No, no regular expressions.)
Requirements for the main program.
- It should get a filename from the outside world by some normal means.
(Generally this means either from standard input or a command-line option.
Also, it needs to be really easy for me to figure out which.)
- The given file should be considered as a source file for a program
in the language handled by the lexical analyzer.
The program should use the lexical analyzer to analyze the file,
printing a list of lexemes (for each, type and text) one on each line.
Following this, it should print a symbol table:
a list of all identifiers appearing in the file, in alphabetical order,
one on each line.
No identifier should appear in this list twice.
Lexemes
The lexical analyzer should analyze according to the following description.
Whitespace (as in the analyzer we wrote in class:
blank, tab, vertical-tab, new-line, carriage-return, form-feed)
is generally a lexeme separator.
However, whitespace that occurs in a comment or a STRING is part of the comment or STRING.
Lexemes are not required to be separated by whitespace.
Comments are C-style multiline comments.
They begin with “/*”, not inside a STRING,
and end with “*/” or the end of the file.
The beginning of a comment ends a lexeme.
The characters “/*” occuring inside a STRING
are part of the STRING, and are not considered to begin a comment.
There are 6 kinds of lexemes: STRING, KEYWORD, IDENTIFIER, INTEGER, FLOAT, OPERATOR.
- STRING
- A double-quoted string.
Begins with a double quote (“"”),
and continues until the next double quote or the end of the file.
A STRING may contain new-lines.
(Note that there are no backslash escapes, or anything like that.)
The text of a STRING lexeme does not include the quotes.
For example, if x"abc"y is encountered in the file,
then the analyzer should indicate three lexemes:
- An IDENTIFIER with text x.
- A STRING with text abc.
- An IDENTIFIER with text y.
- KEYWORD
- One of “for”, “while”, “if”, or “else”.
- IDENTIFIER
- Any C-style identifier that is not a keyword.
Begins with a letter or underscore (“_”).
Contains only letters, underscores, and digits,
and is not equal to any of the KEYWORDs.
Examples: “main”, “_1_2_3_4_”, “abc12345”, “___”.
- INTEGER
- An optional “+” or “-”,
followed by one or more digits.
Examples: “1”, “-2345”, “+12”.
- FLOAT
- An optional “+” or “-”
followed by a sequence of characters consisting only of digits and a dot (“.”),
which contains exactly one dot and at least one digit, followed by an optional EXPONENT.
An EXPONENT is an “e” or “E” followed
by a sequence of characters conforming to the description of an INTEGER.
Examples: “.123”, “-34.2”, “-34.e45”, “.01E-24”,
“+123.456e+123”, “00.E00”.
These are not FLOATs: “-.”, “+e23”, “E10”, “.e10”, “00E00”.
- OPERATOR
- “++”, “--”, “==”, “!=”, “<=”, “>=”, or any single character which is not whitespace and not the beginning of a comment or a non-OPERATOR lexeme.
A lexeme continues as far as it legally can.
Thus, “===” parses as “==” followed by “=”.
Example
Suppose your program is given the following file as input.
for+23++23e01 23.e01e01=====//comment
"/*++"/* "hello"
The output should be very much like the following.
Lexemes:
KEYWORD for
INTEGER +23
OPERATOR ++
INTEGER 23
IDENTIFIER e01
FLOAT 23.e01
IDENTIFIER e01
OPERATOR ==
OPERATOR ==
OPERATOR =
OPERATOR /
OPERATOR /
IDENTIFIER comment
STRING /*++
Symbol table:
e01
comment