CS 331 Spring 2009 > Assignment 2

CS 331 Spring 2009
Assignment 2

Assignment 2 is due at 5 p.m. Tuesday, February 24. It is worth 25 points.

Procedures

E-mail answers to the exercises below to ffggc@uaf.edu, using the subject “PA2”.

Your answers should consist of files containing the source code from Exercise A. These file (or an archive file containing them) should be attached to your e-mail message.
Be sure to include your name in your e-mail.
I may not read your homework e-mail immediately. If you wish to discuss the assignment (or anything else) with me, send me a separate message with a different subject line.

Exercises (25 pts total)

Exercise A — Lexical Analyzer

Purpose

In this exercise, you will write a state-machine-based lexical analyzer for a C-like language.

Instructions

Write, and turn in source code for:

A state-machine-based lexical analyzer.
A program that uses the above lexical analyzer.

General requirements.

The source code for lexical analyzer and the program that uses it should be stored in separate files.
The code should be written in C or C++, unless you have gotten approval from me to use another language. (No general approval will be given; you must speak to me individually.)
Code should be neat and readable, and should follow standard file conventions for the language you use (e.g., in C++, use the standard header-source conventions, with #ifndef, #include, etc.)
Code should be usable according to the standard conventions of the language. For example, in C++, I should be able to add all the files to a project in any standard IDE, build, and run.
Code may use built-in features of a language, along with its standard libraries. No third-party libraries may be used.

Requirements for the lexical analyzer.

It should be usable separately from your main program. That is, one can toss out the main program and write a different package that uses the same lexical analyzer.
It should take input from a file, whose name is specified by the client code.
It should not print anything. Rather, its results are returned to the caller.
It should analyze according to the lexeme description below (see “Lexemes”).
For each lexeme, it should return an indication of the lexeme’s type (KEYWORD, OPERATOR, etc.), as well as the text of the lexeme.
It should do most of its processing via a state machine. (No, no regular expressions.)

Requirements for the main program.

It should get a filename from the outside world by some normal means. (Generally this means either from standard input or a command-line option. Also, it needs to be really easy for me to figure out which.)
The given file should be considered as a source file for a program in the language handled by the lexical analyzer. The program should use the lexical analyzer to analyze the file, printing a list of lexemes (for each, type and text) one on each line. Following this, it should print a symbol table: a list of all identifiers appearing in the file, in alphabetical order, one on each line. No identifier should appear in this list twice.

Lexemes

The lexical analyzer should analyze according to the following description.

Whitespace (as in the analyzer we wrote in class: blank, tab, vertical-tab, new-line, carriage-return, form-feed) is generally a lexeme separator. However, whitespace that occurs in a comment or a STRING is part of the comment or STRING. Lexemes are not required to be separated by whitespace.

Comments are C-style multiline comments. They begin with “/*”, not inside a STRING, and end with “*/” or the end of the file. The beginning of a comment ends a lexeme. The characters “/*” occuring inside a STRING are part of the STRING, and are not considered to begin a comment.

There are 6 kinds of lexemes: STRING, KEYWORD, IDENTIFIER, INTEGER, FLOAT, OPERATOR.

STRING

A double-quoted string. Begins with a double quote (“"”), and continues until the next double quote or the end of the file. A STRING may contain new-lines. (Note that there are no backslash escapes, or anything like that.)

The text of a STRING lexeme does not include the quotes. For example, if x"abc"y is encountered in the file, then the analyzer should indicate three lexemes:

An IDENTIFIER with text x.
A STRING with text abc.
An IDENTIFIER with text y.

KEYWORD

One of “for”, “while”, “if”, or “else”.

IDENTIFIER

Any C-style identifier that is not a keyword. Begins with a letter or underscore (“_”). Contains only letters, underscores, and digits, and is not equal to any of the KEYWORDs.

Examples: “main”, “_1_2_3_4_”, “abc12345”, “___”.

INTEGER

An optional “+” or “-”, followed by one or more digits.

Examples: “1”, “-2345”, “+12”.

FLOAT

An optional “+” or “-” followed by a sequence of characters consisting only of digits and a dot (“.”), which contains exactly one dot and at least one digit, followed by an optional EXPONENT. An EXPONENT is an “e” or “E” followed by a sequence of characters conforming to the description of an INTEGER.

Examples: “.123”, “-34.2”, “-34.e45”, “.01E-24”, “+123.456e+123”, “00.E00”.

These are not FLOATs: “-.”, “+e23”, “E10”, “.e10”, “00E00”.

OPERATOR

“++”, “--”, “==”, “!=”, “<=”, “>=”, or any single character which is not whitespace and not the beginning of a comment or a non-OPERATOR lexeme.

A lexeme continues as far as it legally can. Thus, “===” parses as “==” followed by “=”.

Example

Suppose your program is given the following file as input.

for+23++23e01 23.e01e01=====//comment
"/*++"/* "hello"

The output should be very much like the following.

Lexemes:
KEYWORD    for
INTEGER    +23
OPERATOR   ++
INTEGER    23
IDENTIFIER e01
FLOAT      23.e01
IDENTIFIER e01
OPERATOR   ==
OPERATOR   ==
OPERATOR   =
OPERATOR   /
OPERATOR   /
IDENTIFIER comment
STRING     /*++

Symbol table:
e01
comment

CS 331 Spring 2009: Assignment 2 / Updated: 22 Feb 2009 / Glenn G. Chappell / ffggc@uaf.edu

CS 331 Spring 2009 Assignment 2