CS 331 Spring 2025 > Assignment 3 (Writing a Lexer)
CS 331 Spring 2025
Assignment 3 (Writing a Lexer)
Assignment 3 is due at 5 pm on Tuesday, February 18. It is worth 90 points.
Procedures
This assignment is to be done individually.
Turn in answers to the exercises below on the UA Canvas site, under Assignment 3 for this class.
- Your answers must consist of
the answer to Exercise A
(submitted as an attached text/PDF file),
and
the source code for Exercise B
(file
lexit.lua
). - I may not look at your homework submission immediately. If you have questions, e-mail me.
Exercises (A–B, 90 pts total)
Exercise A — Running a Scheme Program
Purpose
In this exercise you will make sure you can execute Scheme code.
Instructions
Get the file check_scheme.scm
from the class Git repository.
This is a Scheme source file for a complete program.
It is intended to be executed under DrRacket:
open it in DrRacket (in the File menu),
and click the “Run” button.
When it is executed,
this program prints
“Secret message #3:
”.
Below that, it prints a secret message.
Run the program.
What is the secret message?
Exercise B — Lexer in Lua
Purpose
In this exercise you will write a Lua module that does lexical analysis.
In the next assignment, you will build a parser on top of your lexer. And in a later assignment, you will build an interpreter that uses the output from your parser. When you are done, you will have a complete interpreter for a programming language called Fulmar.
Instructions
Write a Lua module lexit
,
contained in the file lexit.lua
.
Your module will do lexical analysis;
it must be written as a hand-coded state machine.
Be sure to follow the Coding Standards.
- The interface of module
lexit
is very similar to that of modulelexer
, which was written in class—with some differences, to be covered shortly. In particular,lexit
exports:- Function
lexit.lex
, which takes a string parameter and allows for-in iteration through lexemes in the passed string. - Numerical constants representing lexeme categories. These constants are shown in a table below.
- Table
lexit.catnames
, which maps lexeme category numbers to printable strings.
- Function
- The interface of module
lexit
differs from that oflexer
as follows:- Lexing is done based on the Lexical Specification in this document (below), not the one distributed in class.
- The exported numerical constants and category names are different.
- Module
lexit
exports nothing other than functionlexit.lex
, seven constants representing lexeme categories, and tablelexit.catnames
. You may write anything you want in the source code for the module, as long as it is declared local and not exported—and it follows the Coding Standards.
The following properties of module lexer
must hold for module lexit
as well.
- At each iteration, the iterator function returns a pair: a string, which is the string form of the lexeme, and a number representing the category of the lexeme.
- The number mentioned in the previous point
is suitable as a key for table
catnames
. - The iteration ends when there are no further lexemes.
The correspondence between lexeme category numbers and category names/strings is as follows.
Category Number |
Named Constant |
Printable Form |
---|---|---|
1 | lexit.KEY |
Keyword |
2 | lexit.ID |
Identifier |
3 | lexit.NUMLIT |
NumericLiteral |
4 | lexit.STRLIT |
StringLiteral |
5 | lexit.OP |
Operator |
6 | lexit.PUNCT |
Punctuation |
7 | lexit.MAL |
Malformed |
Thus, the following code will work.
[Lua]
lexit = require "lexit" program = "x = 3 # Set a variable\nprintln(x+4)\n" for lexstr, cat in lexit.lex(program) do print(lexstr, lexit.catnames[cat]) end
Lexical Specification
This is a specification of the lexemes in the Fulmar programming language.
Whitespace characters are blank, tab, vertical-tab, new-line, carriage-return, form-feed. No lexeme, except for a StringLiteral, may contain a whitespace character. So a whitespace character, or any contiguous group of whitespace characters, is generally a separator between lexemes. However, pairs of lexemes are not required to be separated by whitespace.
A comment begins with a pound sign (#
)
occurring outside a StringLiteral lexeme or another comment,
and ends at a newline character or the end of the input,
whichever comes first.
There are no other kinds of comments.
Any character at all may occur in a comment.
Comments are treated by the lexer as whitespace: they are not part of lexemes and are not passed on to the caller.
Legal characters outside comments and StringLiteral lexemes are whitespace and printable ASCII characters (values 32 [blank] to 126 [tilde]). Any other character outside comments and StringLiteral lexemes is illegal.
Maximal munch is followed.
There are seven lexeme categories: Keyword, Identifier, NumericLiteral, StringLiteral, Operator, Punctuation, Malformed.
Below, in a regular expression, a character preceded by a backslash means the literal character, with no special meaning.
- Keyword
- One of the following 12:
chr elif else end func if print println readnum return rnd while
- Identifier
- Any string matched by
/[a-zA-Z_][a-zA-Z_0-9]*/
that is not a Keyword.Here are some Identifier lexemes.
myvar _ ___x_37cr HelloThere RETURN
Note. The reserved words are the same as the Keyword lexemes.
- NumericLiteral
- Any string matched by
/[0-9]+([eE]\+?[0-9]+)?/
.Notes. A NumericLiteral must begin with a digit and cannot contain a dot (
.
). A minus sign is not legal in an exponent (the “e
” or “E
” and what comes after it). A plus sign is legal, and optional, in an exponent. An exponent must contain at least one digit.Here are some valid NumericLiteral lexemes.
1234 00900 123e+7 00E00 3e888
The following are not valid NumericLiteral lexemes.
-42 3e e 123E+ 1.23 123e-7
- StringLiteral
- A single quote (
'
) or double quote ("
), followed by zero or more characters that are not newlines or the same as the opening quote mark, followed by a quote that matches the opening quote mark. There are no escape sequences. Any character, legal or illegal, other than a newline or a quote that matches the opening quote mark, may appear inside a StringLiteral. The beginning and ending quote marks are both part of the lexeme.Here are some StringLiteral lexemes.
"Hello there!" '' '"' "'--#!Ωé\"
- Operator
- One of the following seventeen:
! && || == != < <= > >= + - * / % [ ] =
- Punctuation
- Any single legal character
that is not whitespace,
not part of a comment,
and not part of any valid lexeme in one of the other categories,
including Malformed.
Here are some Punctuation lexemes.
; ( ) { } , & $
- Malformed
- There are two kinds of Malformed lexemes:
bad character and bad string.
A bad character is any single character that is illegal, that is not part of a comment or a StringLiteral lexeme that began earlier.
A bad string is essentially a partial StringLiteral where the end of the line or the end of the input is reached before the ending quote mark. It begins with a double quote mark that is not part of a comment or StringLiteral that began earlier, and continues to the next newline or the end of the input, without a double quote appearing. Any character, legal or illegal, may appear in a bad string. If the lexeme ends at a newline, then this newline is not part of the lexeme.
Here are three Malformed lexemes that are bad strings.
In order to be counted as Malformed. each of the above must end at a newline (which would not be considered part of the lexeme) or at the end of the input."a-b-c 'wx yz "Ωé'
Note. The two kinds of Malformed lexemes are presented to the caller in the same way: they are both simply Malformed.
Test Program
A test program
is available in the Git repository:
lexit_test.lua
.
If you
run this program (unmodified!) with your code,
then it will test
whether your code works properly.
Do not turn in the test program.
Notes
- You will use your code from Assignment 3 again in Assignment 4. It will also need to work with your code in Assignment 6. Do an extra good job!
- To see what output your lexer produces for some input,
try
use_lexit.lua
, in the Git repository. You may wish to modify the variableprogram
inuse_lexit.lua
, in order to see how your lexer handles various inputs. - See Thoughts on Assignment 3, in the lecture slides for February 12.