CS 331 Spring 2025  >  Assignment 3 (Writing a Lexer)


CS 331 Spring 2025
Assignment 3 (Writing a Lexer)

Assignment 3 is due at 5 pm on Tuesday, February 18. It is worth 90 points.

Procedures

This assignment is to be done individually.

Turn in answers to the exercises below on the UA Canvas site, under Assignment 3 for this class.

Exercises (A–B, 90 pts total)

Exercise A — Running a Scheme Program

Purpose

In this exercise you will make sure you can execute Scheme code.

Instructions

Get the file check_scheme.scm from the class Git repository. This is a Scheme source file for a complete program. It is intended to be executed under DrRacket: open it in DrRacket (in the File menu), and click the “Run” button. When it is executed, this program prints “Secret message #3:”. Below that, it prints a secret message. Run the program. What is the secret message?

Exercise B — Lexer in Lua

Purpose

In this exercise you will write a Lua module that does lexical analysis.

In the next assignment, you will build a parser on top of your lexer. And in a later assignment, you will build an interpreter that uses the output from your parser. When you are done, you will have a complete interpreter for a programming language called Fulmar.

Instructions

Write a Lua module lexit, contained in the file lexit.lua. Your module will do lexical analysis; it must be written as a hand-coded state machine.

Be sure to follow the Coding Standards.

The following properties of module lexer must hold for module lexit as well.

The correspondence between lexeme category numbers and category names/strings is as follows.

Category
Number
Named
Constant
Printable
Form
1 lexit.KEY Keyword
2 lexit.ID Identifier
3 lexit.NUMLIT NumericLiteral
4 lexit.STRLIT StringLiteral
5 lexit.OP Operator
6 lexit.PUNCT Punctuation
7 lexit.MAL Malformed

Thus, the following code will work.

[Lua]

lexit = require "lexit"

program = "x = 3  # Set a variable\nprintln(x+4)\n"

for lexstr, cat in lexit.lex(program) do
    print(lexstr, lexit.catnames[cat])
end

Lexical Specification

This is a specification of the lexemes in the Fulmar programming language.

Whitespace characters are blank, tab, vertical-tab, new-line, carriage-return, form-feed. No lexeme, except for a StringLiteral, may contain a whitespace character. So a whitespace character, or any contiguous group of whitespace characters, is generally a separator between lexemes. However, pairs of lexemes are not required to be separated by whitespace.

A comment begins with a pound sign (#) occurring outside a StringLiteral lexeme or another comment, and ends at a newline character or the end of the input, whichever comes first. There are no other kinds of comments. Any character at all may occur in a comment.

Comments are treated by the lexer as whitespace: they are not part of lexemes and are not passed on to the caller.

Legal characters outside comments and StringLiteral lexemes are whitespace and printable ASCII characters (values 32 [blank] to 126 [tilde]). Any other character outside comments and StringLiteral lexemes is illegal.

Maximal munch is followed.

There are seven lexeme categories: Keyword, Identifier, NumericLiteral, StringLiteral, Operator, Punctuation, Malformed.

Below, in a regular expression, a character preceded by a backslash means the literal character, with no special meaning.

Keyword
One of the following 12:
chr   elif   else   end   func   if   print
println   readnum   return   rnd   while
Identifier
Any string matched by /[a-zA-Z_][a-zA-Z_0-9]*/ that is not a Keyword.

Here are some Identifier lexemes.

myvar   _    ___x_37cr   HelloThere   RETURN

Note. The reserved words are the same as the Keyword lexemes.

NumericLiteral
Any string matched by /[0-9]+([eE]\+?[0-9]+)?/.

Notes. A NumericLiteral must begin with a digit and cannot contain a dot (.). A minus sign is not legal in an exponent (the “e” or “E” and what comes after it). A plus sign is legal, and optional, in an exponent. An exponent must contain at least one digit.

Here are some valid NumericLiteral lexemes.

1234    00900   123e+7   00E00   3e888

The following are not valid NumericLiteral lexemes.

-42   3e   e   123E+   1.23   123e-7
StringLiteral
A single quote (') or double quote ("), followed by zero or more characters that are not newlines or the same as the opening quote mark, followed by a quote that matches the opening quote mark. There are no escape sequences. Any character, legal or illegal, other than a newline or a quote that matches the opening quote mark, may appear inside a StringLiteral. The beginning and ending quote marks are both part of the lexeme.

Here are some StringLiteral lexemes.

"Hello there!"   ''   '"'   "'--#!Ωé\"
Operator
One of the following seventeen:
!   &&   ||   ==    !=    <    <=    >    >=
+    -    *    /    %    [    ]    =
Punctuation
Any single legal character that is not whitespace, not part of a comment, and not part of any valid lexeme in one of the other categories, including Malformed.

Here are some Punctuation lexemes.

;   (   )   {   }   ,   &   $
Malformed
There are two kinds of Malformed lexemes: bad character and bad string.

A bad character is any single character that is illegal, that is not part of a comment or a StringLiteral lexeme that began earlier.

A bad string is essentially a partial StringLiteral where the end of the line or the end of the input is reached before the ending quote mark. It begins with a double quote mark that is not part of a comment or StringLiteral that began earlier, and continues to the next newline or the end of the input, without a double quote appearing. Any character, legal or illegal, may appear in a bad string. If the lexeme ends at a newline, then this newline is not part of the lexeme.

Here are three Malformed lexemes that are bad strings.

"a-b-c    'wx yz    "Ωé'
In order to be counted as Malformed. each of the above must end at a newline (which would not be considered part of the lexeme) or at the end of the input.

Note. The two kinds of Malformed lexemes are presented to the caller in the same way: they are both simply Malformed.

Test Program

A test program is available in the Git repository: lexit_test.lua. If you run this program (unmodified!) with your code, then it will test whether your code works properly.

Do not turn in the test program.

Notes