Starting from:

$30

Homework #2 (Lexical Analyzer) 


COP3402: Systems Software


Homework #2 (Lexical Analyzer) 
(Solo or Team assignment – Same team members who implemented HW1)
Due June 25, 2021 by 11:59 p.m. 

If you do not follow the specifications your grade will be zero
Goal:
 In this assignment you have to implement a lexical analyzer for the programming language PL/0. Your program must be capable to read in a source program written in PL/0, identify some errors, and produce, as output, the source program lexeme table and a list of lexemes. For an example of input and output refer to Appendix A.

Lexical Conventions for PL/0:
A numerical value is assigned to each token (internal representation) as follows: 
oddsym= 1, eqlsym = 2, neqsym = 3, lessym = 4, leqsym = 5, gtrsym = 6,  geqsym = 7, modsym = 8,  multsym = 9, slashsym = 10, plussym = 11, minussym = 12, 
lparentsym = 13, rparentsym = 14, commasym = 15, periodsym = 16, semicolonsym = 17, becomessym = 18, beginsym = 19, endsym = 20, ifsym = 21, thensym = 22, elsesym = 23, whilesym = 24, dosym = 25, callsym = 26, writesym = 27, readsym = 28, constsym = 29, varsym = 30, procsym = 31, identsym = 32, numbersym = 33.

Reserved Words: const, var, procedure, call, if, then, else, while, do, begin, end, read, write, odd.
Special Symbols ::= == <>  <  <=  >  >=  %  *  /  +  -  (  )  ,  .  ;  :=
Identifiers: identsym ::= letter (letter | digit)* 
Numbers: numbersym ::= (digit)+
Invisible Characters: tab, white spaces, newline
Comments denoted by: /* . . .   */
Refer to Appendix B for a declaration of the token symbols that may be useful.

Constraints:
Input:
1.    Identifiers can be a maximum of 11 characters in length.
2.    Numbers can be a maximum of 5 digits in length.
3.    Comments should be ignored and not tokenized.
4.    Invisible Characters should be ignored and not tokenized.
Important Note: Input files may NOT be grammatically valid PL/0 code. 

Output:
1.    In your output's Token List, identifiers must show the token and the variable name separated by a space.
2.    In your output's Token List, numbers must show the token and the value separated by a space.
3.    The token representation of the Token List will be used in the Parser (Project 3). So, PLAN FOR IT!

Detect the Following Lexical Errors:
1.    Variable does not start with letter.
2.    Number too long.
3.    Name too long.
4.    Invalid symbols.
5.    Neverending comment.
 
Submission Instructions:
Submit to Webcourse:
1.     Source code named lex.c
2.     Instructions to use the program in a readme text file.
3.    This is a team assignment (the same team members who worked together in HW1) or a solo assignment.
4.    Only one submission per team.
5.    The name of all team members must be written in all source code header files, in a comment on the submission, and in the readme.
6.    Include comments in your program.
7.    Same policy on late submission as in HW1. If there is an extension, no late submissions accepted
8.    Output should print to the screen and should follow the format in Appendix A. A deduction of 5 points will be applied to submissions that do not print to the screen.
9.    The input file should be given as a command line argument. A deduction of 5 points will be applied to submissions that do not implement this.
Hints:
●    You could create a transition diagram (DFS) to recognize each lexeme on the source program and once accepted, generate the token otherwise emit an error message.
●    Use the C function iscntrl() to check for whitespace rather than hardcoding acceptable control characters. Your program should function regardless of what control characters are present in the input file.
●    Use the C functions isalpha() and isdigit() to check for letters and digits respectively.
●    Reading in each character line by line can cause lots of problems. Instead read in character by character with fgetc().
●    The only guaranteed whitespace is the whitespace that separates identifiers, numbers, and reserved words. All other whitespace is optional. Assuming no length errors:
o    If an identifier is followed by a number with no whitespace, it is an identifier.
o    If an identifier is followed by a reserved word with no whitespace, it is an identifier.
o    If an identifier is followed by an identifier with no whitespace, it is an identifier.
o    If a number is followed by an identifier with no whitespace, it is an invalid identifier error.
o    If a number is followed by a reserved word with no whitespace, it is an invalid identifier error.
o    If a number is followed by a number with no whitespace, it is a number.
o    If a reserved word is followed by an identifier with no whitespace, it is an identifier
o    If a reserved word is followed by a number with no whitespace, it is an identifier
o    If a reserved word is followed by a reserved word with no whitespace, it is an identifier.

Error Handling:
●    When your program encounters an error, it should print out an error message and stop.
●    When you are reading in a token that begins with a letter, read in characters until you reach one that not alphanumeric (at which point you check for reserved word and move on to the next token) or you reach the twelfth consecutive alphanumeric character (at which point you print the Excessive Identifier Length error and stop).
●    When you are reading in a token that begins with a number, read in characters until you reach on that is not alphanumeric (at which point you tokenize the number and move on) OR you reach the sixth consecutive number (at which point you print the Excessive Number Length error and stop) OR you reach a letter (at which point you print the Invalid Identifier error and stop).
●    When you are reading in a token that starts with an invalid symbol, print out the error message and stop.

For this assignment, we are providing you with a skeleton. You must implement the function lexeme *lexanalyzer(char *input) in lex.c. You may add as many helper functions and global variables as you desire. This function takes the input file contents as a string and should return the lexeme list unless there is an error, in which case it should return NULL. lex.c also includes two printing functions: printerror which will print the error message and free the lexeme list, and printtokens which will print the lexeme table and token list and mark the end of the lexeme list. Make sure to call one of these functions before you return from the lexanalyzer function. For these functions to work, you must have the list and list index as global variables. We also provide driver.c which handles reading in the input file and calling lexanalyzer as well as compiler.h which includes the lexeme struct and token type enumeration. Additionally a make file, two examples, and a bash script for testing are included.

Rubric
-100 – Does not compiles
15 – Compiles
20 – Produces some entries to list/table before segfaulting or looping infinitely
5 – Follows IO specifications (takes command line argument for input file name and prints output to console)
5 – README.txt containing author names
10 – Prints both the list and table formats
15 – Prints out message and stops execution after encountering error
5 – Is not context dependent
5 – Is not whitespace dependent
10 – Supports all four errors
10 – Supports all thirty-three symbols
  
Appendix A:

If the input is:
const==var<>procedure<begin>end<=if>=then.else;while(do)call:=read,write+124-jalapeno*/*comment*//

The output should be:
Lexeme Table:
lexeme        token type
      const    29
         ==    2
        var    30
         <>    3
  procedure    31
          <    4
      begin    19
          >    6
        end    20
         <=    5
         if    21
         >=    7
       then    22
          .    16
       else    23
          ;    17
      while    24
          (    13
         do    25
          )    14
       call    26
         :=    18
       read    28
          ,    15
      write    27
          +    11
        124    33
          -    12
   jalapeno    32
          *    9
          /    10

Token List:
29 2 30 3 31 4 19 6 20 5 21 7 22 16 23 17 24 13 25 14 26 18 28 15 27 11 33 124 12 32 jalapeno 9 10 



If the input is:
12345abcdefghijklmnop$/*

The output should be:
Lexical Analyzer Error: Invalid Identifier
 
Appendix B:

Declaration of Token Types:
typedef enum {
    oddsym = 1, eqlsym, neqsym, lessym, leqsym, gtrsym, geqsym, 
    modsym, multsym, slashsym, plussym, minussym,
    lparentsym, rparentsym, commasym, periodsym, semicolonsym, 
    becomessym, beginsym, endsym, ifsym, thensym, elsesym,
    whilesym, dosym, callsym, writesym, readsym, constsym, 
    varsym, procsym, identsym, numbersym,
} token_type;

Token Definitions:
oddsym    1    odd
eqlsym        2    ==
neqsym    3    <>
lessym        4    <
leqsym        5    <=
gtrsym        6    >
geqsym    7    >=
modsym    8    %
multsym    9    *
slashsym    10    /
plussym    11    +
minussym    12    -
lparentsym    13    (
rparentsym    14    )
commasym    15    ,
periodsym    16    .
semicolonsym    17    ;
becomessym    18    :=
beginsym    19    begin
endsym    20    end
ifsym        21    if
thensym    22    then
elsesym    23    else
whilesym    24    while
dosym        25    do
callsym    26    call
writesym    27    write
readsym    28    read
constsym    29    const
varsym        30    var
procsym    31    procedure
identsym    32    identifiers
numbersym    33    numbers

More products