Lexical description. The lexical units of the language are integers, special notation, identifiers, keywords, and white space. Any input string that contains only those components is lexically valid.

  • Keywords. def, if, then, else, skip, while, do, repeat, until, break and continue. Note that keywords are case sensitive, and do not contain upper-case letters.
  • Integers. Integers are non-empty strings of digits 0-9.
  • Identifiers. Identifiers are non-empty strings consisting of letters (lower or upper case), digits, and the underscore character. The first character in an identifier must be a lower-case letter.
  • White space. White space consists of any sequence of the characters: blank (ascii 32), \n (newline, ascii 10), \f (form feed, ascii 12), \r (carriage return, ascii 13), \t (tab, ascii 9). Whitespace always separates tokens: whatever (non whitespace) is to the left of a whitespace must be part of a different token than whatever is on the right of the whitespace. Note that the opposite direction is not necessarily true: two distinct tokens are not always separated by whitespace, for example the string (()) consists of 4 tokens, likewise the string 65x consists of two tokens, T_Integer(65) followed by token T_Identifier("x"), and the string 65if; should be lexed into T_Integer(65) T_If T_Semicolon.
  • Special notation. The special syntactic symbols (e.g., parentheses, assignment operator, etc.) are as follows.
    	; ( ) = == < > <= >= , { } := + * - /
          
    Like white space, special notation always separates tokens.
  • Disambiguation. The rules above are ambiguous. To disambiguate, use the following two policies.
    • Operate a 'longest match' policy to disambiguate: if the beginning of a string can be lexed in several ways, choose the tokensisation where the initial token removes the most from the beginning of the string.
    • If there are more than one longest match, give preference to keywords.

    For example the string deff should be lexed into a single identifier, not the token def followed by the identifier f. Similarly, === must be == followed by =, not the other way round or three occurences of =.

Syntax description. Here is the language syntax, given by the following context free grammar with initial non-terminal PROG, where ε stands for the empty production.

 PROG → DEC | DEC PROG 
 DEC → def ID (VARDEC) = BLOCK
 VARDEC →  ε | VARDECNE 
 VARDECNE → ID | VARDECNE, ID 
 ID → ... (identifiers)
 INT → ... (Integers)
 BLOCK → { ENE }
 ENE → E | E; ENE
 E →  INT 
   | ID 
   | if E COMP E then BLOCK else BLOCK
   | (E BINOP E)
   | skip
   | BLOCK
   | while E COMP E do BLOCK 
   | repeat BLOCK until E COMP E 
   | ID := E
   | ID (ARGS)
   | break
   | continue
 ARGS → ε | ARGSNE
 ARGSNE → E | ARGSNE, E
 COMP → == | < | > | <= | >=
 BINOP → + | - | * | /