Lexical Analysis

When compiling a program we need to recognize the words and punctuations that make up the vocabulary of the language. This part of the compiler is therefore known as “lexical” analysis.

Usually a compiler is given one or more input programs, and the first thing it must do is read the program and figure out what lexical elements appear in the program.

Typically, these lexical elements are known as tokens. So for example, in the following snippet of code:

print('hi')

We have a number of lexical elements / tokens:

  • print

  • (

  • 'hi'

  • )

There are many different ways to implement a “lexer” - the name we give to this component of the compiler.

  • We can write this code by hand. This involves scanning the input program character by character and deciding what tokens appear in the program.

  • Or we can specify the lexical elements in a grammar and have a tool generate the code to process the input program and give us the tokens that appear in the program.

A lexical analyser can be designed to process input on demand, or it may be designed to translate the entire input source to a set of tokens at the very beginning.

Considerations

  • Should comments in the input program be retained as tokens? Usually a lexer will discard comments, but in languages that allow comments to be retained as documentation, the lexer must not discard them.

  • Should end of line markers be retained? Typically lexers drop all intermediate space including line markers, but if the language syntax depends on line markers then these may need to be retained.

  • Should tokens copy the input text, convert them to another form, or retain pointers to the input itself? Retaining the original form of the lexical token may be important in some cases, for example if the lexer is used in a code formatter.

  • How much can we peek ahead? During later stages of the compiler, depending on the complexity of the language grammar, it may be necessary to allow the compiler to look ahead one or more tokens without consuming them.

  • Ancillary information regarding tokens such a line number, column number in the input source are invaluable for error reporting.

Example Hand-Coded Implementation

The Lexer module in the EZ language implementation contains an example of hand-coded lexical analyser written in Java. This implementation returns tokens on demand.

Another example is the Lua lexer.