Chapter 2 Lexical Analysis Compiler Design BMZ

Chapter 2 : Lexical Analysis Compiler Design BMZ 1

Lexical Analysis Source Program Lexical Analyser Syntax Analyser Symbol Table Manager Semantic Analyser Intermediate Code Generator Error Handler Code Optimizer Code Generator Target Program Compiler Design BMZ 2

Languages An alphabet (Σ) is a finite set of symbols. {a, b, c} A symbol is an element of an alphabet. a A word is a finite sequence of symbols drawn from the alphabet Σ. abcaa A language (over alphabet Σ) is a set of words. {abcaa, abc, b, caa} Compiler Design BMZ 3

Languages Σ* denotes the set of all words over the alphabet Σ. | s | denotes the length of string bmz is a string of length 3 ε denotes the word of length 0, the empty word. denotes the empty set, or {ε} Note 1: In language theory the terms sentence and word are often used as synonyms for the term string Note 2: A language (over alphabet Σ) is a set of string (over alphabet Σ). For example: Σ = {a}; one possible language is L = { ε, a; aaa}. Compiler Design BMZ 4

Terms for parts of a string TERM DEFINITION prefix of s A string obtained by removing zero or more trailing symbols of string s; ban is a prefix of banana. suffix of s A string formed by deleting zero or more of the leading symbols of s; nana is a suffix of banana. substring of s A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and e are prefixes, suffixes, and substrings of s. proper prefix, suffix, or substring of s Any nonempty string x that is, respectively, a prefix, suffix, or substring of s such that s x. subsequence of s Compiler Design Any string formed by deleting zero or more not necessarily contiguous symbols from s; baaa is a subsequence of banana. BMZ 5

Terms for parts of a string (examples) Let us take this string: banana prefix: ε, b, ban, . . . , banana suffix: ε, a, na, ana, . . . , banana substring: ε, b, a, n, ba, an, na, . . . , banana subsequence: ε, b, a, n, ba, bn, aa, nn, . . . , banana Compiler Design BMZ 6

Operations on Strings Concatenation: Concatenation of words is denoted by juxtaposition. If x and y are strings, then the concatenation of x and y is xy If x=dog and y= house, then xy=doghouse x(yz) = (xy)z x ε = ε x = x Concatenation is not symmetric Exponentiation s 0 = ε s 1 = s s 2 = ss Compiler Design BMZ 7

Operations on Languages Union of L and M, L M = { s | s L or s M} Concatenation of L and M, LM LM = {st | s L and t M} Kleene closure of L, L* L* = Positive closure of L, L+ L+ = Compiler Design BMZ 8

Example L is the set {A, B, . . . , Z, a, b, . . . , z} and D the set {0, 1, . . . , 9}. Since a symbol can be regarded as a string of length one, the sets L and D are each finite languages. The following are some examples of new languages created from L and D 1. L U D is the set of letters and digits. 2. LD is the set of strings consisting of a letter followed by a digit. 3. L 4 is the set of all four-letter strings. 4. L* is the set of all strings of letters, including ε, the empty string. 5. L(L U D)* is the set of all strings of letters and digits beginning with a letter. 6. D+ is the set of all strings of one or more digits. Compiler Design BMZ 9

Operator Associativity Grammar rules may influence operator Associativity How to specify operator Associativity for: 1. Multiplication operator (left associative) in FORTRAN: 1 * 2 * 3 (1 * 2) * 3 3 * 1 * 5 (3 * 1) * 5 2. Exponentiation (right associative) in FORTRAN: X ** Y ** Z Compiler Design X ** (Y ** Z) BMZ 10

Example 1 left-associativity (9 – 5) + 2 9 – 5 + 2 right-associativity 9 – (5 + 2) The choice relies with the language designer, who must take into account intuitions and convenience. By convention, most arithmetic operations use leftassociativity. (We shall use this assumption in this course) Compiler Design BMZ 11

Example 2 list digit list + digit | list – digit | digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 left-associativity 9 – 5 – 2 Compiler Design right letter = right | letter a | b | c | … | z right-associativity a = b = c BMZ 12

Specifying Operator Associativity * For left associative, rewrite grammar rule: LHS appears at the beginning of its RHS - this rule is also known as (aka) left recursive * For right associative, rewrite grammar rule: LHS appears at the end of its RHS - this rule is aka right recursive Compiler Design BMZ 13

Operator Precedence Operator precedence defines the order in which an expression evaluates when several different operators are present Grammar rules may influence operator precedence assign ident = expr ident A | B | C expr ident + expr | ident * expr | ( expr ) | ident Draw the parse tree for A = B * A + C Operators generated lower in the parse tree is evaluated first, therefore, higher precedence than operators generated higher up in the parse tree Compiler Design BMZ 14

Precedence Levels () ^ * + <exp> : : = <exp> + <exp> | <exp> * <exp> | <const> : : = 0. . 9 higher / - lower 9+5*2 exp exp + const exp 9 const 5 Compiler Design BMZ * exp const 2 15

Specifying Operator Precedence Grammar rules can be made to exhibit operator precedence by introducing additional nonterminals and new rules. assign ident = expr ident A | B | C expr + term | term * factor | factor ( expr ) | ident Draw the parse tree for A = B * A + C Compiler Design BMZ 16

Postfix Notation The posy notation for an expression E can be defined inductively as follows: 1. If E is a variable or constant, then the postfix notation for E is E itself. 2. If E is an expression of the form E 1 op E 2, where op is any binary operator, then the postfix notation for E is El' E 2' op, where El' and E 2' are the postfix notations for El and E 2, respectively. 3. If E is an expression of the form ( E 1 ), then the postfix notation for E 1, is also the postfix notation for E. the postfix notation for (9 -5) +2 is 95 -2+ the postfix notation for 9 - ( 5+2 ) is 952+Compiler Design BMZ 17

The Role of the Lexical Analyzer The lexical Analyzer is the first phase of a compiler The Main Task: is to read the input characters and produce as output a sequence of tokens that the parser uses for syntax analysis source program lexical analyzer token parser get next token symbol table Compiler Design BMZ 18

The Role of the Lexical Analyzer (continued) The lexical analyzer is the part of the compiler that reads the source text. The Secondary Tasks: 1. Eliminating the following from the source program: a. comments // global variables b. whitespace a=1 + 4; 1. tab write ( a); 2. newline characters write (a, a*2); Compiler Design BMZ 19

The Role of the Lexical Analyzer (continued) 2. Correlating error messages from the compiler with the source program. It may keep track of the number of newline characters seen, so that a line number can be associated with an error message. 3. Making a copy of source program with errors marked (in some compilers) Compiler Design BMZ 20

The Role of the Lexical Analyzer (continued) Note: lexical analyzer is divided into a cascade of two phases (in some compilers): Scanning: The scanner is responsible for doing simple tasks Lexical Analysis: lexical analyzer is responsible for doing more complex operations FORTRAN Compiler, uses a scanner to eliminate blanks from the input. Do 5 I = 1, 25 Enter a Number ==> 13 2 R. W. Do num 5 id I Do 5 I = 1. 25 id Do 5 I The number is 132 Compiler Design BMZ 21

Advantages for Separating the Analysis Phase The advantages for separating the analysis phase of compiling into lexical analysis and parsing: 1. Simpler Design: Separate lexical analysis from syntax analysis simplifies one or the other of these phases. (comments and white space) 2. Improved Efficiency: Large amount of time in a compiler is spent reading source and partitioning into tokens. Specialized buffering techniques for reading input characters and processing tokens can significantly speed up the performance of a compiler. 3. Enhanced Portability: Input alphabet peculiarities and other device specific anomalies can be restricted to the lexical analyzer. Representation of non-standard symbols can be isolated in the lexical analyzer Compiler Design BMZ 22

Symbol Table * It is a Data Structure used to store information about various source language constructs. - During lexical analysis, the character string or lexeme forming an identifier is saved in a symbol table entry. * Later phases of the compiler might add to this entry information such as the type of the identifier, its usage (variable or label) and its position in storage (address). Compiler Design BMZ 23

Tokens, Patterns, and Lexemes Lexeme: a string matched by the pattern of a token Token: a set of strings Pattern: is a rule associated with token that describes the set of strings Compiler Design BMZ 24

Attributes of Tokens • Attributes are used to distinguish different lexemes in a token E = M * C ** 2 <id, pointer to symbol-table entry for E> <assign_op, > <id, pointer to symbol-table entry for M > <mult_op, > <id, pointer to symbol-table entry for C> <exp-op, > <num, integer value 2> Tokens affect syntax analysis & Attributes affect semantic analysis Compiler Design BMZ 25

Describing Tokens * We use regular expressions to describe programming language tokens. * A regular expression (RE) is defined inductively a ε R|S RS R* ordinary character stands for itself empty string either R or S (alteration), where R, S = RE R followed by S (concatenation) concatenation of R 0 or more times Compiler Design BMZ 26

Language • A regular expression R describes a set of strings of characters denoted L(R) • L(R) = the language defined by R L(abc) = { abc } L(hello|goodbye) = { hello, goodbye } L(1(0|1)*) = all binary numbers that start with a 1 • Each token can be defined using a regular expression Compiler Design BMZ 27

Lexical Errors Few errors are detectable at the lexical level, because the lexical analyzer has a very localized view of a source program fi(a==x) … Error-Recovery Actions: 1. Panic Mode Recovery: we delete successive characters from the remaining input until the lexical analyzer can find a wellformed token. 2. Deleting an extraneous character 3. Inserting a missing character 4. Replacing an incorrect character by a correct character 5. Transposing two adjacent characters (o 0 O) Compiler Design BMZ 28

Input Buffering There are three general approaches to implement lexical analyzer: 1. Use a lexical-analyzer generator, such as the Lex compiler to produce the lexical analyzer from a regular-expression-based specification. In this case, the generator provides routines for reading and buffering the input. 2. Write the lexical analyzer in a conventional systemsprogramming language, using the I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input. Compiler Design BMZ 29

End Compiler Design BMZ 30