Lexical Analysis Role Specification Recognition Tool LEX

Скачать презентацию Lexical Analysis Role Specification Recognition Tool LEX

68f80841efc58f9a95a0e96edbb57045.ppt

Количество слайдов: 142

Lexical Analysis Role, Specification & Recognition Tool: LEX Construction: - RE to NFA to DFA to min-state DFA - RE to DFA

Conducting Lexical Analysis n n Techniques for specifying and implementing lexical analyzers Hand-written u u n Tools: Pattern Triggered actions u n state transition diagram that reveals the structure of the tokens hand-translated driver program pattern-action language: LEX Other applications: u u query language, information retrieval AWK shell commands PCB inspection

Lexical Analyzer and Parser source program lexical analyzer token &attributes Parser get next token symbol table token: smallest logically cohesive sequence of characters of interest in source program (Aho, Sethi, Ullman, pp. 84)

Lexical Analysis n Convert input lexemes to stream of tokens F n Lexeme: a sequence of characters that comprises a single token Typical Functions: u Removal of white space and comments F F u Digits into Token ID + value/attributes F F u instead of writing productions that include spaces and comments keeping line count for associating error message with line number instead of writing productions for integer constants 31+28+59 <+, > <+, > Recognizing Identifiers and Keywords F F F Identifiers: count = count + increment id = id + id Keywords: begin, end, if, else Operators/punctuations: ‘>’, ‘<=’, ‘<>’

Why Separate Lexical Analysis from Parsing n Simpler Design u n Improved Efficiency u n no production rules and translations for white spaces and comments lexical analyzer can be optimized separately (e. g. , using specialized buffering techniques) Enhanced Compiler Portability u Input alphabet peculiarities and device-specific anomalies can be restricted to the lexical analyzer

Tokens, Patterns, Lexemes n Token: terminal symbol or lexical unit of a parser, representing a set of strings of particular type u u e. g. , “pi”, “count”, … => id e. g. , “ 3. 1416”, “ 6. 02 e 23”, … => num Typical: keywords, operators, identifiers, constants, literal strings, punctuation symbols Representation: an integer (e. g. , #define ID 258) F n Pattern: a specification of the set of strings u u n n with associated attributes a rule describing the set of lexemes that can represent a particular token e. g. , id => “letter followed by letters and digits” Lexeme: a sequence of characters in the source program that is matched by the pattern for a token Examples: Fig. 3. 2

Attributes for Tokens n Attributes: additional information for a particular lexeme when matching multiple patterns u n n parsing decision, translation Implementation: a pointer to symbol table entry in which the token information is kept Example: E = M * C ** 2 F F F E (or M, C): =: *: **: 2:

Lexical Errors n Matched but ambiguous: u u n left to the other phases (e. g. , parser) e. g. , fi ( a == f(x) ) … : fi => identifier ? ? misspelling of “if” Unmatched: u u Panic mode recovery: delete successive characters from the remaining input until a well-formed token is found Repair input (single error): F F u deleting an extraneous character inserting a missing character replacing with a correct character transposing two adjacent characters Minimum-distance error correction (multiple errors)

Specification of Tokens A Formal Specification for Tokens or Patterns - Strings and Languages - Regular Expressions & Definitions - Recognition of Tokens

Strings and Language n n alphabet (or character class) (字符集): any finite set of symbols string over some alphabet (字串): a finite sequence of symbols drawn from that alphabet u u u length of string s, |s|: number of symbols in s empty string: a special string of length zero (proper) prefix: abcdef (proper) suffix: abcdef (proper) substring: abcdef subsequence: abcdef

Strings and Language n n language: any set of strings over some alphabet empty set: the set containing only empty string, i. e. , Φ={ε}

Operations on Strings n Concatenation: xy u u n s ε = εs = s x=“Dog” y=“House” => xy = “Dog. House” Exponentiations: si =si-1 s (s 0=ε)

Operations on Languages n Union u n Concatenation u n LM= {st | s is in L and t is in M} Kleene closure: zero or more concatenation of strings drawn from oneself u u n L U M = {s | s is in L or s is in M} L*: union of Li (i = 0 … infinity) L 0 = {ε}, Li = Li-1 L Positive closure: one or more concatenation u L+: union of Li (i = 1 … infinity)

Examples n n L={A, B, …, Z, a, b, …, z}, D={0, 1, …, 9} Union u n Concatenation u u n LD={a letter followed by a digit} (={A 0, A 1, … B 0, …}) L 4 = {4 -letter strings}(={AAAA, AABC, BBBB, …}) Kleene closure: zero or more concatenation u u n L U D = {letters and digits of length 1} L*: {all strings of letters of length zero (i. e. , ε) or more} L(L U D)* = {all strings of letters-and-digits, starting with a letter} Positive closure: one or more concatenation u D+: {strings of one or more digits}

Regular Expression (R. E. ) A Formal Specification for Tokens

Regular Expression: Syntax for Specifying String Patterns n Regular expression r over alphabet u u n Basic Symbols u u n Defines the language L(r) corresponding to r Regular Set: A language denoted by a regular expression empty-string: any symbol a in input symbol set Basic Operators u u disjunction (OR, union): r | s concatenation (AND): r s (or simply rs) closure (repetition): r* identity (parenthesized): (r)

Regular Expression: Syntax for Specifying String Patterns n Extended operators: u u u u u ? : optional operator + : positive closure operator. : any character but newline [a-z]: character class [^a-z]: complement (any characters NOT in [a-z]) {m, n}: number of occurrence ^: start of line $: end of line registers: the n-th part of match: 1, 2 F u escape, meta-symbols: c (character ‘c’ literally) F u sed ‘s/. *]*). */1/g’ [a-z]: ‘a’, ‘-’or ‘z’ (NOT: ‘a’, ‘b’, …, ‘z’) r/s: r which is followed by s (‘/’: lookahead operator)

Notational Shorthands ª One or more instances (r)+ denoting (L(r))+ r* = r + | r+ = r r * ª ª Zero or one instance r? = r | Character classes [abc] = a | b | c [a-z] = a | b |. . . | z

Regular Expression n Examples: = {a, b} u u r = a|b {a, b} r = (a|b) {aa, ab, ba, bb} F u u = aa|ab|ba|bb (another equivalent regular expression) r = a* { , a, aaa, aaaa, …} r = (a|b)* {all strings of a’s and b’s} F = (a*b*)*

Equivalence n n A language may be represented by two or more equivalent regular expressions. Equivalence: u n L(r) = L(s) r = s Algebraic properties of Regular Expression u u Commutative: r|s = s|r Associative: F F u u n r|(s|t) = (r|s)|t (rs)t = r(st) Distribution: r(s|t) = rs|rt & (s|t)r = sr | tr Identity element ( ): r = r and r =r Application of properties: Proof of Equivalence u u r* = (r| )* r** = r*

Regular Definition: A CFG-like Notation of Regular Expression n Regular Definition Similar to CFG u u u Define regular expressions in terms of named regular expressions d 1 r 1 d 2 r 2 … dn rn

Regular Definition n Example of Regular Definition: u u u n letter A | B | C | … | z digit 0 | 1 | … | 9 id letter (letter | digit ) * Another Example: Unsigned numbers (ex. 3. 5) u u u // Unsigned numbers (512, 3. 14, 6. 33 E 4, 1. 89 E -5) digit 0 | 1 | … | 9 digits digit* optional_fraction . digits |ε optional_exponent ( E (+| - |ε) digits ) |ε num digits optional_fraction optional_exponent

Nonregular Set n n Some languages cannot be described by any regular expression Examples: u Balanced and nested constructs F u Repeating strings F F F u BUT, Can be specified by CFG {wcw| w is a string of a’s and b’s} ={aca, bcb, abcab, …} Cannot be expressed in CFG either Context dependent strings F n. Ha 1 a 2…an

Regular Expression: Syntax for Specifying String Patterns n Chomsky Hierarchy: u u regular set (R. E. ) context-free context-sensitive recursively enumerable (Tuning Machine)

Regular Expression: Syntax for Specifying String Patterns n Applications: u u Matching wildcard characters (shell commands, filename expansion) string pattern matching (grep, awk) search engine (keyword matching, fuzzy match) string pattern editing/processing (sed, vi, tr)

Recognition of Tokens

Example Task n Grammar: u u u stmt → if expr then stmt | if expr then stmt else stmt | expr → term relop term | term → id | num

Example Task n Goal: construct a lexical analyzer that isolates lexeme for the next token Produce token and associated attribute-values n Methods: n u By hands: constructing FAs & a simulator for the FAs F u Simulator (scanner) depends on FAs By tools: writing regular definition for scanner generators to build FAs for a scanner F Scanner: a driver program that is independent of the forms of the FAs FA / FSA: Finite (State) Automata

FA and Transition Diagrams r = (abc)+

FA/FSA and Transition Tables Next. State = Move( Current. State, Input )

Recognition n n state = 0; while ( (c = next_char() ) != EOF ) { u switch (state) { F case 0: if ( c == ‘a’ ) state = 1; • break; F case 1: if ( c == ‘b’ ) state = 2; • break; F case 2: if ( c == ‘c’ ) state = 3; • break; F case 3: if ( c == ‘a’ ) state = 1; • else { ungetchar(); return (TRUE); } • break; F default: • error(); u n n } } if ( state == 3 ) return (TRUE) else return (FALSE);

Finite Automata for the Lexical Tokens i a- z f 2 1 1 3 a- z 2 1 0 -9 2 0 -9 IF 0 -9 1 . ID . 2 0 -9 1 - 2 REAL - 3 blank, etc. 4 0 -9 5 NUM 0 -9 5 3 n a- z 4 1 any but n 2 blank, etc. White space error (and comment starting with ‘- -’) (Appel, pp. 21)

Regular expressions for tokens if {return IF; } [a - z] [a - z 0 - 9 ] * {return ID; } [0 - 9] + {return NUM; } ([0 - 9] + “. ” [0 - 9] *) | (“. ” [0 - 9] +) {return REAL; } (“--” [a - z]* “n”) | (“ ” | “ n ” | “ t ”) + {/* do nothing*/} . {error (); } (Appel, pp. 20)

Recognition of the Lexical Tokens Given the FA’s (Naïve Pattern Matching) n Traversal of the transition diagrams in sequence to match any of the above state transition diagrams until match u u Give different unique state numbers to different initial states (and other states) in individual diagram before writing a program to simulate the traversal process Match the longest expression first if two state transition diagrams have super-/sub-string relationship F u n E. g. , match REAL before INTEGER On failure, next_state = init_state of next FA Example program: [Aho 86]

Finite State Automata

How to Construct a FA Systematically? n You can construct a single complicated state transition diagram directly to recognize all token types, u n E. g. , (next page), or You can do it systematically by constructing simpler transition diagrams and composing them into larger networks u u Preferred for automatic construction Easy to verify its correctness

A DFA for Recognizing Common 1 st pattern or Token Types a-e, g-z, 0 -9 ID 2, 5, 6, 8, 15 i a-h 1, 4, 9, 14 j-z 0 -9 f IF(or ID) 3, 6, 7, 8 ID 5, 6, 7, 8, 15 a-z, 0 -9 ID 6, 7, 8 a-z, 0 -9 NUM 0 -9 11, 12, 13 10, 11, 12, 13, 15 error other 15 reserved word in LEX spec. 0 -9 a-z, 0 -9 Longest match (Appel, pp. 29)

A DFA for Recognizing Common 1 st pattern or Token Types a-e, g-z, 0 -9 ID 2, 5, 6, 8, 15 f reserved word in LEX spec. IF(or ID) 3, 6, 7, 8 i a-z, 0 -9 ID ID Can a-h 5, 6, 7, 8, 15 you extend this FA to 1, 4, 9, 14 j-z a-z, 0 -9 include other patterns, like 6, 7, 8 SPACES & floating point NUM a-z, 0 -9 0 -9 numbers & comments ? ? 11, 12, 13 10, 11, 12, 13, 15 error other 15 0 -9 Longest match (Appel, pp. 29)

Finite (State) Automata n n n A set of states: S A set of input symbols: (the input symbol alphabet) A transition (move) function: (s, a) = s’ Initial (start) state: s 0 A set of final (accepting) states: F

Finite (State) Automata n Graphical Representation: u n Implementation: u n State transition table Deterministic (DFA) u n State transition diagram Single transition for all states on all input symbols Non-deterministic (NFA) u More than one transition for at least one state with some input symbol

NFA: Nondeterministic Finite Automata ª An NFA consists of q q q S: A finite set of states : A finite set of input symbols : A transition function that maps (state, symbol) pairs to sets of states s 0: A state distinguished as start state F: A set of states distinguished as final states

NFA: An Example ª ª ª RE: (a | b)*abb States: {0, 1, 2, 3} Input symbols: {a, b} Transition function: (0, a) = {0, 1}, (0, b) = {0} (1, b) = {2}, (2, b) = {3} Start state: 0 Final states: {3}

Transition Diagram (NFA) (a | b)*abb a start 0 a 1 b ª ª ª b 2 b 3 States: {0/Start/init. , 1, 2, 3/Final} Input symbols: {a, b} NFA Transition function: (0, a) = {0, 1}, (0, b) = {0} (1, b) = {2}, (2, b) = {3}

Acceptance of NFA ª ª An NFA accepts an input string s iff there is some path in the transition diagram from the start state to some final state such that the edge labels along this path spell out s Example: ª ª bbababb is accepted by (a|b)*abb bbabab is NOT

NFA: Example with e-transition ª ª ª RE: aa* | bb* States: {0, 1, 2, 3, 4} Input symbols: {a, b} Transition function: (0, ) = {1, 3}, (1, a) = {2}, (2, a) = {2} (3, b) = {4}, (4, b) = {4} Start state: 0 Final states: {2, 4}

Transition Diagram (NFA) aa* | bb* ª 1 ε start ε 3 a a 2 0 b NFA Transition function: (0, ) = {1, 3}, (1, a) = {2}, (2, a) = {2} (3, b) = {4}, (4, b) = {4} 4 b

Deterministic Finite Automata ª A DFA is a special case of an NFA in which u u no state has an -transition for each state s and input symbol a, there is at most one edge labeled a leaving s

DFA: An Example ª RE: (a | b)*abb ª States: {0, 1, 2, 3} ª Input symbols: {a, b} ª Transition function: (0, a) = {1}, (1, a) = {1}, (2, a) = {1}, (3, a) = {1} (0, b) = {0}, (1, b) = {2}, (2, b) = {3}, (3, b) = {0} ª Start state: 0

Transition Diagram A DFA for (a | b)*abb b a start 0 a b 1 b a 2 a b 3

Transition Diagram a start 0 b a b 1 b 2 3 a DFA for (a | b)*abb {0, 2} {0} start b a 0 a b {0, 1} 1 b a 2 a b 3 {0, 3}

Recognition of Regular Expression Using DFA n Simulating Deterministic Finite Automata (DFA) u initialization: F u while (current_state is not fail_state && input_symbol != EOF) F F F u current_state = s 0; input_symbol = 1 st symbol next_state = (current_state, input_symbol), & Current_state = next_state input_symbol = next_input_symbol If (current_state in final states) accept() else fail()

Simulating a DFA Input. An input string ended with eof and a DFA with start state s 0 and final states F. Output. The answer “yes” if accepts, “no” otherwise. begin s : = s 0; c : = nextchar; while c <> eof do begin s : = move(s, c); // transition function c : = nextchar end; if s is in F then return “yes” else return “no” end.

DFA: An Example (a | b)*abb b a start 0 a b 1 b a 2 a b 3

An Example bbababb s=0 s = move(0, b) = 0 s = move(0, a) = 1 s = move(1, b) = 2 s = move(2, b) = 3 s is in {3} bbabab s=0 s = move(0, b) = 0 s = move(0, a) = 1 s = move(1, b) = 2 s = move(2, a) = 1 s = move(1, b) = 2 s is not in {3}

Recognition of Regular Expression Using NFA n Simulating Non-Deterministic Finite Automata (NFA) u Backtrack/Backup: (Sequential Traversal) F u Parallelism: (Parallel Traversal) F u remember next alternative configuration (current input & next alternative state) when alternative choices are possible trace every possible alternatives in parallel Look-ahead: F look at more input symbols to make it deterministic

Simulating an NFA Input. An input string ended with eof and an NFA with start state s 0 and final states F. Output. The answer “yes” if accepts, “no” otherwise. begin S : = -closure({s 0}); // s 0 = => S c : = nextchar; while c <> eof do begin S : = -closure(move(S, c)); // S =c=> M = => S’ c : = nextchar end; if S F <> then return “yes” else return “no” end.

Operations on NFA states ª ª -closure: set of states reachable without consuming any input symbol -closure(s): set of NFA states reachable from NFA state s on -transitions alone -closure(S): set of NFA states reachable from some NFA state s in S on -transitions alone move(S, c): set of NFA states to which there is a transition on input symbol c from some NFA state s in S

Computation of -closure Input. An NFA and a set of NFA states S. Output. T = -closure(S). begin push all states in S onto stack; & initialize T : = S; while stack is not empty do begin pop t, the top element, off of stack; for each state u with an edge from t to u labeled do if u is not in T [i. e. , current -closure(S)] do begin add u to T; push u onto stack end; return T end.

T= -closure(0): 01: S={0}, T={0} An Example (a | 02: S={}; t=0; T={0} 03: S={1, 7}; T={0, 1, 7} b)*abb ε 2 04: S={1}; t=7; T={0, 1, 7} a 06: S={}; t=1; T={0, 1, 7} ε start 0 ε 3 ε 1 ε 4 05: S={1}; T={0, 1, 7} 6 07: S={2, 4}; T={0, 1, 2, 4, 7} ε 7 a 8 b 9 b 10 ε b ε 08: S={2}; t=4; T={0, 1, 2, 4, 7} 5 09: S={2}; T={0, 1, 2, 4, 7} 10: S={}; t=2; T={0, 1, 2, 4, 7} **: S={}; T={0, 1, 2, 4, 7}

An Example (a | b)*abb ε 2 a ε start 0 ε 4 3 ε 1 ε 6 ε b ε A = -closure ({0}) = {0, 1, 2, 4, 7} 5 ε 7 a 8 b 9 b 10

An Example (a | b)*abb ε 2 a ε start 0 ε ε 1 ε 4 3 6 ε b ε ε 7 a 8 move(A, a)= {3, 8} b b 10 9 move(A, b)= {5} 5

An Example (a | b)*abb ε 2 a ε start 0 ε 4 3 ε 1 ε 6 ε b ε C = -closure (move(A, b)) = {1, 2, 4, 5, 6, 7} ε 7 a 8 b 9 move(A, b)= {5} 5 b 10

An Example (a | b)*abb ε 2 a ε start 0 ε ε 1 ε 4 3 6 ε b ε ε 7 a 8 b 9 move(C, b)= {5} 5 b 10

An Example (a | b)*abb ε 2 a ε start 0 ε 4 3 ε 1 ε 6 ε b ε C = -closure (move(C, b)) = {1, 2, 4, 5, 6, 7} ε 7 a 8 b 9 move(C, b)= {5} 5 b 10

An Example bbabb S = -closure({0}) = {0, 1, 2, 4, 7} =A S = -closure(move({0, 1, 2, 4, 7}, b)) = -closure({5}) = {1, 2, 4, 5, 6, 7} =C S = -closure(move({1, 2, 4, 5, 6, 7}, a)) = -closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} S = -closure(move({1, 2, 3, 4, 6, 7, 8}, b)) = -closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} S = -closure(move({1, 2, 4, 5, 6, 7, 9}, b)) = -closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} S {10} <>

Recognition of Regular Expression n n Simulating NFA is harder than simulating DFA Constructing NFA is easier than constructing DFA u u Construct NFA => Construct Equivalent DFA F F F u u By pre-defining states in NFA that can be reached in parallel as a state for the DFA & pre-computing all possible transitions Instead of simulating the parallel transitions in run-time => (optional) State Minimization => Simulate DFA

Constructing Automata from R. E. n (1) R. E. NFA (Thompson’s construction) DFA (Subset Construction) State Minimization u u u R. E. decomposition into basic alphabets & operators construct FA for basic alphabets merging FA’s by operator

Constructing Automata from R. E. n (2) R. E. DFA: state_transition position_transition in pattern State Minimization u u u annotate RE symbols with position labels get syntax tree of the annotated pattern compute {nullable, fistpos, lastpos} of subexpressions compute follow(i) s 0 = firstpos(root) construct transition function according to follow(i)

Regular Expression to NFA R. E. NFA (Thompson’s construction)

Constructing NFA How to define an NFA that accepts a regular expression? It is very simple. Remember that a regular expression is formed by the use of alternation, concatenation and repetition. Thus all we need to do is to know how to build the NFA for a single symbol, and how to compose NFAs.

Composing NFAs with Alternation The NFA for a symbol a (or ε) is: start i a f Given two NFA N(s) and N(t), the NFA N(s|t) is: N(s) start i f N(t) (Aho, Sethi, Ullman, pp. 122)

Composing NFAs with Concatenation Given two NFA N(s) and N(t), the NFA N(st) is: start i N(s) N(t) f (Aho, Sethi, Ullman, pp. 123)

Composing NFAs with Repetition The NFA for N(s*) is i N(s) f (Aho, Sethi, Ullman, pp. 123)

Properties of the NFA via. Thompson’s Construction n Following the construction rules, we obtain an NFA N(r) that: u u u has at most twice as many states as the number of symbols and operators in r has exactly one starting and one accepting state each state has at most one outgoing transition on a symbol of the alphabet or at most two outgoing -transitions F All “nondeterministic” transitions are introduced by transitions that connect to/from new/old init. /final states.

An Example (a | b)*abb ε 2 a ε start 0 ε ε 1 ε 4 3 6 ε b ε 5 ε 7 a 8 b 9 b 10

Comparison: NFA (by Heuristics) (a | b)*abb a start 0 a 1 b ª ª NOT constructed using Thompson’s Construction ª b 2 b 3 States: {0/Start/init. , 1, 2, 3/Final} Input symbols: {a, b} NFA Transition function: (0, a) = {0, 1}, (0, b) = {0} (1, b) = {2}, (2, b) = {3}

NFA to DFA NFA DFA (Subset Construction)

Translating NFA into DFA n Each state of DFA (D) corresponds to a set of states of NFA (N) u n transforming N to D is done by subset construction D will be in state {x, y, z} after reading a given input string if and only if N could be in any of the states x, y, or z, depending on the transitions it chooses. u D keeps track of all the possible routes N might take and runs them in parallel.

Simulating an NFA (recall that …) Input. An input string ended with eof and an NFA with start state s 0 and final states F. Output. The answer “yes” if accepts, “no” otherwise. begin S : = -closure({s 0}); // s 0 = => S c : = nextchar; while c <> eof do begin S : = -closure(move(S, c)); // S =c=> M = => S’ c : = nextchar end; if S F <> then return “yes” else return “no” end.

Simulating an NFA (recall that …) Input. An input string ended with eof and an NFA with start state s 0 and final states F. Output. The answer “yes” if accepts, “no” otherwise. begin Initial state S : = -closure({s 0}); // s 0 = => S c: extends to c c : = nextchar; Next state: U Previous state: T all symbols in alphabet while c <> eof do begin (not input S : = -closure(move(S, c)); // S =c=> M = => S’ Symbols c : = nextchar NFA to DFA— in some files) S: all states generated end; if S F <> then return “yes” during NFA parallel traversal over all possible input prefixes else return “no” (NOT a particular input) end. : all transitions during traversal

From an NFA to a DFA Subset construction Algorithm. Input. An NFA N. Output. A DFA D with states Dstates and transition table Dtran. begin add -closure(s 0) as an unmarked state to Dstates; while there is an unmarked state T in Dstates do begin mark T; for each input symbol a do begin U : = -closure(move(T, a)); if U is not in Dstates then add U as an unmarked state to Dstates; mark as final if U contains the original final state; Dtran[T, a] : = U end.

An Example (a | b)*abb ε 2 a ε start 0 ε ε 1 ε 4 3 6 ε b ε 5 ε 7 a 8 b 9 b 10

An Example: -closure(s) & move(s, x) s -closure(s) 0 1 {0, 1, 2, 4, 7} {1, 2, 4} 2 3 4 5 6 7 8 9 10 2 {1, 2, 3, 4, 6, 7} 4 {1, 2, 4, 5, 6, 7} {1, 2, 4, 6, 7} 7 8 9 10 move(s, a move(s, b) important state? ) 3 Yes 5 Yes 9 10 ((Fin)) Yes Yes ((? )) 8 ((Fin))

An Example • Ignore -transitions (0, 1, …) • a-transitions: (2, a) 3, (7, a) 8 • b-transitions: (4, b) 5, 8 9, 9 10 • Good to label states sequentially: such that (s, x) s+1 -closure({0}) = {0, 1, 2, 4, 7} = A A: -closure(move({0, 1, 2, 4, 7}, a)) = -closure({3, 8}) = {1, 2, 3, 4, 6, 7, 8} = B A: -closure(move({0, 1, 2, 4, 7}, b)) = -closure({5}) = {1, 2, 4, 5, 6, 7} = C B: -closure(move({1, 2, 3, 4, 6, 7, 8}, a)) = -closure({3, 8}) = B B: -closure(move({1, 2, 3, 4, 6, 7, 8}, b)) = -closure({5, 9}) = {1, 2, 4, 5, 6, 7, 9} = D C: -closure(move({1, 2, 4, 5, 6, 7}, a)) = -closure({3, 8}) = B C: -closure(move({1, 2, 4, 5, 6, 7}, b)) = -closure({5}) = C D: -closure(move({1, 2, 4, 5, 6, 7, 9}, a)) = -closure({3, 8}) = B D: -closure(move({1, 2, 4, 5, 6, 7, 9}, b)) = -closure({5, 10}) = {1, 2, 4, 5, 6, 7, 10} = E E: -closure(move({1, 2, 4, 5, 6, 7, 10}, a)) = -closure({3, 8}) = B E: -closure(move({1, 2, 4, 5, 6, 7, 10}, b)) = -closure({5}) = C

An Example • Ignore -transitions (0, 1, …) • a-transitions: (2, a) 3, (7, a) 8 • b-transitions: (4, b) 5, 8 9, 9 10 • Good to label states sequentially: such that (s, x) s+1 Input Symbol State a b A = {0}* ={0, 1, 2, 4, 7} B C B = {3, 8}* ={1, 2, 3, 4, 6, 7, 8} B D C = {5}* ={1, 2, 4, 5, 6, 7} B C D = {5, 9}* ={1, 2, 4, 5, 6, 7, 9} B E E = {5, 10}* ={1, 2, 4, 5, 6, 7, 10} B C

An Example Input Symbol State a b A = {0, 1, 2, 4, 7} B C B = {1, 2, 3, 4, 6, 7, 8} B D C = {1, 2, 4, 5, 6, 7} B C D = {1, 2, 4, 5, 6, 7, 9} B E E = {1, 2, 4, 5, 6, 7, 10} B C

An Example: Result of Subset Construction C b A start {0, 1, 2, 4, 7} D {1, 2, 4, 5, 6, 7} b b a {1, 2, 3, 4, 6, 7, 8} a B a b a {1, 2, 4, 5, 6, 7, 9} b a E {1, 2, 4, 5, 6, 7, 10}

Minimizing Number of States q. Every DFA has a unique smallest equivalent DFA. q. Given a DFA M, we use splitting to construct the equivalent minimal DFA. • Normally, we actually merge individual states to larger set of states, instead of splitting wildly

DFA to Minimum State DFA Input. A DFA M=(S, S, d, s 0, F). Output. An equivalent DFA M’=(S’, S, d’, s 0’, F’) with fewer states. a a’ a’’ … begin initialize a partition P of two groups of states: s q q’ q’’ … {F(final states), S-F(non-final states)} t q q’ q’’ … for each group G of P do begin /* until Pnew unchanged */ partition G into subgroups such that any two states s and t of G are in the same subgroup iff for all input symbol a, states s and t have transitions on a to states in the same group of P; /* at worst, a state will be in a subgroup by itself */ update Pnew by replacing G by the set of all subgroups formed end s 0’ = r(s 0), representative of s 0; S’= {representatives of subgroups}; F’ = {representatives of states in F}; d(s, a)=t => d’(r(s), a) = r(t) end.

Splitting into Equivalent States Algorithm: • Initially, there are two sets, one consisting of all accepting states of M, the other containing the remaining states. repeat { Choose a set A = { s 1, s 2, , sn } Split A into A 1, A 2, , Am so that for all Ai & all symbols a if sj, sk Ai and, on input a sj tj and sk tk // source target then tj and tk are in the same set. } until no more change.

An Example Input Symbol State a b A = {0, 1, 2, 4, 7} B C B = {1, 2, 3, 4, 6, 7, 8} B D C = {1, 2, 4, 5, 6, 7} B C D = {1, 2, 4, 5, 6, 7, 9} B E E = {1, 2, 4, 5, 6, 7, 10} B C

An Example -Fin +Fin Input Symbol State a b A = {0, 1, 2, 4, 7} B C B = {1, 2, 3, 4, 6, 7, 8} B D C = {1, 2, 4, 5, 6, 7} B C D = {1, 2, 4, 5, 6, 7, 9} B E E = {1, 2, 4, 5, 6, 7, 10} B C

An Example State Input Symbol a b A = {0, 1, 2, 4, 7} B = {1, 2, 3, 4, 6, 7, 8} C = {1, 2, 4, 5, 6, 7} D = {1, 2, 4, 5, 6, 7, 9} B B C D C E E = {1, 2, 4, 5, 6, 7, 10} B C

An Example State Input Symbol a b A = {0, 1, 2, 4, 7} B = {1, 2, 3, 4, 6, 7, 8} A = {1, 2, 4, 5, 6, 7} D = {1, 2, 4, 5, 6, 7, 9} B B A D A E E = {1, 2, 4, 5, 6, 7, 10} B A

Transition Diagram (after State Reduction) We said… a DFA for (a | b)*abb {0, 2} {0} start b a 0 a b {0, 1} 1 b a 2 a b 3 {0, 3}

Transition Diagram (after State Reduction) It really is … a DFA for (a | b)*abb D A start b a 0 1 a b B b a 2 a b 3 E

RE to DFA Construct DFA from RE directly without intermediate NFA

Review of Thompson’s Transition Diagram: An Example (a | b)*abb ε 2 a ε start 0 ε 3 ε 1 6 ε 4 ε b ε A = -closure ({0}) = {0, 1, 2, 4, 7} 5 ε 7 a 8 b 9 b 10

Review of Thompson’s Transition Diagram: An Example (a | b)*abb ε 2 a ε start 0 ε 3 ε 1 6 ε 4 ε b ε ε 7 a 8 b 9 move(A, b)= {5} 5 b 10

Review of Thompson’s Transition Diagram: An Example (a | b)*abb ε 2 a ε start 0 ε 3 ε 1 6 ε 4 ε b ε C = -closure (move(A, b)) = {1, 2, 4, 5, 6, 7} 5 ε 7 a 8 0 1 2 4 7 b b 9 b 2 1 2 4 5 6 7 10

Constructing DFA from R. E. n “Important states”: -transitions have no effect on determining next state since they will not really make a transition on visible input symbol u -transitions determine equivalent states in a loose sense u Important states are related to a non-null symbol at particular position in RE e. g. , b at position 2 of (a|b)abb# u n Re-definition of “States”: u u Thompson’s Transition diagram: nodes as states (the status before & after matching a symbol) Alternative method: arcs as states (the position (in RE) of match) F F #: simulate the last node for checking final state Only states that consumes symbols matter

DFA directly from R. E. : underlying NFA ε a 1 ε start A ε 2 C ε B ε (a 1|b 2)*a 3 b 4 b 5#6 E ε 3 a 4 b 5 b 6 ε b ε # D Important states ( {1… 6}): with nonnull transitions F

DFA directly from R. E. : underlying NFA ε a 1 ε start A ε 2 C ε B ε (a 1|b 2)*a 3 b 4 b 5#6 E ε 3 a 4 b 5 b 6 ε b ε # D Followpos(1) ={1, 2, 3} F

Constructing Automata from R. E. n u n n RE = (a|b)*abb# (a 1|b 2)*a 3 b 4 b 5#6 Syntax tree for RE: (Fig. 3. 41) Directed graph for followpos(): a b a 2 b 3 a b Ready to match ‘b’ at ‘ 2’ 4 Followpos 1 on a {1, 2, 3} 2 on b {1, 2, 3} 3 on a {4} 4 on b {5} 5 on b {6} Ready to match ‘a’ at ‘ 3’ a 1 Node Example: 6 b 5 - b #6 followpos(1): (a 1|b 2)* a 3 b 4 b 5#6 ~ ((a 1|b 2) … (a 1|b 2)) a 3 b 4 b 5#6

DFA directly from R. E. a DFA for (a | b)*abb Possible matching positions (a 1|b 2)*a 3 b 4 b 5#6 {1, 2, 3, 5} {1, 2, 3} start b a 0 a b {1, 2, 3, 4} 1 b a 2 b a Next Possible matching positions 3 {1, 2, 3, 6}

Constructing DFA from RE: First. Pos, Last. Pos, Nullable n Matching RE’s – 3 possible cases u u u n x(c 1|c 2)y x(c 1. c 2)y x(c*)y Followpos: Which position(s)/symbol(s) to match after matching lastpos of ‘x’ ? u u Requires firstpos of c, c 1, c 2, y Need to know whether c 1, c 2 can be pass-through (nullable) (c* is always nullable)

Constructing DFA from R. E. n R. E. DFA: u State (set of) position(s) ( respective symbols) in RE F u Set of Positions Set of Important States of NFA (that consumes input symbols) n (where an input character is being matched) State_transition allowed position transition for RE DFA Construction: u u u u Augment RE: (r)# [#: end-of-pattern mark] Annotate RE symbols (excluding ) with position labels Get syntax tree T of the annotated pattern Compute {nullable, firstpos, lastpos} of nodes [sub-RE’s] Compute follow(i) [by making DFT over the tree T] Initial state: s 0 = firstpos(root) [ complete RE] Construct transition function according to follow(i) [ (i, a)=i’]

Constructing DFA from R. E. n n n DFA Construction: Initial state: s 0 = firstpos(root) & S = {s 0} While there is an unmarked state Q in S do begin u For each input symbol ‘a’ do begin F For each position ‘p’ in Q s. t. symbol(p)=‘a’, • Let U = followpos(p) // take Union if more than one such p Compare : NFA DFA F F u n If U is not F (empty), and U S, then S += U // new state (Q, a)=U // new transition End /* a* / End /* while */ Q: {p ap} a U ={followpos(p)}

Lexical Analyzer Generator RE Thompson’s construction NFA Subset construction DFA

Time-Space Tradeoffs ª RE (r) to NFA, simulate NFA on input x u ª RE to NFA, NFA to DFA, simulate DFA u ª time: O(|r| * |x|), space: O(|r|) [max. 2|r| states] time: O(|x|), space: O(2|r|) Lazy transition evaluation u transitions are computed as needed at run time; computed transitions are stored in cache for later use

LEX A Language for Specifying Lexical Analyzers

Lex A language for specifying lexical analyzers (for any language, say, X) (Lex. Analyzer Spec. ) lex. l lex. yy. c source code in X lex compiler (Lex. Analyzer in C) lex. yy. c C compiler (Lex. Analyzer Exe. ) a. out next_token = yylex(); tokens (for parser)

Using a Scanner Generator: Lex n n Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab, written in C, running under UNIX. Lex produces an entire scanner module that can be compiled and linked with other compiler modules. Lex associates regular expressions with arbitrary code fragments. When an expression is matched, the code segment is executed. A typical lex program contains three sections separated by %% delimiters.

Lex Programs %{ auxiliary declarations %} regular definitions %% translation rules %% auxiliary procedures

First Section of Lex n The first section define character classes and auxiliary regular expression. (Fig. 3. 5 on p. 67) u u [] delimits character classes - denotes ranges: [xyz] = = [x-z] denotes the escape character: as in C. ^ complements a character class, (Not): F u u [^xy] denotes all characters except x and y. |, *, and + (alternation, Kleene closure, and positive closure) are provided. () can be used to control grouping of subexpressions. (expr)? = = (expr)| , i. e. matches Expr zero times or once. {} signals the macroexpansion of a symbol defined in the first section.

First Section of Lex, cont. n Catenation is specified by the juxtaposition of two expressions; no explicit operator is used. [ab][cd] will match any of ad, ac, bc, and bd. begin = = “begin” = = [b][e][g][i][n] u n

Second Section of Lex n The second section of lex defines a table of regular expressions and corresponding commands. u When an expression is matched, its associated command is executed. F u u Input that is matched is stored in the string variable yytext whose length is yyleng. Lex creates an integer function yylex() that may be called from the parser. F u Auxiliary functions may be defined in the third section. The value returned is usually the token code of the token scanned by Lex. When yylex() encounters end of file, it calls a user-supplied integer function named yywrap() to wrap up input processing.

Translation Rules P 1 P 2 Pn . . . {action 1} {action 2} {actionn} where Pi are regular expressions and actioni are program segments to be executed on matching Pi

Dealing with Multiple Input Files n yylex() uses three user-defined functions to handle character I/O: u u u input(): retrieve a single character, 0 on EOF output(c): write a single character to the output unput(c): put a single character back on the input to be re-read

An Example %{ #define LT #define LE #define EQ. . . %} delim ws letter digit id number %% 24 25 26 // auxiliary declarations (in C) // regular definitions [ tn] {delim}+ [A-Za-z] [0 -9] {letter}({letter}|{digit})* {digit}+(. {digit}+)? (E[+-]? {digit}+)?

An Example // translation rules (actions are in C) {ws} { /* no action and no return */ } if {return (IF); } then {return (THEN); } else {return (ELSE); } {id} {yylval=install_id(); return (ID); } {number} {yylval=install_num(); return (NUMBER); } “<” {yylval=LT; return (RELOP); } “<=” {yylval=LE; return (RELOP); }. . . // auxiliary procedures (in C) %% install_id() { … /* yytext to symbol table */ } install_num() {. . . /* yytext to symbol table */ }

Functions and Variables yylex() a function implementing the lexical analyzer and returning the token matched yytext a global pointer variable pointing to the lexeme matched yyleng a global variable giving the length of the lexeme matched yylval an external global variable storing the attribute of the token

NFA from Lex Programs P 1 | P 2 |. . . | Pn N(P 1) s 0 N(P 2). . . N(Pn)

Rules ª Look for the longest lexeme u u ª Look for the first-listed pattern that matches the longest lexeme u ª e. g. , Number Match until no transition & retract to longest match keywords and identifiers List frequently occurring patterns first u white space

Rules ª View keywords as exceptions to the rule of identifiers u ª construct a keyword table to distinguish them from id’s Lookahead operator: r 1/r 2 - match a string in r 1 only if followed by a string in r 2 u DO 5 I = 1. 25 DO 5 I = 1, 25 DO/({letter}|{digit})* = ({letter}|{digit})*,

Lexical Error Recovery ª Error: ª ª Panic mode error recovery u ª none of the patterns matches a prefix of the remaining input delete successive characters from the remaining input until the pattern-matching can continue Error repair: (single error recovery) u u delete an extraneous character insert a missing character replace an incorrect character transpose two adjacent characters

Appendix: Regular Expression and Pattern Matching KMP algorithm - AC algorithm -

R. E. and Pattern Matching n Naïve Pattern Matching: u u Specify the pattern with a regular expression R. E. for each keyword Construct a FA for each such R. E. , and conduct left-to-right matching: DFA : = State_Transition_Table : = Construct_DFA(R. E. ) while (input_pointer != EOF) F F F stop_state = recognize(input_pointer, DFA) if fail (stop_state not in final_states) : move input pointer by one character if not match if success (stop_state in final_states) : output matching status & skip over matched pattern upon successful match

R. E. and Pattern Matching n Why Is It Slow? u u match multiple keywords multiple times for each keyword, move input pointer backward to the character next to the last begin of matching & reset to initial state on failure, even though some repeated pattern might appear in recently matched partial string F F probability of failure is significantly larger than probability of success match in most applications (success or match only a few times) will therefore start the next matching session by setting the input pointer one character behind the starting position of the previous match most of the time

R. E. and Pattern Matching n RE vs. Pattern Matching u R. E. <=> FA for recognizing one of a set of keywords/patterns in input string F u say “yes” if input string is in Lang(R. E. ) (the regular language for the expression) Pattern Matching (PM): recognizing all the occurrences of any keyword/pattern, specified in regular expression, within a text document F F specify each pattern/keyword with a RE output all occurrences, in addition to saying yes/no

R. E. and Pattern Matching n Formal Method for Pattern Matching (PM) u Constructing a FA for (single/multi-keyword) PM is equivalent to constructing a FA that recognizes the regular expression: PM = (. * | RE)* , and outputting a keyword upon visiting a final state of the original FA for recognizing RE F F F RE = K 1 | K 2 | K 3 | … | Kn (the regular expression for all specified keywords) “. ” : any character not starting in the first characters of K 1 ~ Kn “. *”: unspecified patterns (or unknown keywords)

R. E. and Pattern Matching n Constructing FA 1 for recognizing RE = K 1 | K 2 | … | Kn u n equivalent to merging prefixes of the keywords to avoid redundant forward matching => TRIE lexicon tree = a DFA for RE Constructing FA 2 for recognizing PM = (. *|RE)* u extending FA 1 by (a) including ‘unknown keywords’ and (2) introducing epsilon-moves from the original final states to original initial states F F u on matching failure, redundant backward matching can be avoided if a substring preceding current input pointer is the prefix of another keyword failure function: the state (in TRIE) to backoff on failure (!= init. state if the above mentioned sub-string exists and is non-null) epsilon-moves & failure function make FA 2 a NFA, whose DFA counterpart can be simulated by backtracking

R. E. and Fast Methods for Pattern Matching n Fast Single Keyword Matching [KMP - Knuth, Morris & Pratt 1977] F u u u Reference: [Aho et. al 1986, Ex. 3. 26 -3. 27] keyword => state_transition_table reduce repeated matching suggested by keyword pattern failure function: where to backoff on failure

R. E. and Fast Methods for Pattern Matching n Fast Multiple Keyword Matching [AC, Cherry 1982] F u u Reference: [Aho, Ex. 3. 31 -32] keywords => TRIE (state_transition_table) reduce repeated matching suggested by TRIE of the keywords TRIE failure function

R. E. and Fast Methods for Pattern Matching n n Boyer & Moore [1977] Harrison [1971]: Hashing Method

KMP: Failure Function start a b 0 1 n 2 b 3 a 4 a 5 6 If failed at state 5 on x => u u u Input = “ababax” (input pointer => x) Need to re-try “babax”, “bax”, “ax” from state 0 “babax”: fail again; (do not start with prefix “ab…”) “abax”: success until state 3, pointing at x Look back from s 5 & see longest match (s 3) to prefix F n a Choose the longest one so we can re-try the least Do you need to go back and try all these? u u No. Simply set s : =3 and keep the input pointer to x State 3 is the “failure state” of state 5

KMP: Failure Function a start b 0 1 n 2 b 3 a 4 a 5 6 If failed at state 5 on x => s f(s) 0 0 1 0 u 2 0 u 3 1 u 4 2 5 3 6 1 u u Input = “ababax” (input pointer => x) Need to re-try “babax”, “bax”, “ax” from state 0 “babax”: fail again; (do not start with prefix “ab…”) “abax”: success until state 3, pointing at x Look back from s 5 & see longest match (s 3) to prefix F n a Choose the longest one so we can re-try the least Do you need to go back and try all these? u u No. Simply set s : =3 and keep the input pointer to x State 3 is the “failure state” of state 5

KMP: Re-Matching on Failure a start 0 b 1 s f(s) 0 0 1 0 u 2 0 u 3 1 u 4 2 5 3 6 1 n a 2 b 3 a 4 a 5 6 If failed at state 5 on x => u u (5, x) = fail (“ababax” does not match prefix) f(5)=3 => (3, x)=? ? , if fail (“abax” unmatch) f(3)=1 => (1, x)=? ? , if fail (“ax” unmatch) f(1)=0 => (0, x)=? ? , try x from initial state (since no partial match in failed prefixes is observed) If (. , x) is legal transition, just go ahead to (. , x)

KMP start “x” 0 a t … t+1 “x” ? s a bs+1 Recursively compute f(s+1) If t=f(s) there is a max common string “x” between states 0…t and ? . . . s. based on f(s) and f(. ) of If further (s, a)=s+1 & (t, a)=t+1 previous states “x”||a is the max common prefix-suffix => f(s+1)=t+1

KMP – Recursion - if d(f(s), a)=fail start “x” 0 -a t … “x” s ? t+1 a bs+1 “y” is a suffix of “x” “y”||a is the max common prefix-suffix start “y” 0 a f(t) … f(t)+1 => f(s+1)=f(f(s))+1 (recursively) “y” ? ? s a s+1

KMP Algorithm f(0) = f(1) = 0; for(s=1; s < Smax; s++) { // find f(s+1) a = b(s+1); t=f(s); f(s+1)=0; while ( (t, a) = fail && t != 0) t=f(t); if ( t!= 0 ) f(s+1)=t+1; }