05ff51d1e6e65824cee8e1548f4031b7.ppt
- Количество слайдов: 26
Lecture 2: Lexical Analysis 1
Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens Next_token() Next_char() Input character Scanner token Parser Symbol Table A lexical analyzer is generally a subroutine of parser: ¡ Simpler design ¡ Efficient ¡ Portable 2
Definitions ¡ ¡ ¡ token – set of strings defining an atomic element with a defined meaning pattern – a rule describing a set of string lexeme – a sequence of characters that match some pattern 3
Examples Token Pattern Sample Lexeme while relation_op = | != | < | > < integer (0 -9)* 42 string Characters between “ “ “hello” 4
Input string: size : = r * 32 + c
Implementing a Lexical Analyzer Practical Issues: ¡ Input buffering ¡ Translating RE into executable form ¡ Must be able to capture a large number of tokens with single machine ¡ Interface to parser ¡ Tools 6
Capturing Multiple Tokens Capturing keyword “begin” b e g i Capturing variable names A WS n WS WS – white space A – alphabetic AN – alphanumeric AN What if both need to happen at the same time? 7
Capturing Multiple Tokens b e g A-b AN AN WS i n WS WS – white space A – alphabetic AN – alphanumeric Machine is much more complicated – just for these two tokens! 8
Lex – Lexical Analyzer Generator Lex specification Flex/Lex lex. yy. c C/C++ compiler input a. out tokens 9
Lex Specification %{ int char. Count=0, word. Count=0, line. Count=0; Definitions – %} word [^ tn]* Code, RE %% {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; line. Count++; } Rules –. {char. Count++; } RE/Action pairs %% main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, User Routines word. Count, line. Count); } 10
Lex definitions section %{ int char. Count=0, word. Count=0, line. Count=0; %} word [^ tn]* ¡ ¡ C/C++ code: l Surrounded by %{… %} delimiters l Declare any variables used in actions RE definitions: l Define shorthand for patterns: digit [0 -9] letter [a-z] ident {letter}({letter}|{digit})* l Use shorthand in RE section: {ident} 11
Lex Regular Expressions {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; line. Count++; }. {char. Count++; } ¡ ¡ Match explicit character sequences l integer, “+++”, <> Character classes l [abcd] l [a-z. A-Z] l [^0 -9] – matches non-numeric 12
Lex Regular Expressions(cont. ) ¡ ¡ Alternation l twelve | 12 Closure l * - zero or more l + - one or more l ? – zero or one l {number}, {number, number} 13
Lex Regular Expressions(cont. ) ¡ Other operators l l l . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions 14
Lex Matching Rules ¡ ¡ Lex always attempts to match the longest possible string. If two rules are matched (and match strings are same length), the first rule in the specification is used. 15
Lex Operators Highest: closure concatenation alternation Special lex characters: -/*+>“{}. $()|%[]^ Special lex characters inside [ ]: -[]^ 16
Examples ¡ ¡ ¡ a. *z (ab)+ [0— 9]{1, 5} (ab|cd)? ef = abef, cdef, ef -? [0 -9]. [0 -9] 17
Lex Actions Lex actions are C (C++) code to implement some required functionality ¡ Default action is to echo to output ¡ Can ignore input (empty action) ¡ ECHO – macro that prints out matched string ¡ yytext – matched string ¡ yyleng – length of matched string 18
User Subroutines main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, word. Count, line. Count); } ¡ ¡ ¡ C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default 19
Uses for Lex ¡ ¡ ¡ Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller. 20
Regular expression • A regular expression is a kind of pattern that can be applied to text (Strings, in Java) • A regular expression either matches the text (or part of the text), or it fails to match. • Regular expressions are an extremely useful tool for manipulating text – Regular expressions are heavily used in the automatic generation of Web pages 21
Pattern matching applications: • • • Scan for virus signatures Process natural languages Search for information using Google Search and replace in word processors Filter text( spam, malware ) Validate data-entry field (dates, email, url) 22
Basic Operation • Notation to specify a set of strings 23
Regular expression : examples • Notation is surprisingly expressive. Regular Expression a* | (a*ba*ba*ba*)* multiple of 3 b's a | a(a|b)*a begins and ends with a (a|b)*abba(a|b)* contains the substring abba Yes ε bbb aaa abbbaaa bbbaababba a a aba aa abba bbabbabb abba No b bb abbaaaa baabbbaa ε ab ba ε abb bbaaba 24
Using Regular expression • • • Built in to Java, Perl , PHP, Unix, . NET, …. Additional operations typically aded for convenience. Ex. [a-e]+ is shorthand for (a|b|c|d|e)* Operation Regular Expression Yes No Concatenation hello Othello say hello Hello Any single character . . oo. bloodroot spoonfood cookbook choo 25
Using Regular expression Operation Regular Expression Yes No abc Replication a(bc)*de abcde abcbcde One or more a(bc)+de abcbcde abc Once or not at all a(bc)? de abcde abcbcde Character classes [a-m]* blackmail imbecile above below Negation of character classes [^aeiou] b c a e Exactly N times [^aeiou]{6} rhythm syzygy rhythms allowed Between M and N times [a-z]{4, 6} spider tiger jellyfish cow Whitespace characters [a-zs]*hello say hello Othello 26


