Скачать презентацию Lecture 2 Lexical Analysis 1 Lexical Analysis Скачать презентацию Lecture 2 Lexical Analysis 1 Lexical Analysis

05ff51d1e6e65824cee8e1548f4031b7.ppt

  • Количество слайдов: 26

Lecture 2: Lexical Analysis 1 Lecture 2: Lexical Analysis 1

Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens Next_token() Next_char() Input character Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens Next_token() Next_char() Input character Scanner token Parser Symbol Table A lexical analyzer is generally a subroutine of parser: ¡ Simpler design ¡ Efficient ¡ Portable 2

Definitions ¡ ¡ ¡ token – set of strings defining an atomic element with Definitions ¡ ¡ ¡ token – set of strings defining an atomic element with a defined meaning pattern – a rule describing a set of string lexeme – a sequence of characters that match some pattern 3

Examples Token Pattern Sample Lexeme while relation_op = | != | < | > Examples Token Pattern Sample Lexeme while relation_op = | != | < | > < integer (0 -9)* 42 string Characters between “ “ “hello” 4

Input string: size : = r * 32 + c <token, lexeme> pairs: ¡ Input string: size : = r * 32 + c pairs: ¡ ¡ ¡ ¡ ¡ ¡ ¡ 5

Implementing a Lexical Analyzer Practical Issues: ¡ Input buffering ¡ Translating RE into executable Implementing a Lexical Analyzer Practical Issues: ¡ Input buffering ¡ Translating RE into executable form ¡ Must be able to capture a large number of tokens with single machine ¡ Interface to parser ¡ Tools 6

Capturing Multiple Tokens Capturing keyword “begin” b e g i Capturing variable names A Capturing Multiple Tokens Capturing keyword “begin” b e g i Capturing variable names A WS n WS WS – white space A – alphabetic AN – alphanumeric AN What if both need to happen at the same time? 7

Capturing Multiple Tokens b e g A-b AN AN WS i n WS WS Capturing Multiple Tokens b e g A-b AN AN WS i n WS WS – white space A – alphabetic AN – alphanumeric Machine is much more complicated – just for these two tokens! 8

Lex – Lexical Analyzer Generator Lex specification Flex/Lex lex. yy. c C/C++ compiler input Lex – Lexical Analyzer Generator Lex specification Flex/Lex lex. yy. c C/C++ compiler input a. out tokens 9

Lex Specification %{ int char. Count=0, word. Count=0, line. Count=0; Definitions – %} word Lex Specification %{ int char. Count=0, word. Count=0, line. Count=0; Definitions – %} word [^ tn]* Code, RE %% {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; line. Count++; } Rules –. {char. Count++; } RE/Action pairs %% main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, User Routines word. Count, line. Count); } 10

Lex definitions section %{ int char. Count=0, word. Count=0, line. Count=0; %} word [^ Lex definitions section %{ int char. Count=0, word. Count=0, line. Count=0; %} word [^ tn]* ¡ ¡ C/C++ code: l Surrounded by %{… %} delimiters l Declare any variables used in actions RE definitions: l Define shorthand for patterns: digit [0 -9] letter [a-z] ident {letter}({letter}|{digit})* l Use shorthand in RE section: {ident} 11

Lex Regular Expressions {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; Lex Regular Expressions {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; line. Count++; }. {char. Count++; } ¡ ¡ Match explicit character sequences l integer, “+++”, <> Character classes l [abcd] l [a-z. A-Z] l [^0 -9] – matches non-numeric 12

Lex Regular Expressions(cont. ) ¡ ¡ Alternation l twelve | 12 Closure l * Lex Regular Expressions(cont. ) ¡ ¡ Alternation l twelve | 12 Closure l * - zero or more l + - one or more l ? – zero or one l {number}, {number, number} 13

Lex Regular Expressions(cont. ) ¡ Other operators l l l . – matches any Lex Regular Expressions(cont. ) ¡ Other operators l l l . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions 14

Lex Matching Rules ¡ ¡ Lex always attempts to match the longest possible string. Lex Matching Rules ¡ ¡ Lex always attempts to match the longest possible string. If two rules are matched (and match strings are same length), the first rule in the specification is used. 15

Lex Operators Highest: closure concatenation alternation Special lex characters: -/*+>“{}. $()|%[]^ Special lex characters Lex Operators Highest: closure concatenation alternation Special lex characters: -/*+>“{}. $()|%[]^ Special lex characters inside [ ]: -[]^ 16

Examples ¡ ¡ ¡ a. *z (ab)+ [0— 9]{1, 5} (ab|cd)? ef = abef, Examples ¡ ¡ ¡ a. *z (ab)+ [0— 9]{1, 5} (ab|cd)? ef = abef, cdef, ef -? [0 -9]. [0 -9] 17

Lex Actions Lex actions are C (C++) code to implement some required functionality ¡ Lex Actions Lex actions are C (C++) code to implement some required functionality ¡ Default action is to echo to output ¡ Can ignore input (empty action) ¡ ECHO – macro that prints out matched string ¡ yytext – matched string ¡ yyleng – length of matched string 18

User Subroutines main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, word. User Subroutines main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, word. Count, line. Count); } ¡ ¡ ¡ C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default 19

Uses for Lex ¡ ¡ ¡ Transforming Input – convert input from one form Uses for Lex ¡ ¡ ¡ Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller. 20

Regular expression • A regular expression is a kind of pattern that can be Regular expression • A regular expression is a kind of pattern that can be applied to text (Strings, in Java) • A regular expression either matches the text (or part of the text), or it fails to match. • Regular expressions are an extremely useful tool for manipulating text – Regular expressions are heavily used in the automatic generation of Web pages 21

Pattern matching applications: • • • Scan for virus signatures Process natural languages Search Pattern matching applications: • • • Scan for virus signatures Process natural languages Search for information using Google Search and replace in word processors Filter text( spam, malware ) Validate data-entry field (dates, email, url) 22

Basic Operation • Notation to specify a set of strings 23 Basic Operation • Notation to specify a set of strings 23

Regular expression : examples • Notation is surprisingly expressive. Regular Expression a* | (a*ba*ba*ba*)* Regular expression : examples • Notation is surprisingly expressive. Regular Expression a* | (a*ba*ba*ba*)* multiple of 3 b's a | a(a|b)*a begins and ends with a (a|b)*abba(a|b)* contains the substring abba Yes ε bbb aaa abbbaaa bbbaababba a a aba aa abba bbabbabb abba No b bb abbaaaa baabbbaa ε ab ba ε abb bbaaba 24

Using Regular expression • • • Built in to Java, Perl , PHP, Unix, Using Regular expression • • • Built in to Java, Perl , PHP, Unix, . NET, …. Additional operations typically aded for convenience. Ex. [a-e]+ is shorthand for (a|b|c|d|e)* Operation Regular Expression Yes No Concatenation hello Othello say hello Hello Any single character . . oo. bloodroot spoonfood cookbook choo 25

Using Regular expression Operation Regular Expression Yes No abc Replication a(bc)*de abcde abcbcde One Using Regular expression Operation Regular Expression Yes No abc Replication a(bc)*de abcde abcbcde One or more a(bc)+de abcbcde abc Once or not at all a(bc)? de abcde abcbcde Character classes [a-m]* blackmail imbecile above below Negation of character classes [^aeiou] b c a e Exactly N times [^aeiou]{6} rhythm syzygy rhythms allowed Between M and N times [a-z]{4, 6} spider tiger jellyfish cow Whitespace characters [a-zs]*hello say hello Othello 26