Lecture 2 Lexical Analysis 1 Lexical Analysis

Скачать презентацию Lecture 2 Lexical Analysis 1 Lexical Analysis

05ff51d1e6e65824cee8e1548f4031b7.ppt

Количество слайдов: 26

Lecture 2: Lexical Analysis 1

Lexical Analysis INPUT: sequence of characters OUTPUT: sequence of tokens Next_token() Next_char() Input character Scanner token Parser Symbol Table A lexical analyzer is generally a subroutine of parser: ¡ Simpler design ¡ Efficient ¡ Portable 2

Definitions ¡ ¡ ¡ token – set of strings defining an atomic element with a defined meaning pattern – a rule describing a set of string lexeme – a sequence of characters that match some pattern 3

Examples Token Pattern Sample Lexeme while relation_op = | != | < | > < integer (0 -9)* 42 string Characters between “ “ “hello” 4

Input string: size : = r * 32 + c pairs: ¡ ¡ ¡ ¡ ¡ ¡ ¡ 5

Implementing a Lexical Analyzer Practical Issues: ¡ Input buffering ¡ Translating RE into executable form ¡ Must be able to capture a large number of tokens with single machine ¡ Interface to parser ¡ Tools 6

Capturing Multiple Tokens Capturing keyword “begin” b e g i Capturing variable names A WS n WS WS – white space A – alphabetic AN – alphanumeric AN What if both need to happen at the same time? 7

Capturing Multiple Tokens b e g A-b AN AN WS i n WS WS – white space A – alphabetic AN – alphanumeric Machine is much more complicated – just for these two tokens! 8

Lex – Lexical Analyzer Generator Lex specification Flex/Lex lex. yy. c C/C++ compiler input a. out tokens 9

Lex Specification %{ int char. Count=0, word. Count=0, line. Count=0; Definitions – %} word [^ tn]* Code, RE %% {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; line. Count++; } Rules –. {char. Count++; } RE/Action pairs %% main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, User Routines word. Count, line. Count); } 10

$Lex definitions section %{ int char. Count=0, word. Count=0, line. Count=0; %} word [^$ Lex definitions section %{ int char. Count=0, word. Count=0, line. Count=0; %} word [^ tn]* ¡ ¡ C/C++ code: l Surrounded by %{… %} delimiters l Declare any variables used in actions RE definitions: l Define shorthand for patterns: digit [0 -9] letter [a-z] ident {letter}({letter}|{digit})* l Use shorthand in RE section: {ident} 11

Lex Regular Expressions {word} {word. Count++; char. Count += yyleng; } [n] {char. Count++; line. Count++; }. {char. Count++; } ¡ ¡ Match explicit character sequences l integer, “+++”, <> Character classes l [abcd] l [a-z. A-Z] l [^0 -9] – matches non-numeric 12

Lex Regular Expressions(cont. ) ¡ ¡ Alternation l twelve | 12 Closure l * - zero or more l + - one or more l ? – zero or one l {number}, {number, number} 13

Lex Regular Expressions(cont. ) ¡ Other operators l l l . – matches any character except newline ^ - matches beginning of line $ - matches end of line / - trailing context () – grouping {} – RE definitions 14

Lex Matching Rules ¡ ¡ Lex always attempts to match the longest possible string. If two rules are matched (and match strings are same length), the first rule in the specification is used. 15

$Lex Operators Highest: closure concatenation alternation Special lex characters: -/*+>“{}. $()|%[]^ Special lex characters$ Lex Operators Highest: closure concatenation alternation Special lex characters: -/*+>“{}. $()|%[]^ Special lex characters inside [ ]: -[]^ 16

Examples ¡ ¡ ¡ a. *z (ab)+ [0— 9]{1, 5} (ab|cd)? ef = abef, cdef, ef -? [0 -9]. [0 -9] 17

Lex Actions Lex actions are C (C++) code to implement some required functionality ¡ Default action is to echo to output ¡ Can ignore input (empty action) ¡ ECHO – macro that prints out matched string ¡ yytext – matched string ¡ yyleng – length of matched string 18

User Subroutines main() { yylex(); printf(“Characters %d, Words: %d, Lines: %dn”, char. Count, word. Count, line. Count); } ¡ ¡ ¡ C/C++ code Copied directly into the lexer code User can supply ‘main’ or use default 19

Uses for Lex ¡ ¡ ¡ Transforming Input – convert input from one form to another (example 1). yylex() is called once; return is not used in specification Extracting Information – scan the text and return some information (example 2). yylex() is called once; return is not used in specification. Extracting Tokens – standard use with compiler (example 3). Uses return to give the next token to the caller. 20

Regular expression • A regular expression is a kind of pattern that can be applied to text (Strings, in Java) • A regular expression either matches the text (or part of the text), or it fails to match. • Regular expressions are an extremely useful tool for manipulating text – Regular expressions are heavily used in the automatic generation of Web pages 21

Pattern matching applications: • • • Scan for virus signatures Process natural languages Search for information using Google Search and replace in word processors Filter text( spam, malware ) Validate data-entry field (dates, email, url) 22

Basic Operation • Notation to specify a set of strings 23

Regular expression : examples • Notation is surprisingly expressive. Regular Expression a* | (a*ba*ba*ba*)* multiple of 3 b's a | a(a|b)*a begins and ends with a (a|b)*abba(a|b)* contains the substring abba Yes ε bbb aaa abbbaaa bbbaababba a a aba aa abba bbabbabb abba No b bb abbaaaa baabbbaa ε ab ba ε abb bbaaba 24

Using Regular expression • • • Built in to Java, Perl , PHP, Unix, . NET, …. Additional operations typically aded for convenience. Ex. [a-e]+ is shorthand for (a|b|c|d|e)* Operation Regular Expression Yes No Concatenation hello Othello say hello Hello Any single character . . oo. bloodroot spoonfood cookbook choo 25

Using Regular expression Operation Regular Expression Yes No abc Replication a(bc)*de abcde abcbcde One or more a(bc)+de abcbcde abc Once or not at all a(bc)? de abcde abcbcde Character classes [a-m]* blackmail imbecile above below Negation of character classes [^aeiou] b c a e Exactly N times [^aeiou]{6} rhythm syzygy rhythms allowed Between M and N times [a-z]{4, 6} spider tiger jellyfish cow Whitespace characters [a-zs]*hello say hello Othello 26