5e750ae2f68367cc02bc6fce6e9a1f69.ppt
- Количество слайдов: 46
Can we • Submissions: 99 • Average for A 2: 71% generate. Early submission bonus: 1 • code to support mundane • Full marks: 5 coding tasks and safe time? • 16 teams attempted nonce bonus • 7 got full marks • 7 teams attempted ACC bonus • 7 got full marks Scanning & Parsing with Lex and YACC Give you an example for Milestone 1. Hans-Arno Jacobsen ECE 297 Powerful, but not easy
Course. Peer – try it out! • Developed by a former ECE 297 student – Many of the videos under tips & tricks are from him too • Short video about Course. Peer • To sign up and auto-enrol under ECE 297, use this link – http: //www. crspr. com/? rid=339 • Will have a quick demo and use it on Wednesday for our Q&A session
Know your tools! • Can we generate code based on a specification of what we want? • Is the specification simpler than writing a program for doing the same task? • Fully automated program generation has been a dream since the early days of computing.
Where do we need parsing in the storage server?
Where do we need parsing in the storage server? • Configuration file (file) • Bulk loading of data files (file) • Protocol messages (network) • Command line arguments (string)
Parsing • default. conf – the way the disk may see it server_host localhost n server_port 1111 n table marks n # This data directory may be an absolute or relative path. n data_directory. /data nnn EOF server_host localhost server_port 1111 table marks PROPERTY VALUE (TABLE-NAME)+ PROPERTY VALUE data_directory. /data Tokens
Scenarios Where we’d like to safe time in writing a quick language processor? Conceptually speaking In our storage servers • Languages – Data description language – Script language – Markup language – Data schema & data – Query language – Output formatting (Web, Latex, PDF, Word, Excel) • System configurations • Storage server configuration • Workload generation • Benchmarking
Parser generation from 30 K feet Written by developer Specification Generator Generated code Other code Executable Compiler / Linker Written by developer
Scanning & parsing I er_host localhost n server_port 1111 n table marks n # Th PROPERTY VALUE … Scanning PROPERTY VALUE (TABLE-NAME)+ PROPERTY VALUE Verify content, add to data structures, … Parsing Processing
Regular expressions • (TABLE-NAME)+ Patterns – TABLE-NAME TABLE-NAME –… • Regular expressions (formal languages) • Extended regular expressions (UNIX)
Scanning & parsing II • Parsing is really two steps – Scanning (a. k. a. tokenizing or lexical analysis) – Parsing, i. e. , analysis of structure and syntax according to a grammar (i. e. , a set of rules) • flex is the scanner generator (open source) – Fast Lex for lexical analysis • YACC is the parser generator – Yet Another Compiler for structural and syntax analysis • Lex and YACC work together • Generated scanner drives the generated parser • We use flex (fast Lex) and Bison (GNU YACC) • There are myriads of other tools for Java, C++, …, some of which combine Lex/Yacc into one tool (e. g. , javacc)
Objectives for today • Cover the basics of Lex & Yacc • Everybody should have an appreciation of the potential of these tools • There is a lot more detail that remains unsaid • To challenge you
Lex & YACC overview server_host localhost n server_port 1111 n table marks n # This data directory may be an absolute or relative path. n data_directory. /data nnn EOF input stream Lexical Analyzer token stream PROPERTY VALUE Output defined by actions in parser Structural token stream specification Analyzer (often an in-memory representation of input)
LEXICAL ANALYSIS WITH LEX
Synonyms: lexical analyzer, scanner, lexer, tokenizer Lex introduction Input specification (*. l) flex lex. yy. c flex is fast Lex You can control the name of generated file C compiler input stream Lexical Analyzer token stream You generate the lexical analyzer by using flex
Lex • Input specification for lex – the “program” – Three parts: Definitions, Rules, User code – Use “%%” as a delimiter for each part • First part: Definitions – Options used by flex inside the scanner – Defines variables & macros – Code within “%{” and “%}” directly copied into the scanner (e. g. , global variables, header files) • Second part: Rules – Patterns and corresponding actions • Actions are executed when corresponding pattern(s) matches – Patterns are defined by regular expressions
Parsing the configuration file of Milestone 1 %{ #include "config_parser. tab. h". . . %} a 2 Z [a-z. A-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [tn ]+ { } #. *n { } {a 2 Z}*Pattern { yylval. sval = strdup(yytext); return STRING; } [0 -9]+ { yylval. pval = (int) atoi(yytext); Action return PORT_NUMBER; } . { return yytext[0]; } … Shorthands for use below config_parser. l
flex pattern matching principles • Actions are executed when patterns match – Tokens are returned to caller; next pattern … • Patterns match a given input character or string only once – Input stream is consumed • flex executes the action for the longest possible matching input – Order of patterns in the spec. is important
flex regular expressions by example I (Really: extended regular expressions) `x‘ `. ‘ `[xyz]’ `[abj-o. Z]‘ match the character 'x' any character (byte) except newline match either an 'x', a 'y', or a 'z' match an 'a', a 'b', any letter from 'j' through 'o', or a 'Z‘ `[^A-Z]‘ a "negated character class", i. e. , any character EXCEPT those in the class `[^A-Zn]’ any character EXCEPT an uppercase letter or a newline
flex regular expression by example II `r*‘ zero or more r's, where r is any regular expression `r+‘ one or more r's `r? ‘ zero or one r (that is, “an optional r”) ‘r{2, 5}‘ anywhere from two to five r's `r{2, }‘ two or more r's r is any `r{4}‘ exactly 4 r's regular ‘<<EOF>>' an end-of-file expression
flex regular expressions • There are many more expressions, see manual • Form complex expressions – E. g. : IP address, names, … • The expression syntax is used in other tools as well (well worth learning)
Parsing the configuration file of Milestone 1 %{ #include "config_parser. tab. h". . . %} a 2 Z [a-z. A-Z] host server_host port server_port dir data_directory %% server_host localhost server_port 1111 table marks data_directory. /data {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [tn ]+{ } #. *n { } {a 2 Z}* { yylval. sval = strdup(yytext); return STRING; } [0 -9]+ { yylval. pval = (int) atoi(yytext); return PORT_NUMBER; }. { return yytext[0]; } <<EOF>> { return User-defined 0; } variable in YACC (conveys token value to YACC) config_parser. l
PARSING WITH YACC
YACC introducing Input specification (*. y) YACC You can control the name of generated file y. tab. c C compiler token stream, e. g. , via flex Syntax analyzer / parser Output defined by actions in parser specification From the specified grammar, YACC generates a parser which recognizes “sentences” according to the grammar
YACC • Input specification for YACC (similar to flex) – Three parts: Definitions, Rules, User code – Use “%%” as a delimiter for each part • First part: Definitions – Definition of tokens for the second part and for use by flex – Definition of variables for use by the parser code • Second part: Rules – Grammar for the parser • Third part: User code – The code in this part is copied into the parser generated by YACC
Configuration file parser Milestone 1 %{ #include <string. h> #include <stdio. h> struct table *tl, *t; struct configuration *c; /* define a structure for the configuration information */ struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; /* define a linked list of table names */ struct table { char *table_name; struct table *next; }; config_parser. y Definition section
Configuration file parser Milestone 1 %} %union{ char *sval; // String value (user defined) int pval; // Port number value (user defined) } %token <sval> STRING %token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY DDIR_PROPERTY TABLE %% Definition section cont’d. config_parser. y
Configuration file parser Milestone 1 property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING ; data_directory: DDIR_PROPERTY STRING ; %% (Grammar) Rules section (simplified) config_parser. y
struct configuration *c; struct configuration { char *host; int port; data_directory: struct table *tlist; $1 $2 char *data_dir; DDIR_PROPERTY STRING }; { c= (struct configuration *) malloc(sizeof(struct configuration)); // Check c for NULL c->data_dir = strdup( $2 ); } ; config_parser. y (Grammar) Rules section (details)
property_list: struct configuration *c; struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ; (Grammar) Rules section config_parser. y (details)
Configuration file parser Milestone 1 property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING ; data_directory: DDIR_PROPERTY STRING ; %% … TABLE STRING (Grammar) Rules section (simplified) config_parser. y
table_list is a recursive rule • Example table specification in configuration file table My. Courses table My. Marks table My. Friends • table_list: table_list TABLE STRING | TABLE STRING ; • Terminology – table_list is called a non-terminal – TABLE & STRING are terminals
Recursive rule execution table_list : table_list TABLE STRING table My. Courses table_list TABLE STRING table My. Marks table My. Courses TABLE STRING table My. Friends table My. Courses table My. Marks table My. Friends table_list: table My. Marks table My. Courses table_list TABLE STRING | TABLE STRING ;
struct table *tl, *t; struct table { table_list: char *table_name; struct table *next; $1 $2 $3 }; table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t table t->next = tl } $1 $2 | TABLE STRING { tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; tl table } config_parser. y ;
How to invoke the parser int main (int argc, char **argv){ FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1], "r"); if (!f){ …// error handling …} yyin = f; while( ! feof(yyin) ) { if (yyparse() != 0) { … yyerror(""); exit(0); }; } fclose(f); } … • yylex() for calling generated scanner • by default called within yyparse()
In the Makefile lexer: config_parser. l ${LEX} config_parser. l ${CC} ${CFLAGS} ${INCLUDE} -c lex. yy. c yaccer: config_parser. y ${YACC} -d config_parser. y ${CC} ${CFLAGS} ${INCLUDE} -c config_parser. tab. c parser: config_parser. tab. o lex. yy. o ${CC} ${CFLAGS} ${INCLUDE} -c parser. c ${CC} -o p ${CFLAGS} ${INCLUDE} lex. yy. o config_parser. tab. o parser. o
Benefits • Faster development – Compared to manual implementation • Easier to change the specification and generate new parser – Than to modify 1000 s of lines of code to add, change, delete an existing feature • Less error-prone, as code is generated • Cost: Learning curve – Invest once, amortized over 40+ years career
If you want to know more • Lecture, examples and some recommended reading are enough to tackle all of the parsing for Milestone 3 & 4 • 3 rd and 4 th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC • Lectures on Computability and Theory of Computation may also show you these algorithms
A flex specification %{ #include <stdio. h #include "y. tab. h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0 -9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z 0 -9b] { c = yytext[0]; return(c); } The Header The “Guts”: Regular expressions annotated with actions
The header %{ #include <stdio. h #include "y. tab. h" int c; extern int yylval; %} %% dividing line between header and rules section Temporary variable(s) Special variable • defined in scanner • used in parser • for transferring values associated with tokens to parser
The rules %% " " [a-z] ; { c = yytext[0]; yylval = c - 'a'; return (LETTER); [0 -9] } { c = yytext[0]; yylval = c - '0'; return (DIGIT); } [^a-z 0 -9b] { c = yytext[0]; return(c); } yytext: the string associated with the token
The rules %% " " [a-z] sets yylval to the character’s alphabetical order ; { c = yytext[0]; yylval = c - 'a'; return(LETTER); [0 -9] } { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z 0 -9n] sets yylval to digit’s numerical value { c = yytext[0]; return(c); } otherwise simply returns that character; presumably it’s an operator: +*-, etc.
Simple example • Implement a calculator which can recognize adding or subtracting of numbers [linux 33]%. /y_calc 1+101 = 102 [linux 33] %. /y_calc 1000 -300+200+100 = 1000 [linux 33] %
Example – the Lex part %{ #include <math. h> #include "y. tab. h" extern int yylval; %} Definitions pattern %% [0 -9]+ { action yylval = atoi(yytext); return NUMBER; } [t ]+ ; /* Do nothing for white space */ n return 0; /* End of the logic */. return yytext[0]; Rules %%
Example – the Yacc part %token NAME NUMBER %% Definitions statement: NAME '=' expression | expression { printf("= %dn", $1); } ; expression: expression '+' NUMBER Include Yacc library (-ly) { $$ = $1 + $3; } |expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } Rules ;
5e750ae2f68367cc02bc6fce6e9a1f69.ppt