Compiler Tools Lex Yacc Flex Bison

Compiler Tools Lex/Yacc – Flex & Bison

Compiler Front End (from Engineering a Compiler) Source code Scanner tokens Parser Intermediate Representation Errors Scanner (Lexical Analyzer) • Maps stream of characters into words Basic unit of syntax x = x + y ; becomes <id, x> <eq, => <id, x> <plus_op, +> <id, y> <sc, ; > Speed is an issue in scanning use a specialized recognizer • The actual words are its lexeme • Its part of speech (or syntactic category) is called its token type • Scanner discards white space & (often) comments

The Front End (from Engineering a Compiler) Source code Scanner tokens Parser IR Parser Errors • Checks stream of classified words (parts of speech) for grammatical correctness • Determines if code is syntactically well-formed • Guides checking at deeper levels than syntax • Builds an IR representation of the code Parsing is harder than scanning. Better to put more rules in scanner (whitespace etc).

The Big Picture • Language syntax is specified with parts of speech, not words • Syntax checking matches parts of speech against a grammar 1. goal expr S = goal 2. expr op term 3. | term T = { number, id, +, - } 4. term number 5. | id 6. op 7. N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} + | – No words here! Parts of speech, not words!

Why study lexical analysis? • We want to avoid writing scanners by hand • Finite automata are used in other applications: grep, website filtering, various “find” commands source code Scanner parts of speech & words tables or code specifications Goals: Scanner Generator Specifications written as “regular expressions” Represent words as indices into a global table To simplify specification & implementation of scanners To understand the underlying techniques and technologies

Finite Automata Formally a finite automata is a five-tuple(S, , , s 0, SF) where • S is the set of states, including error state Se. S must be finite. • is the alphabet or character set used by recognizer. Typically union of edge labels (transitions between states). • (s, c) is a function that encodes transitions (i. e. , character c in changes to state s in S. ) • s 0 is the designated start state • SF is the set of final states, drawn with double circle in transition diagram

Finite Automata Finite automata to recognize fee and fie: e e S 0 f S 1 S 3 S 2 i S 4 e n S = {s 0, s 1, s 2, s 3, s 4, s 5, se} n = {f, e, i} n (s, c) set of transitions shown above n s 0 = s 0 n S 5 SF= { s 3, s 5} Set of words accepted by a finite automata F forms a language L(F). Can also be described by regular expressions.

Finite Automata Quick Exercise n Draw a finite automata that can recognize CU | CSM | DU (drawing included below for reference) e e S 0 f S 1 S 3 S 2 i S 4 e S 5

RE vs CFG* n Context-free grammars are strictly more powerful than regular expressions. Any language that can be generated using regular expressions can be generated by a context-free grammar. ¨ There are languages that can be generated by a context-free grammar that cannot be generated by any regular expression. ¨ n n As a corollary, CFGs are strictly more powerful than DFAs and NDFAs. The proof is in two parts: Given a regular expression R , we can generate a CFG G such that L(R) == L(G). ¨ We can define a grammar G for which there is no FA F such that L(F) == L(G). ¨ * from http: //www. cs. rochester. edu/~nelson/courses/csc_173/grammars/cfg. html

Example n n No finite automata (and therefore no regular expression) can recognize the set of strings consisting of some number of 0 s followed by the same number of 1 s. {0^n 1^n|n>=1} <S> --> 0 <S> 1 <S> --> 01 What does this mean for us? Recognizing tokens can be done with a regular expression, but we need a CFG to recognize sentences in our language.

Regular Expression Examples n n digit: [0 -9] int with at least 1 digit: [0 -9]+ int that can have 0 digits: [0 -9]* What about float? ¨ [0 -9]*. [0 -9]+ // literal. , at least 1 digit after. ¨ – what about 0 or 2? ¨ ([0 -9]+)| ([0 -9]*. [0 -9]+) // combine int and float, notice use of (), ¨ - what about unary -? ¨ -? (([0 -9]+)| ([0 -9]*. [0 -9]+))

Regular Expressions in Lex* The characters that form regular expressions include: Ø. matches any single character except newline Ø * matches zero or more copies of preceding expression Ø [] a character class that matches any character within the brackets. If first character is ^ will match any character except those within brackets. A dash can be used for character range, e. g. , [0 -4] is equivalent to [01234]. more in book… Ø Examples: a. [0]* would match ab, ac, aa, ab 0, ac 0, aa 0, ab 00, ac 00, aa 00, … Ø a[bc] would match ab or ac Ø a[bc]* would match a, ab, ac, abb, acc, abc, acb, … Ø * from lex & yacc by Levine, Mason & Brown

Regular Expressions in Lex* The characters that form regular expressions also include: Ø ^ matches beginning of line as first character of expression (also negation within [], as listed above). Ø $ matches end of line as last character of expression Ø {} indicates how many times previous pattern is allowed to match, e. g. , A{1, 3} matches one to three occurrences of A. Ø used to escape metacharacters, e. g. , * is literal asterisk, ” is a literal quote, { is literal open brace, etc. Ø Examples: ^I[ab] matches Ia or Ib at the beginning of a line Ø I[ab]$ matches Ia or Ib at the end of a line, e. g. , not Iaj Ø. . . matches a. b, c. d etc. Ø * from lex & yacc by Levine, Mason & Brown

Regular Expressions, continued Ø Ø Ø + matches one or more occurrences of preceding expression, e. g. , [0 -9]+ matches “ 1” “ 11” or “ 1234” but not empty string ? matches zero or one occurrence of preceding expression, e. g. , ? [0 -9]+ matches signed number with optional leading minus sign | matches either preceding or following expression, e. g. , cow|pig|sheep matches any of the three words “…” interprets everything inside quotation marks literally () Groups a series of regular expressions into a new regular expression, e. g. , (01) becomes character sequence 01. Useful when building up complex patterns with *, + and |. Examples: Ø a[0 -9]+(dog|cat) matches a 0 cat, a 21 dog, etc. but NOT acat

More Regular Expression Examples n What’s a regular expression for matching quotes? ¨ Try: ”. *” ¨ won’t work for lines with multiple quoted parts, such as: “mine” and “yours” because lex matches largest possible pattern. ¨ ”[^”n]*[“n] ¨ will work by excluding “ (forces lex to stop as soon as “ is reached). The n keeps a quoted string from exceeding one line.

Flex – Fast Lexical Analyzer Here’s where we’ll put the regular expressions to good use! lex. yy. c contains yylex() regular expressions & C-code rules FLEX (Scanner generator) scanner (program to recognize patterns in text) compile executable – analyzes and executes input

Flex input file n 3 sections definitions %% rules %% user code

Definition Section Examples n name definition DIGIT [0 -9] ID [a-z][a-z 0 -9]* n A subsequent reference to {DIGIT}+". "{DIGIT}* is identical to: ([0 -9])+". "([0 -9])*

C Code n Can include C-code in definitions %{ /* This is a comment inside the definition */ #include <math. h> // may need headers #include <stdio. h> // for printf in BB #include <stdlib. h> // for exit(0) in BB %}

Rules n The rules section of the flex input contains a series of rules of the form: pattern action n In the definitions and rules sections, any indented text or text enclosed in %{ and %} is copied verbatim to the output (with the %{ %}'s removed). The %{ %}'s must appear unindented on lines by themselves.

Example: Simple Pascal-like recognizer Definitions section: /* scanner for a toy Pascal -like language */ Remember these are on a line %{ by themselves, unindented! /* need for the call to Lines inserted as-is into atof() below */ resulting code #include <math. h> %} DIGIT [0 -9] Definitions that can be used in rules section ID [a-z][a-z 0 -9]* n } }

Example continued text that matched the pattern (a char*) pattern action n Rules section: %% {DIGIT}+ { printf("An integer: %s (%d)n", yytext, atoi(yytext )); } {DIGIT}+". "{DIGIT}* {printf("A float: %s (%g)n", yytext, atof(yytext)); } if|then|begin|end|procedure|function {printf("A keyword: %sn", yytext); } {ID} { printf( "An identifier: %sn", yytext ); } "+"|"-"|"*"|"/" { printf( "An operator: %sn", yytext ); } "{"[^}n]*"}" /* eat up one-line comments */ [ tn]+ /* eat up whitespace */. { printf( "Unrecognized character: %sn", yytext ); }

Example continued n User code (required for flex, in library for lex) %% yywrap() {} // needed to link, unless libfl. a is available // OR put %option noyywrap at the top of a flex file. int main(int argc, char ** argv ) { ++argv, --argc; /* skip over program name */ if ( argc > 0 ) yyin = fopen( input file lex argv[0], "r" ); else yyin = stdin; yylex(); } lexer function produced by lex

Flex exercise #1 1. 2. 3. Download pascal. l From a command prompt (Start->Run->cmd): n Flex -opascal. c -L pascal. l n NOTE: without –o option output file will be called lex. yy. c n -L option suppresses #lines that cause problems with some compilers (e. g. Dev. C++) Compile and execute pascal. c (batch on Blackboard) n 4. gcc –opascal. exe –Lc: progra~1gnuwin 32lib pascal. c –lfl -ly Execute program. Type in digits, ids, keywords etc. End program with Ctrl-Z

Flex exercise #2 n n n Copy words. l (from lex & yacc) Use flex then compile and execute What does it do? Extend the example with 1 new part of speech. Recognize lexemes R 0 -R 9 as register names Recognize complex numbers, including for example -3+4 i, +5 -6 i, +7 i, 8 i, -12 i, but not 3++4 i (hint: print newline before displaying your complex number, lexer may display 3+ and then recognize +4 i)

Lex techniques n Hardcoding lists not very effective. Often use symbol table. Example in lec & yacc, not covered in class but see me if you’re interested.

Bison – like Yacc (yet another compiler) Context-free Grammar in BNF form, LALR(1)* Bison parser (c program) group tokens according to grammar rules Bison parser provides yyparse *Look. Ahead Left Recursive You must provide: • the lexical analyzer (e. g. , flex) • an error-handling routine named yyerror • a main routine that calls yyparse

Bison Parser n n Same sections as flex (yacc came first): definitions, rules, C-Code We’ll discuss rules first, then definitions and C-Code

Bison Parser – Rule Section n n Consider CFG <statement> => ID = <expression> Would be written in bison “rules” section as: statement: NAME ‘=‘ expression | expression { printf("= %dn", $1); ; } white space ; at end n n expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 + $3; } | NUMBER { $$ = $1; } ; Use : between lhs and rhs, place ; at end. What are $$? next slide… NOTE: The first rule in statement won’t be operational yet…

More on bison Rules and Actions n n $1, $3 refer to RHS values. $$ sets value of LHS. In expression, $$ = $1 + $3 means it sets the value of lhs (expression) to NUMBER ($1) + NUMBER ($3) A rule action is executed when the parser reduces that rule (will have recognized both NUMBER symbols) lexer should have returned a value via yylval (next slide) when is this statement: NAME ‘=‘ expression executed? | expression { printf("= %dn", $1); } ; $$ $1 $2 $3 expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } | NUMBER ‘-’ NUMBER { $$ = $1 - $3; } ;

Coordinating flex and bison n Example to return int value: [0 -9]+ { yylval = atoi(yytext); return NUMBER; } sets value for use in actions This one just returns the numeric value of the string stored in yytext returns recognized token atoi is C function to convert string to integer In prior flex examples we just returned tokens, not values n Also need to skip whitespace, return symbols [ t] ; /* ignore white space */ n return 0; /* logical EOF */. return yytext[0];

Bison Rule Details n n n Unlike flex, bison doesn’t care about line boundaries, so add white space for readability Symbol on lhs of first rule is start symbol, can override with %start declaration in definition section Symbols in bison have values, must be “declared” as some type ¨ YYSTYPE determines type ¨ Default for all values is int ¨ We’ll be using different types for YYSTYPE in the Simple. Calc exercises

Bison Parser – Definition Section n Definition Section ¨ Tokens used in grammar should be defined. Example rule: n expression: NUMBER ‘+’ NUMBER { $$ = $1 + $3; } The token NUMBER should be defined. Later we’ll see cases where expression should also be defined, and how to define tokens with other data types. %token must be lowercase, e. g. , : n %token NUMBER n ¨ From the tokens that are defined, Bison will create an appropriate header file ¨ Single quoted characters can be used as tokens without declaring them, e. g. , ‘+’, ‘=‘ etc.

Lex - Definition Section Must include the header created by bison n Must declare yylval as extern n %{ #include "simple. Calc. tab. h extern int yylval; #include <math. h> %}

Bison Parser – C Section n At a minimum, provide yyerror and main routines yyerror(char *errmsg) { fprintf(stderr, "%sn", errmsg); } main() { yyparse(); }

Bison Intro Exercise n n Download Simple. Calc. y, Simple. Calc. l and mbison. bat Create calculator executable ¨ n FYI, mbison includes these steps: ¨ ¨ ¨ n mbison simple. Calc bison -d simple. Calc. y flex -L -osimple. Calc. c simple. Calc. l gcc -c simple. Calc. c gcc -c simple. Calc. tab. c gcc -Lc: progra~1gnuwin 32lib simple. Calc. o simple. Calc. tab. o -osimple. Calc. exe -lfl –ly Test with valid sentences (e. g. , 3+6 -4) and invalid sentences.

Understanding simple. Calc %{ #include "simple. Calc. tab. h" extern int yylval; %} %% [0 -9]+ { yylval = atoi(yytext); return NUMBER; } [ t] ; /* ignore white space */ n return 0; /* logical EOF */. return yytext[0]; %% /*--------------------*/ /* 5. Other C code that we need. */ yyerror(char *errmsg) { fprintf(stderr, "%sn", errmsg); } main() { yyparse(); } simple. Calc. l Explanation: When the lexer recognizes a number [0 -9]+ it returns the token NUMBER and sets yylval to the corresponding integer value. When the lexer sees a carriage return it returns 0. If it sees a space or tab it ignores it. When it sees any other character it returns that character (the first character in the yytext buffer). If the yyparse recognizes it – good! Otherwise the parser can generate an error. #ifndef YYTOKENTYPE # define YYTOKENTYPE /* Put the tokens into the symbol table, so that GDB and other debuggers know about them. */ enum yytokentype { NAME = 258, NUMBER = 259 }; #endif /* Tokens. */ #define NAME 258 simple. Calc. tab. h #define NUMBER 259

Understanding simple. Calc, continued %token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %dn", $1); } ; expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; Explanation Execute simple. Calc and enter expression 1+2 main program calls yyparse. This calls lex to recognize 1 as a NUMBER (puts 1 in yylval), sets $$ = $1 Calls lex which returns +, matches ‘+’ in first expression rhs Calls lex to recognize 2 as a NUMBER (puts 2 in yylval) Recognize expression + NUMBER and “reduce” this rule, does action {$$ = $1 + $3}. Recognizes expression as a statement, so it does the printf action.

Simple. Calc n Do simple. Calc exercise #1

Adding other variable types* n n YYSTYPE determines the data type of the values returned by the lexer. If lexer returns different types depending on what is read, include a union: %union { char cval; char *sval; int ival; } n n // C feature, allows one memory area to // be interpreted in different ways. // For bison, will be used with yylval The union will be placed at the top of your. y file (in the definitions section) Tokens and non-terminals should be defined using the union * relates to Simple. Calc exercise 2

Adding other variable types - Example n Definitions in simple. Calc. y: %union { float fval; int ival; } %token <ival>NUMBER %token <fval>FNUMBER %type <fval> expression n Use union in rules in simple. Calc. l: {DIGIT}+ { yylval. ival = atoi(yytext); return NUMBER; }

Processing lexemes in flex* n n Sometimes you want to modify a lexeme before it is passed to bison. This can be done by putting a function call in the flex rules Example: to convert input to lower case ¨ put a prototype for your function in the definition section (above first %%) ¨ write the function definition in the C-code section (bottom of file) ¨ call your function when the token is recognized. Use strdup to pass the value to bison. * relates to Simple. Calc exercise 3

$Example continued %{ #include “example. tab. h“ need prototype here void make_lower(char *text_in); %}$

Example continued %{ #include “example. tab. h“ need prototype here void make_lower(char *text_in); %} %% function call to process text [a-z. A-Z]+ {make_lower(yytext); yylval. sval = strdup(yytext); make duplicate using strdup return token type return KEYWORD; } %% void make_lower(char *text_in) { function code in C section int i; for (i=0; i<strlen(yytext); ++i) yytext[i]=tolower(yytext[i]); }

Adding actions to rules * For more complex processing, functions can be added to bison. n Remember to add a prototype at the top, and the function at the bottom n * relates to Simple. Calc exercise 4

Processing more than one line * To process more than one line, ensure the n is simply ignored n Use a recursive rule to allow multiple inputs n * relates to Simple. Calc exercise 4

Example todo. l %{ #include "todo. tab. h" %} %% groceries | clothes | dinner | book | bicycle | home { yylval. sval = strdup(yytext); return NOUN; } [ tnr]+ ; /* ignore white space */ [a-z. A-Z]+ { yylval. sval = strdup(yytext); return NAME; }. { return yytext[0]; } <<EOF>> return 0; buy | write | %% int yyerror(char *errmsg) { read | fprintf(stderr, "error is: %sn", errmsg); ride | } eat | main() { yyparse(); } go { yylval. sval = strdup(yytext); return VERB; } All one file – two columns used to fit one slide.

Example todo. y // Recognizes "to-do" list in form name { verb noun } %union { %% char *sval; void end(char* name) { } printf("That is all for now, %s!n", name); %{ exit(0); void end(char*); }; %} %token <sval>VERB NOUN NAME %type <sval>todo. List person task entry %% entry: person '{' todo. List '}' { end($1); } ; person: NAME { printf("Greetings, %sn", $1); $$ = $1; } ; todo. List: todo. List task | task ; task: VERB NOUN { printf("You need to %s %sn", $1, $2); } ; All one file – two columns used to fit one slide.

Simple. Calc n Do remaining simple. Calc exercises

Using files with Bison The standard file for Bison is yyin. The following code can be used to take an optional command-line argument: int main(argc, argv) int argc; char **argv; { FILE *file; if (argc == 2) { file = fopen(argv[1], "r"); if (!file) { fprintf(stderr, “Couldn't open %sn", argv[1]); exit(1); } yyin = file; } n

More details (if you’re curious) Running flex creates simple. Calc. c. This creates the following case statement (I added the printf statements: case 1: YY_RULE_SETUP printf("returning number value %dn", atoi(yytext)); { yylval = atoi(yytext); return NUMBER; } YY_BREAK case 2: YY_RULE_SETUP printf("ignoring white spacen"); ; /* ignore white space */ YY_BREAK case 3: YY_RULE_SETUP printf("recognized eofn"); return 0; /* logical EOF */ YY_BREAK case 4: YY_RULE_SETUP printf("returning other character %cn", yytext[0]); return yytext[0]; YY_BREAK

Continuing more detail Running bison creates simple. Calc. tab. c switch (yyn) { case 3: #line 4 "simple. Calc. y" { printf("= %dn", (yyvsp[0])); ; } break; case 4: #line 7 "simple. Calc. y" { (yyval) = (yyvsp[-2]) + (yyvsp[0]); ; } break; case 5: #line 8 "simple. Calc. y" { (yyval) = (yyvsp[-2]) - (yyvsp[0]); ; } break; case 6: #line 9 "simple. Calc. y" { (yyval) = (yyvsp[0]); ; } break; Notice use of stack pointer sp for $values NOTE: I added extra printf statements to each case, which is what you can see in the trace.

Continuing more detail n In exercise 2 you define a union. This gets translated to code within Simple. Calc. tab. h: #if ! defined (YYSTYPE) && ! defined (YYSTYPE_IS_DECLARED) #line 1 "simple. Calc. Ex 2. y" typedef union YYSTYPE { float fval; int ival; } YYSTYPE; extern YYSTYPE yylval; This is what makes your yylval return part of the union

Continuing more detail n Symbols you define in bison’s CFG are added to a symbol table: static const char *const yytname[] = { "$end", "error", "$undefined", "NUMBER", "FNUMBER", "NAME", "'='", "'+'", "'*'", "'('", "')'", "$accept", "statement", "expression", "term", "factor", 0 };

Continuing more detail New rules make use of union: switch (yyn) { case 3: #line 15 "simple. Calc. Ex 2. y" { printf("= %fn", (yyvsp[0]. fval)); ; } break; n case 4: #line 18 "simple. Calc. Ex 2. y" { (yyval. fval) = (yyvsp[-2]. fval) + (yyvsp[0]. fval); ; } break; case 5: #line 19 "simple. Calc. Ex 2. y" { (yyval. fval) = (yyvsp[0]. fval); ; } break; expression is defined as <fval>, so is NUMBER

Summary of steps (from online manual) The actual language-design process using Bison, from grammar specification to a working compiler or interpreter, has these parts: 1. Formally specify the grammar in a form recognized by Bison (i. e. , machine-readable BNF). For each grammatical rule in the language, describe the action that is to be taken when an instance of that rule is recognized. The action is described by a sequence of C statements. 2. Write a lexical analyzer to process input and pass tokens to the parser. 3. Write a controlling function (main) that calls the Bison-produced parser. 4. Write error-reporting routines.