CS 432 Compiler Construction Lecture 5 Department of

Скачать презентацию CS 432 Compiler Construction Lecture 5 Department of

d6aaa3f2e605dfb49b39bbb214ffdf9d.ppt

Количество слайдов: 73

CS 432: Compiler Construction Lecture 5 Department of Computer Science Salisbury University Fall 2017 Instructor: Dr. Sophie Wang http: //faculty. salisbury. edu/~xswang 3/18/2018 1

Goals and Approach The goals are ‒ Parsers in the front end for certain Pascal constructs: compound statements, assignment statements, and expressions. ‒ Flexible, language-independent intermediate code generated by the parsers to represent these constructs.

The approach is to: ‒ to begin with syntax diagrams for the Pascal constructs. ‒ you will create a conceptual design for the intermediate code structures, develop Java interfaces that represent the design, and write the Java classes that implement the interfaces. ‒ The syntax diagrams will guide the development of parsers that will generate the appropriate intermediate code. ‒ Finally, a syntax checker utility program will help verify the code you develop for this chapter.

Quick Review of the Framework FROM: Today’s topic: Parsing assignment statements and expressions, and generating parse trees. TO: 3/18/2018 4

Grammars and Languages o A grammar defines a language. n n n o Grammar = the set of all the BNF rules (or syntax diagrams) Language = the set of all the legal strings of tokens o … according to the grammar Legal string of tokens = a syntactically correct statement A statement is in the language (it’s syntactically correct) if it can be derived by the grammar. n n 3/18/2018 Each grammar rule “produces” a token string in the language. The sequence of productions required to arrive at a syntactically correct token string is the derivation of the string. _ 5

Grammars and Languages, cont’d o Example: A very simplified expression grammar: : : = | | ( ) : : = 0|1|2|3|4|5|6|7|8|9 : : = + | * 3/18/2018 6

Grammar Hierarchy o A lot of our understanding of grammars came from the work of the American linguist Noam Chomsky. o There are four categories of formal grammars in the Chomsky Hierarchy, they span from Type 0, the most general, to Type 3, the most restrictive. o More restrictions on the grammar make it easier to describe and efficiently parse, but reduce the expressive power. 3/18/2018 7

Type 0: Free or unrestricted grammars o o o These are the most general. Productions are of the form: u –> v where both u and v are arbitrary strings of symbols in V, with u non-null. There are no restrictions on what appears on the left or right-hand side other than the lefthand side must be non-empty. 3/18/2018 8

Type 1: Context-Sensitive Grammars o Context-sensitive grammars are very powerful and they can define more languages. o Production rules can be of the form : : = n o The parser is allowed to reduce to only in the context of and . Example: We can attempt to capture the language rule: “An identifier must have been previously declared to be a variable before it can appear in an expression. ” n In an expression, the parser can reduce to only in the context of a prior for that identifier. _ 3/18/2018 9

Context-Sensitive Grammars, cont’d o Context-sensitive grammars are extremely unwieldy for writing compilers. n Alternative: Use context-free grammars and rely on semantic actions such as building symbol tables to provide the context. _ 3/18/2018 10

Type 2: Context-Free Grammars (CFG) o Every production rule has a single nonterminal for its left-hand side. n Example: : : = + o Whenever the parser matches the right-hand side of the rule, it can freely reduce it to the nonterminal symbol. n Regardless of the context of where the match occurs. o A language is context-free if it can be defined by a context-free grammar. o Context-free grammars are a subset of context-sensitive grammars. 3/18/2018 11

Type 3: Regular grammars o o o Productions are of the form X–> a. Y X–>ε where X and Y are nonterminals and a is a terminal. That is, the left-hand side must be a single nonterminal and the right-hand side can be either empty, a single terminal by itself or with a single nonterminal. These grammars are the most limited in terms of expressive power. 3/18/2018 12

o o o Every type 3 grammar is a type 2 grammar, and every type 2 is a type 1 and so on. Type 3 grammars are particularly easy to parse because of the lack of recursive constructs. Efficient parsers exist for many classes of Type 2 grammars. Type 1 and Type 0 grammars are more powerful than Type 2 and 3, they are far less useful since we CONTEXT SENSITIVE cannot create efficient parsers for them. In designing programming languages using CONTEXT FREE formal grammars, we will use Type 2 or context-free grammars (CFGs) REGULAR 3/18/2018 13

Issues in parsing context-free grammars o o o 3/18/2018 Ambiguity, Recursive rules, and Left-factoring. 14

Ambiguity o o If a grammar permits more than one parse tree for some sentences, it is said to be ambiguous. For example, consider the following classic arithmetic expression grammar: E –> E op E | ( E ) | int op –> + | - | * | / int –> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 3/18/2018 15

10 – 2 * 5 E E op E int - int 10 * E int 5 2 3/18/2018 16

10 – 2 * 5 E E int 10 E op - E op E int * int 2 3/18/2018 5 17

Recursive productions o Productions are often defined in terms of themselves. For example a list of variables could be specified as: variable_list –> variable | variable_list , variable o o If the recursive nonterminal is at the left of the right-side of the production, e. g. A –> u | Av, we call the production left-recursive. Similarly, we can define a right-recursive production: A –> u | v. A. 3/18/2018 18

Left-factoring o o The parser usually reads tokens from left to right and it is convenient if, upon reading a token, it can make an immediate decision about which production from the grammar to expand. However, this can be trouble if there are productions that have common first symbol(s) on the right side of the productions. Stmt –> if Cond then Stmt else Stmt | if Cond then Stmt 3/18/2018 19

o left-factoring allows us to restructure the grammar to avoid this situation. Stmt –> if Cond then Stmt Opt. Else –> else Stmt | ε 3/18/2018 20

Top-Down Parsers o Start with the topmost nonterminal grammar symbol such as and work your way down recursively. n n 3/18/2018 Top-down recursive-descent parser Easy to understand write, but are generally BIG and slow. _ 21

Top-Down Parsers, cont’d o Write a parse method for a production (grammar) rule. n n n o Each parse method “expects” to see tokens from the source program that match its production rule. o Example: Stmt –> if Cond then Stmt Opt. Else A parse method calls other parse methods that implement lower production rules. Parse methods consume tokens that match the production rules. A parse is successful if it’s able to derive the input string (i. e. , the source program) from the production rules. n All the tokens match the production rules and are consumed. _ 3/18/2018 22

Top-down parsing o A top-down parsing program consists of a set of procedures, one for each non-terminal. o Execution begins with the procedure for the start symbol, which halts and announces success if its procedure body scans the entire input string. 23

Top-down parsing A typical procedure for non-terminal A in a top-down parser: boolean parser. For. A() { choose an A → X 1 X 2 … Xk; for (i= 1 to k) { if (Xi is a non-terminal) call procedure Xi(); else if (Xi matches the current input token “a”) advance the input to the next token; else /* an error has occurred */; } } 24

Bottom-Up Parsers o A popular type of bottom-up parser is the shift-reduce parser. n A bottom-up parser starts with the input tokens from the source program. o A shift-reduce parser uses a parse stack. n The stack starts out empty. n The parser shifts (pushes) each input token (terminal symbol) from the scanner onto the stack. _ 3/18/2018 25

Bottom-Up Parsers, cont’d o When what’s on top of the parse stack matches the longest right hand side of a production rule: n The parser pops off the matching symbols and … n … reduces (replaces) them with the nonterminal symbol at the left hand side of the matching rule. n Example: : : = * o Pop off * and replace by o Repeat until the parse stack is reduced to the topmost nonterminal symbol. n Example: n The parser accepts the input source as being syntactically correct. o The parse was successful. _ 3/18/2018 26

Example: Shift-Reduce Parsing o Parse the expression a + b*c given the production rules: Parse stack (top at right) Input Action a + b*c shift + b*c reduce + b*c reduce + b*c reduce : : = : : = + : : = | * : : = : : = : : = a | b | c a + b*c reduce + b*c shift + b *c reduce + *c reduce + *c reduce + *c shift + * c reduce + * reduce + * reduce + reduce 3/18/2018 + * In this grammar, the topmost nonterminal symbol is reduce accept 27

Why Bottom-Up Parsing? o The shift-reduce actions can be driven by a table. n The table is based on the production rules. n It is almost always generated by a compiler-compiler. o Like a table-driven scanner, a table-driven parser can be very compact and extremely fast. n However, for a significant grammar, the table can be nearly impossible for a human to follow. n Error recovery can be especially tricky. o It can be very hard to debug the parser if something goes wrong. n It’s usually an error in the grammar (of course!). _ 3/18/2018 28

LL and LR Parsers o Parsers are classified LL or LR according to the way they operate while parsing. o The first L stands for left-to-right, the order a parser reads the source program. o If the second letter is also L, it means that whenever the parser is processing a production rule, it “expands” the leftmost nonterminal symbol first. _ 3/18/2018 29

o o LL parsers begin at the start symbol and try to apply productions to arrive at the target string. An LL parser is a left-to-right, leftmost derivation. That is, we consider the input symbols from the left to the right and attempt to construct a leftmost derivation. This is done by beginning at the start symbol and repeatedly expanding out the leftmost nonterminal until we arrive at the target string. LR parsers begin at the target string and try to arrive back at the start symbol. An LR parse is a left-to-right, rightmost derivation, meaning that we scan from the left to right and attempt to construct a rightmost derivation. The parser continuously picks a substring of the input and attempts to reverse it back to a nonterminal. 3/18/2018 30

o o o During an LL parse, the parser continuously chooses between two actions: Predict: Based on the leftmost nonterminal and some number of lookahead tokens, choose which production ought to be applied to get closer to the input string. Match: Match the leftmost guessed terminal symbol with the leftmost unconsumed symbol of input. 3/18/2018 31

o As an example, given this grammar: n n o S→E E→T+E E→T T → int Notice that in each step LL parser look at the leftmost symbol in the production. If it's a terminal, we match it, and if it's a nonterminal, we predict what it's going to be by choosing one of the rules. 3/18/2018 32

int + int (with LL parser) Production Input Action ----------------------------S int + int Predict S -> E E int + int Predict E -> T + E T+E int + int Predict T -> int + E int + int Match int +E + int Match + E int Predict E -> T T int Predict T -> int int Match int Accept 3/18/2018 33

o o o In an LR parser, there are two actions: Shift: Add the next token of input to a buffer for consideration. Reduce: Reduce a collection of terminals and nonterminals in this buffer back to some nonterminal by reversing a production. 3/18/2018 34

int + int (with LR parser) Workspace Input Action ----------------------------int + int Shift int + int Reduce T -> int T + int Shift T+ int Shift T + int Reduce T -> int T+T + int Shift T+T+ int Shift T + int Reduce T -> int T+T+T Reduce E -> T T+T+E Reduce E -> T + E E Reduce S -> E S Accept 3/18/2018 35

Pascal Statement Syntax Diagrams 3/18/2018 36

Pascal Statement Syntax Diagrams, cont’d For now, greatly simplified! 3/18/2018 37

Backus Naur Form (BNF) o A text-based way to describe source language syntax. n Named after John Backus and Peter Naur. n Text-based means it can be read by a program. . . o … such as a compiler-compiler that can automatically generate a parser for a source language after reading (and parsing) the language’s syntax rules written in BNF. o Uses certain meta-symbols. n Symbols that are part of BNF itself but are not necessarily part of the syntax of the source language. : : = | < > 3/18/2018 “is defined as” “or” Surround names of nonterminal (not literal) items 38

BNF Example: U. S. Postal Address : : = : : = | : : = | . : : = Sr. | Jr. | : : = | : : = , : : = : : = Mary Jane : : = 123 Easy Street : : = San Jose, CA 95192 : : = : : = : : = : : = A|B|C|D|E|F|G|H|I|J|K|L|M |N|O|P|Q|R|S|T|U|V|W|X|Y|Z : : = … : : = … etc. 3/18/2018 39

BNF: Optional and Repeated Items o To show optional items in BNF, use the vertical bar |. n “An expression is a simple expression optionally followed by an relational operator and another simple expression. ” n : : = | o BNF uses recursion for repeated items. n “A digit sequence is a digit followed by zero or more digits. ” n : : = | Right recursive n : : = | Left recursive 3/18/2018 40

BNF Example: Pascal Number : : = | Repetition via recursion. : : = : : = . | . | : : = | : : = E | e The sign is optional. : : = + | : : = | 3/18/2018 41

BNF Example: Pascal IF Statement : : = IF THEN | IF THEN ELSE o It should be straightforward to write a parsing method from either the syntax diagram or the BNF. _ 3/18/2018 42

Derivations and Productions o Is (1 + 2)*3 valid in our expression language? PRODUCTION GRAMMAR RULE : : = : : = 3 *3 : : = * ( )*3 : : = ( ) ( )*3 : : = ( )*3 : : = ( 2 )*3 ( + 2 )*3 : : = 2 : : = + ( + 2 )*3 : : = ( 1 + 2 )*3 o : : = 3 : : = 1 Yes! The expression is valid. n Read the derivation backwards to approximate what a top-down parser does to recognize the expression as valid. 3/18/2018 : : = | | ( ) : : = 0|1|2|3|4|5|6|7|8|9 43 : : = + | *

Extended BNF (EBNF) o Extended BNF (EBNF) adds meta-symbols { } and [ ] { } Surround items to be repeated zero or more times. [ ] Surround optional items. o Originally developed by Niklaus Wirth. n Inventor of Pascal. n Early user of syntax diagrams. o Repetition (one or more): n n 3/18/2018 BNF: : : = | EBNF: : : = { } 44

Extended BNF, cont’d o Optional items. n n 3/18/2018 BNF: : : = IF THEN | IF THEN ELSE EBNF: : : = IF THEN [ ELSE ] BNF: : : = | EBNF: : : = [ ] 45

Parse Tree: Conceptual Design More accurately called an abstract syntax tree (AST). BEGIN alpha : = -88; beta : = 99; result : = alpha + 3/(beta – gamma) + 5 END 3/18/2018 46

Parse Tree: Conceptual Design, cont’d o At the conceptual design level, we don’t care how we implement the tree. n o This should remind you of how we first designed the symbol table. Basic tree operations n n n n 3/18/2018 Create a new node. Create a copy of a node. Set and get the root node of a parse tree. Set and get an attribute value in a node. Add a child node to a node. Get the list of a node’s child nodes. Get a node’s parent node. 47

Intermediate Code Interfaces Goal: Keep it source language-independent. 3/18/2018 48

Intermediate Code Implementations 3/18/2018 49

An Intermediate Code Factory Class public class ICode. Factory { public static ICode create. ICode() { return new ICode. Impl(); } public static ICode. Node create. ICode. Node(ICode. Node. Type type) { return new ICode. Node. Impl(type); } } 3/18/2018 50

Coding to the Interfaces (Again) o Example: n // Create the ASSIGN node. ICode. Node assign. Node = ICode. Factory. create. ICode. Node(ASSIGN); // Create the VARIABLE node (left-hand side). ICode. Node variable. Node = ICode. Factory. create. ICode. Node(VARIABLE); // Adopt the VARIABLE node as the first child. assign. Node. add. Child(variable. Node); 3/18/2018 51

Intermediate Code (ICode) Node Types public enum ICode. Node. Type. Impl implements ICode. Node. Type { Do not confuse // Program structure PROGRAM, PROCEDURE, FUNCTION, node types (ASSIGN, ADD, etc. ) with data types (integer, real, etc. ). // Statements COMPOUND, ASSIGN, LOOP, TEST, CALL, PARAMETERS, IF, SELECT_BRANCH, SELECT_CONSTANTS, NO_OP, // Relational operators EQ, NE, LT, LE, GT, GE, NOT, // Additive operators ADD, SUBTRACT, OR, NEGATE, We use the enumerated type ICode. Node. Type. Impl for node types which is different from the enumerated type Pascal. Token. Type to help maintain source language independence. // Multiplicative operators MULTIPLY, INTEGER_DIVIDE, FLOAT_DIVIDE, MOD, AND, // Operands VARIABLE, SUBSCRIPTS, FIELD, INTEGER_CONSTANT, REAL_CONSTANT, STRING_CONSTANT, BOOLEAN_CONSTANT, } 3/18/2018 52

Intermediate Code Node Implementation public class ICode. Node. Impl extends Hash. Map implements ICode. Node { private ICode. Node. Type type; private ICode. Node parent; private Array. List children; // node type // parent node // children array list public ICode. Node. Impl(ICode. Node. Type type) { this. type = type; this. parent = null; this. children = new Array. List(); }. . . } o o Each node is a Hash. Map. Each node has an Array. List of child nodes. 3/18/2018 53

A Parent Node Adopts a Child Node public ICode. Node add. Child(ICode. Node node) { if (node != null) { children. add(node); ((ICode. Node. Impl) node). parent = this; } return node; } o When a parent node adds a child node, we can say that the parent node “adopts” the child node. o Keep the parse tree implementation simple! _ 3/18/2018 54

What Attributes to Store in a Node? public enum ICode. Key. Impl implements ICode. Key { LINE, ID, VALUE; } o Not much! Not every node will have these attributes. n n n o LINE: statement line number ID: symbol table entry of an identifier VALUE: data value Most of the information about what got parsed is encoded in the node type and in the tree structure. 3/18/2018 55

Statement Parser Class o Class Statement. Parser is a subclass of Pascal. Parser. TD which is a subclass of Parser. n 3/18/2018 Its parse() method builds a part of the parse tree and returns the root node of the subtree. _ 56

Statement Parser Subclasses o Statement. Parser itself has subclasses: n n n Compound. Statement. Parser Assignment. Statement. Parser Expression. Parser o o 3/18/2018 The parse() method of each subclass returns the root node of the subtree that it builds. Note the dependency relationships among Statement. Parser and its subclasses. _ 57

Building a Parse Tree o Each parse() method builds a subtree and returns the root node of the subtree. o The caller of the parse() method adopts the subtree’s root node as a child of the subtree that it’s building. n n 3/18/2018 The caller then returns the root node of its subtree to its caller. This process continues until the entire source has been parsed and we have the entire parse tree. _ 58

Building a Parse Tree, cont’d o Example: BEGIN alpha : = 10; beta : = 20 END 1. Compound. Statement. Parser’s parse() method creates a COMPOUND node. 2. Assignment. Statement. Parser’s parse()method creates an ASSIGN node and a VARIABLE node, which the ASSIGN node adopts as its first child. 3/18/2018 59

Building a Parse Tree, cont’d 3. Expression. Parser’s parse() method creates an INTEGER CONSTANT node which the ASSIGN node adopts as its second child. BEGIN alpha : = 10; beta : = 20 END 4. The COMPOUND node adopts the ASSIGN node as its first child. 3/18/2018 60

Building a Parse Tree, cont’d 5. Another set of calls to the parse() methods of Assignment. Statement. Parser and Expression. Parser builds another assignment statement subtree. 6. The COMPOUND node adopts the subtree as its second child. BEGIN alpha : = 10; beta : = 20 END 3/18/2018 61

Pascal Expression Syntax Diagrams 3/18/2018 62

Expression Syntax Diagrams, cont’d 3/18/2018 63

Expression Syntax Diagrams, cont’d 3/18/2018 64

Pascal’s Operator Precedence Rules Level 1 (factor: highest) 2 (term) additive: + - OR 4 (expression: lowest) relational: = <> < <= > >= If there are no parentheses: n n o multiplicative: * / DIV MOD AND 3 (simple expression) o Operators NOT Higher level operators execute before the lower level ones. Operators at the same level execute from left to right. Because the factor syntax diagram defines parenthesized expressions, parenthesized expressions always execute first, from the most deeply nested outwards. 3/18/2018 65

Example Decomposition o alpha + 3/(beta – gamma) + 5 3/18/2018 66

Parsing Expressions o Pascal statement parser subclass Expression. Parser has methods that correspond to the expression syntax diagrams: n n o parse. Expression() parse. Simple. Expression() parse. Term() parse. Factor() Each parse method returns the root of the subtree that it builds. n 3/18/2018 Therefore, Expression. Parser’s parse() method returns the root of the entire expression subtree. 67

Parsing Expressions, cont’d o Pascal’s operator precedence rules determine the order in which the parse methods are called. n n The parse tree that Expression. Parser builds determines the order of evaluation. Example: 32 + centigrade/ratio Do a postorder traversal of the parse tree. (Visit the left subtree, visit the right subtree, then visit the root. ) 3/18/2018 68

Parsing Expressions, cont’d o Example: 3/18/2018 a+b+c+d+e 69

Example: Method parse. Expression() o First, we need to map Pascal token types to parse tree node types. n We’ll use a hash table. // Map relational operator tokens to node types. private static final Hash. Map REL_OPS_MAP = new Hash. Map(); static { REL_OPS_MAP. put(EQUALS, EQ); REL_OPS_MAP. put(NOT_EQUALS, NE); REL_OPS_MAP. put(LESS_THAN, LT); REL_OPS_MAP. put(LESS_EQUALS, LE); REL_OPS_MAP. put(GREATER_THAN, GT); REL_OPS_MAP. put(GREATER_EQUALS, GE); }; 3/18/2018 70

Method parse. Expression(), cont’d private ICode. Node parse. Expression(Token token) throws Exception { ICode. Node root. Node = parse. Simple. Expression(token); token = current. Token(); Token. Type token. Type = token. get. Type(); if (REL_OPS. contains(token. Type)) { ICode. Node. Type node. Type = REL_OPS_MAP. get(token. Type); ICode. Node op. Node = ICode. Factory. create. ICode. Node(node. Type); op. Node. add. Child(root. Node); token = next. Token(); // consume the operator op. Node. add. Child(parse. Simple. Expression(token)); root. Node = op. Node; } return root. Node; } 3/18/2018 71

Printing Parse Trees o Utility class Parse. Tree. Printer prints parse trees. n Prints in an XML format. 3/18/2018 72

Pascal Syntax Checker I o The -i compiler option prints the intermediate code: java -classpath classes Pascal execute -i assignments. txt o Add to the constructor of the main Pascal class: if (intermediate) { Parse. Tree. Printer tree. Printer = new Parse. Tree. Printer(System. out); tree. Printer. print(i. Code); } o Demo n n 3/18/2018 For now, all we can parse are compound statements, assignment statements, and expressions. More syntax error handling. 73