Скачать презентацию Control Flow Prediction 1 ECE 1773 Fall 2007 Скачать презентацию Control Flow Prediction 1 ECE 1773 Fall 2007

708a0427757641625430bb2672e1d513.ppt

  • Количество слайдов: 99

Control Flow Prediction #1 ECE 1773 Fall 2007 © A. Moshovos (Toronto) Control Flow Prediction #1 ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Roadmap • • Out-of-Order Execution Overview The Need for Branch Prediction Dynamic Branch Prediction Roadmap • • Out-of-Order Execution Overview The Need for Branch Prediction Dynamic Branch Prediction Control Speculative Execution ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Out-of-Order Execution loop: add ld add sub bne r 4, r 2, r 3, Out-of-Order Execution loop: add ld add sub bne r 4, r 2, r 3, r 1, r 4, 10(r 4) r 3, r 1, r 0, 1 r 2 1 loop Superscalar fetch decode sub fetch decode sum += a[++m]; i--; } while (i != 0); add fetch do { add ld bne out-of-order fetch decode fetch add decode ld add sub ECE 1773 Fall 2007 © A. Moshovos (Toronto) bne

Sequential Semantics? • Execution does NOT adhere to sequential semantics inconsistent fetch decode fetch Sequential Semantics? • Execution does NOT adhere to sequential semantics inconsistent fetch decode fetch • • decode add ld add sub bne consistent To be precise: Eventually it may Simplest solution: Define problem away Not acceptable today: e. g. , Virtual Memory Three-phase Instruction execution – In-Progress, Completed and Committed ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Out-of-Order Execution Overview Processing Phase Static program In-Progress Program Form Dispatch/ dependences dynamic inst. Out-of-Order Execution Overview Processing Phase Static program In-Progress Program Form Dispatch/ dependences dynamic inst. Stream (trace) inst. Issue inst execution inst. Reorder & commit ECE 1773 Fall 2007 © A. Moshovos (Toronto) Committed completed instructions Completed execution window

Instructions are like air • If can’t breathe nothing else matters • If you Instructions are like air • If can’t breathe nothing else matters • If you have no instructions to execute – no technique is going to help you • 1 every 5 insts. is a control flow one – we’ll use the term branch – Jumps, branches, calls, etc. • Parallelism within a basic block is small • Need to go beyond branches • Useful for simple pipelining and superscalar but more so for out-of-order execution ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Another Reason to Care about Branches • Memory is slow, very slow – Ax Another Reason to Care about Branches • Memory is slow, very slow – Ax 300 cycles – One solution is to tolerate memory latencies – Tolerate? • Do something else – Find parallelism • Well, need instructions – How many branches in 100 instructions? • Assuming 1 in 5, 20 • Overlap long delays with other, potentially useful work ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Anatomy of Branch Instructions • Conditional/Uncoditional – Conditional • Unknown: Direction and Target address Anatomy of Branch Instructions • Conditional/Uncoditional – Conditional • Unknown: Direction and Target address – Unconditional • Unknown: Target • Direct/Indirect – Direct • Target constant: usually PC + Immediate (in instruction) – Indirect • Target changes ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Branches on MIPS • Conditional Branches – Evaluate condition, may branch – Next PC Branches on MIPS • Conditional Branches – Evaluate condition, may branch – Next PC = PC + sizeof(inst) – Next PC = Target Address • Often, PC + offset encoded in the instruction • Target Address Set is Static • Jumps – Definitely branches – PC = Target Address – Target Address is static • Indirect Jumps – PC = register value – Target Address Set Unbounded and Dynamic • Traps ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Branch Prediction • • Guess the direction of a branch Guess its target (if Branch Prediction • • Guess the direction of a branch Guess its target (if necessary) Fetch instructions from there Execute Speculatively – Without knowing whether we should • Eventually, verify if prediction was correct – If correct, good for us – Flush/Squash: • if not, well, discard and execute down the right path • Ultimately we need the Target Instructions • Also known as Control Flow Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Branch Prediction Timeline: Correct Prediction Original Dynamic Instruction Stream ed n a Br ch Branch Prediction Timeline: Correct Prediction Original Dynamic Instruction Stream ed n a Br ch tch e f B C s es u h nc ra so re d ve l P xt ne n tio ic Fetch and execute Speculatively V G ECE 1773 Fall 2007 © A. Moshovos (Toronto) t ida al red ep TIME

Branch Prediction Timeline: Incorrect Prediction Original Dynamic Instruction Stream ed n a Br ch Branch Prediction Timeline: Incorrect Prediction Original Dynamic Instruction Stream ed n a Br ch tch e f B C s es u h nc ra P xt ne V ECE 1773 Fall 2007 © A. Moshovos (Toronto) Squash n tio ic Fetch and execute Speculatively G so re d ve l t ida al red ep TIME

Nested Predictions Original Dynamic Instruction Stream B 1 ed ed ch ch et f Nested Predictions Original Dynamic Instruction Stream B 1 ed ed ch ch et f B 2 t fe ed d he c h B 1 c et ed v f B 2 t fe B 2 r ol es B 1 Squash o es r d ve l Squash ed B 1 v ol es r Squash TIME ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Terminology • Branch Direction Prediction – Taken or not • Branch Prediction – Target Terminology • Branch Direction Prediction – Taken or not • Branch Prediction – Target & Direction – Often used for Branch Direction Prediction • Control Flow Speculation – Guessing Targets includes Direction • Control Flow Speculative Execution – Executing instructions at the guessed target • Branch Misprediction – Incorrect Control Flow Speculation ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Elements of Branch Prediction • Start with branch PC and answer: • Why just Elements of Branch Prediction • Start with branch PC and answer: • Why just PC? • Q 1? Branch taken or not? • For conditional Branches • Direction Prediction • Binary • Q 2? Where to? • For all • Target Address, not binary • Q 3? Target Instruction(s) • For all • All must be done to be successful • Let’s consider these separately ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Where Branch Prediction Takes Place • At the Fetch phase – Fetching current instructions Where Branch Prediction Takes Place • At the Fetch phase – Fetching current instructions – At the end of this cycle we need to know what to fetch next – Otherwise we have a bubble • Note: I-Cache latencies have been increasing: 2 -3 Cycles • We have available – PC of current fetch block • We do not know of the instructions within the block • So, branch predictors use just the PC ECE 1773 Fall 2007 – Any other information must be previously © A. Moshovos (Toronto)

Branch Prediction Performance Potential ECE 1773 Fall 2007 © A. Moshovos (Toronto) Source: Branch Branch Prediction Performance Potential ECE 1773 Fall 2007 © A. Moshovos (Toronto) Source: Branch Prediction, Instruction-Window Size, and Cache Size: Performance Tradeoffs and Simulation Techniques Kevin Skadron, Pritpal S. Ahuja, Margaret Martonosi, Douglas W. Clark

Static Branch Prediction • Static: – Decisions do not take into account dynamic behavior Static Branch Prediction • Static: – Decisions do not take into account dynamic behavior – Non-adaptive can be another term • • Always Taken Always Not-Taken Forward NT Backward T If X then T but if Y then NT but if Z then T – More elaborate schemes are possible – ‘Branch Prediction “for free”’, Ball and Larus • Bottom line – – Accuracy is high but not high enough Say it’s 60% Probability of 100 instructions : . 6^20 =. 000036 ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Correct Prediction Rate Static Branch Prediction Accuracy Source: Alex Ramirez, UPC ECE 1773 Fall Correct Prediction Rate Static Branch Prediction Accuracy Source: Alex Ramirez, UPC ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Correct Prediction Rate The Role of the Compiler Source: Alex Ramirez, UPC ECE 1773 Correct Prediction Rate The Role of the Compiler Source: Alex Ramirez, UPC ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Metrics • Misprediction rate: Accuracy – wrong predictions per branch – Not well correlated Metrics • Misprediction rate: Accuracy – wrong predictions per branch – Not well correlated to performance • Mispredictions per Instruction MPI/MPKI: accuracy and “performance” – – – Well correlated to performance MPI = 1%, IPC = 2, Squash Penalty = 10 Cycles MPI = 1% Squash every 100 insts Squash every 50 cycles since IPC = 2 10 cycles lost per 50 cycles 20% performance loss • Constant squash latency is an approximation ECE 1773 Fall 2007 © A. Moshovos (Toronto) Source: Uri Weiser Slides

Why we need branch prediction revisited: • Probability of 100 insts (TODAY): – – Why we need branch prediction revisited: • Probability of 100 insts (TODAY): – – – – • Probabilty of 250 insts (SOON) – – • • . 6 =. 000036. 7 =. 00079. 8 =. 011 or 1%. 9 =. 12 or 12%. 95 = 36%. 98 = 66%. 99 = 82%. 9 =. 9^250/5 =. 9^50 =. 0051. 95 =. 08. 98 =. 36. 99 =. 6 Assuming uniform distr. Not true but for the sake of illustration ECE 1773 Fall 2007 © A. Moshovos (Toronto) • Assume 1 every 5 insts a branch • Assume MPR = 90% • Probability we’ll have n instructions: – 0. 9 ^ ( n / 5)

Why Dynamic Branch Prediction Works • Why? Larger window -> More opportunity for parallelism Why Dynamic Branch Prediction Works • Why? Larger window -> More opportunity for parallelism • Basic Idea: – hardware guesses whether a branch will be taken, and if so where it will go • What makes this work? – History Repeats Itself – Past Branch Behavior STRONG indicator of future branch behavior • Branches tend to exhibit regular behavior ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Regularity in Branch Behavior • Given Branch at PC X • Observe it’s behavior: Regularity in Branch Behavior • Given Branch at PC X • Observe it’s behavior: – Q 1? Taken or not? – Q 2? Where to? • In typical programs the answers are often: – A 1? Same as last time – A 2? Same as last time • This is different than “it’s always the same” – Allows for changes in direction and target ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Last-outcome Branch prediction • J. E. Smith • Start with PC and answer whether Last-outcome Branch prediction • J. E. Smith • Start with PC and answer whether taken or not – 1 bit information: T or NT (e. g. , 1 or 0) • Example Implementation: 2 m x 1 mem. Branch Prediction table PC m Prediction - Read at Fetch - Write on mispred. - I. e. , EX or Commit – Read Prediction with LSB of PC – Change bit on misprediction – May use PC from wrong PC • aliasing: destructive vs. constructive – Also known as 1 -bit prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Last-Outcome Predictor Example • • • Initial state: taken Three branches – large enough Last-Outcome Predictor Example • • • Initial state: taken Three branches – large enough table Actual Sequence: TTTTTNTTTN Predicted: TTTTTNNTTTN ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Aliasing • Predictor Space is finite • Aliasing: – Multiple branches mapping to the Aliasing • Predictor Space is finite • Aliasing: – Multiple branches mapping to the same entry • Constructive – The branches behave similarly • May benefit accuracy • Destructive – They don’t • Will hurt accuracy • Can play with the hashing function to minimize – Black magic – Simple hash (PC << 16) ^ PC works OK ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Learning Time • Number of times we have to observe a branch before we Learning Time • Number of times we have to observe a branch before we can predict it’s behavior • Last-outcome has very fast learning time • We just need to see the branch at least once • Even better: – initialize predictor to taken – Most branches are taken so for those learning time will be zero • “Problem” with Last-Outcome – Too easy to change it’s mind ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Saturating-Counter Predictor/Bimodal • Consider strongly biased branch with infrequent outcome (e. g. , loop Saturating-Counter Predictor/Bimodal • Consider strongly biased branch with infrequent outcome (e. g. , loop branch) • TTTTTTTTNTTTT • Last-outcome will mispredict twice per infrequent outcome encounter: Strong bias • TTTTTTTTNTTTT • Idea: Remember most frequent case Pred. Not-Taken Pred. Taken • Saturating-Counter: Hysteresis T T T 00 01 10 11 N T N N N WEAK bias • Often called bi-modal predictor • Captures Temporal Bias ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Bimodal Comparison • • • Initial state: taken / weakly taken Three branches – Bimodal Comparison • • • Initial state: taken / weakly taken Three branches – large enough table Actual Sequence: TTTTTNTTTN Predicted: Last-Outcome: TTTTTNTTTN Bimodal: TTTTTTTNTTTN ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Missprediction Rate Bimodal accuracy Source: Alex Ramirez, UPC ECE 1773 Fall 2007 © A. Missprediction Rate Bimodal accuracy Source: Alex Ramirez, UPC ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Other Automata ECE 1773 Fall 2007 © A. Moshovos (Toronto) Other Automata ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Automata continued • Exhaustive search of all options • Goal: Which best captures the Automata continued • Exhaustive search of all options • Goal: Which best captures the behavior of branches • Seems somewhat ad-hoc ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Pattern-Based Prediction • • • Nested loops: for i = 0 to N for Pattern-Based Prediction • • • Nested loops: for i = 0 to N for j = 0 to 3 • Associate a sequence of events with an outcome … Branch Outcome Stream for j-for branch • E. g, last time I saw • 111011101110 – T NT T Pattern • • – Then I saw a NT Patterns: 111 -> 0 110 -> 1 101 -> 1 011 -> 1 100% accuracy Learning time 4 instances What is this associated with? PC + last three outcomes of jbranch ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Pattern Predictor: Correlation amongst branches • From program perspective: – – Different Branches may Pattern Predictor: Correlation amongst branches • From program perspective: – – Different Branches may be correlated if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) then … • Can be viewed as a pattern detector – Instead of keeping aggregate history information • I. e. , most frequent outcome – Keep exact history information • Pattern of n most recent outcomes • Example: – BHR: n most recent branch outcomes – Use PC and BHR (xor? ) to access prediction table ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Two-Level Branch Predictors • An architecture for pattern-based prediction • First level captures the Two-Level Branch Predictors • An architecture for pattern-based prediction • First level captures the patterns – Branch History Register • Second level captures the outcome of the pattern – Some of a confidence mechanism – Bimodal, last-outome, or other automaton • Yeh and Patt studies. ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Multi-Level Predictors (Yeh and Patt) bimodal • GAg predictor Global BHR Prediction • PAg Multi-Level Predictors (Yeh and Patt) bimodal • GAg predictor Global BHR Prediction • PAg predictor PC bimodal BHR BHR ECE 1773 Fall 2007 © A. Moshovos (Toronto) Prediction

Multi-level Predictors, cont. • PAp predictor – PC selects BHR • Separate prediction table Multi-level Predictors, cont. • PAp predictor – PC selects BHR • Separate prediction table per BHT PC BHR BHR Naming scheme: XAy X History = Global = Common history amongst branches = Private history per branch Y Predictor tables = private = global ECE 1773 Fall 2007 © A. Moshovos (Toronto) Prediction

Multi-Level Predictors • Key result: – Global prediction is fast – Local is more Multi-Level Predictors • Key result: – Global prediction is fast – Local is more accurate • Goal: – Can we get the speed of global – And the accuracy of local? • Solution – Combined Branch Predictors • Example GShare, GSelect ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Global vs. Local Source: P. C. Yew ECE 1773 Fall 2007 © A. Moshovos Global vs. Local Source: P. C. Yew ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Gshare Predictor (Mc. Farling, DEC) Branch History Table Global BHR PC f Prediction • Gshare Predictor (Mc. Farling, DEC) Branch History Table Global BHR PC f Prediction • PC and BHR can be – Concatenated (GSelect) – xored (Gshare) • How deep BHR should be? – Really depends on program – But, deeper increases learning time – May increase quality of information ECE 1773 Fall 2007 © A. Moshovos (Toronto) 1 -bit, or 2 -bit hysteresis

Combined Predictor Accuracy Source: P. C. Yew. ECE 1773 Fall 2007 © A. Moshovos Combined Predictor Accuracy Source: P. C. Yew. ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Correct prediction Rate GShare vs. Bimodal Source: Alex Ramirez, UPC ECE 1773 Fall 2007 Correct prediction Rate GShare vs. Bimodal Source: Alex Ramirez, UPC ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Multi-Method Predictors • Some predictors work better than others for different branches • Idea: Multi-Method Predictors • Some predictors work better than others for different branches • Idea: Use multiple predictors and one that selects among them. • Example: – – – Bi-modal Predictor Pattern based (e. g. , Gshare) predictor Bi-modal Selector Initially Selector Points to Bi-modal If misprediction both predictor and selector are updated Why? Gshare takes more time to learn ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Tournament Predictor PC Predictor A Predictor B Predictor C • A & B operate Tournament Predictor PC Predictor A Predictor B Predictor C • A & B operate independently • C observes which of the two is right Chooser or metapredictor – Can be reduced to a binary stream by giving precedence to say A – C observes: A is wrong and B is right (0), else (1) • Typically A has longer learning time than B ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Updates • • Speculatively update the predictor or not? Speculative: on branch complete Non-Speculative: Updates • • Speculatively update the predictor or not? Speculative: on branch complete Non-Speculative: on branch resolve Why care? – Branch may take long time to resolve – What about subsequent branches? • Should they see the predicted history or not? • Trace based studies – Speculative is better – Faster Learning – Not much interference ECE 1773 Fall 2007 © A. Moshovos (Toronto)

The AGREE Predictor • Goal: Reduce Negative Interference – Two branches mapped onto the The AGREE Predictor • Goal: Reduce Negative Interference – Two branches mapped onto the same entry • Idea: Add an agree bit in each instruction – Where to store this? Cache, or BTB (will talk about this later) – Predict whether we agree with the agree bit • Example: – B 1 85% taken and B 2 with 15% taken – Probability of opposite outcomes? B 1 T x B 2 N + B 1 N X B 2 T • Conventional: . 85 x 0. 85 +. 15 x Source: 75%. 15 = Uri Weiser • Agree: 0. 85 x 0. 15 +. 15 *. 85 = 25. 5% ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Agree Predictor The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference Eric Agree Predictor The Agree Predictor: A Mechanism for Reducing Negative Branch History Interference Eric Sprangle, Robert S. Chappell, Mitch Alsup, Yale N. Patt, ISCA 1997 ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Agree Predictor Accuracy ECE 1773 Fall 2007 © A. Moshovos (Toronto) Agree Predictor Accuracy ECE 1773 Fall 2007 © A. Moshovos (Toronto)

The Bi-Mode Predictor • Idea: Change the bias bit dynamically • Three predictor tables The Bi-Mode Predictor • Idea: Change the bias bit dynamically • Three predictor tables – Taken – Not Taken – Choice • Update only the one pointed by Choice • Choice not updated when – Direction incorrect – Sub-predictor correct The Bi-Mode Branch Predictor Chih-Chieh Lee, I-Cheng K. Chen, and Trevor N. Mudge MICRO 1997 ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Bi-Mode Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto) Bi-Mode Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Bimode Accuracy Best average Best for gcc ECE 1773 Fall 2007 © A. Moshovos Bimode Accuracy Best average Best for gcc ECE 1773 Fall 2007 © A. Moshovos (Toronto)

YAGS Predictor • • No need to store all branches in the bi-mode Only YAGS Predictor • • No need to store all branches in the bi-mode Only those that disagree Introduce tags in the PHTs Allows us to use smaller PHTs than Choice ECE 1773 Fall 2007 © A. Moshovos (Toronto)

YAGS Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto) YAGS Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto)

YAGS Accuracy ECE 1773 Fall 2007 © A. Moshovos (Toronto) YAGS Accuracy ECE 1773 Fall 2007 © A. Moshovos (Toronto)

GSkew Predictor • Aliasing is more important than size • Reduce Aliasing by spliting GSkew Predictor • Aliasing is more important than size • Reduce Aliasing by spliting the predictor into multiple ones • Vote amongst them ECE 1773 Fall 2007 © A. Moshovos (Toronto)

GSKEW Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto) GSKEW Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto)

GSkew Accuracy ECE 1773 Fall 2007 © A. Moshovos (Toronto) GSkew Accuracy ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Dynamic History Length Fitting • What is the right history length? ECE 1773 Fall Dynamic History Length Fitting • What is the right history length? ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Dynamic History Length Fitting • • • Current length Best so far length Misprediction Dynamic History Length Fitting • • • Current length Best so far length Misprediction counter Prediction counter Table of measured MRs per length – Initialized at zero • Sampling: at fixed intervals – Try new length: get MR – Adjust to new MR if better than before – Move to a random length if length has not changed for a while • Avoid local optima ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Dynamic History Length Fitting: An Example ECE 1773 Fall 2007 © A. Moshovos (Toronto) Dynamic History Length Fitting: An Example ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Accuracy of Dynamic History Length Fitting Various gshare DHLF ECE 1773 Fall 2007 © Accuracy of Dynamic History Length Fitting Various gshare DHLF ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Compression and Branch prediction – High correlation between the two – -C. Cheng, J. Compression and Branch prediction – High correlation between the two – -C. Cheng, J. Coffey, and T. Mudge. Compression and branch prediction. 7 th Int. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), Oct. 1996, – Intuitively: – if compressible then high-redundancy – Or, automaton exists that has same behavior ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Predictors as Markov Chains • Markov chain of length j / j = history Predictors as Markov Chains • Markov chain of length j / j = history length – 2^j states – Probability of going from state i to state k ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Two-Level Predictors vs. Markov Chains ECE 1773 Fall 2007 © A. Moshovos (Toronto) Two-Level Predictors vs. Markov Chains ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Compression and Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto) Compression and Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Compression and Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto) Compression and Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Path Correlation Can’t predict A based solely on past history of B or C Path Correlation Can’t predict A based solely on past history of B or C ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Path Correlation • • • In addition to PC + Direction History Add Path Path Correlation • • • In addition to PC + Direction History Add Path history PATH: history of last N branch PCs In practice only a few bits are needed Even just one sometimes is good enough ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Perceptron Implementation ECE 1773 Fall 2007 © A. Moshovos (Toronto) Perceptron Implementation ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Perceptron Contd. Xi = 1 taken Xi = -1 not taken ECE 1773 Fall Perceptron Contd. Xi = 1 taken Xi = -1 not taken ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Perceptron Contd. ECE 1773 Fall 2007 © A. Moshovos (Toronto) Perceptron Contd. ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Perceptron Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto) Perceptron Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Predictor Latency • Larger Predictors are typically more accurate • Problem is that we Predictor Latency • Larger Predictors are typically more accurate • Problem is that we are time limited at the front -end GSHARE Throughput Source: The Impact of Delay on the Design of Branch Predictors Daniel A. Jim´enez Stephen. W. Keckler Calvin Lin MICRO 33 ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Overriding Branch Predictors • Multiple Predictors with Different Latencies • Predictor A – Quick Overriding Branch Predictors • Multiple Predictors with Different Latencies • Predictor A – Quick Prediction, start fetching instructions • Predictor B – Later, produces a better prediction – Triggers miss-predict if it disagrees with A ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Overriding Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto) Overriding Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Prophet/Critic • Prophet – Regular history-based predictor – Last time I saw “event sequence” Prophet/Critic • Prophet – Regular history-based predictor – Last time I saw “event sequence” outcome was this • Critic – – Waits until prophet generates a few predictions Last time I saw “event sequence A” And Prophet predicted “event sequence B” This was the outcome • If Critic(outcome) != Prophet(outcome) – Change prediction – Early Mispredict ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Prophet/Critic Performance ECE 1773 Fall 2007 © A. Moshovos (Toronto) Prophet/Critic Performance ECE 1773 Fall 2007 © A. Moshovos (Toronto)

The 21264 Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto) The 21264 Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Branch Target Buffer • • 2 nd step, where this branch goes Recall, 1 Branch Target Buffer • • 2 nd step, where this branch goes Recall, 1 st step: taken vs. not taken Associate Target PC with branch PC Target PC available earlier: derived using branch’s PC. – No pipeline bubbles • Example Implementation? – – Think of it as a cache: Index & tags: Branch PC (instead of address) Data: Target PC Could be combined with Branch Prediction ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Branch Target Buffer - Considerations • Careful: – Many more bits per entry than Branch Target Buffer - Considerations • Careful: – Many more bits per entry than branch prediction table • Size & Associativity • Store not-taken branches? – Pros and cons. – Uniform – BUT, cost in wasted space ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Coupling the BTB and the Direction Predictor ECE 1773 Fall 2007 © A. Moshovos Coupling the BTB and the Direction Predictor ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Coupled Predictor Timing ECE 1773 Fall 2007 © A. Moshovos (Toronto) Coupled Predictor Timing ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Pentium Predictor • 256 entry • 4 -way set-associative • BTB coupled with bimodal Pentium Predictor • 256 entry • 4 -way set-associative • BTB coupled with bimodal – Stores Taken branches only – Stores Target address and 2 bit counter ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Power. PC 604 Predictor • BTB and Direction are decoupled • BTB: – 64 Power. PC 604 Predictor • BTB and Direction are decoupled • BTB: – 64 entry. Fully-associative. • Direction: – 512 entry PHT (global history) ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Branch Target Cache • Note difference in terminology • Start with Branch PC and Branch Target Cache • Note difference in terminology • Start with Branch PC and produce – Prediction – Target PC – Target Instruction • Example Implementation? – Special Cache – Index & tags: branch PC – Data: target PC, target inst. & prediction • Facilitates “Branch Folding”, i. e. , – Could send target instruction instead of branch – “Zero-Cycle” branches • Considerations: more bits, size & assoc. ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Pentium III Predictor Source: Uri Weiser ECE 1773 Fall 2007 © A. Moshovos (Toronto) Pentium III Predictor Source: Uri Weiser ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Pentium M ECE 1773 Fall 2007 © A. Moshovos (Toronto) Pentium M ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Pentium M – Indirect Branches ECE 1773 Fall 2007 © A. Moshovos (Toronto) Pentium M – Indirect Branches ECE 1773 Fall 2007 © A. Moshovos (Toronto)

A BTB-based Front-End ECE 1773 Fall 2007 © A. Moshovos (Toronto) A BTB-based Front-End ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Next Line Prediction • Alternative to BTB • Better performance – Less area, faster, Next Line Prediction • Alternative to BTB • Better performance – Less area, faster, less accurate • BTB stores: – Full target address • NLP stores: – – Set, block within set pointers Less bits Less storage needed Can be wrong ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Next Line Prediction vs. BTB Same cost BTB accuracy Does not depend On I-Cache Next Line Prediction vs. BTB Same cost BTB accuracy Does not depend On I-Cache size ECE 1773 Fall 2007 © A. Moshovos (Toronto) I-Cache size

Jump Prediction • When? – Call/Returns – Direct Jumps – Indirect Jumps (e. g. Jump Prediction • When? – Call/Returns – Direct Jumps – Indirect Jumps (e. g. , switch stmt. ) • Call/Returns? – – – Well established programming convention Use a small hardware stack Calls push a value on top Returns use the top value NOTE: this is a prediction mechanism if it’s wrong it only impacts performance NOT correctness ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Indirect Jump Prediction • Not yet used in state-of-the-art processors. • Why? (very) Infrequent Indirect Jump Prediction • Not yet used in state-of-the-art processors. • Why? (very) Infrequent • BUT: becoming increasingly important – OO programming • • Possible solutions? Last-Outcome Prediction Pattern-Based Prediction Think of branch prediction as predicting 1 -bit values • Now, think how what we learned can be used to predict ECE 1773 Fall 2007 n-bit values © A. Moshovos (Toronto)

Call/Return • Easy to detect Call/Return Idioms • Use “hardware stack” • Kaeli et. Call/Return • Easy to detect Call/Return Idioms • Use “hardware stack” • Kaeli et. al. ECE 1773 Fall 2007 © A. Moshovos (Toronto)

TAGE ECE 1773 Fall 2007 © A. Moshovos (Toronto) TAGE ECE 1773 Fall 2007 © A. Moshovos (Toronto)

State-of-the-Art Direction Prediction: L-TAGE ECE 1773 Fall 2007 © A. Moshovos (Toronto) State-of-the-Art Direction Prediction: L-TAGE ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Confidence Estimation • Guess whether the predictor is accurate or not • Can be Confidence Estimation • Guess whether the predictor is accurate or not • Can be useful: – Throttle execution – power – Execute something else – multithreading • Implementation: – Array of local histories • Count the number of mispredicts ECE 1773 Fall 2007 © A. Moshovos (Toronto)

Confidence Estimation ECE 1773 Fall 2007 © A. Moshovos (Toronto) Confidence Estimation ECE 1773 Fall 2007 © A. Moshovos (Toronto)