A Scalable Front-End Architecture for Fast Instruction Delivery

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong

Conventional Pipeline Architecture n High-performance processors can be broken down into two parts ¨ Front-end: fetches and decodes instructions ¨ Execution core: executes instructions

Front-End and Pipeline Simple Front-End Fetch Decode … Fetch Decode Fetch

Front-End with Prediction Simple Front-End Fetch Predict Decode Fetch Predict … Decode Fetch Predict

Front-End Issues I n Flynn’s bottleneck: ¨ IPC is bounded by the number of Instructions fetched per cycle n Implies: As execution performance increases, the front-end must keep up to ensure overall performance

Front-End Issues II n Two opposing forces ¨ Designing n a faster front-end Increase I-cache size ¨ Interconnect Scaling Problem Wire performance does not scale with feature size n Decrease I-cache size n

Key Contributions I

Key Contributions: Fetch Target Queue n Objective: ¨ Avoid using large cache with branch prediction n Purpose ¨ Decouple I-cache from branch prediction n Results ¨ Improves throughput

Key Contributions: Fetch Target Buffer n Objective ¨ Avoid large caches with branch prediction n Implementation ¨A n multi-level buffer Results ¨ Deliver performance is 25% better than single level ¨ Scales better with “future” feature size

Outline n Scalable Front-End and Components ¨ Fetch Target Queue ¨ Fetch Target Buffer Experimental Methodology n Results n Analysis and Conclusion n

Fetch Target Queue n Decouples I-cache from branch prediction ¨ Branch predictor can generate predictions independent of when the I-cache uses them Fetch Simple Front-End Predict Fetch Predict

Fetch Target Queue n Decouples I-cache from branch prediction ¨ Branch predictor can generate predictions independent of when the I-cache uses them Fetch Front-End with FTQ Predict Fetch

Fetch Target Queue n Fetch and predict can have different latencies ¨ Allows n for I-cache to be pipelined As long as they have the same throughput

Fetch Blocks n n FTQ stores fetch block Sequence of instructions ¨ Starting at branch target ¨ Ending at a strongly biased branch n Instructions are directly fed into pipeline

Outline n Scalable Front-End and Component ¨ Fetch Target Queue ¨ Fetch Target Buffer Experimental Methodology n Results n Analysis and Conclusion n

Fetch Target Buffer: Outline Review: Branch Target Buffer n Fetch Blocks n Functionality n

Review: Branch Target Buffer I Previous Work (Perleberg and Smith [2]) n Makes fetch independent of predict n Fetch Simple Front-End Predict Fetch With Branch Target Buffer Predict Fetch Predict

Review: Branch Target Buffer II n Characteristics ¨ Hash table ¨ Makes predictions ¨ Caches prediction information

Review: Branch Target Buffer III Index/ Tag PC Branch Predicted Prediction branch target Fall. Instructions through at Branch address 0 x 1718 Taken 0 x 1834 0 x 1788 add sub 0 x 1734 Taken 0 x 2088 0 x 1764 neq br 0 x 1154 Not taken 0 x 1364 0 x 1200 ld store … …

FTP Optimizations over BTB n Multi-level ¨ Solves conundrum Need a small cache n Need enough space to successfully predict branches n

FTP Optimizations over BTB n Oversize bit ¨ Indicates if a block is larger than cache line ¨ With multi-port cache n Allows several smaller blocks to be loaded at the same time

FTP Optimizations over BTB n Only stores partial fall-through address ¨ Fall-through address is close to the current PC ¨ Only need to store an offset

FTP Optimizations over BTB n Doesn’t store every blocks: ¨ Fall-through blocks ¨ Blocks that are seldom taken

Fetch Target Buffer Target: of branch n Type: conditional, subroutine call/return n Oversize: if block size > cache line n Next PC

Fetch Target Buffer

PC used as index into FTB

L 1 Hit HIT!

Branch NOT Taken HIT! NOT TAKEN

Branch Taken HIT! TAKEN

L 1 Miss L 1: MISS FALL THROUGH

L 1 Miss L 1: MISS L 2: HIT After N cycle Delay FALL THROUGH

L 1 and L 2 Miss L 1: MISS L 2: MISS FALL THROUGH: eventually mispredicts

Hybrid branch prediction n Meta-predictor selects between ¨ Local history predictor ¨ Global history predictor ¨ Bimodal predictor

Branch Prediction Meta Bimod Local Pred Local History Global Predictor

Branch Prediction

Committing Results When full, SHQ commits oldest value to local history or global history

Outline n Scalable Front-End and Component ¨ Fetch Target Queue ¨ Fetch Target Buffer Methodology n Results n Analysis and Conclusion n

Experimental Methodology I n Baseline Architecture ¨ Processor n n 8 instruction fetch with 16 instruction issue per cycle 128 entry reorder buffer with 32 entry load/store buffer 8 cycle minimum branch mis-prediction penalty Cache ¨ ¨ 64 k 2 -way instruction cache 64 k 4 way data cache (pipelined)

Experimental Methodology II n Timing Model ¨ Cacti cache compiler n Models on-chip memory n Modified for 0. 35 um, 0. 188 um and 0. 10 um processes n Test set ¨ 6 SPEC 95 benchmarks ¨ 2 C++ Programs

Outline n Scalable Front-End and Component ¨ Fetch Target Queue ¨ Fetch Target Buffer Experimental Methodology n Results n Analysis and Conclusion n

Comparing FTB to BTB n n FTB provides slightly better performance Tested for various cache sizes: 64, 256, 1 k, 4 k and 8 K entries Better

Comparing Multi-level FTB to Single-Level FTB n Two-level FTB Performance ¨ Smaller fetch size 2 Level Average Size: 6. 6 n 1 Level Average Size: 7. 5 n ¨ Higher accuracy on average Two-Level: n Single: n ¨ Higher n 83. 3% 73. 1 % performance 25% average speedup over single

Fall-through Bits Used Number of fallthrough bits: 4 -5 ¨ Because fetch distances 16 instructions do not improve performance Better n

FTQ Occupancy Better Roughly indicates throughput n On average, FTQ is n ¨ Empty: 21. 1% ¨ Full: 10. 7% of the time

Two level FTB scale well with features size ¨ Higher better slope is t t er n Be Scalability

Outline n Scalable Front-End and Component ¨ Fetch Target Queue ¨ Fetch Target Buffer Experimental Methodology n Results n Analysis and Conclusion n

Analysis 25% improvement in IPC over best performing single-level designs n System scales well with feature size n On average, FTQ is non-empty 21. 1% of the time n FTB Design requires at most 5 bits for fallthrough address n

Conclusion n FTQ and FTB design ¨ Decouples n Produces higher throughput ¨ Uses n the I-cache from branch prediction multi-level buffer Produces better scalability

References n n [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26 th Annual International Symposium on Computer Architecture. May 1999 [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

Thank you Questions?