An Accurate Prefetch Technique for Dynamic Paging Behaviour

An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory Jie Cai and Peter Strazdins Research School of Computer Science The Australian National University ICPP 2012 Pittsburgh, PA, USA

Outline • Introduction • Background • Related Work on Existing Prefetch Techniques • Stride-augmented Run-length Encoding Method (s. RLE) • Dynamic Region-based Prefetch Technique • Evaluation Results • Conclusion ICPP 2012 @ Pittsburgh, PA

Introduction • Software Distributed Shared Memory (s. DSM) systems provide programming environments that enable the use of shared programming model such as Open. MP on clusters. • s. DSM systems inherit the good programmability of shared memory programming models. • Removing explicit control of data exchange from programmer • However, s. DSM suffers from significant system overheads. • Prefetch techniques, fitting well with lazy release consistency (LRC), can be used to improve performance. • Prefetch techniques for s. DSM face two major challenges: • Applications’ dynamic memory access patterns • Page misses caused by non-global synchronization operations ICPP 2012 @ Pittsburgh, PA

Introduction (Cont. ) • In this talk, we address the challenges of prefetch techniques for s. DSM systems • Reconstruct page miss record using strided-augmented run -length encoding (s. RLE) method • Designed a dynamic region-based prefetch (DRe. P) technique based on s. RLE’d records to predict and issue prefetches. • Implemented into the only commercialized s. DSM system, Intel Cluster Open. MP (CLOMP) • DRe. P and s. RLE with CLOMP are evaluated using NPBOMP benchmark suite, LINPACK, and a memory consistency cost micro-benchmark (MCBENCH) ICPP 2012 @ Pittsburgh, PA

Background (1) • Fork-join type shared memory programming models: • Regions are separated using global synchronizations, e. g. implicit and explicit barriers; • Region-executions are multiple executions of the same region when this region is enclosed in a loop. ICPP 2012 @ Pittsburgh, PA

Background (2) • s. DSM memory consistency model Each process has a local view of the shared pages • The shared pages are kept consistent via mprotect (please refer to the page state machine for details). • ICPP 2012 @ Pittsburgh, PA

Background (3) • s. DSM memory consistency costs The major s. DSM system overhead is the memory consistency cost. • MCBENCH is a in-house developed microbenchmark that measures this cost for different Open. MP implementations, including cluster enabled Open. MPs. • ICPP 2012 @ Pittsburgh, PA

Related Work • Dynamic Aggregation (C. Amza et al. 1997) • Simple assumption of temporal paging behavior before and after a barrier. • B+ and Adaptive++ (R. Bianchini et al. 1996 & 1998) • B+: simple assumption of temporal paging behavior before and after a barrier. • Adaptive++: assuming page misses occurred before a barrier or even before the previous barrier will occur again after the barrier. • Third order differential finite context method (TODFCM) (E. Speight et al. 2002) • Generic technique prefetch a page when three previous consecutive misses had happened before. ICPP 2012 @ Pittsburgh, PA

Related Work (Cont. ) • Temporal region-based pretech (TRe. P) technique (J. Cai et al. 2010) • Deployed idea of region and region-executions • Assume page misses in the previous region-execution will occur in the current region-execution • Considered temporal paging behaviour between consecutive region-executions • Hybrid region-based prefetch (HRe. P) technique (J. Cai et al. 2010) • Deployed idea of region and region-executions • Combined TRe. P and Adaptive++ • Addressed temporal paging behaviour between consecutive region-executions and spatial paging behaviour within a region-execution. ICPP 2012 @ Pittsburgh, PA

s. RLE Method -- Observation • LINPACK dynamic page access pattern with 4 processes • Corresponding dynamic page miss pattern ICPP 2012 @ Pittsburgh, PA

s. RLE Method • Step (a) group sub-list with common stride; • Step (b) encode the sub- lists into first level format: • (start page, stride, run length) • Step (c) group consecutive encoded sub-lists with common stride into second level encoding format: • (first level encoded record, stride, run length) • Ordinary page fault list can be converted to 2 D fault regions with s. RLE. ICPP 2012 @ Pittsburgh, PA

DRe. P Technique Designs • All page fault records (per region) has been encoded twice with s. RLE method. • Each record contains a list of second level encoded entries. ICPP 2012 @ Pittsburgh, PA

DRe. P Technique Designs (cont. ) • Beginning of a region-execution At the beginning of each region-execution, DRe. P predict and prefetch pages. Yes Previously executed twice? Compare every entries between two records. Issue prefetches ONLY for the following three cases. Case 1: Prefetch the entry if it is common to both lists Case 2: No No Prefetch issued Case 3: When strides and run lengthes are common to both lists, predict a start page, and prefetch with the common strides and run length pred. l 1_en_col. start_page = p_list. l 1_en_col. start_page + (p_list. l 1_en_col. start_page − bp_list. l 1_en_col. start_page) When strides are common and run lengthes are highly similar to both lists, predict a start page and a run lengthes, then prefetch with the common strides. pred. l 1_en_col. run len = p_list. l 1_en_col. run_len + (p_list. l 1_en_col. run_len − bp_list. l 1_en_col. run_len) pred. run_len = p_list. run_len + (p_list. run_len −bp_list. run_len) ICPP 2012 @ Pittsburgh, PA

DRe. P Implementation • DRe. P has been implemented into Intel Cluster Open. MP runtime. • New region notification user interactive interface: KMP_USER_NOTIFY_NEW_REGION(1) : 1 indicates this is a parallel region • KMP_USER_NOTIFY_NEW_REGION(0): 0 indicates this is a sequential region • • Flush filtering solved the problem of single page can be missed multiple times within one region-execution by removing duplicated records. • Enlarged message header of the communication layer which can accommodate 128 page IDs to leverage network bandwidth. • Each process first communicate to its right neighbor that avoid network congestion. ICPP 2012 @ Pittsburgh, PA

DRe. P Implementation (Cont. ) • DRe. P has been implemented into Intel Cluster Open. MP runtime. • Page state machine has been updated with two new introduced page states • Prefetched_diff • Prefetched_page ICPP 2012 @ Pittsburgh, PA

Evaluation • Experimental setup • Software and benchmarks • NPB-OMP suite • LINPAK Open. MP implementation (n=8196, nb=64) • MCBENCH (a = 4 MB, c = 4 B and 4 KB) • Hardware platform • 8 -node Intel cluster • Each node consists of 2 Intel E 5472 3. 0 Ghz CPUs • 16 GB memory • Gigabit Ethernet • DDR Infiniband ICPP 2012 @ Pittsburgh, PA

Efficiency and Coverage • Nf: total number of page faults • Np: number of prefetches • Nu: number of useful prefetches, Nu = Nf*C • C = Nu/Nf, coverage • E = Nu/Np, efficiency • Bold font represents best results ICPP 2012 @ Pittsburgh, PA

Efficiency and Coverage (Cont. ) • MCBENCH: DRe. P vs TRe. P and HRe. P • c = 4 B: extreme false sharing • c= = 4 KB: no false sharing • Bold font represents best results ICPP 2012 @ Pittsburgh, PA

Memory Consistency Cost • Measured using MCBENCH, a = 4 MB, c = 4 B and 4 KB • c = 4 B: extreme false sharing (reduced ~86% cost) • c = 4 KB: no false sharing ICPP 2012 @ Pittsburgh, PA

Memory Consistency Cost (Cont. ) • LINPACK Open. MP implementation with n=8196 and nb=64 • DRe. P is represented as a reduction rate to that of original CLOMP implementation, e. g. (Orig-DRe. P)/Orig. ICPP 2012 @ Pittsburgh, PA

Memory Consistency Cost (Cont. ) • NPB-OMP • Rates are represented as an average of each class from A to C. ICPP 2012 @ Pittsburgh, PA

Overhead Analysis of DRe. P • NPB-OMP IS. C • Tsegv: total memory consistency cost in seconds for original CLOMP and • • • DRe. P enabled CLOMP. TMK Comm (% to Tsegv): communication time spent in the DSM layer of CLOMP (TMK) TMK local (% to Tsegv): the local software overhead of TMK layer DRe. P Comm (% to Tsegv): communication cost of data prefetching DRe. P local (% to Tsegv): the local software cost introduced by DRe. P Communication costs are further broke down to cost for transferring diffs and pages. ICPP 2012 @ Pittsburgh, PA

Conclusions • With assistance of s. RLE, DRe. P accurately analyses the paging behaviour exhibiting both static and dynamic memory access patterns, such as NPB-OMP and LINPACK. • On average of NPB and LINPACK, DRe. P improves 34% efficiency and 47% coverage based on existing prefetch techniques, in details: • 55% and 5% better efficiency compared to Adaptive++ and TODFCM; 55% and 44% better coverage compared to Adaptive++ and TODFCM • 47% and 30% better efficiency compared to TRe. P and HRe. P; and 56% and 34% better coverage compared to TRe. P and HRe. P. • DRe. P dramatically reduces 86% memory consistency cost for the false sharing scenario; and ~45% and ~38% for LINPACK and NPB on Gig. E and IB respectively. • A detailed breakdown analysis showed a ~2% introduced overhead for DRe. P. ICPP 2012 @ Pittsburgh, PA

Acknowledgement • Australian Research Council Grant LP 0669726 • ANU CECS Faculty Research Grant • Intel Corp. • Sun Microsystems • NCI National Facility / ANU Supercomputer Facility ICPP 2012 @ Pittsburgh, PA