Mostly Concurrent Compaction for Mark-Sweep GC Yoav Ossia

Mostly Concurrent Compaction for Mark-Sweep GC Yoav Ossia, Ori Ben-Yitzhak, Marc Segal IBM Haifa Research Lab. Israel ISMM 2004

IBM Labs in Haifa Prologue: Commercial Multi-tier Applications ³ Clients (or load injectors) ² Sending requests to Server ³ Web Server – using application ³ Application and database Transaction – a request cycle Client Server Performance requirements Restricted resource utilization on the server (e. g. , CPU utilization below 50%) Throughput – Transactions per second Average transaction response time ISMM 2004 Application DB

IBM Labs in Haifa Prologue: Commercial Multi-tier Applications ³ Clients (or load injectors) ² Sending requests to Server ³ Web Server – using application ³ Application and database ³ Transaction – a (set of) request cycle(s) Client Server Performance requirements Restricted resource utilization on the server (e. g. , CPU utilization below 50%) Throughput – Transactions per second Average transaction response time ISMM 2004 Application DB

IBM Labs in Haifa Prologue: Commercial Multi-tier Applications ³ Clients (or load injectors) ² Sending requests to Server ³ Web Server – using application ³ Application and database ³ Transaction – a (set of) request cycle(s) ³ Performance requirements ² Throughput – Transactions per second ² Average Transaction Response Time ² At restricted CPU utilization (e. g. , below 50%) ISMM 2004 Client Server Application DB

IBM Labs in Haifa Prologue: Observations ³ GC share is negligible (every 20 sec. ) in all examples 1. Long compaction occurs 2. Switch from 500 ms Stop-The-World (STW) GC, to 250 ms mostly concurrent GC 1 2 ISMM 2004

IBM Labs in Haifa Prologue: Observations ³ GC share is negligible (every 20 sec. ) in all examples 1. Long compaction occurs 2. Switch from 500 ms Stop-The-World (STW) GC, to 250 ms mostly concurrent GC 2. 0 2 1. 0 Time ISMM 2004 Average Response time 1 2. 0 1. 0 Time

IBM Labs in Haifa Prologue: Insights ³ Average response time overreacts ² To shorter GC pause time ² To occasional compaction Why? ³ Longer GC pause times create a queue of transactions ² Queue persist long after the GC ³ Transaction timeout creates additional work ³ Conclusion: “some” pause time is acceptable but extras should be avoided ISMM 2004

IBM Labs in Haifa Prologue: The Clinic Analogy ³ Receptionist handles the incoming patient in 5 minutes, the physician in 10 minutes. Appointments are scheduled every 10 minutes An appointment lasts 15 minutes : =) But if the receptionist takes a long break… When he returns, appointments last ~50 minutes : =( Only after a while, with hard work (of both receptionist and physician), Qo. S may be restored ISMM 2004

IBM Labs in Haifa Prologue: The Physician Analogy ³ Receptionist handles the incoming patient in 5 minutes, the physician in 10 minutes. Appointments are scheduled every 10 minutes ³ An appointment lasts 15 minutes : =) But if the receptionist takes a long break… When he returns, appointments last ~50 minutes : =( Only after a while, with hard work (of both receptionist and physician), Qo. S may be restored ISMM 2004

IBM Labs in Haifa Prologue: The Physician Analogy ³ Receptionist handles the incoming patient in 5 minutes, the physician in 10 minutes. Appointments are scheduled every 10 minutes ³ An appointment lasts 15 minutes : =) ³ But if the receptionist takes a long break… When he returns, appointments last ~50 minutes : =( Only after a while, with hard work (of both receptionist and physician), Qo. S may be restored ISMM 2004

IBM Labs in Haifa Prologue: The Physician Analogy ³ Receptionist handles the incoming patient in 5 minutes, the physician in 10 minutes. Appointments are scheduled every 10 minutes ³ An appointment lasts 15 minutes : =) ³ But if the receptionist takes a long break… ³ When he returns, appointments last ~50 minutes : =( Only after a while, with hard work (of both receptionist and physician), Qo. S may be restored ISMM 2004

IBM Labs in Haifa Prologue: The Physician Analogy ³ Receptionist handles the incoming patient in 5 minutes, the physician in 10 minutes. Appointments are scheduled every 10 minutes ³ An appointment lasts 15 minutes : =) ³ But if the receptionist takes a long break… ³ When he returns, appointments last ~50 minutes : =( ³ Only after a while, with hard work (of both receptionist and physician), Qo. S may be restored ISMM 2004

IBM Labs in Haifa Outline ³ Prologue – Commercial applications ³ Mark Sweep (and Compact) GC ³ Mostly Concurrent Compaction ² Overview ² The generic algorithm ² Our implementation ³ Results ³ Related work, conclusions and future directions ISMM 2004

IBM Labs in Haifa Mark-Sweep (and Compact) GC ³ Used by many modern memory management systems ² Either for the entire heap, or for parts (e. g. , the old objects area of generational GC) ² Good performance on large server heaps ² Usually activated by an allocation request, when the heap is full ³ Mark - tags all objects that are reachable from roots ³ Sweep – Reclaims unmarked objects into list of free chunks ³ Result may be unsatisfactory (fragmentation) ³ Compact – packs together all live objects, creating a large free chunk ISMM 2004

IBM Labs in Haifa Characteristics of Compaction ³ Includes two activities ² Move of live objects ² Fix-up of all references (in objects and roots) to new locations ³ Advantages ² Eliminates fragmentation and enables (better, faster) allocation ² Better cache locality ³ Disadvantages ² Very expensive. Typically takes much more time than Mark- Sweep ² Done in Stop-The-World (STW) mode ² Severe impact on pause time ³ Avoided as much as possible, but is occasionally inevitable ³ Compaction is the weak point of Mark Sweep GC (pause time) ISMM 2004

IBM Labs in Haifa Outline ³ Prologue – Commercial applications ³ Mark Sweep (and Compact) GC ³ Mostly Concurrent Compaction ² Overview ² The generic algorithm ² Our implementation ³ Results ³ Conclusions and future directions ISMM 2004

IBM Labs in Haifa Mostly Concurrent Compaction - Overview ³ Our Goal: restrict the effect of compaction on pause time ² Typically to less than mark time ² For average response time Qo. S, critical code (e. g. , heartbeat) ³ Method – partial Move in STW, concurrent Fix-up ³ Reduce the pause time of the Move phase, by using incremental compaction ² Select the compacted part according to sweep results ² To optimize compaction impact and control pause time effect ³ Execute the fix-up phase after the move, when application threads are resumed ² Correctness preserved by page-protecting the unfixed objects from application threads access ISMM 2004

IBM Labs in Haifa Assumptions About the Environment ³ Memory management module ² Uses Mark Sweep GC ² Has a move operation – able to pack objects in the heap ² Supplies fix-up logic - knows the new location of an object by the original address ³ Operating system services ² Map 2 - maps physical memory into two virtual address ranges, or views ² Prot. N - protects a virtual address range of pages from read and write access. ² Unprot - removes the protection from specified page(s) ² Execute a Trap routine upon page access violation ISMM 2004

IBM Labs in Haifa Outline ³ Prologue – Commercial applications ³ Mark Sweep (and Compact) GC ³ Mostly Concurrent Compaction ² Overview ² The generic algorithm ² Our implementation ³ Results ³ Conclusions and future directions ISMM 2004

IBM Labs in Haifa The Generic Algorithm - Details ³ At Application initialization ² Use Map 2 to create the application view and the fix-up view of the heap ³ Calculating the areas to compact ² Motivation: optimal quality at restricted move time ² Heap is divided into small sections (e. g. , 100 sections) ² Gather object layout information during sweep ± Per section: free space, number of small free chunks, etc. ² Select the optimal set of sections for compaction ² Using configurable policy/heuristic ISMM 2004

IBM Labs in Haifa The Generic Algorithm - Details (cont. ) ³ Move phase ² Objects are compacted within the selected areas ³ Fix-up of root references ³ Prepare the heap pages ² Page protect all heap pages that contain objects ² Reset state of all pages (that contain objects) to Unfixed ² Rest of (“free”) pages are set to Fixed ³ Resume execution of application threads ISMM 2004

IBM Labs in Haifa The Generic Algorithm - Concurrent Fix-up (method) ³ Constrains ² All Unfixed pages are fixed, and only once ² A page starts as Unfixed (and protected) , then Busy, and finally Fixed (and unprotected) ² Application threads access only Fixed pages ³ Fix-up of page (Exclusive Fix) ² Done only by a thread that managed to change the page’s state from Unfixed to Busy ² All the (protected) page’s references are fixed. Page is accessed through the (unprotected) Fix-up view ² Protection is lifted ² Page state is set to Fixed ISMM 2004

IBM Labs in Haifa The Generic Algorithm - Concurrent fix-up (who/how) ³ Concurrent Fixing – fix-up that is initiated by the collector ² All concurrency flavors are possible ² Concurrent Fixers scan the heap, and try to Exclusively Fix each page ² Failure is OK; someone else did (or is doing) the fix-up ³ Trapped Fixing – forced fix-up ² Access violating application thread becomes a Trapped Fixer ² Executes a trap routine that attempts to Exclusively Fix the accessed page ² If fails, thread must wait till page becomes Fixed ³ Completed when Concurrent Fixing exhaust the heap ISMM 2004

IBM Labs in Haifa Outline ³ Prologue – Commercial applications ³ Mark Sweep (and Compact) GC ³ Mostly Concurrent Compaction ² Overview ² The generic algorithm ² Our implementation ³ Results ³ Conclusions and future directions ISMM 2004

IBM Labs in Haifa Our Implementation ³ Implemented for Java, on top of the IBM J 9 JVM ² Using Mark Sweep GC on the entire heap ² Reusing parallel move code and fix-up logic of J 9’s compactor ³ Configurable fix-up unit, bigger than the OS page size ² Fix-up more than an OS page on each trap ² Fewer access violations (more “hot” memory fixed each time) ±Reduces the relative cost of traps ² Longer trapped fixing ±We found that a significant unit size increase can be tolerated ³ Concurrent fixing by incremental work of the Java threads ² For each X KB of allocation, fix-up X*F KB of heap space ISMM 2004

IBM Labs in Haifa Outline ³ Prologue – Commercial applications ³ Mark Sweep (and Compact) GC ³ Mostly Concurrent Compaction ² Overview ² The generic algorithm ² Our implementation ³ Results ³ Conclusions and future directions ISMM 2004

IBM Labs in Haifa Testing Environment ³ Red Hat Linux OS ³ Pentium 4 Intel uniprocessor and a 4 -way, Intel Xeon MP processors, server ³ Benchmarks: SPECjbb 2000, Health (from Java-olden suite) and SPECjvm 98 ³ Compaction triggered every N GCs ² N=10 for SPECjvm 98, 15 for SPECjbb, and 1 for Health ³ No compact (Base) compared to compact with three area selection heuristics: ² Dark Matter reduction (DM) ² Creating Bigger Free chunks (BF) ² Round-Robin (RR) ISMM 2004

IBM Labs in Haifa Results : Throughput and Pause Time (Highlights) ³ Minor effect on pause time ³ Area selection heuristics matters, and should not be hard-coded ISMM 2004

IBM Labs in Haifa Results: Overall Costs of Concurrent Fix-up ³ INCR-C - our Mostly Concurrent incremental compactor to INCR-STW - same incremental move with STW fix-up FULL-STW - full heap move with STW fix-up STWinc pause time contribution is 3 times the move time No throughput gain over our compactor STWfull has very large pause time increase Compaction time is up to ten times the mark time Significant throughput gain with Health, some gain with SPECjvm ³ Concurrent fix-up is better than STW fix-up, for incremental compaction ³ Partial (but “smart”) compaction may be more effective than full compaction ISMM 2004

IBM Labs in Haifa Results: Cost of Access Violations ³ Concern: recently, page protection techniques became relatively inefficient, due to increase in computational speed ³ SPECjbb costs of Trapped fix-up ³ Conclusion: For concurrent fix-up, bigger fix-up units (64 KB-256 KB) are acceptable, and justify the use of page protection techniques ISMM 2004

IBM Labs in Haifa Results: Java Mutator Utilization ³ Concern: Trapped fix-up cannot be controlled. ²If most pages are accessed all the time, the Java application, right after STW, will practically do nothing but Trapped fix-up ³ We measured the portion of time spent on trapped fix-up in first 450 ms ³ Acceptable Java utilization ³ Reasonable Java utilization after 50. . 100 ms With 256 KB fix-up unit results are even better SPECjbb’s Java utilization in first 100 ms improves from 16% to 48% ISMM 2004

IBM Labs in Haifa Results: Java Mutator Utilization ³ Concern: Trapped fix-up cannot be controlled. ²If most pages are accessed all the time, the Java application, right after STW, will practically do nothing but fix-up ³ We measured the portion of time spent on trapped fix-up in first 450 ms ³ Acceptable Java utilization ³ Reasonable Java utilization after 50. . 100 ms ³ With 256 KB fix-up unit results are even better ²SPECjbb’s Java utilization in first 100 ms improves from 16% to 48% ISMM 2004

IBM Labs in Haifa Outline ³ Prologue – Commercial applications ³ Mark Sweep (and Compact) GC ³ Mostly Concurrent Compaction ² Overview ² The generic algorithm ² Our implementation ³ Results ³ Related work, conclusions and future directions ISMM 2004

IBM Labs in Haifa Related Work ³ Compaction techniques ² Jonkers, Morris - The threaded algorithm. 1978, 1979 ² Flood et al - Parallel garbage collection for shared memory multiprocessors. 2001 ² Sachindran and Moss - Mark Copy: Fast copying GC with less space overhead. 2003 ² Abuaiadh et al - An efficient parallel heap compaction algorithm. 2004 ³ Incremental compaction ² Lang and Dupont - Incremental incrementally compacting garbage collection. 1987 ² Ben-Yitzhak et al. - An algorithm for parallel incremental compaction. 2002 ISMM 2004

IBM Labs in Haifa Related Work (cont. ) ³ Concurrent Copying collectors ² Baker - List processing in real-time on a serial computer. 1978 ² Brooks - Trading data space for reduced time and code space. 1984 ² Appel et al. - Real-time concurrent collection on stock multiprocessors. 1988 ³ Fully concurrent compaction ² Larose and Feeley - A compacting incremental collector and its performance. . . 1998 ² Bacon et al. - Controlling fragmentation and space consumption in the metronome. 2003 ³ Use of page protection ² Appel et al. - Virtual memory primitives for user programs. 1991 ISMM 2004

IBM Labs in Haifa Conclusions ³ A solution is proposed for bounding the pause time effect of compaction ³ Mostly concurrent compaction: ² A generic solution suitable for Mark Sweep, and other GCs ² Method – partial Move in STW, concurrent Fix-up ³ A Java implementation is presented, on top of IBM J 9 JVM ³ Minor pause time hit (less than 1/3 of the mark time) ³ Highly efficient - No significant hit due to concurrent fix-up ³ Improved performance with most benchmarks ISMM 2004

IBM Labs in Haifa Future Directions Explore adaptive and sophisticated methods for: ³ Triggering of the mostly concurrent compaction ³ Choosing an optimal policy for selecting the parts to compact ³ Minimize the costs of Trapped Fix-up, by performing “proactive” concurrent fix-up ² Fix the predicted next locations of access violations, rather than performing sequential pass of heap ISMM 2004

IBM Labs in Haifa End ISMM 2004

IBM Labs in Haifa Java Mutator Utilization – The Mark Perspective 2000… ISMM 2004