Design and Implementation of a Generic Resource-Sharing Virtual-Time

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Tal Ben-Nun Yoav Etsion Dror Feitelson Scl. Eng & CS Hebrew University CS Dept Barcelona SC Ctr Scl. Eng & CS Hebrew University Supported by the Israel Science Foundation, grant no. 28/09

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Goal is to control share of resources, not to optimize performance – important in virtualization

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant Selection of deserving client using virtual time formalism

Design and Implementation of a Generic Resource-Sharing Virtual-Time Dispatcher Goal is to control share of resources, not to optimize performance – important in virtualization Same module used for diverse resources Mechanism used: dispatch the most deserving client at each instant Selection of deserving client using virtual time formalism Implemented and measured in Linux

Motivation Context: VMM for server consolidation Multiple legacy servers share physical platform Improved utilization and easier maintenance Flexibility in allocating resources to virtual machines Virtual machines typically run a single application (“appliances”)

Motivation Assumed goal: enforce predefined allocation of resources to different virtual machines (“fair share” scheduling) Based on importance / SLA Can change with time or due to external events Problem: what is “ 30% of the resources” when there are many different resources, and diverse requirements?

Global Scheduling “Fair share” usually applied to a single resource But what if this resource is not a bottleneck? Global scheduling idea: 1) Identify the system bottleneck resource 2) Apply fair share scheduling on this resource 3) This induces appropriate allocations on other resources This paper: how to apply fair-share scheduling on any resource in the system

Previous Work I: Virtual Time Accounting is inversely proportional to allocation Schedule the client that is farthest behind

Previous Work II: Traffic Shaping • Leaky bucket – Variable requests – Constant rate transmission – Bucket represent buffer • Token bucket – Variable requests – Constant allocations – Bucket represents stored capacity

Putting them Together: RSVT • “Resource sharing”: all clients make progress continuously – Generalization of processor sharing • Each job has its ideal resource sharing progress – This is considered to be the allocation ai – Grows at constant rate • Each job has its actual consumption ci – Grows only when job runs • Scheduling priority is the difference: pi = ai – ci

Example Three clients Allocations roughly 50%, 30%, 20% Consumed resource time Consumption always occur in resource time Wallclock time

Bookkeeping • The set of active jobs is A • The relative allocation of job i is ri • During an interval T job k has run • Update allocations: • Update consumptions:

The Active Set • Active jobs (the set A) are those that can use the resource now • Allocations are relative to the active set • The active set may change – New job arrives – Job terminates – Job stops using resource temporarily – Job resumes use of resource

Grace Period • Intermittent activity: process data / send packet • should retain allocations even when inactive • Thus ai continues to grow during grace period after it becomes inactive • Grace period reflects notion of continuity • Sub-second time scale

Rebirth • Resumption after very long inactive periods should be treated as new arrivals • Due to grace period, job that becomes inactive accrues extra allocation • Forget this extra allocation after rebirth period (set ai = ci) • Two order of magnitude larger than grace period

Implementation • Kernel module with generic functionality – Create / destroy module – Create / destroy client – Make request / set active / set inactive – Make allocations – Dispatch – Check-in (note resource usage) • Glue code for specific subsystems – Currently networking and CPU – Plan to add disk I/O

Networking Glue Code App Use the Linux Qo. S framework: create RSVT queueing discipline TCP IP queueing discipline Qo. S NIC

Networking Glue Code App Non-RSVT traffic has priority (e. g. NFS traffic) and is counted as dead time TCP IP no RSVT? send immediately yes enqueue select and send NIC

CPU Scheduling Glue Code • Use Linux modular scheduling core • Add an RSVT scheduling policy – RSVT module essentially replaces the policy runqueue – Initial implementation only for uniprocessors • CFS and possibly other policies also exist and have higher priority – When they run, this is considered dead time

Timer Interrupts • Linux employs timer interrupts (250 Hz) • Allocations are done at these times – Translate time into microseconds – Subtract known dead time (unavailable to us) – Divide among active clients according to relative allocations – Bound divergence of allocation from consumption • Also handling of grace period (mark as inactive) • Also handling of rebirth (set ai = ci)

Multi-Queue • At dispatch, need to find client with highest priority • But priorities change at different rates • Solution: allow only a limited discrete set of relative priorities • Each priority has a separate queue • Maintain all clients in each queue in priority order • Only need to check the first in each queue to find the maximum

Experiment – Basic Allocations rate bandwidth 1 30. 89 0. 0 5 2 61. 41 0. 0

Experiment – Basic Allocations rate bandwidt h 1 15. 69 0. 11 2 30. 81 0.

Experiment – Active Set

Experiment – Grace Period

Experiment – Rebirth

Experiment – Throttling • Two competing MPlayers • The one with higher allocation does not need all of it – Allocation tracks consumption

Conclusions • Demonstrated generic virtual-time based resource sharing dispatcher • Need to complete implementation – Support for I/O scheduling – More details, e. g. SMP support • Building block of global scheduling vision