Скачать презентацию Computers for the Post-PC Era David Patterson University Скачать презентацию Computers for the Post-PC Era David Patterson University

3824b320b30a3a9e8f9d56db8b622a49.ppt

  • Количество слайдов: 37

Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson@cs. berkeley. Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson@cs. berkeley. edu UC Berkeley IRAM Group UC Berkeley ISTORE Group istore-group@cs. berkeley. edu May 2000 Slide 1

Perspective on Post-PC Era • Post. PC Era will be driven by 2 technologies: Perspective on Post-PC Era • Post. PC Era will be driven by 2 technologies: 1) “Gadgets”: Tiny Embedded or Mobile Devices – ubiquitous: in everything – e. g. , successor to PDA, cell phone, wearable computers 2) Infrastructure to Support such Devices – e. g. , successor to Big Fat Web Servers, Database Servers Slide 2

VIRAM-1 Block Diagram Slide 3 VIRAM-1 Block Diagram Slide 3

VIRAM-1: System on a Chip Prototype scheduled for tape-out mid 2000 • 0. 18 VIRAM-1: System on a Chip Prototype scheduled for tape-out mid 2000 • 0. 18 um EDL process • 16 MB DRAM, 8 banks Memory (64 Mbits / 8 MBytes) • MIPS Scalar core and caches @ 200 MHz • 4 64 -bit vector unit 4 Vector Pipes/Lanes Xbar pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17 x 17 mm, 2 Watts Memory (64 Mbits / 8 MBytes) • 25. 6 GB/s memory (6. 4 GB/s per direction and per Xbar) • 1. 6 Gflops (64 -bit), 6. 4 GOPs (16 -bit) C P U +$ Slide 4 I/O

Problem: General Element Permutation 0 1 15 0 1 16 16 16 15 • Problem: General Element Permutation 0 1 15 0 1 16 16 16 15 • Hardware for a full vector permutation instruction (128 16 b elements, 256 b datapath) • Datapath: 16 x 16 (x 16 b) crossbar; scales by 0(N^2) • Control: 16 16 -to-1 multiplexors; scales by 0(N*log. N) • Other problems – Consecutive result elements not written together; time/energy wasted on wide vector register file port Slide 5

Simple Vector Permutations 0 1 15 • Simple steps of butterfly permutations – A Simple Vector Permutations 0 1 15 • Simple steps of butterfly permutations – A register provides the butterfly radix – Separate instructions for moving elements to left/right • Sufficient semantics for – Fast reductions of vector registers (dot products) – Fast FFT/DCT kernels Slide 6

Hardware for Simple Permutations 64 64 shift 64 0 shift 64 3 • Hardware Hardware for Simple Permutations 64 64 shift 64 0 shift 64 3 • Hardware for 128 16 b elements, 256 b datapath • Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16 b, 32 b only); Scales by O(N) • Control: 6 control cases; scales by O(N) • Other benefits – Consecutive result elements written together; – Buses used only for small radices Slide 7

FFT: Straight forward Problem: most time spent in short vectors in later stages of FFT: Straight forward Problem: most time spent in short vectors in later stages of FFT Slide 8

FFT: Transpose inside Vector Regs Slide 9 FFT: Transpose inside Vector Regs Slide 9

FFT: Straight forward Slide 10 FFT: Straight forward Slide 10

VIRAM-1 Design Status • MIPS scalar core – Synthesizable RTL code received from MIPS VIRAM-1 Design Status • MIPS scalar core – Synthesizable RTL code received from MIPS – Cache RAMs to be compiled for IBM technology – FPU RTL code almost compete • Vector unit • Memory system – Synthesizable model for DRAM controllers done – To be integrated with IBM DRAM macros – Full-custom layout for crossbar under development • Testing infrastructure – RTL models for sub-blocks developed; currently – Environment developed for integrated and tested automatic test & validation – Control logic to be compiled – Directed tests for IBM technology single/multiple instruction groups developed – Full-custom layout for multipliers/adders developed; – Random instruction layout for shifters to be sequence generator developed Slide 11

FPU Features • • Executes MIPS IV ISA single-precision FP instructions Thirty-two 32 -bit FPU Features • • Executes MIPS IV ISA single-precision FP instructions Thirty-two 32 -bit Floating Point Registers Two 32 -bit Control Registers One 3 -cycle (division takes 10 cycles) fully pipelined, nearly full IEEE-754 compliant, execution unit (from Albert Ma@MIT) • 6 -stage pipeline (R-X-X-X-CDB-WB) • Support for partial out-of-order execution and precise exceptions • Scalar Core dispatches FP instructions to FPU using an interface that splits instructions into 3 classes: – Arithmetic instructions (ADD. S, SUB. S, MUL. S, DIV. S, ABS. S, NEG. S, C. cond. S, CVT. S. W, CVT. W. S, TRUNC. W. S, MOVZ. S, MOVN. S) – From Coprocessor Data Transfer instructions (SWC 1, MFC 1, CFC 1) – To Coprocessor Data Transfer instructions (LWC 1, MTC 1, CTC 1) Slide 12

FPU Architecture Slide 13 FPU Architecture Slide 13

Multiplier Partitioning • 64 -bit multiplier built from 16 -bit multiplier subblocks • Subblocks Multiplier Partitioning • 64 -bit multiplier built from 16 -bit multiplier subblocks • Subblocks combined with adders to perform larger multiplies • Performs 2 simultaneous 32 bit multiplies by grouping 4 subblocks • Performs 4 simultaneous 16 bit multiplies by using individual subblocks • Unused blocks turned off to conserve power Slide 14

FPU Current Status • Current Functionality – Able to execute most instructions (all except FPU Current Status • Current Functionality – Able to execute most instructions (all except C. cond. S, CFC 1 and CTC 1). – Supports precise exception semantics. – Functionality verification. » Used a random test generator that generates/kills instructions at random and compares the results from the RTL Verilog simulator against the results from an ISA Perl simulator. • What remains to be done – – – Instructions that use the Control Registers (C. cond. S, CFC 1 and CTC 1). Exception generation. Integrate execution pipeline with the rest of the design. Synthesize, place and route. Final assembly and verification of multiplier • Performance – Sustainable Throughput: 1 instruction/cycle (assuming no data hazards) – Instruction Latency: 6 cycles Slide 15

UC-IBM Agreement • Biggest IRAM Obstacle: Intellectual Property Agreement between University of California and UC-IBM Agreement • Biggest IRAM Obstacle: Intellectual Property Agreement between University of California and IBM • Can university accept free fab costs ($2. 0 M to $2. 5 M) in return for capped non-exclusive patent licensing fees for IBM if UC files for IRAM patents? • Process started with IBM March 1999 • IBM won’t give full process info until contract • UC started negotiating seriously Jan 2000 • Agreement June 1, 2000! Slide 16

Other examples: IBM “Blue Gene” • 1 Peta. FLOPS in 2005 for $100 M? Other examples: IBM “Blue Gene” • 1 Peta. FLOPS in 2005 for $100 M? • Application: Protein Folding • Blue Gene Chip – 32 Multithreaded RISC processors + ? ? MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip – 1 GFLOPS / processor • • 2’ x 2’ Board = 64 chips (2 K CPUs) Rack = 8 Boards (512 chips, 16 K CPUs) System = 64 Racks (512 boards, 32 K chips, 1 M CPUs) Total 1 million processors in just 2000 sq. ft. Slide 17

Other examples: Sony Playstation 2 • Emotion Engine: 6. 2 GFLOPS, 75 million polygons Other examples: Sony Playstation 2 • Emotion Engine: 6. 2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13: 5) – Superscalar MIPS core + vector coprocessor + graphics/DRAM – Claim: “Toy Story” realism brought to games Slide 18

Outline 1) Example microprocessor for Post. PC gadgets 2) Motivation and the ISTORE project Outline 1) Example microprocessor for Post. PC gadgets 2) Motivation and the ISTORE project vision – AME: Availability, Maintainability, Evolutionary growth – ISTORE’s research principles – Benchmarks for AME • Conclusions and future work Slide 19

Lampson: Systems Challenges • Systems that work – Meeting their specs – Always available Lampson: Systems Challenges • Systems that work – Meeting their specs – Always available – Adapting to changing environment – Evolving while they run – Made from unreliable components – Growing without practical limit • • Credible simulations or analysis Writing good specs “Computer Systems Research Testing -Past and Future” Keynote address, Performance 17 th SOSP, – Understanding when it doesn’t matter Dec. 1999 Butler Lampson Microsoft Slide 20

Hennessy: What Should the “New World” Focus Be? • Availability – Both appliance & Hennessy: What Should the “New World” Focus Be? • Availability – Both appliance & service • Maintainability – Two functions: » Enhancing availability by preventing failure » Ease of SW and HW upgrades • Scalability – Especially of service “Back to the Future: Time to Return to Longstanding • Cost Problems in Computer Systems? ” – per device and per service transaction Keynote address, FCRC, • Performance May 1999 John Hennessy – Remains important, but its not SPECint Stanford Slide 21

The real scalability problems: AME • Availability – systems should continue to meet quality The real scalability problems: AME • Availability – systems should continue to meet quality of service goals despite hardware and software failures • Maintainability – systems should require only minimal ongoing human administration, regardless of scale or complexity • Evolutionary Growth – systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded • These are problems at today’s scales, and will only get worse as systems grow Slide 22

Principles for achieving AME (1) • No single points of failure • Redundancy everywhere Principles for achieving AME (1) • No single points of failure • Redundancy everywhere • Performance robustness is more important than peak performance – “performance robustness” implies that real-world performance is comparable to best-case performance • Performance can be sacrificed for improvements in AME – resources should be dedicated to AME » compare: biological systems spend > 50% of resources on maintenance – can make up performance by scaling system Slide 23

Principles for achieving AME (2) • Introspection – reactive techniques to detect and adapt Principles for achieving AME (2) • Introspection – reactive techniques to detect and adapt to failures, workload variations, and system evolution – proactive techniques to anticipate and avert problems before they happen Slide 24

ISTORE-1 hardware platform • 80 -node x 86 -based cluster, 1. 4 TB storage ISTORE-1 hardware platform • 80 -node x 86 -based cluster, 1. 4 TB storage – cluster nodes are plug-and-play, intelligent, networkattached storage “bricks” » a single field-replaceable unit to simplify maintenance – each node is a full x 86 PC w/256 MB DRAM, 18 GB disk – more CPU than NAS; fewer disks/node than cluster ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors. . . Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister Slide 25

ISTORE-1 Status • • • 10 Nodes manufactured Boots OS Diagnostic Processor Interface SW ISTORE-1 Status • • • 10 Nodes manufactured Boots OS Diagnostic Processor Interface SW complete PCB backplane: not yet designed Finish 80 node system: Summer 2000 Slide 26

Hardware techniques • Fully shared-nothing cluster organization – truly scalable architecture – architecture that Hardware techniques • Fully shared-nothing cluster organization – truly scalable architecture – architecture that tolerates partial failure – automatic hardware redundancy Slide 27

Hardware techniques (2) • No Central Processor Unit: distribute processing with storage – Serial Hardware techniques (2) • No Central Processor Unit: distribute processing with storage – Serial lines, switches also growing with Moore’s Law; less need today to centralize vs. bus oriented systems – Most storage servers limited by speed of CPUs; why does this make sense? – Why not amortize sheet metal, power, cooling infrastructure for disk to add processor, memory, and network? – If AME is important, must provide resources to be used to help AME: local processors responsible for health and maintenance of their storage Slide 28

Hardware techniques (3) • Heavily instrumented hardware – sensors for temp, vibration, humidity, power, Hardware techniques (3) • Heavily instrumented hardware – sensors for temp, vibration, humidity, power, intrusion – helps detect environmental problems before they can affect system integrity • Independent diagnostic processor on each node – provides remote control of power, remote console access to the node, selection of node boot code – collects, stores, processes environmental data for abnormalities – non-volatile “flight recorder” functionality – all diagnostic processors connected via independent diagnostic network Slide 29

Hardware techniques (4) • On-demand network partitioning/isolation – Internet applications must remain available despite Hardware techniques (4) • On-demand network partitioning/isolation – Internet applications must remain available despite failures of components, therefore can isolate a subset for preventative maintenance – Allows testing, repair of online system – Managed by diagnostic processor and network switches via diagnostic network Slide 30

Hardware techniques (5) • Built-in fault injection capabilities – Power control to individual node Hardware techniques (5) • Built-in fault injection capabilities – Power control to individual node components – Injectable glitches into I/O and memory busses – Managed by diagnostic processor – Used for proactive hardware introspection » automated detection of flaky components » controlled testing of error-recovery mechanisms – Important for AME benchmarking (see next slide) Slide 31

“Hardware” techniques (6) • Benchmarking – One reason for 1000 X processor performance was “Hardware” techniques (6) • Benchmarking – One reason for 1000 X processor performance was ability to measure (vs. debate) which is better » e. g. , Which most important to improve: clock rate, clocks per instruction, or instructions executed? – Need AME benchmarks “what gets measured gets done” “benchmarks shape a field” “quantification brings rigor” Slide 32

Availability benchmark methodology • Goal: quantify variation in Qo. S metrics as events occur Availability benchmark methodology • Goal: quantify variation in Qo. S metrics as events occur that affect system availability • Leverage existing performance benchmarks – to generate fair workloads – to measure & trace quality of service metrics • Use fault injection to compromise system – hardware faults (disk, memory, network, power) – software faults (corrupt input, driver error returns) – maintenance events (repairs, SW/HW upgrades) • Examine single-fault and multi-fault workloads – the availability analogues of performance micro- and macro-benchmarks Slide 33

Benchmark Availability? Methodology for reporting results • Results are most accessible graphically – plot Benchmark Availability? Methodology for reporting results • Results are most accessible graphically – plot change in Qo. S metrics over time – compare to “normal” behavior? » 99% confidence intervals calculated from no-fault runs Slide 34

Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3 x faster than Example results: multiple-faults Windows 2000/IIS Linux/ Apache • Windows reconstructs ~3 x faster than Linux • Windows reconstruction noticeably affects application performance, while Linux reconstruction does not Slide 35

Conclusions (1): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server Conclusions (1): ISTORE • Availability, Maintainability, and Evolutionary growth are key challenges for server systems – more important even than performance • ISTORE is investigating ways to bring AME to large-scale, storage-intensive servers – via clusters of network-attached, computationallyenhanced storage nodes running distributed code – via hardware and software introspection – we are currently performing application studies to investigate and compare techniques • Availability benchmarks a powerful tool? – revealed undocumented design decisions affecting SW RAID availability on Linux and Windows 2000 Slide 36

Conclusions (2) • IRAM attractive for two Post-PC applications because of low power, small Conclusions (2) • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth – Gadgets: Embedded/Mobile devices – Infrastructure: Intelligent Storage and Networks • Post. PC infrastructure requires – New Goals: Availability, Maintainability, Evolution – New Principles: Introspection, Performance Robustness – New Techniques: Isolation/fault insertion, Software scrubbing – New Benchmarks: measure, compare AME metrics Slide 37