Introduction to Hardware Architecture David A Patterson http cs

Скачать презентацию Introduction to Hardware Architecture David A Patterson http cs

dc3431c985fb6860a7b9611f8697a185.ppt

Количество слайдов: 86

Introduction to Hardware/Architecture David A. Patterson http: //cs. berkeley. edu/~patterson/talks {patterson, kkeeton}@cs. berkeley. edu EECS, University of California Berkeley, CA 94720 -1776 1

What is a Computer System? Application (Netscape) Compiler Software Hardware Assembler Processor Operating System (Windows 98) Memory I/O system Instruction Set Architecture Datapath & Control Digital Design Circuit Design transistors n Coordination of many levels of abstraction 2

Levels of Representation = v[k]; temp High Level Language Program (e. g. , C) v[k] = v[k+1]; v[k+1] = temp; lw$to, 0($2) lw$t 1, 4($2) sw $t 1, 0($2) sw $t 0, 4($2) Compiler Assembly Language Program (e. g. , MIPS) Assembler Machine Language Program (MIPS) Machine Interpretation 0000 1010 1100 0101 1001 1111 0110 1000 1100 0101 1010 0000 0110 1000 1111 1001 1010 0000 0101 1100 1111 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111 Control Signal Specification ° ° 3

The Instruction Set: a Critical Interface software instruction set hardware 4

struction Set Architecture (subset of Computer Arch “. . . the attributes of a [computing] system as seen by the programmer, i. e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. ” – Amdahl, Blaaw, and Brooks, 1964 -- Organization of Programmable Storage SOFTWARE -- Data Types & Data Structures: Encodings & Representations -- Instruction Set -- Instruction Formats -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions 5

Anatomy: 5 components of any Compute Personal Computer Processor (active) Control (“brain”) Datapath (“brawn”) Memory (passive) (where programs, data live when running) Processor often called (IBMese) “CPU” for “Central Processor Unit” Devices Input Output Keyboard, Mouse Disk (where programs, data live when not running) Display, Printer 6

Technology Trends: Microprocessor Capacity Moore’s Law Alpha 21264: 15 million Pentium Pro: 5. 5 million Power. PC 620: 6. 9 million Alpha 21164: 9. 3 million Sparc Ultra: 5. 2 million 2 X transistors/Chip Every 1. 5 years Called “Moore’s Law”: 7

Technology Trends: Processor Performance 1. 54 X/yr Processor performance increase/yr mistakenly referred to as Moore’s Law (transistors/chip) 8

Computer Technology=>Dramatic Chan n n Processor m 2 X in speed every 1. 5 years; 1000 X performance in last 15 years Memory m DRAM capacity: 2 x / 1. 5 years; 1000 X size in last 15 years m Cost per bit: improves about 25% per year Disk m capacity: > 2 X in size every 1. 5 years m Cost per bit: improves about 60% per year m 120 X size in last decade State-of-the-art PC “when you graduate” (1997 -2001) m Processor clock speed: 1500 Mega. Hertz (1. 5 Giga. Hertz) m Memory capacity: 500 Mega. Byte (0. 5 Giga. Bytes) m Disk capacity: 100 Giga. Bytes (0. 1 Tera. Bytes) m New units! Mega => Giga, Giga => Tera 9

Integrated Circuit Costs Die cost = Wafer cost Dies per Wafer * Die yield Dies Flaws Die Cost is goes roughly with the cube of the area: fewer dies per wafer * yield worse with die area 10

Die Yield (1993 data) Raw Dices Per Wafer wafer diameter 6”/15 cm 8”/20 cm 10”/25 cm die area (mm 2) 100 144 196 139 90 62 265 177 124 431 290 206 256 44 90 153 324 32 68 116 400 23 52 90 die yield 23% 19% 16% 12% 11% 10% typical CMOS process: =2, wafer yield=90%, defect density=2/cm 2, 4 test sites/wafer 6”/15 cm 8”/20 cm 10”/25 cm Good Dices Per Wafer (Before Testing!) 31 16 9 5 3 2 59 32 19 11 7 5 96 53 32 20 13 9 typical cost of an 8”, 4 metal layers, 0. 5 um CMOS wafer: ~$2000 11

1993 Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield /cm 2 mm 2 wafer Die Cost layers width 386 DX $4 486 DX 2 54% $12 Power. PC 601 28% $53 HP PA 7100 27% $73 DEC Alpha 19% $149 Super. SPARC cost 2 0. 90 $900 1. 0 43 360 71% 3 0. 80 $1200 1. 0 81 181 4 0. 80 $1700 1. 3 121 115 3 0. 80 $1300 1. 0 196 66 3 0. 70 $1500 1. 2 234 53 3 0. 70 $1700 1. 6 256 48 12

Other Costs IC cost = Die cost + Testing cost + Packaging cost Final test yield Packaging Cost: depends on pins, heat dissipation Chip 386 DX 486 DX 2 Power. PC 601 HP PA 7100 DEC Alpha Super. SPARC Pentium Die cost $4 $12 $53 $73 $149 $272 $417 Package pins type 132 QFP 168 PGA 304 QFP 504 PGA 431 PGA 293 PGA 273 PGA cost $1 $11 $3 $35 $30 $20 $19 Test & Assembly $4 $12 $21 $16 $23 $34 $37 Total $9 $35 $77 $124 $202 $326 $473 13

System Cost: 1995 96 Workstation System Subsystem % of total cost Cabinet Sheet metal, plastic 1% Power supply, fans 2% Cables, nuts, bolts 1% (Subtotal) (4%) Motherboard Processor 6% DRAM (64 MB) 36% Video system 14% I/O system 3% Printed Circuit board 1% (Subtotal) (60%) I/O Devices Keyboard, mouse 1% Monitor 22% Hard disk (1 GB) 7% Tape drive (DAT)6% (Subtotal) (36%) 14

COST v. PRICE Q: What % of company income on Research and Development (R&D)? list price +50– 80% Average Discount (33– 45%) gross margin (33– 14%) direct costs (8– 10%) component cost (25– 31%) avg. selling price +25– 100% Gross Margin +33% Direct Costs Component Cost Input: chips, displays, . . . component cost Making it: labor, scrap, returns, . . . (WS–PC) Overhead: R&D, rent, marketing, profits, . . . Commision: channel profit, volume discounts, 15

Outline n Review of Five Technologies: Processor, Memory, Disk, Network Systems m m n Description / History / Performance Model State of the Art / Trends / Limits / Innovation Common Themes across Technologies m m m Perform. : per access (latency) + per byte (bandwidth) Fast: Capacity, BW, Cost; Slow: Latency, Interfaces Moore’s Law affecting all chips in system 16

Processor Trends/ History n Microprocessor: main CPU of “all” computers m m n < 1986, +35%/ yr. performance increase (2 X/2. 3 yr) >1987 (RISC), +60%/ yr. performance increase (2 X/1. 5 yr) Cost fixed at $500/chip, power whatever can cool CPU time= Seconds Program n = Instructions x Program Clocks Instruction x Seconds Clock History of innovations to 2 X / 1. 5 yr m m Pipelining (helps seconds / clock, or clock rate) Out-of-Order Execution (helps clocks / instruction) Superscalar (helps clocks / instruction) Multilevel Caches (helps clocks / instruction) 17

Pipelining is Natural! ° Laundry Example ° Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away A B C D ° Washer takes 30 minutes ° Dryer takes 30 minutes ° “Folder” takes 30 minutes ° “Stasher” takes 30 minutes to put clothes into drawers 18

Sequential Laundry 6 PM T a s k O r d e r A 7 8 9 10 11 12 1 2 AM 30 30 30 30 Time B C D Sequential laundry takes 8 hours for 4 loads 19

Pipelined Laundry: Start work ASAP 6 PM T a s k O r d e r 7 8 9 10 30 30 11 12 1 2 AM Time A B C D Pipelined laundry takes 3. 5 hours for 4 loads! 20

Pipeline Hazard: Stall 6 PM T a s k O r d e r 7 8 9 30 30 A 10 11 12 1 2 AM Time bubble B C D E F A depends on D; stall since folder tied up 21

Out-of-Order Laundry: Don’t Wait 6 PM T a s k O r d e r 7 8 9 30 30 A 10 11 12 1 2 AM Time bubble B C D E F A depends on D; rest continue; need more resources to allow out-of-order 22

Superscalar Laundry: Parallel per stage 6 PM T a s k O r d e r 7 8 9 30 30 30 A B C D E F 10 11 12 1 2 AM Time (light clothing) (dark clothing) (very dirty clothing) More resources, HW match mix of parallel tasks? 23

Superscalar Laundry: Mismatch Mix 6 PM T a s k O r d e r 7 8 9 30 30 A B C D 10 11 12 1 2 AM Time (light clothing) (dark clothing) (light clothing) Task mix underutilizes extra resources 24

State of the Art: Alpha 21264 n n n 15 M transistors 2 64 KB caches on chip; 16 MB L 2 cache off chip Clock <1. 7 nsec, or >600 MHz (Fastest Cray Supercomputer: T 90 2. 2 nsec) 90 watts Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle Execution out-of-order 25

Today’s Situation: Microprocessor n n n MIPS MPUs R 5000 R 10000 10 k/5 k Clock Rate 200 MHz 195 MHz 1. 0 x On-Chip Caches 32 K/32 K 1. 0 x Instructions/Cycle 1(+ FP) 4 4. 0 x Pipe stages 5 5 -7 1. 2 x Model In-order Out-of-order --Die Size (mm 2) 84 298 3. 5 x m n n without cache, TLB 32 205 6. 3 x Development (man yr. . ) 60 300 5. 0 x SPECint_base 95 5. 7 8. 8 1. 6 x 26

Memory History/Trends/State of Art n DRAM: main memory of all computers m m n n State of the Art: $152, 128 MB DIMM (16 64 -Mbit DRAMs), 10 ns x 64 b (800 MB/sec) Capacity: 4 X/3 yrs (60%/yr. . ) m n n Commodity chip industry: no company >20% share Packaged in SIMM or DIMM (e. g. , 16 DRAMs/SIMM) Moore’s Law MB/$: + 25%/yr. Latency: – 7%/year, Bandwidth: + 20%/yr. (so far) source: www. pricewatch. com, 5/21/98 27

Memory Summary n n DRAM rapid improvements in capacity, MB/$, bandwidth; slow improvement in latency Processor-memory interface (cache+memory bus) is bottleneck to delivered bandwidth m Like network, memory “protocol” is major overhead 28

Processor Innovations/Limits n Low cost , low power embedded processors m m m n Very Long Instruction Word (Intel, HP IA-64/Merced) m n Lots of competition, innovation Integer perf. embedded proc. ~ 1/2 desktop processor Strong ARM 110: 233 MHz, 268 MIPS, 0. 36 W typ. , $49 multiple ops/ instruction, compiler controls parallelism Consolidation of desktop industry? Innovation? S S MIP ISC PA-R Po we Alp ha r. P C SPARC x 86 IA-64 29

Processor Summary n SPEC performance doubling / 18 months m m n Processor tricks not as useful for transactions? m m n n Growing CPU-DRAM performance gap & tax Running out of ideas, competition? Back to 2 X / 2. 3 yrs? Clock rate increase compensated by CPI increase? When > 100 MIPS on TPC-C? Cost fixed at ~$500/chip, power whatever can cool Embedded processors promising m 1/10 cost, 1/100 power, 1/2 integer performance? 30

Processor Limit: DRAM Gap CPU “Moore’s Law” 100 10 1 µProc 60%/yr. . Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr. . 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 • Alpha 21264 full cache miss in instructions executed: 180 ns/1. 7 ns =108 clks x 4 or 432 instructions • Caches in Pentium Pro: 64% area, 88% transistors 31

he Goal: Illusion of large, fast, cheap memo n n n Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy of Levels m Similar to Principle of Abstraction: hide details of multiple levels 32

Library n n Working on paper in library at a desk Option 1: Every time need a book m m m m Leave desk to go to shelves (or stacks) Find the book Bring one book back to desk Read section interested in When done with section, leave desk and go to shelves carrying book Put the book back on shelf Return to desk to work Next time need a book, go to first step 33

Memory Hierarchy Analogy: Library n Option 2: Every time need a book m m m n n Leave some books on desk after fetching them Only go to shelves when need a new book When go to shelves, bring back related books in case you need them; sometimes you’ll need to return books not used recently to make space for new books on desk Return to desk to work When done, replace books on shelves, carrying as many as you can per trip Illusion: whole library on your desktop Buzzword “cache” from French for hidden treasure 34

Why Hierarchy works: Natural Locality n The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Probability of reference m 0 2^n - 1 Address Space n What programming constructs lead to Principle of Locality? 35

Memory Hierarchy: How Does it Work? n Temporal Locality (Locality in Time): Keep most recently accessed data items closer to the processor m Library Analogy: Recently read books are kept on desk m Block is unit of transfer (like book) n Spatial Locality (Locality in Space): Move blocks consists of contiguous words to the upper levels m Library Analogy: Bring back nearby books on shelves when fetch a book; hope that you might need it later for your paper 36

Memory Hierarchy Pyramid Central Processor Unit (CPU) Increasing “Upper” Distance from Level 1 Levels in CPU, Level 2 memory Decreasing hierarchy Level 3 cost / MB “Lower” . . . Level n Size of memory at each level (data cannot be in level i unless also in i+1) 37

Hierarchy n n n Temporal locality: keep recently accessed data items closer to processor Spatial locality: moving contiguous words in memory to upper levels of hierarchy Uses smaller and faster memory technologies close to the processor m m n Fast hit time in highest level of hierarchy Cheap, slow memory furthest from processor If hit rate is high enough, hierarchy has access time close to the highest (and fastest) level and size equal to the lowest (and largest) level 38

Recall : 5 components of any Computer Focus on I/O Computer Processor Memory (active) (passive) Control (“brain”) (where programs, Datapath data live when (“brawn”) running) Devices Input Output Keyboard, Mouse Disk, Network Display, Printer 39

Disk Description / History Embed. Proc. Track (ECC, SCSI) Sector Track Arm Buffer Head Platter Cylinder 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2, 300 MBytes source: New York Times, 2/23/98, page C 3, “Makers of disk drives crowd even more data into even smaller spaces” 40

Disk History 2000: 10, 100 Mb/s. i. 25, 000 MBytes 1989: 63 Mbit/sq. in 60, 000 MBytes 1997: 1450 Mbit/sq. in 2300 Mbytes source: N. Y. Times, 2/23/98, page C 3 (2. 5” diameter) 1997: 3090 Mbit/s. i. 8100 Mbytes (3. 5” diameter) 2000: 11, 000 Mb/s. i. 73, 400 MBytes 41

State of the Art: Ultrastar 72 ZX m m m Embed. Proc. Track m Sector Cylinder Platter Track Arm Head Buffer Latency = Queuing Time + Controller time + per access Seek Time + + Rotation Time + per byte Size / Bandwidth { source: www. ibm. com; www. pricewatch. com; 2/14/00 m m m m 73. 4 GB, 3. 5 inch disk 2¢/MB 16 MB track buffer 11 platters, 22 surfaces 15, 110 cylinders 7 Gbit/sq. in. areal density 17 watts (idle) 0. 1 ms controller time 5. 3 ms avg. seek (seek 1 track => 0. 6 ms) 3 ms = 1/2 rotation 37 to 22 MB/s to media 42

Disk Limit n n Continued advance in capacity (60%/yr) and bandwidth (40%/yr. ) Slow improvement in seek, rotation (8%/yr) Time to read whole disk Year Sequentially Randomly 1990 4 minutes 6 hours 200012 minutes 1 week Dynamically change data layout to reduce seek, rotation delay? Leverage space vs. spindles? 43

A glimpse into the future? n IBM microdrive for digital cameras m n 340 Mbytes Disk target in 5 -7 years? m building block: 2006 Micro. Drive » 9 GB disk, 50 MB/sec from disk m 10, 000 nodes fit into one rack! 44

Disk Summary n n n Continued advance in capacity, cost/bit, BW; slow improvement in seek, rotation External I/O bus bottleneck to transfer rate, cost? => move to fast serial lines (FC-AL)? What to do with increasing speed of embedded processor inside disk? 45

Connecting to Networks (and Other I/O) n n Bus - shared medium of communication that can connect to many devices Hierarchy of Buses in a PC 46

Buses in a PC CPU Memory bus Memory n Data rates m m m PCI: Internal (Backplane) I/O bus Ethernet Interface Memory: 100 MHz, 8 bytes 800 MB/s (peak) PCI: 33 MHz, 4 bytes wide 132 MB/s (peak) SCSI: “Ultra 2” (40 MHz), “Wide” (2 bytes) SCSI Interface SCSI: External I/O bus (1 to 15 disks) Ethernet Local Area Network 47

Why Networks? n n Originally sharing I/O devices between computers (e. g. , printers) Then Communicating between computers (e. g, file transfer protocol) Then Communicating between people (e. g. , email) Then Communicating between networks of computers Internet, WWW 48

Types of Networks n Local Area Network (Ethernet) m m m n Inside a building: Up to 1 km (peak) Data Rate: 10 Mbits/sec, 1000 Mbits/sec Run, installed by network administrators Wide Area Network m m m Across a continent (10 km to 10000 km) (peak) Data Rate: 1. 5 Mbits/sec to 2500 Mbits/sec Run, installed by telephone companies 49

ABCs of Networks: 2 Computers n n Starting Point: Send bits between 2 computers Queue (First In First Out) on each end Can send both ways (“Full Duplex”) Information sent called a “message” m Note: Messages also called packets 50

A Simple Example: 2 Computers n What is Message Format? (Similar in idea to Instruction Format) m Fixed size? Number bits? Request/ Response Address/Data 32 bits 1 bit 0: Please send data from address in your memory 1: Packet contains data corresponding to request m • Header(Trailer): information to deliver message • Payload: data in message (1 word above) 51

Questions About Simple Example n What if more than 2 computers want to communicate? Need computer “address field” in packet to know which computer should receive it (destination), and to which computer it came from for reply (source) Req. / Resp. Dest. Source Net ID Address/Data 1 bit 5 bits 32 bits Header Payload m 52

Questions About Simple Example n What if message is garbled in transit? Add redundant information that is checked when message arrives to be sure it is OK m 8 -bit sum of other bytes: called “Check sum”; upon arrival compare check sum to sum of rest of information Req. / in message Checksum Resp. Dest. Source Net ID Address/Data 1 bit 5 bits 32 bits 8 bits m Header Payload Trailer 53

Questions About Simple Example n What if message never arrives? If tell sender it has arrived (and tell receiver reply has arrived), can resend upon failure m Don’t discard message until get “ACK” (acknowledgment); (Also, if check sum fails, don’t send ACK) Req. / Resp. Dest. Source Net ID Address/Data Check 2 bits 5 bits 32 bits 8 bits m 00: Request—Please send data from Address 01: Reply—Message contains data corresponding to request 10: Acknowledge (ACK) request 11: Acknowledge (ACK) reply 54

Observations About Simple Example n n Simple questions such as those above lead to more complex procedures to send/receive message and more complex message formats Protocol: algorithm for properly sending and receiving messages (packets) 55

Ethernet (popular LAN) Packet Format Preamble Dest Addr 8 Bytes 6 Bytes Data Pad Check Src Addr 6 Bytes 0 -1500 B 0 -46 B 4 B Length of Data 2 Bytes n n n Preamble to recognize beginning of packet Unique Address per Ethernet Network Interface Card so can just plug in & use (privacy issue? ) Pad ensures minimum packet is 64 bytes m n Easier to find packet on the wire Header+ Trailer: 24 B + Pad 56

n Software Protocol to Send and Receive SW Send steps 1: Application copies data to OS buffer 2: OS calculates checksum, starts timer 3: OS sends data to network interface HW and says start n SW Receive steps 3: OS copies data from network interface HW to OS buffer 2: OS calculates checksum, if OK, send ACK; if not, delete message (sender resends when timer expires) 1: If OK, OS copies data to user address space, & signals application to continue 57

Protocol for Networks of Networks (WAN)? n Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently; m m n Enabling technologies: SW standards that allow reliable communications without reliable networks Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, called protocol families or protocol suites Abstraction to cope with complexity of communication vs. Abstraction for complexity of computation 58

Protocol for Network of Networks n Transmission Control Protocol/Internet Protocol (TCP/IP) m m This protocol family is the basis of the Internet, a WAN protocol IP makes best effort to deliver TCP guarantees delivery TCP/IP so popular it is used even when communicating locally: even across homogeneous LAN 59

FTP From Stanford to Berkeley Hennessy FDDI Ethernet FDDI T 3 FDDI Ethernet n Ethernet BARRNet is WAN for Bay Area m n Patterson T 3 is 45 Mbit/s leased line (WAN); FDDI is 100 Mbit/s LAN IP sets up connection, TCP sends file 60

Protocol Family Concept Message Logical Actual H Message T Actual H H Message Logical T T Message Actual H Message T Actual H H Message T T Actual 61

Protocol Family Concept n Key to protocol families is that communication occurs logically at the same level of the protocol, called peer-to-peer, but is implemented via services at the lower level m Danger is each level lower performance if family is implemented as hierarchy (e. g. , multiple check sums) 62

n n TCP/IP packet, Ethernet packet, protocols Application sends message TCP breaks into 64 KB segments, adds 20 B header IP adds 20 B header, sends to network n If Ethernet, broken into 1500 B packets with headers, trailers (24 B) n All Headers, trailers have length field, destination, . . . n Ethernet Hdr IP Header TCP Header EH IP Data TCP data Message Ethernet Hdr 63

Networks n n Shared Media vs. Switched: pairs communicate at same time: “point-topoint” connections Aggregate BW in switched network is Node many times shared m point-to-point faster since no arbitration, simpler interface Shared Node Crossbar Switch Node 64

Heart of Today’s Data Switch Covert serial bit stream to, say, 128 bit words Covert 128 bit words into serial bit stream Memory Unpack header to find destination and place message into memory of proper outgoing port; OK as long as memory much faster than switch rate 65

Network Media (if time) Twisted Pair: Copper, 1 mm think, twisted to avoid attenna effect (telephone) Coaxial Cable: Plastic Covering Braided outer conductor Insulator Copper core Fiber Optics Transmitter – L. E. D – Laser Diode light source Air Used by cable companies: high BW, good noise immunity Total internal reflection Receiver Silica – Photodiode Light: 3 parts are cable, light source, light detector 66

Rates n n Using the peak transfer rate of a portion of the I/O system to make performance projections or performance comparisons Peak bandwidth measurements often based on unrealistic assumptions about system or unattainable because of other system limitations m m In example, Peak Bandwidth FDDI vs. 10 Mbit Ethernet = 10: 1, but delivered BW ratio (due to software overhead) is 1. 01: 1 Peak PCI BW is 132 MByte/sec, but combined with memory often < 80 MB/s 67

Network Description/Innovations n n Shared Media vs. Switched: pairs communicate at same time Aggregate BW in switched network is many times shared m m n point-to-point faster only single destination, simpler interface Serial line: 1 – 5 Gbit/sec Moore’s Law for switches, too m 1 chip: 32 x 32 switch, 1. 5 Gbit/sec links, $396 48 Gbit/sec aggregate bandwidth (AMCC S 2025) 68

Network History/Limits TCP/UDP/IP protocols for WAN/LAN in 1980 s n Lightweight protocols for LAN in 1990 s n Limit is standards and efficient SW protocols 10 Mbit Ethernet in 1978 (shared) 100 Mbit Ethernet in 1995 (shared, switched) 1000 Mbit Ethernet in 1998 (switched) n m n FDDI; ATM Forum for scalable LAN (still meeting) Internal I/O bus limits delivered BW m m 32 -bit, 33 MHz PCI bus = 1 Gbit/sec future: 64 -bit, 66 MHz PCI bus = 4 Gbit/sec 69

Network Summary n n Fast serial lines, switches offer high bandwidth, low latency over reasonable distances Protocol software development and standards committee bandwidth limit innovation rate m n Ethernet forever? Internal I/O bus interface to network is bottleneck to delivered bandwidth, latency 70

Network Summary n Protocol suites allow heterogeneous networking Another use of principle of abstraction m Protocols operation in presence of failures m Standardization key for LAN, WAN m n Integrated circuit revolutionizing network switches as well as processors m n Switch just a specialized computer High bandwidth networks with slow SW overheads don’t deliver their promise 71

Systems: History, Trends, Innovations n n Cost/Performance leaders from PC industry Transaction processing, file service based on Symmetric Multiprocessor (SMP)servers m m n n 4 - 64 processors Shared memory addressing Decision support based on SMP and Cluster (Shared Nothing) Clusters of low cost, small SMPs getting popular 72

1997 State of the Art System: PC n n n $1140 OEM 1 266 MHz Pentium II 64 MB DRAM 2 Ultra. DMA EIDE disks, 3. 1 GB each 100 Mbit Ethernet Interface (Penny. Sort winner) source: www. research. microsoft. com/research/barc/Sort. Benchmark/Penny. Sort. ps 73

1997 State of the Art SMP: Sun E 10000 4 address buses data crossbar switch Proc Xbar Proc Proc Mem s Mem n TPC-D, Oracle 8, 3/98 m s bridge s c s i bridge … 1 s c s i … … … m s s c c s s i i bus bridge s c s i 16 s c s i … 1 … s c s i … … … m m s c s i … 23 m m SMP 64 336 MHz CPUs, 64 GB dram, 668 disks (5. 5 TB) Disks, shelf $2, 128 k Boards, encl. $1, 187 k CPUs $912 k DRAM $768 k Power $96 k Cables, I/O $69 k source: www. tpc. org 74 HW total

State of the Art Cluster: Tandem/Compaq SMP n n Server. Net switched network Rack mounted equipment SMP: 4 -PPro, 3 GB dram, 3 disks (6/rack) n TPC-C, Oracle 8, 4/98 CPUs $191 k 10 Disk shelves/rack DRAM, $122 k @ 7 disks/shelf Disks+cntlr $425 k Total: 6 SMPs Disk shelves $94 k (24 CPUs, 18 GB DRAM), Networking $76 k 402 disks (2. 7 TB) Racks $15 k m m n m m m HW total $926 k 75

1997 Berkeley Cluster: Zoom Project n 3 TB storage system m m n 370 8 GB disks, 20 200 MHz PPro PCs, 100 Mbit Switched Ethernet System cost small delta (~30%) over raw disk cost Application: San Francisco Fine Arts Museum Server m m m 70, 000 art images online Zoom in 32 X; try it yourself! www. Thinker. org (statue) 76

User Decision Support Demand vs. Processor speed Database demand: 2 X / 9 -12 months “Greg’s Law” “Moore’s Law” Database-Proc. Performance Gap: CPU speed 2 X / 18 months 77

Berkeley Perspective on Post PC Era Post. PC Era will be driven by 2 technologies: 1) “Gadgets”: Tiny Embedded or Mobile Devices m ubiquitous: in everything m e. g. , successor to PDA, cell phone, wearable computers n 2) Infrastructure to Support such Devices m e. g. , successor to Big Fat Web Servers, Database Servers 78

Intelligent RAM: Microprocessor & DRAM on a single chip: m 10 X capacity vs. SRAM on-chip memory latency 5 -10 X, bandwidth 50 -100 X m improve energy efficiency 2 X-4 X (no off-chip bus) I/O IRAM Proc $ $ L 2$ Bus I/O Bus m m serial I/O 5 -10 X v. buses m D R I/O smaller board area/volume IRAM advantages extend to: m a single chip system m a building block for larger systems L o f g a i b c A M I/O Proc Bus D R A M D R f A a Mb 79

Other examples: IBM “Blue Gene” n n n 1 Peta. FLOPS in 2005 for $100 M? Application: Protein Folding Blue Gene Chip m m n n 32 Multithreaded RISC processors + ? ? MB Embedded DRAM + high speed Network Interface on single 20 x 20 mm chip 1 GFLOPS / processor 2’ x 2’ Board = 64 chips (2 K CPUs) Rack = 8 Boards (512 chips, 16 K CPUs) System = 64 Racks (512 boards, 32 K chips, 1 M CPUs) Total 1 million processors in just 2000 sq. ft. 80

Other examples: Sony Playstation 2 n Emotion Engine: 6. 2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13: 5) m m Superscalar MIPS core + vector coprocessor + graphics/DRAM 81 Claim: “Toy Story” realism brought to games

n The problem space: big data Big demand for enormous amounts of data m today: high-end enterprise and Internet applications » enterprise decision-support, data mining databases » online applications: e-commerce, mail, web, archives m future: infrastructure services, richer data » computational & storage back-ends for mobile devices » more multimedia content » more use of historical data to provide better services n n Today’s SMP server designs can’t easily scale Bigger scaling problems than performance! 82

n The real scalability problems: AME Availability m n Maintainability m n systems should require only minimal ongoing human administration, regardless of scale or complexity Evolutionary Growth m n systems should continue to meet quality of service goals despite hardware and software failures systems should evolve gracefully in terms of performance, maintainability, and availability as they are grown/upgraded/expanded These are problems at today’s scales, and will only 83

ISTORE 1 hardware platform n 80 -node x 86 -based cluster, 1. 4 TB storage m cluster nodes are plug-and-play, intelligent, networkattached storage “bricks” » a single field-replaceable unit to simplify maintenance m m each node is a full x 86 PC w/256 MB DRAM, 18 GB disk more CPU than NAS; fewer disks/node than cluster ISTORE Chassis 80 nodes, 8 per tray 2 levels of switches • 20 100 Mbit/s • 2 1 Gbit/s Environment Monitoring: UPS, redundant PS, fans, heat and vibration sensors. . . Intelligent Disk “Brick” Portable PC CPU: Pentium II/266 + DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor Disk Half-height canister 84

Conclusion n IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth m m n Gadgets: Embedded/Mobile devices Infrastructure: Intelligent Storage and Networks Post. PC infrastructure requires m m New Goals: Availability, Maintainability, Evolution New Principles: Introspection, Performance Robustness New Techniques: Isolation/fault insertion, Software scrubbing 85 New Benchmarks: measure, compare AME metrics

Questions? Contact us if you’re interested: email: patterson@cs. berkeley. edu http: //iram. cs. berkeley. edu/ 86