
454cd19929056fe8d3bedc28240c354b.ppt
- Количество слайдов: 64
Scaleable Computing Jim Gray Microsoft Research Gray@Microsoft. com http: //research. Microsoft. com/~Gray/talks/ • Outline – The bandwidth revolution – Scale. Up, Scale. Out – Terra. Server (Barclay, Slutz, Gray) Gray @ Nortel 20 April 1999
Gilder’s Law: 3 x bandwidth/year for 25 more years • Today: – 10 Gbps per channel – 4 channels per fiber: 40 Gbps – 32 fibers/bundle = 1. 2 Tbps/bundle • In lab 3 Tbps/fiber (400 x WDM) • In theory 25 Tbps per fiber • 1 Tbps = USA 1996 WAN bisection bandwidth 1 fiber = 25 Tbps Gray @ Nortel 20 April 1999
Networking BIG!! Changes coming! • Technology – 1 GBps bus “now” • Software improving – User-level Net-IO – 1 Gbps links “now” – 1 Tbps links in 10 years – Fast & cheap switches • Software Challenge • Standard wires for interconnect – processor-processor – processor-device (=processor) • Deregulation WILL work someday – reduce software tax on messages – Today 30 K ins + 10 ins/byte – Goal: 1 K ins +. 01 ins/byte Gray @ Nortel 20 April 1999
Technology (hardware) NOW 2003 Forecast (10 x better) • CPU: nearing 1 BIPS • CPU: 1 bips real (smp) – but CPI rising fast (2 -10) so less than 100 mips – 1$/mips to 10$/mips – 0. 1$ - 1$/mips • DRAM: 1 Gb chip • DRAM: 3 $/MB • DISK: 20 $/GB • TAPE: – 0. 1 $/MB • Disk: – 20 GB/tape, 6 MBps – Lags disk – 2$/GB offline, 15$/GB nearline • BUS/SAN: 10/1 GBps • WAN: 0. 1 Mbps – 10 GB smart cards 500 GB RAID 5 packs (NTinside) – 3$ GB • BUS/SAN: 100/10 GBps • WAN: 1 Gbps Gray @ Nortel 20 April 1999
Microsoft SAN Infrastructure Win. Sock Direct Path App Winsock U K Ms. Afd AFD TCP IP NDIS Mini. Port HW U K 110 MBps Winsock (that’s B not b) Switch 10% cpu Ms. Afd Hw. SPI (not 200%) Network faster than AFD most IO TCP attachments IP VIA NDIS Mini. Port HW Gray @ Nortel 20 April 1999
SAN: Standard Interconnect Gbps SAN: 110 MBps PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps • LAN faster than memory bus? • • 1 GBps links in lab. 100$ port cost soon Port is computer Winsock: 110 MBps (10% cpu utilization at each end) Gray @ Nortel 20 April 1999 RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?
Outline –The bandwidth revolution –Scale. Up, Scale. Out –Terra. Server (Barclay, Slutz, Gray) Gray @ Nortel 20 April 1999
Latency: How Far Away is the Data? Andromeda 10 9 Tape /Optical Robot 2, 000 Years 10 6 Disk 100 10 2 1 Pluto Sacramento Memory On Board Cache On Chip Cache Registers 2 Years 1. 5 hr This Campus 10 min This Room My Head 1 min Gray @ Nortel 20 April 1999
System On A Chip • Integrate Processing with memory on one chip – – chip is 75% memory now 1 MB cache >> 1960 supercomputers 256 Mb memory chip is 32 MB! IRAM, CRAM, PIM, … projects abound • Integrate Networking with processing on one chip – system bus is a kind of network – ATM, Fiber. Channel, Ethernet, . . Logic on chip. – Direct IO (no intermediate bus) • Functionally specialized cards shrink to a chip. Gray @ Nortel 20 April 1999
Scaleability Scale Up and Scale Out Grow Up with SMP 4 x. P 6 is now standard SMP Super Server Grow Out with Cluster has inexpensive parts Departmental Server Personal System Cluster of PCs Gray @ Nortel 20 April 1999
There'll be Billions Trillions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous Gray @ Nortel 20 April 1999
Trillions Billions Of Clients Need Millions Of Servers u Billions All clients networked to servers Ø u u May be nomadic or on-demand Fast clients want faster servers Servers provide Clients Mobile clients Fixed clients Servers Shared Data Ø Control Ø Coordination Ø Communication Server Ø Gray @ Nortel 20 April 1999 Super server
Thin Client Support (FAT SERVERS ) TSO comes to NT lower per-client costs Net PC Windows NT Server Terminal Server Existing, Desktop PC MS-DOS, UNIX, Mac clients Gray @ Nortel 20 April 1999 Dedicated Windows terminal
Windows 2000 Intelli. Mirror™ • Extends CMU Coda File System ideas • Files and settings mirrored on client and server • Great for disconnected users • Facilitates roaming • Easy to replace PCs • Optimizes network performance FAT STORAGE SERVERS Gray @ Nortel 20 April 1999
SMP -> n. UMA: BIG FAT SERVERS • Needs • Directory based caching – 64 bit addressing lets you build large SMPs – n. UMA sensitive OS • Every vendor building a • (not clear who will do it) HUGE SMP • Or Hypervisor – 256 way – 3 x slower remote memory – 8 -level memory hierarchy • • L 1, L 2 cache DRAM remote DRAM (3, 6, 9, …) Disk cache Disk Tape cache Tape – like IBM LSF, – Stanford Disco www-flash. stanford. edu/Hive/papers. html • Not certain what happens next Gray @ Nortel 20 April 1999
Thesis Many little beat few big $1 million Mainframe 3 1 MM $100 K Mini $10 K Micro Pico Processor Nano 1 MB 10 pico-second ram 10 nano-second ram 100 MB 10 GB 10 microsecond ram 1 TB 14" u u 9" 5. 25" 3. 5" 2. 5" 1. 8" 10 millisecond disc 100 TB 10 second tape archive Smoking, hairy golf ball How to connect the many little parts? How to program the many little parts? Fault tolerance & Management? 1999 Gray @ Nortel 20 April 1 M SPECmarks, 1 TFLOP 106 clocks to bulk ram Event-horizon on chip VM reincarnated Multi-program cache, On-Chip SMP
4 B PC’s (1 Bips, . 1 GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace • Cost 1, 000 $ • Come with – NT – DBMS – High speed Net – System management – GUI / OOUI – Tools • Compatible with everyone else • Cyber. Bricks Gray @ Nortel 20 April 1999
Super Server: 4 T Machine u Array of 1, 000 4 B machines Ø 1 b ips processors Ø 1 B B DRAM Ø 10 B B disks Ø 1 Bbps comm lines Ø 1 TB tape robot u u CPU 50 GB Disc A few megabucks Challenge: 5 GB RAM ØManageability ØProgrammability Cyber Brick ØSecurity a 4 B machine ØAvailability ØScaleability ØAffordability u As easy as a single system Future servers are CLUSTERS of processors, discs Distributed database techniques make clusters work Gray @ Nortel 20 April 1999
Scale OUT Clusters Have Advantages • Fault tolerance: – Spare modules mask failures • Modular growth without limits – Grow by adding small modules • Parallel data search – Use multiple processors and disks • Clients and servers made from the same stuff – Inexpensive: built with commodity Cyber. Bricks Gray @ Nortel 20 April 1999
• • 1988: IBM DB 2 + CICS Mainframe 65 tps IBM 4391 Simulated network of 800 clients 2 m$ computer Staff of 6 to do benchmark 2 x 3725 network controllers Refrigerator-sized CPU 16 GB disk farm 4 x 8 x. 5 GB Gray @ Nortel 20 April 1999
1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem) • A dozen people (1. 8 M$/y) • False floor, 2 rooms of machines 32 node processor array Admin expert Performance Hardware experts expert Network expert Auditor Manager Simulate 25, 600 clients 40 GB disk array (80 drives) Gray @ Nortel 20 April 1999 DB expert OS expert
1997: 9 years later 1 Person and 1 box = 1250 tps • • 1 Breadbox ~ 5 x 1987 machine room 23 GB is hand-held One person does all the work Cost/tps is 100, 000 x less 5 micro dollars per transaction Hardware expert OS expert Net expert DB expert App expert 4 x 200 Mhz cpu 1/2 GB DRAM 12 x 4 GB disk Gray @ Nortel 20 April 1999 3 x 7 x 4 GB disk arrays
What Happened? Where did the 100, 000 x come from? • • Moore’s law: 100 X (at most) Software improvements: 10 X (at most) Commodity Pricing: 100 X (at least) Total 100, 000 X • 100 x from commodity – (DBMS was 100 K$ to start: now 1 k$ to start – IBM 390 MIPS is 7. 5 K$ today – Intel MIPS is 10$ today – Commodity disk is 50$/GB vs 1, 500$/GB –. . . Gray @ Nortel 20 April 1999 ma infr price min i mic ro time am e
Kilo Mega Giga Tera Peta Exa Computers shrink to a point • Disks 100 x in 10 years 2 TB 3. 5” drive • Shrink to 1” is 200 GB • Disk is super computer! Zetta Yotta • This is already true of printers and “terminals” Gray @ Nortel 20 April 1999
All Device Controllers will be Cray 1’s • TODAY – Disk controller is 10 mips risc engine with 2 MB DRAM – NIC is similar power • SOON – Will become 100 mips systems with 100 MB DRAM. Central Processor & Memory • They are nodes in a federation (can run Oracle on NT in disk controller). • Advantages – – – Uniform programming model Great tools Security economics (cyberbricks) Move computation to data (minimize traffic) Gray @ Nortel 20 April 1999 Tera Byte Backplane
It’s Already True of Printers Peripheral = Cyber. Brick • You buy a printer • You get a – several network interfaces – A Postscript engine • • cpu, memory, software, a spooler (soon) – and… a print engine. Gray @ Nortel 20 April 1999
Functionally Specialized Cards • Storage P mips processor ASIC Today: P=50 mips • Network M MB DRAM M= 2 MB In a few years ASIC P= 200 mips M= 64 MB • Display ASIC Gray @ Nortel 20 April 1999
Implications Conventional Radical • Offload device handling to NIC/HBA • higher level protocols: I 2 O, NASD, VIA… • SMP and Cluster parallelism is important. • Move app to NIC/device controller • higher-higher level protocols: DCOM. • Cluster parallelism is VERY important. Central Processor & Memory h Gray @ Nortel 20 April 1999
How Do They Talk to Each Other? • • – DCOM? IIOP? RMI? – One or all of the above. Applications ? RPC streams datagrams • Huge leverage in high-level interfaces. • Same old distributed system story. ? RPC streams datagrams Applications Each node has an OS Each node has local resources: A federation. Each node does not completely trust the others. Nodes use RPC to talk to each other VIAL/VIPL Wire(s) Gray @ Nortel 20 April 1999
Disk = Node • • has magnetic storage (100 GB? ) has processor & DRAM has SAN attachment has execution Applications environment Services DBMS RPC, . . . File System SAN driver Disk driver OS Kernel Gray @ Nortel 20 April 1999
Scaleability Scale Up and Scale Out Grow Up with SMP 4 x. P 6 is now standard SMP Super Server Grow Out with Cluster has inexpensive parts Departmental Server Personal System Cluster of PCs Gray @ Nortel 20 April 1999
Hot. Mail: ~300 Computers • Free. BSD and Solaris Gray @ Nortel 20 April 1999
Microsoft. com: ~150 nodes Gray @ Nortel 20 April 1999
Other Clusters • 16 -node Cluster – 64 cpus – 2 TB of disk – Decision support • 45 -node Compaq Cluster – – 140 cpus 14 GB DRAM 4 TB RAID disk OLTP (Debit Credit) • 1 B tpd (14 k tps) Gray @ Nortel 20 April 1999
Berkeley NOW (network of workstations) Project http: //now. cs. berkeley. edu/ • 105 nodes – Sun Ultra. Sparc 170, 128 MB, 2 x 2 GB disk – Myrinet interconnect (2 x 160 MBps per node) – SBus (30 MBps) limited • • • GLUNIX layer above Solaris Inktomi (Hot. Bot search) NAS Parallel Benchmarks Crypto cracker Sort 9 GB per second Gray @ Nortel 20 April 1999
NCSA Super Cluster http: //access. ncsa. uiuc. edu/Cover. Stories/Super. Cluster/super. html • National Center for Supercomputing Applications University of Illinois @ Urbana • 512 Pentium II cpus, 2, 096 disks, SAN • Compaq + HP +Myricom + Windows. NT • A Super Computer for 3 M$ • Classic Fortran/MPI programming • DCOM programming model Gray @ Nortel 20 April 1999
Outline –The bandwidth revolution –Scale. Up, Scale. Out –Terra. Server (Barclay, Slutz, Gray) A scaleup example Gray @ Nortel 20 April 1999
Kilo Mega Giga Tera Peta Some Tera-Byte Databases • • • The Web: 1 TB of HTML Terra. Server 1 TB of images Several other 1 TB (file) servers Hotmail: 7 TB of email Sloan Digital Sky Survey: 40 TB raw, 2 TB cooked • EOS/DIS (picture of planet each week) – 15 PB by 2007 Exa • Federal Clearing house: images of checks – 15 PB by 2006 (7 year history) Zetta • Nuclear Stockpile Stewardship Program – 10 Exabytes (? ? ? !!) Yotta Gray @ Nortel 20 April 1999
Kilo Info Capture record • You can A letter A novel Mega Giga A Movie Tera Library of Congress (text) Peta everything you see or hear or read. • What would you do with it? • How would you organize & analyze it? Lo. C (image) Video Audio Read or write: 8 PB per lifetime (10 GBph) 30 TB (10 KBps) 8 GB (words) Exa All Disks Zetta All Tapes See: http: //www. lesk. com/mlesk/ksg 97/ksg. html Yotta Gray @ Nortel 20 April 1999
Michael Lesk’s Points www. lesk. com/mlesk/ksg 97/ksg. html • Soon everything can be recorded and kept • Most data will never be seen by humans • Precious Resource: Human attention Auto-Summarization Auto-Search will be a key enabling technology. Gray @ Nortel 20 April 1999
The Terra. Server http: //www. terraserver. microsoft. com/ Gray @ Nortel 20 April 1999
Database & application UI • Coverage: Range from 70ºN to 70ºS today: 35% U. S. , 1% outside U. S. • Source Imagery: – 4 TB 1 sq meter/pixel Aerial (USGS - 60, 000 46 Mb B&W- 151 Mb Color IR files) – 1 TB 1. 56 meter/pixel Satellite (Spin-2 - 2400 300 Mb B&W) • Concept: User navigates an ‘almost seamless’ image of earth • Display Imagery: 200 x 200 pixel images, subsample to build image pyramid • Nav Tools: – – 1. 5 m place names “Click-on” Coverage map Expedia & Virtual Globe map Pick of the week Gray @ Nortel 20 April 1999 200 x 200 m tile , 4 x, 4 km browse. 8 x. 8 km 8 m thumbnail 1. 6 x 1. 6 km “city view”
Image Data DRG 4 TB 6 TB Coming USGS “DOQ” Spin-2 50, 000 Topo Maps adding now 1 TB World. Wide New Data Coming Gray @ Nortel 20 April 1999
Software Architecture Web Client Internet Information Server 4. 0 HTML Terra-Server 24 Active Server Pages IE 3… 5 Netscape 3… 4 Active Data Object ODBC 19 Java Viewer The Internet Terra-Server Stored Procedures 39 SQL Server 7. 0 (14 Img) (8 Place) Terra-Server DB Microsoft Site Serve EE 3. 0 SPIN-2/USGS Store 13 Active Server Pages Terra-Server Web Site Image Delivery SQL Server Application Gray @ Nortel 20 April 1999 Image Commerce Site(s)
How Images are Found Expedia Name Map Search 22% 40% Famous Places 18% Geo Coordinate Coverage Map 1% 19% Gray @ Nortel 20 April 1999
Terra. Server: Lots of Web Hits Summary Total Average Max Unique Users Sessions 17 M 24 M 69 k 94 k 150 k 172 k Hits Page Views DB Queries Image Xfers 1. 7 B 274 M 1. 5 B 1. 3 B 6. 8 M 1. 1 M 5. 8 M 5. 0 M 29 M 6. 6 M 18 M 15 M As of Feb 28, 1999 • Today: – 1. 7 billion web hits – 1 TB, largest SQL DB on the Web – 100 qps average, 1, 000 Qps peak – 1. 5 B SQL queries Gray @ Nortel 20 April 1999 so far
Logical Schema Country Name Place. Type Feature Type State Name Image Data & Meta Data Where Am I Gazetteer Index on • image, place, type • image, state, country, type • image, place, state, type • image, place, country, type all lookups are fast Jump Img Spin Frame Meta Img Meta Place Name Theme Meta Information Tile Meta Browse Img Thumb Img Tile Img Lookup by UGrid or ZGrid ID plus resolution Lookups are fast. Indices are in DRAM (auto-magically by SQL) SQL manages all the tiles and indices Images are brought in on demand Gray @ Nortel 20 April 1999
Image Load and Update DLT Tape “tar” Active Server Pages Cut & Load Scheduling System Staging Disk Metadata Load DB Dither Image Pyramid From base Gray @ Nortel 20 April 1999 JPEG tiles Image Cutter Merge ODBC Tx Terra. Loader ODBC TX ODBC Tx Terra. Server SQL DBMS
Terra. Server Administrator Web Site • Accessible by Microsoft, SPIN-2, and USGS • Web browser forms to: – Edit Famous Places list – Modify Image Status fields – Define new Terra. Server Administrators Gray @ Nortel 20 April 1999
Load & Backup&Recovery • Backup and Recovery – Using Legato Networker integrated with SQL Backup/Restore Utility – Fast, incremental, differential, online • Restore – Fast, incremental (file oriented), not online. • SQL Server Enterprise Manager – DBA Maintenance – SQL Performance Monitor Gray @ Nortel 20 April 1999
Site Configuration 9710 Timber. Wolf Enterprise Storage Array Alpha 8400 9 HSZ 70 Ultra-SCSI Dual redundant Controllers (8 x 440) 10 GB Ram 324 9 GB Seagate Disks Compaq 5500 4 x 200 mhz Web Servers Gray @ Nortel 20 April 1999 Compaq 5500 4 x 200 mhz Web Servers To the Web
• • Compaq Alpha. Server 8400 8 x 400 Mhz Alpha cpus 10 GB DRAM 324 9. 2 GB Storage. Works Disks – 3 TB raw, 2. 4 TB of RAID 5 • STK 9710 tape robot (4 TB) • Windows. NT 4 EE, SQL Server 7. 0 Gray @ Nortel 20 April 1999 The Microsoft Terra. Server Hardware
File System Config • Use Storage. Works to form 28 RAID 5 sets Each raid set has 11 disks (16 spare drives) • Use NTFS to form 4 595 GB NT volumes Each striped over 7 Raid sets on 7 controllers • Create 26 20, 000 MB files on F: , 27 on G: • DB is File Group of 53 files (1. 011 TB) F: G: H: Gray @ Nortel 20 April 1999 I:
SQL 7 Terra. Server Availability • Operating for 9 months: 6400 hrs • Unscheduled outage: 36. 5 minutes: 99. 9905% scheduled up • Scheduled outage: 60 minutes • Availability: 99. 96% overall up • No NT failures (ever) • One SQL 7 Beta 2 bug • No failures in July, Aug, Oct, Dec, Jan, Feb, Mar Gray @ Nortel 20 April 1999
Things we did right. . . • Use a database to store images: – Simplify management – Can dynamically load data into tables while viewing application is active • Simple X, Y Z-Grid navigation system • Used Img. Status to control logical “presence” of the image in the app • “Stitching tiles together” from multiple input images to form seamless mosaic • Offering two forms of seamless -- time based (SPIN-2) and theme based (DOQ) Gray @ Nortel 20 April 1999
TS 3: Things are changing. . . • Square Tiles, power of 2 size (200 x 200) • Power of 2 zoom levels (2: 1, 4: 1, 8: 1, etc. ) so uniform tile size on each zoom (variable ground size per tile) • Indexing system independent of tile size • • Digital Raster Graphics (Topo maps) Layered Maps (Topo merge with DOQ) Integrate with other applications and services Later: – Digital Elevation Models (DEMs) – Other foreign data sources (EU, etc. ) Gray @ Nortel 20 April 1999
What Terra. Server Shows • Can serve huge databases on Internet for about a penny a page view mostly phone bill (!). Advertising pays more than a penny a page. • Commodity tools do scale fairly far. • A few people (3 developers, 1 operator) using power tools can build an impressive web site Gray @ Nortel 20 April 1999
Thank You! SPIN-2 Tom Barclay did most of this app, Slutz and Gray helped. Gray @ Nortel 20 April 1999
Outline –The bandwidth revolution –Scale. Up, Scale. Out –Terra. Server (Barclay, Slutz, Gray) Gray @ Nortel 20 April 1999
end Gray @ Nortel 20 April 1999
Windows NT Versus UNIX Best Results on an SMP: Semi. Log plot shows 3 x (2 year) lead by UNIX see www. tpc. org Gray @ Nortel 20 April 1999
TPC C Improvements (MS SQL) 40% hardware, 100% software, 250%/year on Price, 100% PC Technology 100%/year performance Gray @ Nortel 20 April 1999
Price Breakdown (6 months old) Gray @ Nortel 20 April 1999
(dis) Economy Of Scale Gray @ Nortel 20 April 1999
454cd19929056fe8d3bedc28240c354b.ppt