
cc4da05883df971fd235e82e24eb718f.ppt
- Количество слайдов: 27
Storage Bricks Jim Gray Microsoft Research http: //Research. Microsoft. com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Helped me sharpen Erik Riedel these arguments Catharine Van Ingen 1
First Disk 1956 • IBM 305 RAMAC • 4 MB • 50 x 24” disks • 1200 rpm • 100 ms access • 35 k$/y rent • Included computer & accounting software (tubes not transistors) 2
1. 6 meters 10 years later 3
Kilo Mega Giga Tera Peta Exa Zetta Disk Evolution • Capacity: 100 x in 10 years 1 TB 3. 5” drive in 2005 20 GB 1” micro-drive • System on a chip • High-speed SAN Yotta • Disk replacing tape • Disk is super computer! 4
Disks are becoming computers • • Smart drives Camera with micro-drive Replay / Tivo / Ultimate TV Phone with micro-drive MP 3 players Tablet Xbox Many more… Applications Web, DBMS, Files OS Disk Ctlr + 1 Ghz cpu+ 1 GB RAM Comm: Infiniband, Ethernet, radio… 5
Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks Processing decentralized Today: Moving to data sources Moving to power sources ASIC P=50 mips M= 2 MB Storage Moving to sheet metal Network Display ASIC In a few years ASIC ? The end of computers ? P= 500 mips M= 256 MB 6
It’s Already True of Printers Peripheral = Cyber. Brick • You buy a printer • You get a – several network interfaces – A Postscript engine • • cpu, memory, software, a spooler (soon) – and… a print engine. 7
The Absurd Design? • • Segregate processing from storage Poor locality Much useless data movement Amdahl’s laws: bus: 10 B/ips io: 1 b/ips Processors 10 TBps ~ 1 Tips Disks RAM 100 GBps ~ 1 TB 8 ~ 100 TB
The “Absurd” Disk • 2. 5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! • Optimizations: – Reduce management costs – Caching – Sequential 100 x faster than random 100 MB/s 200 Kaps 200$ 1 TB 9
Disk = Node • magnetic storage (1 TB) • processor + RAM + LAN • Management interface (HTTP + SOAP) • Application execution environment • Application – File – DB 2/Oracle/SQL – Notes/Exchange/ Team. Server – SAP/Seibold/… – Quickbooks /Tivo/ PC. … Applications Services DBMS RPC, . . . File System LAN driver Disk driver OS Kernel 10
Implications Conventional Radical • Move app to • Offload device handling to NIC/device controller NIC/HBA • higher-higher level • higher level protocols: I 2 O, NASD, VIA, IP, TCP… SOAP/DCOM/RMI. . • SMP and Cluster • Cluster parallelism is important. VERY important. Central Processor & Memory Terabyte/s Backplane 11
Intermediate Step: Shared Logic • • • Brick with 8 -12 disk drives 200 mips/arm (or more) 2 x. Gbps. Ethernet General purpose OS 10 k$/TB to 50 k$/TB Shared – – – Sheet metal Power Support/Config Security Network ports Snap ~1 TB 12 x 80 GB NAS Net. App ~. 5 TB 8 x 70 GB NAS Maxstor ~2 TB 12 x 160 GB NAS 12 • These bricks could run applications (e. g. SQL or Mail or. . )
Example • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320 MB RAM, 3 u high, 4 100 GB IDE Drives • $4 k/TB (street), • 2. 5 processors/TB, 1 GB RAM/TB • JIT storage & processing 3 weeks from order to deploy Slide courtesy of Brewster Kahle, @ Archive. org 13
What if Disk Replaces Tape? How does it work? • Backup/Restore – RAID (among the federation) – Snapshot copies (in most OSs) – remote replicas (standard in DBMS and FS) • Archive – Use “cold” 95% of disk space • Interchange – Send computers not disks. 14
It’s Hard to Archive a Petabyte It takes a LONG time to restore it. • At 1 GBps it takes 12 days! • Store it in two (or more) places online A geo-plex • Scrub it continuously (look for errors) • On failure, – use other copy until failure repaired, – refresh lost copy from safe copy. • Can organize the two copies differently (e. g. : one by time, one by space) 15
Archive to Disk 100 TB for 0. 5 M$ + 1. 5 “free” petabytes • If you have 100 TB active you need 10, 000 mirrored disk arms (see tpc. C) • So you have 1. 6 PB of (mirrored) storage (160 GB drives) • Use the “empty” 95% for archive storage. • No extra space or extra power cost. • Very fast access (milliseconds vs hours). • Snapshot is read-only (software enforced ) • Makes Admin easy (saves people costs) 16
Disk as Tape Archive Slide courtesy of Brewster Kahle, @ Archive. org • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. 17
Disk as Tape Interchange • Tape interchange is frustrating (often unreadable) • Beyond 1 -10 GB send media not data – FTP takes too long (hour/GB) – Bandwidth still very expensive (1$/GB) • Writing DVD not much faster than Internet • New technology could change this – 100 GB DVD @ 10 MBps would be competitive. • Write 1 TB disk in 2. 5 hrs (at 100 MBps) • But, how does interchange work? 18
Disk As Tape Interchange: What format? • • Today I send 160 GB NTFS/SQL disks. But that is not a good format for Linux/DB 2 users. Solution: Ship NFS/CIFS/ODBC servers (not disks) Plug “disk” into LAN. – DHCP then file or DB server via standard interface. – “pull” data from server. 19
Some Questions • • • What is the product? How do I manage 10, 000 nodes (disks)? How do I program 10, 000 nodes (disks)? How does RAID work? How do I backup a PB? How do I restore a PB? 20
What is the Product? • Concept: Plug it in and it works! • • • Music/Video/Photo appliance (home) Game appliance “PC” File server appliance Data archive/interchange appliance Web server appliance DB server e. Mail appliance Application appliance network power 21
How Does Scale Out Work? • Files: well known designs: – – rooted tree partitioned across nodes Automatic cooling (migration) Mirrors or Chained declustering Snapshots for backup/archive • Databases: well known designs – Partitioning, remote replication similar to files – distributed query processing. • Applications: (hypothetical) – Must be designed as mobile objects – Middleware provides object migration system • Objects externalize methods to migrate ( == backup/restore/archive) • Web services seem to have key ideas (xml representation) 22 – Example: e. Mail object is mailbox
Auto Manage Storage • 1980 rule of thumb: – A Data. Admin per 10 GB, Sys. Admin per mips • 2000 rule of thumb – A Data. Admin per 5 TB – Sys. Admin per 100 clones (varies with app). • Problem: – 5 TB is 50 k$ today, 5 k$ in a few years. – Admin cost >> storage cost !!!! • Challenge: – Automate ALL storage admin tasks 23
Admin: TB and “guessed” $/TB (does not include cost of application, overhead, not “substance”) • • Google: Yahoo! DB Wall St. 1 : 100 TB 1 : 5 TB 1 : 1 TB 5 k$/TB/y 20 k$/TB/y 60 k$/TB/y 400 k$/TB/y (reported) • hardware dominant cost only @ Google. • How can we waste hardware to save people cost? 24
How do I manage 10, 000 nodes? • You can’t manage 10, 000 x (for any x). • They manage themselves. – You manage exceptional exceptions. • Auto Manage – Plug & Play hardware – Auto-load balance & placement storage & processing – Simple parallel programming model – Fault masking 25
How do I program 10, 000 nodes? • You can’t program 10, 000 x (for any x). • They program themselves. – You write embarrassingly parallel programs – Examples: SQL, Web, Google, Inktomi, Hot. Mail, …. – PVM and MPI prove it must be automatic (unless you have a Ph. D)! • Auto Parallelism is ESSENTIAL 26
Summary • Disks will become supercomputers so – Lots of computing to optimize the arm – Can put app close to the data (better modularity, locality) – Storage appliances (self-organizing) • The arm/capacity tradeoff: “waste” space to save access. – – Compression (saves bandwidth) Mirrors Online backup/restore Online archive (vault to other drives or geoplex if possible) • Not disks replace tapes: Storage appliances replace tapes. • Self-organizing storage servers (file systems) (prototypes of this software exist) 27
cc4da05883df971fd235e82e24eb718f.ppt