Скачать презентацию Ocean Store Global-Scale Persistent Storage Ying Lu CSCE Скачать презентацию Ocean Store Global-Scale Persistent Storage Ying Lu CSCE

9efd7eb9adb2e4e1a887639d3d2a2eec.ppt

  • Количество слайдов: 40

Ocean. Store Global-Scale Persistent Storage Ying Lu CSCE 496/896 Spring 2011 1 Ocean. Store Global-Scale Persistent Storage Ying Lu CSCE 496/896 Spring 2011 1

Give Credits • Many slides are from John Kubiatowicz, University of California at Berkeley Give Credits • Many slides are from John Kubiatowicz, University of California at Berkeley • I have modified them and added new slides 2

Motivation • Personal Information Mgmt is the Killer App – Not corporate processing but Motivation • Personal Information Mgmt is the Killer App – Not corporate processing but management, analysis, aggregation, dissemination, filtering for the individual – Automated extraction and organization of daily activities to assist people • Information Technology as a Utility – Continuous service delivery, on a planetaryscale, on top of a highly dynamic information base 3

Ocean. Store Context: Ubiquitous Computing • Computing everywhere: – Desktop, Laptop, Palmtop, Cars, Cellphones Ocean. Store Context: Ubiquitous Computing • Computing everywhere: – Desktop, Laptop, Palmtop, Cars, Cellphones – Shoes? Clothing? Walls? • Connectivity everywhere: – Rapid growth of bandwidth in the interior of the net – Broadband to the home and office – Wireless technologies such as CDMA, Satellite, laser • Rise of the thin-client metaphor: – Services provided by interior of network – Incredibly thin clients on the leaves • MEMS devices -- sensors+CPU+wireless net in 1 mm 3 • Mobile society: people move and devices are 4 disposable

What do we need for personal information management? 5 What do we need for personal information management? 5

Questions about information: • Where is persistent information stored? – 20 th-century tie between Questions about information: • Where is persistent information stored? – 20 th-century tie between location and content outdated • How is it protected? – Can disgruntled employee of ISP sell your secrets? – Can’t trust anyone (how paranoid are you? ) • Can we make it indestructible? – Want our data to survive “the big one”! – Highly resistant to hackers (denial of service) – Wide-scale disaster recovery • Is it hard to manage? – Worst failures are human-related – Want automatic (introspective) diagnose and repair 6

First Observation: Want Utility Infrastructure • Mark Weiser from Xerox: Transparent computing is the First Observation: Want Utility Infrastructure • Mark Weiser from Xerox: Transparent computing is the ultimate goal – Computers should disappear into the background • In storage context: – Don’t want to worry about backup, obsolescence – Need lots of resources to make data secure and highly available, BUT don’t want to own them – Outsourcing of storage already very popular • Pay monthly fee and your “data is out there” – Simple payment interface one bill from one company 7

Second Observation: Need wide-scale deployment • Many components with geographic separation – System not Second Observation: Need wide-scale deployment • Many components with geographic separation – System not disabled by natural disasters – Can adapt to changes in demand regional outages • Wide-scale use and sharing also requires widescale deployment – Bandwidth increasing rapidly, but latency bounded by speed of light • Handling many people with same system leads to economies of scale 8

Ocean. Store: Everyone’s data, One big Utility “The data is just out there” • Ocean. Store: Everyone’s data, One big Utility “The data is just out there” • Separate information from location – Locality is only an optimization (an important one!) – Wide-scale coding and replication for durability • All information is globally identified – Unique identifiers are hashes over names & keys – Single uniform lookup interface – No centralized namespace required 9

Amusing back of the envelope calculation (courtesy Bill Bolotsky, Microsoft) • How many files Amusing back of the envelope calculation (courtesy Bill Bolotsky, Microsoft) • How many files in the Ocean. Store? – Assume 1010 people in world – Say 10, 000 files/person (very conservative? ) – So 1014 files in Ocean. Store! – If 1 gig files (not likely), get 1 mole of files! Truly impressive number of elements… … but small relative to physical constants 10

Utility-based Infrastructure Canadian Ocean. Store Sprint AT&T Pac Bell IBM • Service provided by Utility-based Infrastructure Canadian Ocean. Store Sprint AT&T Pac Bell IBM • Service provided by confederation of companies – Monthly fee paid to one service provider – Companies buy and sell capacity from each other 11

Outline • Motivation • Properties of the Ocean. Store • Specific Technologies and approaches: Outline • Motivation • Properties of the Ocean. Store • Specific Technologies and approaches: – – – Naming and Data Location Conflict resolution on encrypted data Replication and Deep archival storage Introspective computing for optimization and repair Economic models • Conclusion 12

Ubiquitous Devices Ubiquitous Storage • Consumers of data move, change from one device to Ubiquitous Devices Ubiquitous Storage • Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. • Properties REQUIRED for Ocean. Store storage substrate: – Strong Security: data encrypted in the infrastructure; resistance to monitoring and denial of service attacks – Coherence: too much data for naïve users to keep coherent “by hand” – Automatic replica management and optimization: huge quantities of data cannot be managed manually – Simple and automatic recovery from disasters: probability of failure increases with size of system – Utility model: world-scale system requires cooperation 13 across administrative boundaries

Ocean. Store Technologies I: Naming and Data Location • Requirements: – System-level names should Ocean. Store Technologies I: Naming and Data Location • Requirements: – System-level names should help to authenticate data – Route to nearby data without global communication – Don’t inhibit rapid relocation of data • Ocean. Store approach: Two-level search with embedded routing – Underlying namespace is flat and built from secure cryptographic hashes (160 -bit SHA-1) – Search process combines quick, probabilistic search with slower guaranteed search 14

Universal Location Facility • Takes 160 -bit unique identifier (GUID) and Returns the nearest Universal Location Facility • Takes 160 -bit unique identifier (GUID) and Returns the nearest object that matches Universal Name Version OID Floating Replica Name OID Active Data Global Object Resolution Root Structure Update OID: Archive versions: Version OID 1 Version OID 2 Version OID 3 Global Object Resolution Erasure Coded: Archival copy or snapshot Commit Checkpoint Logs OID Global Object Resolution Archival copy or snapshot 15

Routing Two-tiered approach • Fast probabilistic routing algorithm – Entities that are accessed frequently Routing Two-tiered approach • Fast probabilistic routing algorithm – Entities that are accessed frequently are likely to reside close to where they are being used (ensured by introspection) Self-optimizing • Slower, guaranteed hierarchical routing method 16

Probabilistic Routing Algorithm self-optimizing 01234 bit on the depth of the reliable factors attenuated Probabilistic Routing Algorithm self-optimizing 01234 bit on the depth of the reliable factors attenuated bloom filter array 10 11100 1 st 1 st 11011 10 11011 2 nd n 1 n 2 10101 Query for X (11010) 1 st 2 nd 3 rd 11010 11011 100 Y (0, 1, 3) (0, 1, 4) 11010 11100 1 st 00011 reliable factors 11000 00100 11010 X 11001 n 3 n 4 00011 11011 100 z Bloom filter on each node; Attenuated Bloom filter on each directed edge. M (1, 3, 4) (0, 2, 4) 17

Hierarchical Routing Algorithm • Based on Plaxton scheme • Every server in the system Hierarchical Routing Algorithm • Based on Plaxton scheme • Every server in the system is assigned a random node-ID • Object’s root – each object is mapped to a single node whose node. ID matches the object’s GUID in the most bits (starting from the least significant) • Information about the GUID (such as location) were stored at its root 18

Construct Plaxton Mesh 1 x 431 1 0324 1 1 x 633 x 742 Construct Plaxton Mesh 1 x 431 1 0324 1 1 x 633 x 742 2 3714 2 1 2 0265 1215 3 2344 x 927 4 5724 9834 3 1624 7144 1324 … 19

GUID 0 x 43 FE Basic Plaxton Mesh Incremental suffix-based routing 3 4 Node. GUID 0 x 43 FE Basic Plaxton Mesh Incremental suffix-based routing 3 4 Node. ID 0 x 79 FE Node. ID 0 x 035 E 3 Node. ID 0 x 23 FE Node. ID 0 x 73 FE 3 Node. ID 0 x 44 FE 2 1 Node. ID 0 x 73 FF d 4 3 2 Node. ID 0 x 43 FE 4 4 2 Node. ID 0 x 555 E 2 Node. ID c 0 x. ABFE 2 Node. ID 0 x 423 E 3 Node. ID 0 x 993 E e 1 1 3 Node. ID 0 x. F 990 Node. ID 0 x 04 FE Node. ID 0 x 13 FE Node. ID 0 x 9990 1 2 4 b Node. ID 0 x 239 E 3 1 Node. ID 0 x 1290 a 20

Use of Plaxton Mesh Randomization and Locality 21 Use of Plaxton Mesh Randomization and Locality 21

Ocean. Store Enhancements of the Plaxton Mesh • Documents have multiple roots (Salted hash Ocean. Store Enhancements of the Plaxton Mesh • Documents have multiple roots (Salted hash of GUID) • Each node has multiple neighbor links • Searches proceed along multiple paths – Tradeoff between reliability, performance and bandwidth? • Dynamic node insertion and deletion algorithms – Continuous repair and incremental optimization of links self-healing self-configuration self-optimizing 22

Ocean. Store Technologies II: Rapid Update in an Untrusted Infrastructure • Requirements: – Scalable Ocean. Store Technologies II: Rapid Update in an Untrusted Infrastructure • Requirements: – Scalable coherence mechanism which can operate directly on encrypted data without revealing information – Handle Byzantine failures – Rapid dissemination of committed information • Ocean. Store Approach: – Operations-based interface using conflict resolution • Modeled after Xerox Bayou updates packets include: Predicate/action pairs which operate on encrypted data – User signs Updates and principle party signs commits – Committed data multicast to clients 23

Update Model • Concurrent updates w/o wide-area locking – Conflict resolution • Updates Serialization Update Model • Concurrent updates w/o wide-area locking – Conflict resolution • Updates Serialization • A master replica? • Role of primary tier of replicas – All updates submitted to primary tier of replicas which chooses a final total order by following Byzantine agreement protocol • A secondary tier of replicas – The result of the updates is multicast down the dissemination tree to all the secondary replicas 24

Agreement • Need agreement in DS: – Leader, commit, synchronize • Distributed Agreement algorithm: Agreement • Need agreement in DS: – Leader, commit, synchronize • Distributed Agreement algorithm: all non -faulty processes achieve consensus in a finite number of steps • Perfect processes, faulty channels: twoarmy • Faulty processes, perfect channels: Byzantine generals

Two-Army Problem Two-Army Problem

Possible Consensus • Agreement is possible in synchronous DS [e. g. , Lamport et Possible Consensus • Agreement is possible in synchronous DS [e. g. , Lamport et al. ] – Messages can be guaranteed to be delivered within a known, finite time. – Byzantine Generals Problem • A synchronous DS: can distinguish a slow process from a crashed one

Byzantine Generals Problem Byzantine Generals Problem

Byzantine Generals -Example (1) The Byzantine generals problem for 3 loyal generals and 1 Byzantine Generals -Example (1) The Byzantine generals problem for 3 loyal generals and 1 traitor. a) The generals announce the time to launch the attack (by messages marked by their ids). b) The vectors that each general assembles based on (a) c) The vectors that each general receives, where every general passes his vector from (b) to every other general.

Byzantine Generals –Example (2) The same as in previous slide, except now with 2 Byzantine Generals –Example (2) The same as in previous slide, except now with 2 loyal generals and one traitor.

Byzantine Generals • Given three processes, if one fails, consensus is impossible • Given Byzantine Generals • Given three processes, if one fails, consensus is impossible • Given N processes, if F processes fail, consensus is impossible if N 3 F

Tentative Updates: Epidemic Dissemination 32 Tentative Updates: Epidemic Dissemination 32

Committed Updates: Multicast Dissemination 33 Committed Updates: Multicast Dissemination 33

Data Coding Model • Two distinct forms of data: active and archival • Active Data Coding Model • Two distinct forms of data: active and archival • Active Data in Floating Replicas – Latest version of the object • Archival Data in Erasure Coded Fragments – A permanent, read-only version of the object – During commit, previous version coded with erasure-code and spread over 100 s or 1000 s of nodes – Advantage: any 1/2 or 1/4 of fragments regenerates data 34

Floating Replica and Deep Archival Coding Full Copy Ver 1: 0 x 34243 Ver Floating Replica and Deep Archival Coding Full Copy Ver 1: 0 x 34243 Ver 2: 0 x 49873 Ver 3: … Conflict Resolution Logs Floating Replica Full Copy Ver 1: 0 x 34243 Ver 2: 0 x 49873 Ver 3: … Conflict Resolution Logs 35 Erasure-coded Fragments

Proactive Self-Maintenance • Continuous testing and repair of information – Slow sweep through all Proactive Self-Maintenance • Continuous testing and repair of information – Slow sweep through all information to make sure there are sufficient erasure-coded fragments – Continuously reevaluate risk and redistribute data – Slow sweep and repair of metadata/search trees • Continuous online self-testing of HW and SW – Detects flaky, failing, or buggy components via: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: pushing HW/SW components past normal operating parameters • scrubbing: periodic restoration of potentially “decaying” hardware or software state – Automates preventive maintenance 36

Ocean. Store Technologies IV: Introspective Optimization • Requirements: – Reasonable job on global-scale optimization Ocean. Store Technologies IV: Introspective Optimization • Requirements: – Reasonable job on global-scale optimization problem • Take advantage of locality whenever possible • Sensitivity to limited storage and bandwidth at endpoints – Repair of data structures, increasing of redundancy – Stability in chaotic environment Active Feedback • Ocean. Store Approach: – Introspective monitoring and analysis of relationships to cluster information by relatedness – Time series-analysis of user and data motion – Rearrangement and replication in response to monitoring • Clustered prefetching: fetch related objects 37 • Proactive-prefetching: get data there before needed

Example: Client Introspection • Client observer and optimizer components – Greedy agents working on Example: Client Introspection • Client observer and optimizer components – Greedy agents working on the behalf of the client • Watches client activity/combines with historical info • Performs clustering and time-series analysis • Forwards results to infrastructure (privacy issues!) – Monitoring state of network to adapt behaviour • Typical Actions: – Cluster related files together – Prefetch files that will be needed soon – Create/destroy floating replicas 38

Ocean. Store Conclusion • The Time is now for a Universal Data Utility – Ocean. Store Conclusion • The Time is now for a Universal Data Utility – Ubiquitous computing and connectivity is (almost) here! – Confederation of utility providers is right model • Ocean. Store holds all data, everywhere – Local storage is a cache on global storage – Provides security in an untrusted infrastructure • Exploits economies of scale to: – Provide high-availability and extreme survivability – Lower maintenance cost: • self-diagnosis and repair • Insensitivity to technology changes: Just unplug one set of servers, plug in others 39