Скачать презентацию A Framework for Supporting DBMS-like Indexes in the Скачать презентацию A Framework for Supporting DBMS-like Indexes in the

17845960af294337c7b02dd360c25316.ppt

  • Количество слайдов: 25

A Framework for Supporting DBMS-like Indexes in the Cloud Gang Chen, Hoang Tam Vo, A Framework for Supporting DBMS-like Indexes in the Cloud Gang Chen, Hoang Tam Vo, Sai Wu, Beng Chin Ooi, M. Tamer Özsu

Motivation • No. SQL systems’ trade-off – K-V model -- scalability vs. functionality (ACID, Motivation • No. SQL systems’ trade-off – K-V model -- scalability vs. functionality (ACID, indexes…) – Data selection on primary key is not sufficient • Ad-hoc queries on secondary attributes – OLTP queries: high selectivity, low latency expectation – Cloud storage system: huge volume of data • parallel scan: scan 1 TB data to get 10 tuples? • Indexes in the cloud – Useful when query selectivity is high – Distributed indexes • central server may become bottleneck • facilitate parallelism and load balance - the distribution of indexes ? - scalability in terms of data volume, network size and number of indexes ? 2

Current State of the Art • Asynchronous view maintenance for VLSD databases[1] – Pre-configured Current State of the Art • Asynchronous view maintenance for VLSD databases[1] – Pre-configured queries vs. ad-hoc data selection • Open-source systems – Cassandra • built-in distributed hash secondary indexes (from V. 7. 0) – Hbase • a secondary index created as another table (on going) [2] • Closed-source systems – Megastore [3] • consistent local indexes inside an entity group • asynchronous global indexes across groups 1 Asynchronous view maintenance for VLSD databases [Sigmod 2009] 2 http: //hbase. apache. org/book. html#secondary. indexes 3 Megastore: Providing Scalable, Highly Available Storage for Interactive Services [CIDR 2011] 3

Current State of the Art (2) • P 2 P overlays as global distributed Current State of the Art (2) • P 2 P overlays as global distributed indexes – Distributed B+-tree-like indexes [1] • based on tree-based overlay BATON – Distributed R-tree-like indexes [2] • based on CAN overlay Disadvantages • lack of scalability • multiple indexes of different types 1 2 Efficient B+-tree Based Indexing for Cloud Data Processing [VLDB 2010] Indexing Multi-dimensional Data in a Cloud System [SIGMOD 2010] 4

Our Focus • Context – Efficient and elastic database service with database functionality (Daa. Our Focus • Context – Efficient and elastic database service with database functionality (Daa. S) • Aims – Provision of indexing functionality in the context of Daa. S – Efficiency • the ability to locate some specific records among millions of distributed candidates in real time – Scalability • multiple indexes (of different types) over distributed data – Extensibility • users can define new indexes without knowing the structure of the underlying network – Performance self-tuning • users do not have to tune the system performance by themselves 5

Challenges of Distributed Indexes • Different overlays are required to support different types of Challenges of Distributed Indexes • Different overlays are required to support different types of indexes y – BATON for B-tree – CAN for R-tree – Chord for Hash (x, y) Peer h(key 1) E A • Overlay routing and maintenance cost are high C Q(x, y) h(key 2) F • Load balancing issue key – Indexed columns have different data distribution – Difficult to balance the load of index nodes in the presence of multiple indexes D B 6

Our Approach to providing index functionality in the cloud • • • Indexing as Our Approach to providing index functionality in the cloud • • • Indexing as a service Generic overlay Data mapping Performance self-tuning Result – A simple yet efficient and extensible framework for developing distributed indexes in the cloud 7

Index Node Data Mapping Cayley Graph Manager Buffer Manager Connection Manager Local Indexes TCP/IP Index Node Data Mapping Cayley Graph Manager Buffer Manager Connection Manager Local Indexes TCP/IP Connection Data are transformed into a unified cayley key space chord can baton Cayley graph - Index data are distributed into different cluster nodes - Each node builds a local index for maintaining the index data - Part of local indexes are cached in memory A connection manager is set up to decide which connection should be maintained 8

Overlay Mapping • Two interfaces for mapping a specific type of overlay to Cayley Overlay Mapping • Two interfaces for mapping a specific type of overlay to Cayley graph – Generating set – Operator Applying the operator on the ID of an index node and the generating set will generate the routing table for that node generator Baton Cayley graph manager operator generator CAN Index search operator generator Chord operator 2 i {i = 0, . . . , n − 1} + mod 2 n {1 i, i = 1, . . . , n} + mod 2 2 i {i = 0, . . . , n − 1} + mod 2 n A routing algorithm 9

Data Mapping • Uniform data mapping 2 m - 1 U L 0 • Data Mapping • Uniform data mapping 2 m - 1 U L 0 • Load balance property – Uniform data mapping provides 1 -balance with the assumption that the data distribution is uniform 10

Data Mapping • Sampling-based data mapping – To deal with skewed data distribution Equi-depth Data Mapping • Sampling-based data mapping – To deal with skewed data distribution Equi-depth histogram built from samples (data domain) Mapped to equi-width histogram (cayley key space) – Stratified random sampling [1, 2] • • partition the domain into disjoint subsets take a specified number of samples from each subset – Done when bulk load data from external sources into cloud databases, e. g. , bulk insert daily new feed of new items from partners into operational table – Skewed online update • perform data migration to re-balance Sampling Issues in Parallel Database Systems S. Seshadri and J. F. Naughton, EDBT, 1992. 2 Efficient Bulk Insertion into a Distributed Ordered Table A. Silberstein et. al. , Sigmod, 2008. 1 11

Index Building • Each cluster node – acts as a peer in the P Index Building • Each cluster node – acts as a peer in the P 2 P overlay – maintains “local” indexes such as hash, B+-trees and R-trees • Index building – when data are imported – publish the index entries to different indexes based on P 2 P routing protocols 12

Index Search • Optimization: – Index + base table vs. Index covering plan • Index Search • Optimization: – Index + base table vs. Index covering plan • index entries contain portion of data record • support a wider range of queries than materialized views 13

Index Search • Range search – Process on index nodes in parallel • Parallel Index Search • Range search – Process on index nodes in parallel • Parallel scan of different indexes – Facilitate correlated access across multiple indexes – Especially useful for equi-join and range join – Join order ? 14

Index Update • Two steps – Delete the old corresponding index entry – Insert Index Update • Two steps – Delete the old corresponding index entry – Insert the new index entry • Index consistency – Based on requirements of applications – Trade-off between performance and consistency • strict enforcement of ACID properties • less demanding bulk update 15

Performance Self-tuning • Why? – Optimize the performance of existing index nodes before launching Performance Self-tuning • Why? – Optimize the performance of existing index nodes before launching new ones – Complex setting of multiple indexes of different types • Adaptively cache network connections • Effectively buffer local indexes 16

Failure and Replication • Failures in large clusters are common • Replication of index Failure and Replication • Failures in large clusters are common • Replication of index data – 24 X 7 service provision – Correct retrieval of index data in the presence of failures – Two-tier load adaptive replication [1] • first tier: k copies for data reliability • second tier: replicas created adaptively with query load – Replica consistency management – Lost updates? – System recovery from different types of failures 1 Towards Elastic Transactional Cloud Storage with Range Query Support H. T. Vo, B. C. Ooi, C. Chen. PVLDB 3(1): 506 -517 (2010) 17

Evaluation • Settings – – – • Experiments – – – • 64 -node Evaluation • Settings – – – • Experiments – – – • 64 -node in-house cluster and EC 2 Storage service: HDFS TPC-W (most experiments) and synthetically generated data set (bigger number of indexes and skewed data) Index plan vs. full table scan Index covering vs. Index+base approach Multiple indexes of different types Handling skewed data Scalability on EC 2 (up to 256 nodes) Other results (not covered in this talk) – – Effect of varying query rate Effect of varying data size Update performance Performance of equi-join and range join queries 18

Index plan vs. full table (parallel) scan • Index plan performs much better than Index plan vs. full table (parallel) scan • Index plan performs much better than the full table scan approach – Advantage of indexes: being able to identify the data node that contains the qualified tuple quickly – Table scan time increases almost linearly along with the data set size 19

Index covering vs. Index+base approach • Index covering outperforms index+base when the size of Index covering vs. Index+base approach • Index covering outperforms index+base when the size of result set is large – Index covering: index entries contains sufficient data for answering queries directly – Index+base is still useful compared to table scan 20

Multiple indexes of different types • Generalized index is superior to one-overlay-perindex, and provides Multiple indexes of different types • Generalized index is superior to one-overlay-perindex, and provides the much needed scalability – Generalized index: one index process maintains multiple indexes and self-tunes the performance via resource sharing. 21

Handling skewed data • Storage load distribution – Sampling-based data mapping can roughly estimate Handling skewed data • Storage load distribution – Sampling-based data mapping can roughly estimate the data distribution and consequently, a certain percentage of nodes maintains an equivalent percentage of index data • Execution load imbalance – Sampling-based data mapping distributes data among nodes better and therefore, incoming queries on skewed data are also distributed better 22

Scalability on EC 2 • Elastic scaling property – More workload can be handled Scalability on EC 2 • Elastic scaling property – More workload can be handled by adding more nodes into the system 23

Conclusions • A simple yet efficient and extensible framework for supporting indexes in the Conclusions • A simple yet efficient and extensible framework for supporting indexes in the cloud • Main characteristics – Support indexes using P 2 P overlays – Provide high level abstraction for definition new indexes • Main benefits – Reduce index creation and maintenance cost – Provide the much needed scalability • multiple indexes of different types over distributed data More info at epi. C project http: //www. comp. nus. edu. sg/~epi. C 24

Thank you! Questions & Answers 25 Thank you! Questions & Answers 25