The State of the Art in Distributed Query

The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco

Introduction l Distributed database technology is becoming an increasingly attractive enhancement to many database systems ¡ Cost and scalability ¡ Software integration l Legacy systems ¡ New applications ¡ Market forces

Introduction l Topics covered in this paper ¡ Basics of distributed query processing ¡ Client-server distributed DB models ¡ Heterogeneous distributed DB models ¡ Data placement techniques ¡ Other distributed architectures

Client-Server Database Systems l Relationships between distributed nodes take a client-server form l Client: makes requests of the servers, usually the source of queries l Server: responds to client requests, usually the source of data l System architectures: peer-to-peer, strict client-server, middleware/multitier

Architectures: Peer-to-Peer All nodes are equivalent l Each can be either a client or server on demand (can store data and/or make requests) l Ex: SHORE system l Peer Node Server or Client

Architectures: Strict Client-Server Client or server status is pre-defined and can never change l Clients supply queries, servers supply data l Most common architecture in commercial DBMS’s l Client Query source Server Data source

Architectures: Middleware/Multitier Multiple levels of client-server interaction l Nodes act as clients to those below them and servers to those above l SAP R/3, web servers with DB backends l Node 1 Client to Node 2 Server to Node 1, Client to Node 3 Server to Node 2

Architectures: Evaluation l Peer-to-Peer ¡ Simplest setup ¡ Equal load sharing l Strict Client-Server ¡ Specialization ¡ Administration for servers only l Middleware/Multitier ¡ Functionality integration ¡ Scalability

Client-Server Query Processing l Queries initiated at clients, data stored at servers l Where do we execute the query? l Query shipping: move the query down to the data l Data shipping: move the data up to the query l Hybrid shipping: combination of both

Query Shipping SQL query code is sent down to the server l Server parses and evaluates query, returns result l Used in DB 2, Oracle, MS SQL Server l

Data Shipping Client parses query and requests data from server l Server provides data, then client executes query l Data can be cached at client (main memory or disk) l

Hybrid Shipping Mix-and-match data shipping and query shipping l Query parts can be executed at any level according to query plan l Data is cached when beneficial l

Evaluation l Query Shipping ¡ Reliant on server performance ¡ Scales poorly with increasing client load l Data Shipping ¡ Good scalability ¡ High communication costs l Hybrid ¡ Potential to outperform other options ¡ More complex optimizations

Hybrid Shipping Observations l Some observations of optimal performance using hybrid shipping l Preference to not use a client cache ¡ If network transfer cost < client access cost l Shipping down cached data ¡ If in main memory & execution at server l Multiple small updates ¡ Maintain at client and post to server only when necessary

Query Optimization l Query plans must also specify where the query pieces are executed l Data shipping: all execution done at client l Query shipping: all execution done at server l Hybrid: choice can be made for each operator l Results display to user is always at client

Distributed Query Plans l Each operator is annotated with a logical site of execution – plans are shareable l client means an operator is executed from the client where the query is issued l server means: ¡ for scan operators, execute at a location that has the necessary data ¡ for updates, execute at all locations with the relevant data

Query Optimization: Where? l Should optimization occur at the client or the server? l At client: less load on servers, better scalability l At server: more information about system statistics, especially server loads l Potential solution: primary parsing and query rewriting at client, further optimization at server

Query Optimization: Statistics l Even when optimization is done at a server, that server does not usually have full knowledge of the system l System can either: ¡ Guess the status of other servers – less accuracy, less cost ¡ Ask other servers their status – fully accurate, additional communication costs

Query Optimization: When? l Tradeoff of accuracy vs. cost l Traditional-style: optimize once, store plan ¡ No support for changing DB conditions ¡ No incurred cost for query execution l Plan sets: optimize for possible scenarios ¡ Generate a few query plans for diff. conditions ¡ Choose plans based on runtime statistics l On-the-fly: observe intermediate results ¡ Re-optimize query if different from expectations

Query Optimization: Two-Step l Compile-time: generate join order, etc. l Runtime: perform site selection l Reasonable cost at each end l Responds well to changing server loads l Fully utilizes client data caching

Two-Step Optimization: Downside 1. 2. 3. • Optimal plan is generated traditional-style Site selection is performed True optimal plan was missed Optimal was missed because first optimization step was done with no knowledge of the system

Query Execution Techniques l Standard fare: row blocking, multithread when possible l Issues: transactions with both updates and retrieval queries using hybrid shipping ¡ We want to wait to propagate updates for efficiency’s sake ¡ Other option: perform query before update and temporarily pad results

l Questions? l Comments?