Challenges in Ubiquitous Data Management Michael Franklin UC

Challenges in Ubiquitous Data Management Michael Franklin UC Berkeley

Ubiquitous Computing g g “In ten years, billions of people will be using the Web, but a trillion "gizmos" will also be connected to the Web. ” Asilomar Rep. on DB Research, Dec. 1998 You’ve heard it before… h Wireless Internet-enabled devices projected to soon outnumber wired Internet devices. h Many computing devices person: Smartphones, PDAs, Smartcards, badges, wearables, lightswitches, toasters, … © 2000 Michael J. Franklin 2

Ubiquitous Connectivity g g g Tremendous improvements in Internet backbone bandwidth and reductions in diameter. Broadband connectivity to the home and office (i. e. the “last mile”) is being solved. Wireless technologies are enabling anytimeanywhere connectivity. © 2000 Michael J. Franklin 3

Ubiquitous Data Access g g g But, ubiquitous computing and connectivity aren’t worth much without ubiquitous data access. “Fundamentally, the ability to access all information from anywhere and have ONE unified and synchronized information repository is critical to making appliances useful. ” Hambrecht and Quist, i. Word , 3/99 Ubiquitous data access will put existing data management techniques to the test, in all aspects – searching, location, reliability, consistency, … © 2000 Michael J. Franklin 4

Ubiquitous Data – State of the Art g g Everyone uses a database system and/or search engine every day Although they may not realize it! (the true test of “ubiquity”). The Internet and WWW have become a ubiquitous means of global data dissemination and exchange. h Databases g g play a crucial but largely invisible role here. XML and related standards are enabling increasingly sophisticated interoperation. Wireless access provides anytime-anywhere access and enables location-centric applications. © 2000 Michael J. Franklin 5

Scenarios and Requirements g g g Real “killer apps” have not yet emerged. Many in industry have begun to refer to a “user experience” rather than a particular app. Many of these scenarios are quite irritating h e. g. g “buy milk now!!!!” Typical scenarios require three types of functionality: h Support for mobility – of users and data h Context awareness – what is the user trying to do? h Support for collaboration – varied and dynamic groups of people; real-time or asynchronous, … © 2000 Michael J. Franklin 6

Demands on Data Management g A key requirement that emerges from all three of these categories is adaptivity. h movement/availability of data and people h continually changing contexts h dynamic groups and interactions g A problem and solution: “user-in-the-loop”: h people can deal with ambiguity and conflict resolution. h requires a collaborative and responsive approach to information systems: + + provide fast interactive performance quickly respond to user direction. © 2000 Michael J. Franklin 7

Mobility g Limited device capabilities: h storage & CPU, battery power, bandwidth, display, … h requires adjustment of data delivery to these g Varying and intermittent connectivity h requires proxies and smart data staging/pre-staging h requires global access to data g Location-centric applications h “find open drugstores within two miles of my current location. ” h must be able to deal with locations and distances h servers must track huge numbers of moving objects © 2000 Michael J. Franklin 8

Context Awareness g System must maintain an internal representation of the users’ needs, tasks, roles, preferences, etc. h requires “user profiles” and models h some information can be leveraged from PIM apps g In some scenarios, e. g. “smart spaces”, system must continually monitor and react to changes in the environment: h requires processing streams of data from sensors, logs, etc. g All require inferencing and learning techniques over dirty and incomplete data. © 2000 Michael J. Franklin 9

Collaboration g Synchronization and consistency support h collaboration revolves around a set of shared data h requirements range from unmoderated chat rooms to complete ACID transactions g Also need maintenance of history h to support asynchronous collaborations h to support changes in group membership h must be durable and highly-available. © 2000 Michael J. Franklin 10

Two On-going Projects g g Two projects currently underway to address some of these issues (both part of “Endeavour”). Data Centers/Dissemination-Based Info Sys h Profile-based data management h includes “data recharging” h collaboration with Stan Zdonik at Brown and Mitch Cherniack at Brandeis g Telegraph h adaptive query processing over data streams h with Joe Hellerstein at UC Berkeley © 2000 Michael J. Franklin 11

Data Centers Framework g g An architecture that combines data delivery techniques for responsive client access. 3 types of nodes: h Data sources h Clients h Information brokers g Any data delivery mode can be used. h Network g (can add value) transparency Dynamic © 2000 Michael J. Franklin 12

Delivery Options Push Pull Aperiodic Unicast 1 -to-n request/ response w/snoop © 2000 Michael J. Franklin Periodic Aperiodic Periodic Unicast 1 -to-n polling Email lists polling wsnoop publish/ subscribe 13 publish/ subscribe Email list digests Broadcast disks

Network Transparency Clients Brokers Sources The type of a link matters only to nodes

DBIS Example Proxy cache An example: DB Server Can vary dynamically © 2000 Michael

“Data Recharging” for Weakly Connected Devices g Mobile devices require 2 resources: power and data h It is impractical to be continuously connected to fixed sources of these. g Devices cope with disconnection using caching: h Power cached in rechargeable batteries h Data cached in hot-synched memory g Recharging the power is easy… h Anywhere, Anytime, “Hands-off” operation, Flexible connection duration © 2000 Michael J. Franklin 16

Data Recharging – Elevator Pitch g Make recharging data as simple as recharging power: h Anywhere – no need to connect to your home machine, h Anytime – no prior arrangements necessary, h “Hands-off” operation – system knows what you need h Flexible connection duration – the longer you stay connected, the better your device-resident data gets. © 2000 Michael J. Franklin 17

Some Questions g How to know where the user will be? h and do we care? (for context – yes, for staging -? ? ) g How to know what the user wants? g How to prioritize data delivery? g The answer is User Profiles © 2000 Michael J. Franklin 18

“Data Recharging” Profiles g Three main components: 1) Content-based specifications of user interests (read “queries”) 2) Specifications of user priorities/requirements priority ordering, resolution, freshness, dependencies 3) User Context information – where, when, who, what à This info is available in the user’s PIM data! n Profiles must be both specified explicitly and learned automatically. © 2000 Michael J. Franklin 19

First cut at Profile Model g Tasks, sub-tasks, and jobs h Dependencies and alternatives expressed in a tree h “Values” assigned and manipulated g Two optimization problems: h Bounded (known) sync time h Unknown sync time g g Bounded case is an instance of the “precedenceconstrained knapsack problem” The XFilter system allows us to process millions of standing queries of XML documents © 2000 Michael J. Franklin 20

Xfilter- An XML-Based SDI System User Profiles Filtered Data XML Conversion XML Documents Filter Engine Users Data Sources The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles. © 2000 Michael J. Franklin 21

Important XPath Features g Parent/Child (‘/’) and Ancestor/Descendant (‘//’): /catalog/product//msrp g Wildcards (match any single element): /catalog/*/msrp g Element Node Filters to further refine the nodes: h Filters can contain nested path expressions //product[price/msrp < 300]/name © 2000 Michael J. Franklin Filter applied to product element node 22

/a/b[c/d]/e //d/*/*/e /b/e User Profiles (XPath Queries) /a//b/c //b/d/*/e /c/*/d//e Architecture XPath Parser Profile Info Path Nodes XML Documents XML Parser (SAX Based) Element Events Filter Engine Query Index © 2000 Michael J. Franklin 23 Successful Queries Successful Profiles & Filtered Data Profile Base

XML Parsing and Filtering g Event-based XML Parsing using SAX API XML documents are converted to a linear sequence of events that drive the execution of the filter Callback functions are implemented to deal with the different events h Start Element h End Data Element © 2000 Michael J. Franklin 24

Filter Engine g Tricky aspects of the XPath language: h Checking the order of elements in the queries h Handling wildcards and descendent operators h Evaluating filters that are applied to element nodes (Nested path expressions) g Solution: h Convert each XPath query into a Finite State Machine (FSM) + A profile is considered to be satisfied when its final state is reached h Index the states of FSMs for efficient evaluation © 2000 Michael J. Franklin 25

FSM Representation g Each element node is a state g A state is represented using a Path Node structure: h Contains + + information to process current state: Compare the level of element name in input document with the level value of the path node Evaluate the element node filter if there is any Locate next path nodes for the state change in the FSM representation Calculate the level values of next states using relative distance values (in terms of levels) stored in the path nodes © 2000 Michael J. Franklin 26

Handling Multiple Queries l Key insight for scalable Profile Matching: 4 Index the queries instead of the data g Hash table based on the element names in the queries g Each node contains two lists of path nodes: h h g g Candidate List: Stores the path nodes that represent current state of each query Wait List: Stores the path nodes that represent the future states State transition is represented by promoting a path node from the Wait List to the Candidate List Initial distribution of path nodes has a significant impact on performance © 2000 Michael J. Franklin 27

Examples Q 2 = // b / * / c / d Q 1 = / a / b // c Query Id Position Q 1 Q 1 Q 2 Q 2 Rel Dist 1 2 3 NA 1 NA NA 2 1 Level 1 ? -1 -1 ? ? Q 1 -1 Q 1 -2 Q 1 -3 Q 2 -1 Q 2 -2 Q 2 -3 Q 3 = / * / a / c // d Q 4 = b / d / e Q 5 = / a / * / c // e Q 3 Q 3 Q 4 Q 4 Q 5 Q 5 1 2 3 NA 1 NA NA 1 1 NA 3 NA 2 ? -1 -1 ? ? 1 ? -1 Q 3 -2 Q 3 -3 Q 4 -1 Q 4 -2 Q 4 -3 Q 5 -1 Q 5 -2 Q 5 -3 © 2000 Michael J. Franklin 28

Query Index Construction Element Hash Table WL Q 1 -2 Q 5 -2 Q 3 -2 Q 2 -2 Q 1 -3 Q 4 -2 Q 3 -3 Q 2 -3 Q 5 -3 © 2000 Michael J. Franklin Q 4 -3 WL a b CL CL WL z c CL WL d CL e CL WL CL : Candidate List 29 WL: Wait List Q 1 -1 Q 3 -1 Q 2 -1 Q 4 -1 Q 5 -1

Data Centers - Research Agenda g Profile Definition and Maintenance g Update Storage and Preparation g Efficient integration of "recharge" updates with existing cached data. h Recharge, Trickle Charge, Jump Start. . . g Consistency Guarantees g Global Data Staging g Approaches will be driven by (mostly PIM) applications. © 2000 Michael J. Franklin 30

Telegraph: An Adaptive Dataflow Engine g Dataflow because that’s what data does… data streaming from sensors h real-time processing of streams: update stream, click-stream, swipe-stream, … h siphon data from the “deep web” h “continuous queries” for dissemination-based apps h g Adaptivity due to volatility… sensor nets h wide area internet h dynamic caching, replication, and staging h user-in-the-loop interfaces h mobile users and devices h g Joint work at UC Berkeley with Joe Hellerstein © 2000 Michael J. Franklin 31

Wide-area + Wrapped sources Unpredictability h Sources may be unreachable or slow to respond. h Data delivery may be: + + + slower than expected bursty interrupted h Data statistics/cost estimates may be unavailable or unreliable due to poor interfaces or crossing administrative domains. © 2000 Michael J. Franklin 32

User-in-the-loop Unpredictability h Batch processing is inappropriate for many apps. + especially when searching the Internet h Must provide feedback to the user as quickly as possible. h Data access becomes a cooperative, iterative approach: + + User may correct/redirect query. User may refine/change the query. © 2000 Michael J. Franklin 33

Mobility & Data Streams Unpredictability h Mobility + + Location-centric queries Moving endpoints change data staging needs h Data + + + Streams/Sensors Varying data arrival rates Adapting resolutions Push vs. Pull © 2000 Michael J. Franklin 34

Some Solutions g Adaptive Query Processing h Query Scrambling - “Reactive Query Execution” h XJoin – non-blocking, reactive query operator. h Eddies – Continuous Query Optimization g Risk-Aware Query Planning h g Producing robust plans or partial plans. Exploiting Alternative Sources h Mirrors g or “not exactly”. Relaxing Query Semantics h Partial, Fuzzy, or Alternative answers © 2000 Michael J. Franklin 35

Query Scrambling Example ABCDE 4 4 1 1 3 B 2 A B C

Hash Join XJoin Hash Table B A A Source A Build Probe Source B h Traditional Hash Joins(SHJ) when one input stalls. h Symmetric Hash Join blocks only if both stall. h XJoin partitions data -> small footprint -> full pipelining & bushy plans-> higher adaptability. © 2000 Michael J. Franklin 37

Eddy – Continuous Optimization Join RS R S Eddy Join ST T g g g Flow-based (“Rivers”) Tuples are routed via a ticket-based scheme and backpressure. Hellerstein and Avnur 99 © 2000 Michael J. Franklin 38

Adaptive Approaches static plans late binding continuous opt. reopt. anarchy Eddy ? ? ? Dynamic, Query Scrambling Parametric, XJoin Competitive, … Increased uncertainty argues for increased adaptivity. current DBMS g Wide-area nets and admin domains introduce uncertainty. h Pesky users introduce uncertainty. h Mobility and streams introduce uncertainty. h g Implications for data-intensive Internet services. © 2000 Michael J. Franklin 39

Conculsions g g We need to build more intelligent systems to protect humans from the data flood, but good old systems performance issues still matter too. No killer app for Ubiqutious Data Access yet; may be the killer “user experience” h Scenarios give us a common (and challenging!) set of requirements for data management: Adaptivity, context-awareness, global-scale, … g The Data Centers and Telegraph projects are addressing key data management technologies for supporting ubiquitous access to data. © 2000 Michael J. Franklin 40