BMQ-Index Shared and Incremental Processing of Border Monitoring

BMQ-Index: Shared and Incremental Processing of Border Monitoring Queries over Data Streams Jinwon Lee Y. Lee, S. Kang, S. Lee, H. Jin, B. Kim and J. Song (Korea Advanced Institute of Science and Technology)

Outline v Border Monitoring Query (BMQ) v BMQ-Index v Experiments v Related work v Conclusion 2

Emerging Computing Environment v Data stream monitoring ◀ Logistics • • GPSs Management Thief-proofing Catalog Advertisement ◀ Remote Medical Service ◀ Disaster Prevention 10 12 13 12 Data stream 14 11 • Flood Warning Continuous range queries • Earthquake Prediction • Building Monitoring • Traffic light control Sensors Q 1 : 10 < value Q 2 : 11 < value < 13 ……. ◀ Location-based Service 3 ▲ Automatic Home • Tracking (Friends, Employee) • Vehicle Monitoring • Intelligent Transportation • Automatic Ventilation • Automatic Temperature Control • Automatic Humidity Control

Motivating Service Scenario #1 v Stock trading sell Expensive !! ( > $640) Monitor stock data streams crossing the borders !! Cheap !! ( < $600) buy Time SAMSUNG stock price during 23 days from Nov. 16 th to Dec. 23 rd, 2005 4

Motivating Service Scenario #2 v Location-based advertisement Send a special lunch menu Pet-Care Monitor location data streams to people within 1 km crossing lunchborders !! during the time !! Coming into Coupon Going out 5

Border Monitoring Query v To monitor data streams crossing the borders – Essential concern in many practical applications § Users’ main interest § Useful to automatically trigger or stop relevant actions v BMQ (Border Monitoring Query) – A new type of continuous range query !! – It reports only data crossing the borders of a query range RMQ (Region Monitoring Query) (= coming into or going out from the query range) – Conventional continuous range query – It reports all matching data within a query range 6

Problem: Scalability !! v A large number of BMQs can be issued • • + + Millions of stock investors will register their own queries Millions of stores will register their own queries A huge volume of data streams are rapidly incoming Fast response is also essential for users v How can we process BMQs over data streams efficiently? – (1) Naïve approach § Individual BMQ processing at each data update Lack of scalability !! – (2) Based on existing mechanisms for RMQ evaluation § Shared RMQ processing by indexing queries Costly post-processing !! 7

Solution Approach: BMQ-Index v Shared processing – By query indexing approach § BMQ-Index is built on registered BMQs § Upon a data arrival, only border-crossed queries are quickly searched for Achieves a high level of scalability !! BMQ-Index Q 1 , Q 2 (border-crossed queries) Data tuple 14 Registered BMQs Q 1: 10 < value Q 2: 11 < value < 13 ……. 8

Solution Approach: BMQ-Index v Incremental processing – By incremental access method § Use previous search step for the next search Successive searches are significantly accelerated !! § Keep information only needed for incremental search Low storage cost !! Series of data tuples BMQ-Index 10 12 13 12 14 Locality of data streams !! Registered BMQs Q 1: 10 < value Q 2: 11 < value < 13 ……. 9 Q 1 , Q 2 (border-crossed queries)

One-dimensional BMQ-Index (Example) Stream_ID Stream Table Node pointer IBM Notify me whenever the IBM stock price is coming into or going out from Linked list my reasonable price range !! … 0 5 15 10 +Q 1 +Q 2 +Q 3 Q 2 25 35 30 +Q 4 Q 1 20 45 ∞ +Q 5 Q 3 Q 2 Q 4 Q 5 10 0 Q 3 5 20 15 Q 4 $30 reasonable price range (unit: $) 25 0 $10 30 35 Q 5 45 Registered BMQs 10

Search Operation in One-dimension (Example) : previous data value (vt-1) : current data value (vt) Stream_ID § Case 1) 21 23 Node pointer No border-crossed query No node traversal IBM … 0 5 15 10 +Q 1 +Q 2 +Q 3 20 25 § Case 2) 21 37 35 30 +Q 4 Q 1 0 37 21 23 8 45 ∞ -Q 2, -Q 4, +Q 5 Traverse BMQ-Index to the right +Q 5 Q 3 Q 2 Q 4 Q 5 10 Q 1 § Case 3) 21 8 25 0 Q 2 Q 3 5 +Q 3, -Q 4, +Q 1 Traverse BMQ-Index to the left 20 15 Q 4 30 35 Q 5 45 11

Multi-dimensional BMQ-Index Query Table Stream. ID V PX PY Query. ID Range s 1 (v. X 1, v. Y 1) RS-X 2 RS-Y 2 Q 1 (b. X 1, b. X 3, b. Y 1, b. Y 4) s 2 (v. X 2, v. Y 2) RS-X 3 RS-Y 5 Q 2 (b. X 2, b. X 6, b. Y 2, b. Y 6) s 3 (v. X 3, v. Y 3) RS-X 5 RS-Y 4 Q 3 (b. X 4, b. X 5, b. Y 3, b. Y 5) b. Y 7 {Q 2} {} RS-Y 7 {Q 3} {} RS-Y 6 {Q 1} {} RS-Y 5 b Y 4 b. Y 6 Q 2 b. Y 5 {} {} {Q 1} RS-Y 2 Q 1 {Q 3} RS-Y 4 b. Y 3 {Q 2} RS-Y 3 b. Y 2 {} {} {} RS-Y 1 -DQSet-Yi +DQSet-Yi Q 3 v(s 2) v 1(s 3) v 2(s 3) v(s 1) v 3(s 3) b. Y 1 b. Y 0 RS-Y List RS-X List b. X 0 b. X 1 b. X 2 RS-X 1 RS-X 2 +DQSet-Xi {} {Q 1} -DQSet-Xi {} {} b. X 3 b. X 4 b. X 5 b. X 6 b. X 7 RS-X 3 RS-X 4 RS-X 5 RS-X 6 RS-X 7 {} {Q 3} {} {} {Q 2} {} 12 {Q 1} {} {Q 3} {Q 2}

Search Operation in Multi-dimension v Overall flow xc RS-X list. search() ±XQSet cross-check with Y-dimension ±XBMQSet Union (xc, yc) yc RS-Y list. search() Per-dimension search ±YQSet cross-check with X-dimension Validation through cross-check ±YBMQSet Union of per-dimension results v Performance Analysis (d-dimension) – Search performance § (((d– 1) d) one-dimensional search time) – Storage cost § (d one-dimensional storage cost) 13 QSet±

Experiments v Workload generation – Stock trading scenario (one-dimensional case) § Data stream generation (Korea stock market[9]) – Fluctuation level: 0. 01% ~ 0. 1% – 2000 stream sources, 1000 tuples in each stream § Query generation – Lower bound: randomly chosen (1 ~ 106 ) – Width of queries: 1 ~ 10 times larger than FL – Number of queries: 10, 000 ~ 100, 000 v Comparisons – An approach based on state-of-the-arts RMQ-Index (CEI[CIKM’ 05] and IS-list[Information System’ 96]) v Performance metrics – Average search time per data tuple (millisecond) – Index storage size (Mbyte) 14

Search performance Effects of the number of queries Effects of the widths of queries (N=100000, FL=0. 01%) (W=0. 1%, FL=0. 01%) 15

Storage cost Effects of the number of queries Effects of the widths of queries (N=100000) (W=0. 1%) ü BMQ-Index: twice ü IS-list: log (# of queries) times ü CEI: all grids covered by a query range 16

Related Work v Semantics – CQL (Continuous Query Language developed by STREAM project) § General concept to transform a Relation to a Stream § BMQ is a specific class of continuous range query v Shared and Incremental Processing Previous research Tree-based (1 -D: [2][4][5][14]) Data stream processing Grid-based (1 -D: [17], 2 -D: [6][13]) Spatiotemporal database Difference - O(log N) search performance - O(Nlog. N) storage cost Generally not feasible for BMQs !! - Better search performance than tree-based - Require more storage cost SINA[11] (shared and incremental) - Disk-based algorithm - Not purely incremental access method GPAC[12] (incremental) - Not for shared processing 17

Conclusion v Summary – Characterize a new type of continuous range query § Border Monitoring Query (BMQ) § Useful and practical in many emerging applications – One- and multi-dimensional BMQ-Index § Evaluates a large number of BMQs in a shared and incremental manner, thereby achieving excellent search performance and low storage cost 18

Thank you Question? 19

Backup slide

Performance Analysis v 1 -dimensional BMQ-Index – Search performance § (2 Nq FL) – Storage cost § (2 Nq + Nd) Nq = Number of queries Nd = Number of data streams v d-dimensional BMQ-Index – Search performance § (((d– 1) d) 2 Nq FL), only 2 times when d=2 – Storage cost § (d(2 Nq + Nd) + Nq) 21

Cross checking v Algorithm – For +XQSet § check whether vt is located between the Y predicates – For –XQSet § check whether vt-1 is located between the Y predicates – YQSet is checked with X-dimension by a similar manner 22