Multidimensional Data Many applications of databases are

Multidimensional Data • Many applications of databases are ``geographic'' = 2 dimensional data. Others involve large numbers of dimensions. • Example: data about sales. - A sale is described by (store, day, item, color, size, etc. ). Sale = point in 5 dim space. - A customer is described by (age, salary, pcode, marital status, etc. ). Typical Queries • Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100 K? '' • Nearest neighbor : ``If I am at coordinates (x, y), what is the nearest Mc. Donalds. '' • They are expressible in SQL. Do you see how?

SQL • Range queries: ``How many customers for gold jewelry have age between 45 and 55, and salary less than 100 K? '‘ SELECT * FROM Customers WHERE age>=45 AND age<=55 AND sal<100; • Nearest neighbor : ``If I am at coordinates (a, b), what is the nearest Mc. Donalds. '‘ Suppose we have a relation Points(x, y, name) SELECT * FROM Points p WHERE p. name=‘Mc. Donalds’ AND NOT EXISTS ( SELECT * FROM POINTS q WHERE (q. x a)*(q. x a)+(q. y b)*(q. y b) < (p. x a)*(p. x a)+(p. y b)*(p. y b) AND q. name=‘Mc. Donalds’ );

Big Impediment • For these types of queries, there is no clean way to eliminate lots of records that don't meet the condition of the WHERE clause. An Approach Index on attributes independently. - Intersect pointers in main memory to save disk I/O.

Attempt at using B-trees for MD-queries • Database = 1, 000 points evenly distributed in a 1000× 1000 square. Stored in 10, 000 blocks (100 recs per block) • B tree indexes on x and on y Range query {(x, y) : 450 x 550, 450 y 550} • 100, 000 pointers (i. e. 1, 000/10) for the x range, and same for y • 10, 000 pointers for answer (found by pointersection) • Retrieve 10, 000 records. If they are stored randomly we need to do 10, 000 I/O’s. Add here the cost of B Trees: • Root of each B tree in main memory • Suppose leaves have avg. 200 keys 500 disk I/O in each B tree to get pointer lists 1000 + 2(for intermediate B tree level) disk I/O’s Total • 11, 002 disk I/O’s more than sequential scan of file = 10, 000 I/O’s.

Nearest Neighbor query using B trees • Turn NN to (10, 20) into a range query {(x, y): 10 d x 10+d, 20 d y 20+d } • Possible problem: • No point in the selected range • The closest point inside may not be the answer • Solution: re execute range query with slightly larger d

NN queries, example • Same relation Points and its indexes on x and y as before, and Query: NN to (10, 20) • Choose d = 1 range query = {(x, y): 9 x 11, 19 y 21} • 2000 points in [9, 11], same in [19, 21] each dimension = 10+1 I/O’s to get pointers (+1 is because points with x=9 may not start just at the beginning of the leaf) • With an extra I/O for the intermediate node for each index 24 + 1 disk I/O’s to get the answer, assuming 1 of the 4 points is the answer, which we can determine by their coordinates, prior to getting the data blocks holding the points • However, if d is too small, we have to run another range query with a larger d

Grid files (hash like structure) • Divide data into stripes in each dimension • Rectangle in grid points to bucket • Example: database records (age, salary) for people who buy gold jewelry. Data: (25, 60) (45, 60) (50, 75) (50, 100) (50, 120) (70, 110) (85, 140) (30, 260) (25, 400) (45, 350) (50, 275) (60, 260)

Grid file

Operations Lookup Find coordinates of point in each dimension gives you a bucket to search. Nearest Neighbor Lookup point P. Consider points in that bucket. • Problem: there could be points in adjacent buckets that are closer. • Problem: there could be no points at all in the bucket: widen search? Range Queries Ranges define a region of buckets. • Buckets on border may contain points not in range. • Example: 35 < age <= 45; 50 < salary <= 100. Queries Specifying Only One Attribute • Problem: must search a whole row or column of buckets.

Insertion • Use overflow buckets, or split stripes in one or more dimensions • Insert (52, 200). Split central bucket, for instance by splitting central salary stripe • The blocks of 3 buckets are to be processed. • In general the blocks of n buckets are to be processed during a split. • n is the number of buckets in the chosen direction

Insertion • Insert (52, 200). Split central bucket, for instance by splitting central salary stripe (One possibility)

Grid files Advantages • Good for multiple key search • Supports Partial Match, Range Queries, NN queries Disadvantages • Space management overhead • Need partitioning ranges that evenly split keys • Possibility of overflow buckets for insertion

Partitioned hashing I 1. If we hash the concatenation of several keys then such a hash table cannot be used in queries specifying only one dimension (key). 2. A preferable option is to design the hash function so it produces some number of bits, say k. These k bits are divided among n attributes. 3. I. e. the hash function h is a concatenation of n hash functions, one for each dimensional attribute. 1. h = (h 1, …, hn) 2. the bucket where to put a tuple (v 1, …, vn) is computed by concatenating the bit sequences h 1(v 1)…hn(vn).

Partitioned hashing II • Example: Gold jewelry with • first bit = age mod 2 • bits 2 and 3: salary mod 4 • Works well for: • partial match (i. e. just an attribute specified) • Bad for: • range • nearest neighbor queries

Partitioned hashing III • Partial match query – specifying only the value of a: • compute hage(a), which could be, say 1. • Then, locate all the relevant buckets, which are from 100 to 111. – specifying only the value of salary: • compute hsalary(s), which could be, say 10. • Then, locate the relevant buckets, which are 010 and 110.

Grid files vs. partitioned hashing • If many dimensions many empty cells in grid. While partitioned hashing is OK. • Both support exact and partial match queries. • Grid files good for range and Nearest Neighbor queries, while partitioned hashing is not at all.