014b03a36c49f1d92d992deee9d0fedf.ppt
- Количество слайдов: 29
Promised Abstract Can database technology help manage and mine scientific data? That is the question I have been trying to answer with my astronomy colleagues (especially Alex Szalay. ) We have had some success but still face many problems. I will start by describing the odyssey of putting the Sloan Digital Sky Survey online and give some statistics about how it is used and what we are doing now. That segment will end with a discussion of how the integration of SQL with the CLR (common language run time) makes it much easier for us to handle scientific datatypes and spatial access methods. The World-Wide Telescope is an attempt to federate all the astronomy archives of the world. I will briefly describe the architecture of Sky. Query a prototype portal to several archives each of which is a web service. Urls: http: //research. microsoft. com/~gray http: //skyserver. pha. jhu. edu/dr 1/ http: //sky. Query. net
DAT 385 Yukon Features For Sky. Server Database Jim Gray: Microsoft Alex Szalay (and friends): Johns Hopkins Help from: Cathan Cook (personal Sky. Server), Maria A. Nieto-Santisteban (image cutout service)
Sky. Server Overview (10 min) 10 minute Sky. Server tour Pixel space Record space: Doc space: Ned Set space: Web & Query Logs Dr 1 Web. Service http: //skyserver. sdss. org/en/ select top 10 * from sdssad 2. weblog. dbo. weblog where yy = 2003 and mm=7 and dd =25 order by seq desc select top 10 * from sdssad 2. weblog. dbo. sql. Log order by the. Time You can download (thanks to Cathan Cook ) Desc Data + Database code: Website: http: //research. microsoft. com/~gray/SDSS Data Mining the SDSS Sky. Server Database MSR-TR-2002 -01
Cutout Service (10 min) A typical web service Show it Show WSDL Show fixing a bug Rush through code. You can download it. Maria A. Nieto-Santisteban did most of this (Alex and I started it) http: //research. microsoft. com/~gray/SDSS/personal_skyserver. htm
Sky. Query: http: //skyquery. net/ Distributed Query tool using a set of web services Fifteen astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England)… Feasibility study, built in 6 weeks Tanu Malik (JHU CS grad student) Tamas Budavari (JHU astro postdoc) With help from Szalay, Thakar, Gray Implemented in C# and. NET Allows queries like: SELECT o. obj. Id, o. r, o. type, t. obj. Id FROM SDSS: Photo. Primary o, TWOMASS: Photo. Primary t WHERE XMATCH(o, t)<3. 5 AND AREA(181. 3, -0. 76, 6. 5) AND o. type=3 and (o. I - t. m_j)>2
Sky. Query Structure Each Sky. Node publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service Image Cutout SDSS Sky. Query Portal FIRST 2 MASS INT
Four Database Topics Sparse tables: column vs row store tag and index tables pivot Maplist (cross apply) Bookmark bug Object Relational has arrived.
Column Store Pyramid Users see fat base tables (universal relation) Define popular columns index tag table 10% ~ 100 columns Obese query BASE Make many skinny indices 1% ~ 10 columns Query optimizer picks right plan Automate definition & use TAG Fast read, slow insert/update Data warehouse Note: prior to Yukon, index had 16 column limit. INDICIES A bane of my existence. Simple Fat query Typical Semijoin
Examples create table base ( id bigint, f 1 int primary key, f 2 int, …, f 1000 int) create index tag on base (id) include (f 1, …, f 100) BASE TAG Obese query Typical Semi-join Fat query create index skinny on base(f 2, …f 17) INDICIES Simpl
A Semi-Join Example create table fat(a int primary key, b int, c int, fat char (988)) declare @i int, @j int; set @i = 0 again: insert fat values(@i, cast(100*rand() as int), cast (100*rand() as int), ' ') set @i = @i + 1; if (@i < 1000000) goto again b=c 137 K IO 31 sec create index ab on fat(a, b) create index ac on fat(a, c) dbcc dropcleanbuffers with no_infomsgs select count(*) from fat with(index (0)) where c = b -- Table 'fat'. Scan 3, reads 137, 230, CPU : 1. 3 s, elapsed 31. 1 s. dbcc dropcleanbuffers with no_infomsgs select count(*) from fat where b=c -- Table 'fat'. Scan 2, reads: 3, 482 CPU 1. 1 s, 1 GB elapsed: 1. 4 s. ab ac 8 MB b=c 3. 4 K IO 1. 4 sec
Moving From Rows to Columns Pivot & Un. Pivot What if the table is sparse? LDAP has 7 mandatory and 1, 000 optional attributes Store row, col, value Features object attribute value ●●●● create table Features ( select * from (features object varchar , attribute varchar, value varchar, primary key ( object, attribute)) pivot value on attribute in (year, color) ) as T where object = ‘ 4 PNC 450’ 4 PNC 450 year color make model 2000 white Ford Taurus ●●●● T Object year 4 PNC 450 2000 color white
Maplist Meets SQL – cross apply select p. *, q. * from parent as p cross apply f(p. a, p. b, p. c) as q where p. type = 1 Your table-valued function F(a, b, c) returns all objects related to a, b, c. spatial neighbors, p 1 f(p 1) p 2 f(p 2) sub-assemblies, members of a group, items in a folder, … Apply this function to each row Classic drill-down use outer apply if f() may be null pn f(pn)
The Bookmark Bug SQL is a non-procedural language. The compiler/optimizer picks the procedure based on statistics. If the stats are wrong or missing…. Bad things happen. Queries can run VERY slowly. Strategy 1: allow users to specify plan. Strategy 2: make the optimizer smarter (and accept hints from the user. )
An Example of the Problem Bookmark plan: look in index for a subset. Lookup subset in Fat table. Index A query selects some fields of an index and of huge table. This is great if subset << table. terrible if subset ~ table. If statistics are wrong, or if predicates not independent, you get the wrong plan. How to fix the statistics? Huge table
A Fix: Let user ask for stats Create Statistics on View(f 1, . . , fn) Then the optimizer has the right data Picks the right plan. Statistics on Views, C. Galindo-Legaria, M. Josi, F. Waas, M. Wu, VLDB 2003, Q 3: Select count(*) from Galaxy where r < 22 and r_extinction > 0. 120 Bookmark: 34 M random IO, 520 minutes Create Statistics S on Galaxy(obj. ID ) Scan: 5 M sequential IO 18 minutes Ultimately this should be automated, but for now, … it’s a step in the right direction.
And. . Fully-async and synchronous (blocking) calls and multi-concurrent-result sets per connection (transaction) Queues built in (service broker): Fire-and forget asynchronous processing It listens to Port 80 for SOAP calls : TP-lite is back It’s a web service Notification service and data mining and olap and reporting and xml and xquery and. . ) But, let me focus on table-valued functions.
Object Relational Has Arrived VMs are moving inside the DB Yukon includes Common Language Runtime (Oracle & DB 2 have similar mechanisms). So, C++, VB, C# and Java are co-equal with Transact. SQL. code You can define classes and methods SQL will store the instances Access them via methods code You can put your analysis code INSIDE the database. Minimizes data movement. You can’t move petabytes to the client But we will soon have petabyte databases. data +code
Some Background Table valued functions SQL operates on tables. If you can make tables, you can extend SQL This is the idea behind OLE/DB create function Evens(@max. Val int) returns @T table (a int) begin while (@max. Val > 0) begin if (@max. Val % 2 = 0) insert @T values(@max. Val) set @max. Val = @max. Val -1 end return end select * from Evens(10) a -----10 8 6 4 2
Using table Valued Functions For Spatial Search Use function to return likely key ranges. Use filter predicate to eliminate objects outside the query box. Table valued function returns candidate ranges of some space-filling curve. Select obj. ID From Objects O join f. Get. Ranges( @latitude, @longitude, @radius) R on O. htm. ID between R. begin and R. end where abs(o. Lat - @latitude) + abs(o. Lon – @longitude) < @radius Filter discards false positives.
Hierarchical Triangular Mesh (HTM) Szalay, Kunszt, Brunner http: //www. sdss. jhu. edu/htm Every object has a 20 -deep Mesh ID Given an area routine returns set of covering triangles 2, 3, 0 2, 3, 1 2, 3, 2 Each triangle implies range query: 2, 3, 3 2, 1 2, 2 2, 3 htm. ID in triangle iff htm. ID in [traingle. min…triangle. max) Reject false positives with careful geometry test 2 Very fast: 10, 000 triangles / second / cpu
Using Hierarchical Triangular Mesh select * from photo. Obj as p, f. Htm. Cover(x, y, z, r) as n where p. htm. ID between n. start and n. end and (2*asin(sqrt(power(x-cx, 2)+power(y-cy, 2)+power(z-cz, 2))/2)) < radians(r) This is packaged as: f. Get. Nearby. Objects(x, y, z, r) This is the geometry test
The Pre CLR design Transact SQL sp_HTM (20 lines) 469 lines of “glue” looking like: // Get Coordinates param datatype, and param length information of if (srv_paraminfo(p. Srv. Proc, 1, &b. Type 1, &cb. Max. Len 1, &cb. Actual. Len 1, NULL, &f. Null 1) == FAIL) Error. Exit("srv_paraminfo failed. . . "); // Is Coordinate param a character string if (b. Type 1 != SRVBIGVARCHAR && b. Type 1 != SRVBIGCHAR && b. Type 1 != SRVVARCHAR && b. Type 1 != SRVCHAR) Error. Exit("Coordinate param should be a string. "); // Is Coordinate param non-null if (f. Null 1 || cb. Actual. Len 1 < 1 || cb. Max. Len 1 <= cb. Actual. Len 1) Error. Exit("Coordinate param is null. "); // Get pointer to Coordinate param pz. Coordinate. Spec = (char *) srv_paramdata (p. Srv. Proc, 1); if (pz. Coordinate. Spec == NULL) Error. Exit("Coordinate param is null. "); pz. Coordinate. Spec[cb. Actual. Len 1] = 0; // Get Output. Vector datatype, and param length information if (srv_paraminfo(p. Srv. Proc, 2, &b. Type 2, &cb. Max. Len 2, &cb. Actual. Len 2, NULL, &f. Null 2) == FAIL) Error. Exit("Failed to get type info on HTM Vector param. . . "); The HTM code body
The “glue” CLR design Discard 450 lines of UGLY code C# SQL sp_HTM (50 lines) Thanks!!! To Peter Kukol (who wrote this) using System; using System. Data. Sql. Server; using System. Data. Sql. Types; using System. Runtime. Interop. Services; namespace HTM { public class HTM_wrapper { [Dll. Import("SQL_HTM. dll")] static extern unsafe void * xp_HTM_Cover_get (byte *str); public static unsafe void HTM_cover_RS(string input) { // convert the input from Unicode (array of 2 bytes) to an array of bytes (not shown) byte * input; byte * output; // invoke the HTM routine output = (byte *)xp_HTM_Cover_get(input); // Convert the array to a table Sql. Result. Set output. Table = Sql. Context. Get. Return. Result. Set(); if (output[0] == 'O') { // if Output is “OK” uint c = *(UInt 32 *)(s + 4); // cast results as dataset Int 64 * r = ( Int 64 *)(s + 8); // Int 64 r[c-1, 2] for (int i = 0; i < c; ++i) { Sql. Data. Record new. Record = output. Table. Create. Record(); new. Record. Set. Sql. Int 64(0, r[0]); new. Record. Set. Sql. Int 64(1, r[1]); r++; output. Table. Insert(new. Record); } } // return output. Table; } } } The HTM code body
The Clean CLR design Discard all glue code return array cast as table CREATE ASSEMBLY HTM_A FROM '\localhostHTM. dll' CREATE FUNCTION HTM_cover( @input NVARCHAR(100) ) RETURNS @t TABLE ( HTM_ID_START BIGINT NOT NULL PRIMARY KEY, HTM_ID_END BIGINT NOT NULL ) AS EXTERNAL NAME HTM_A: HTM_NS. HTM_C: : HTM_cover using System; using System. Data. Sql; using System. Data. Sql. Server; using System. Data. Sql. Types; using System. Runtime. Interop. Services; namespace HTM_NS { public class HTM_C { public static Int 64[, 2] HTM_cover(string input) { // invoke the HTM routine return (Int 64[, 2]) xp_HTM_Cover(input); // the actual HTM C# or C++ or Java or VB code goes here. Your/My code goes here } } }
Performance (Beta 1) On a 2. 2 Ghz Xeon Call a Transact SQL function 33μs Call a C# function 50μs Table valued function + per row 42 μs 1, 580 μs Array (== table) valued function μs + per row 27 μs 200
What Is the Significance? No more inside/outside DB dichotomy. You can put your code near the data. Indeed, we are letting users put personal databases near the data archive. This avoids moving large datasets. Just move questions and answers.
Meta-Message Trying to fit science data into databases When it does not fit, something is wrong. Look for solutions Many solutions come from OR extensions Some are fundamental engine changes More structure in DB Richer operator sets Better statistics
Community Resources Microsoft Community Resources http: //www. microsoft. com/communities/default. mspx Non-Microsoft Community Resources http: //www. microsoft. com/communities/related/default. mspx Newsgroups Converse online with Microsoft Newsgroups, including Worldwide http: //www. microsoft. com/communities/newsgroups/default. mspx User Groups Meet and learn with your peers http: //www. microsoft. com/communities/usergroups/default. mspx Attend a free chat http: //www. microsoft. com/communities/chats/default. mspx Attend a free web cast http: //www. microsoft. com/usa/webcasts/default. asp Most Valuable Professional (MVP) http: //mvp. support. microsoft. com/
© 2003 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
014b03a36c49f1d92d992deee9d0fedf.ppt