Web Services and the VO Using SDSS DR

Web Services and the VO: Using SDSS DR 1 Alex Szalay, and Jim Gray with Tamas Budavari, Sam Carlisle, Vivek Haridas, Nolan Li, Tanu Malik, Maria Nieto-Santisteban, Wil O’Mullane, Ani Thakar

Changing Roles • Exponential growth: – Projects last at least 3 -5 years – Data sent upwards only at the end of the project – Data will be never centralized • More responsibility on projects – Becoming Publishers and Curators – Larger fraction of budget spent on software – Lot of development duplicated, wasted • More standards are needed – Easier data interchange, fewer tools • More templates are needed – Develop less software on your own

Standards and Interoperability Standards driven by e-business requirements – Exchange of rich and structured data (XML…) – DB connectivity, Web Services, Grid computing Application to astronomy domain – – – Data dictionaries (UCDs) Data models Protocols Registries and resource/service discovery Provenance, data quality Dealing with the astronomy legacy – FITS data format – Software systems Boundary conditions

Virtual Observatory • Many new surveys are coming – SDSS is a dry run for the next ones – LSST will be 5 TB/night • All the data will be on the Internet – But how? ftp, webservice… • Data and apps will be associated with the instruments – Distributed world wide – Cross-indexed – Federation is a must, but how? • Will be the best telescope in the world – World Wide Telescope

Sky. Server. SDSS. org or Skyserver. Pha. Jhu. edu/DR 1/ • Sloan Digital Sky Survey Data: Pixels + Data Mining • About 400 attributes per “object” • Spectrograms for 1% of objects • Demo: pixel space record space set space teaching

Show Cutout Web Service

Sky. Query (http: //skyquery. net/) • Distributed Query tool using a set of web services • Four astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England). • Feasibility study, built in 6 weeks – Tanu Malik (JHU CS grad student) – Tamas Budavari (JHU astro postdoc) – With help from Szalay, Thakar, Gray • Implemented in C# and. NET • Allows queries like: SELECT o. obj. Id, o. r, o. type, t. obj. Id FROM SDSS: Photo. Primary o, TWOMASS: Photo. Primary t WHERE XMATCH(o, t)<3. 5 AND AREA(181. 3, -0. 76, 6. 5) AND o. type=3 and (o. I - t. m_j)>2

Sky. Query Structure • Each Sky. Node publishes • – Schema Web Service – Database Web Service Portal is – Plans Query (2 phase) – Integrates answers – Is itself a web service Image Cutout SDSS Sky. Query Portal FIRST 2 MASS INT

National Virtual Observatory • NSF ITR project, “Building the Framework for the National Virtual Observatory” is a collaboration of 17 funded and 3 unfunded organizations – – – Astronomy data centers National observatories Supercomputer centers University departments Computer science/information technology specialists • Trying to build standards, interfaces and prototypes • Goal: federate datasets, enable new discoveries, make it easier to publish new data

International Collaboration • Similar efforts now in more than 12 countries: – USA, Canada, UK, France, Germany, Italy, Japan, Australia, India, China, Korea, Russia, South Africa, Hungary, …. • Active collaboration among projects – Standards, common demos – International VO roadmap being developed – Regular telecons over 10 timezones • Formal collaboration International Virtual Observatory Alliance (IVOA)

NVO: How Will It Work? • Huge pressure to build something useful today • We do not build ‘everything for everybody’ • Use the 90 -10 rule: – Define the standards and interfaces – Build the framework – Build the 10% of services that are used by 90% – Let the users build the rest from the components • Define commonly used ‘core’ services • Build higher level toolboxes/portals on top

Core Services • Metadata information about resources – Waveband – Sky coverage – Translation of names to universal dictionary (UCD) • Simple search patterns on the resources – Cone Search – Image mosaic – Unit conversions • Simple filtering, counting, histogramming • On-the-fly recalibrations

Higher Level Services • Built on Core Services • Perform more complex tasks • Examples – – – Automated resource discovery Cross-identifications Photometric redshifts Outlier detections Visualization facilities (connect pixels to objects) • Expectation: – Build custom portals in matter of days from existing building blocks (like today in IRAF or IDL)

Using SDSS DR 1 as a Prototype • SDSS DR 1 (Data Release 1) is now publicly available http: //skyserver. pha. jhu. edu/dr 1/ • • About 1 TB of catalog data Using MS SQL Server 2000 Complex schema (72 Tables) About 80 million photometric objects Two versions (TARGET/BEST) Automated documentation Raw data at FNAL file server with URL access

DR 1 Sky. Server • • Classic 2 -node web server (20 k$ total) 1 TB database 1 M hpm (180 k peak day) DSS load killing us: 12 m rows per hour downloads • Answer set size follows power law

Loading DR 1 • Automated table driven workflow system for loading – Included lots of verification code – Over 16 K lines of SQL code • Loading process was extremely painful – – – Lack of systems engineering for the pipelines Poor testing (lots of foreign key mismatch) Detected data bugs even a month ago Most of the time spent on scrubbing data Fixing corrupted files (RAID 5 disk errors) • Once data was clean, everything loaded in 3 days • Neighbors calculation took about 10 hours • Reorganization of data took about 1 week of experiments in partitioning/layouts

Public Data Release: Versions • June 2000: EDR – Early Data Release EDR • July 2003: DR 1 – Contains 30% of final data – 200 million photo objects • 4 versions of the data DR 1 DR 2 – Target, best, runs, spectro • Total catalog volume 1. 7 TB DR 3 – See Terascale sneakernet paper… • Published releases served forever – EDR, DR 1, DR 2, …. – Soon to include email archives, annotations • O(N 2) – only possible because of Moore’s Law! DR 3

Spatial Features • Precomputed Neighbors – All objects within 30” • Boundaries, Masks and Outlines – Stored as spatial polygons Time Domain: • Precomputed Match – All objects with 1”, observed at different times – Found duplicates due to telescope tracking errors – Manual fix, recorded in the database • Match. Head – The first observation of the linked list used as unique id to chain of observations of the same object

Spatial Data Access – SQL extension Szalay, Gray, Kunszt, Fekete, O’Mullane, Brunner http: //www. sdss. jhu. edu/htm • Added Hierarchical Triangular Mesh (HTM) table-valued functions for spatial joins • Every object has a 20 -deep Mesh ID • Given a spatial definition, 2, 3, 0 2, 0 routine returns up to 10 2, 3, 1 2, 3, 2 2, 3, 3 covering triangles 2, 1 • Spatial query is then up 2, 2 2, 3 to 10 range queries • Fast: 10, 000 triangles / second / cpu

Web Services in Progress • Registry – Harvesting and querying, discovery of new services • Data Delivery – Query driven Queue management – My. DB, VOSpace, VOProfile: minimize data movement • Graphics and visualization – Query driven vs interactive – Show spatial objects (Chart/Navi/List) • Footprint/intersect – It is a “fractal” • Cross-matching – Sky. Query and Sky. Node – Ferris-wheel – Distributed vs parallel

Graphics/Visualization Tools • Density plot – Show densities of attributes as a function of sky position • Chart/Navi/List – Tie together catalogs and pixels • Spectrum viewer – Display spectra of galaxies and stars drawn from the database • Filter profiles – Catalog of astronomical filters (optical bandpasses) • Mirage with VO connections – Linked multi-pane visualization tool (Bell Labs) – VO extensions built at JHU

Other Tools Information Management • Registry services • Name resolver • Cosmological calculator • CASService • My. DB • VOSpace • User Authentication Spatial • Cone Search • Sky. Node • Cross. Match (Sky. Query) • Footprint

Spatial Cross-Match • For small area HTM is close to optimal, but needs more speed • For all-sky surveys the zone algorithm is best • Current heuristic is a linear chain of all nodes • Easy to generalize to include precomputed neighbors • But, for all sky queries very large number of random reads instead of sequential

Ferris-Wheel • • Sky split into buckets/zones All archives scan in sync Queries enter at bottom Results come back after full circle • Only sequential access => buckets get into cache, then queries processed SDSS Portal

Data Access is hitting a wall FTP • • and GREP are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. • • You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1 K$ … 3 years and 1 M$ • Oh!, and 1 PB ~5, 000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help

Smart Data (active databases) • If there is too much data to move around, take the analysis to the data! • Do all data manipulations at database – Build custom procedures and functions in the database • Automatic parallelism guaranteed • Easy to build-in custom functionality – Databases & Procedures being unified – Example: temporal and spatial indexing – Pixel processing • Easy to reorganize the data – Multiple views, each optimal for certain types of analyses – Building hierarchical summaries are trivial • Scalable to Petabyte datasets

Generic Catalog Access • After 2 years of SDSS EDR and 6 months of DR 1 usage, access patterns start to emerge – Lots of small users, instant response – 1/f distribution of request sizes (tail of the lognormal) • • • How to make everybody happy? No clear business model… We need a separate interactive and batch server We also need access to full SQL with extensions Users want to access services via browsers Other services will need SOAP access

Data Formats • Different data formats requested: – HTML, CSV, FITS binary, VOTABLE, XML, graphics • Quick browsing and exploration – Small requests, need to be nicely rendered – Needs good random access performance – Also simple 2 D scatter plots or density plots required • Heavy duty statistical use – Aggregate functions on complex joins, lots of scans but small output, mostly want CSV • Successive Data Filter – Multi-step non-indexed filtering of the whole database, mostly want FITS binary

Data Delivery • Small requests (<100 MB) – Putting data on the stream • Medium requests (<1 GB) – Use DIME attachments to SOAP messages • Large requests (>1 GB) – Save data in scratch area and use asynch delivery – Only practical for large/long queries • Iterative requests – Save data in temp tables in user space – Let user manipulate via web browser • Paradox: if we use web browser to submit, users want immediate response from batch-size queries

How To Provide a User. DB • Goal: through several search/filter operations reduce data transfer to manageable sizes (1 -100 MB) • Today: people download tens of millions of rows, and then do their next filtering on client side, using F 77 • Could be much better done in the database • But: users need to create/manage temporary tables – – DOS attacks, fragmentation, who pays for it Security, who can see my data (group access)? Follow progress of long jobs Who does the cleanup?

Query Management Service • • Enable fast, anonymous access to small requests Enable large queries, with ability to manage Enable creation of temporary tables in user space Create multiple ways to get query output Needs to support multiple mirrors/load balancing Do all this without logging in to Windows Need also support of machine clients Þ Web Service: http: //skyservice. pha. jhu. edu/devel/Cas. Jobs/ • Two request categories: – Quick – Batch

Queue Management • • • Need to register batch ‘power users’ Query output goes to ‘My. DB’ Can be joined with source database Results are materialized from My. DB upon request Users can do: – Insert, Drop, Create, Select Into, Functions, Procedures – Publish their tables to a group area • Data delivery via the CASService (C# WS) http: //skyservice. pha. jhu. edu/devel/Cas. Service. asmx

Summary • • • Exponential data growth – distributed data Web Services – hierarchical architecture Distributed computing – Grid services Primary access to data is through databases The key to interoperability: metadata, standards Build upon industry standards, commercial tools, and collaborate with the rest of the world • Give interesting new tools into the hands of smart young people… they will quickly turn them into cutting edge science

http: //skyservice. pha. jhu. edu/develop/