ac5112ded6e510d300447071d971dbb0.ppt
- Количество слайдов: 48
Data Grids, Digital Libraries, and Persistent Archives ESIP Federation Meeting Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan, moore)@sdsc. edu San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
SDSC SRB Team • • • • Arun Jagatheesan George Kremenek Sheau-Yen Chen Arcot Rajasekar Reagan Moore Michael Wan Roman Olschanowsky Bing Zhu Charlie Cowart Wayne Schroeder Vicky Rowley (BIRN) Lucas Gilbert Marcio Faerman (SCEC) Antoine De Torcy (IN 2 P 3) Students & emeritus – – – – – San Diego Supercomputer Center Erik Vandekieft Reena Mathew Xi (Cynthia) Sheng Allen Ding Grace Lin Qiao Xin Daniel Moore Ethan Chen Jon Weinburg National Partnership for Advanced Computational Infrastructure
Topics • Concepts behind data management • Production data grid examples • Integration of data grids with digital libraries and persistent archives • Data Grid demonstration based on the Storage Resource Broker (SRB) San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Management Concepts (Elements) • Collection – The organization of digital entities to simplify management and access. • Context – The information that describes the digital entities in a collection. • Content – The digital entities in a collection San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Types of Context Metadata • Descriptive – Provenance information, discovery attributes • Administrative – Location, ownership, size, time stamps • Structural – Data model, internal components • Behavioral – Display and manipulation operations • Authenticity – Audit trails, checksums, access controls San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Metadata Standards • METS - Metadata Encoding Transmission Standard – Defines standard structure and schema extension • OAIS - Open Archival Information System – Preservation packages for submission, archiving, distribution • OAI - Open Archives Initiative – Metadata retrieval based on Dublin Core provenance attributes San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Management Concepts (Mechanisms) • Curation – The process of creating the context • Closure – Assertion that the collection has global properties, including completeness and homogeneity under specified operations • Consistency – Assertion that the context represents the content San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Information Technologies • Data collecting – Sensor systems, object ring buffers and portals • Data organization – Collections, manage data context • Data sharing – Data grids, manage heterogeneity • Data publication – Digital libraries, support discovery • Data preservation – Persistent archives, manage technology evolution • Data analysis – Processing pipelines, manage knowledge extraction San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Management Challenges • Distributed data sources – Management across administrative domains • Heterogeneity – Multiple types of storage repositories • Scalability – Support for billions of digital entities, PBs of data • Preservation – Management of technology evolution San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Grids • Distributed data sources – Inter-realm authentication and authorization • Heterogeneity – Storage repository abstraction • Scalability – Differentiation between context and content management • Preservation – Support for automated processing (migration, archival processes) San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Assertion • Data Grids provide the underlying abstractions required to support – Digital libraries • Curation processes • Distributed collections • Discovery and presentation services – Persistent archives • Management of technology evolution • Preservation of authenticity San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
SRB Collections at SDSC San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Common Infrastructure • Digital libraries and persistent archives can be built on data grids • Common capabilities are needed for each environment • Multiple examples of production systems across scientific disciplines and federal agencies San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Grid Components • Federated client-server architecture – Servers can talk to each other independently of the client • Infrastructure independent naming – Logical names for users, resources, files, applications • Collective ownership of data – Collection-owned data, with infrastructure independent access control lists • Context management – Record state information in a metadata catalog from data grid services such as replication • Abstractions for dealing with heterogeneity San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Grid Abstractions • Logical name space for files – Global persistent identifier • Storage repository abstraction – Standard operations supported on storage systems • Information repository abstraction – Standard operations to manage collections in databases • Access abstraction – Standard interface to support alternate APIs • Latency management mechanisms – Aggregation, parallel I/O, replication, caching • Security interoperability – GSSAPI, inter-realm authentication, collection-based authorization San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
SDSC Storage Resource Broker & Meta-data Catalog Application Unix Shell C, C++, Linux Libraries I/O Java, NT Browsers DLL / Python Grid. FTP OAI WSDL Consistency Management / Authorization-Authentication Logical Name Space Latency Management Catalog Abstraction Databases DB 2, Oracle, Sybase, SQLServer San Diego Supercomputer Center Data Transport Metadata Transport Storage Abstraction Archives File Systems Databases HPSS, ADSM, HRM Uni. Tree, DMF Unix, NT, Mac OSX DB 2, Oracle, Postgres Access APIs SRB Server Drivers National Partnership for Advanced Computational Infrastructure
Production Data Grid • SDSC Storage Resource Broker – Federated client-server system, managing • Over 90 TBs of data at SDSC • Over 16 million files – Manages data collections stored in • • • Archives (HPSS, Uni. Tree, ADSM, DMF) Hierarchical Resource Managers Tapes, tape robots File systems (Unix, Linux, Mac OS X, Windows) FTP sites Databases (Oracle, DB 2, Postgres, SQLserver, Sybase, Informix) • Virtual Object Ring Buffers San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Federated SRB server model Read Application Logical Name Or Attribute Condition Peer-to-peer Brokering Parallel Data Access 1 6 SRB server 3 San Diego Supercomputer Center SRB server 4 SRB agent 5 SRB agent 1. Logical-to-Physical mapping 2. Identification of Replicas 3. Access & Audit Control 5/6 2 R 1 MCAT Data Access R 2 Server(s) Spawning National Partnership for Advanced Computational Infrastructure
Logical Name Space • Global, location-independent identifiers for digital entities – Organized as collection hierarchy – Attributes mapped to logical name space • Attributed managed in a database • Types of administrative metadata – Physical location of file – Owner, size, creation time, update time – Access controls San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
File Identifiers • Logical file name – Infrastructure independent – Used to organize files into a collection hierarchy • Globally unique identifier – GUID for asserting equivalence across collections • Descriptive metadata – Support discovery • Physical file name – Location of file San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Mappings on Name Space • Define logical resource name – List of physical resources • Replication – Write to logical resource completes when all physical resources have a copy • Load balancing – Write to a logical resource completes when copy exist on next physical resource in the list • Fault tolerance – Write to a logical resource completes when copies exist on “k” of “n” physical resources San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Grid Bricks • Integrate data management system, data processing system, and data storage system into a modular unit – – – Commodity based disk systems (1 TB) Memory (1 GB) CPU (1. 7 Ghz) Network connection (Gig-E) Linux operating system • Data Grid technology to manage name spaces – User names (authentication, authorization) – File names – Collection hierarchy San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Grid Brick • Hardware components – – – – Intel Celeron 1. 7 GHz CPU Super. Micro P 4 SGA PCI Local bus ATX mainboard 1 GB memory (266 MHz DDR DRAM) 3 Ware Escalade 7500 -12 port PCI bus IDE RAID 10 Western Digital Caviar 200 -GB IDE disk drives 3 Com Etherlink 3 C 996 B-T PCI bus 1000 Base-T Redstone RMC-4 F 2 -7 4 U ten bay ATX chassis Linux operating system • Cost is $2, 200 per Tbyte plus tax • Gig-E network switch costs $500 per brick • Effective cost is about $2, 700 per TByte San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Grid Bricks at SDSC • Used to implement “picking” environments for 10 -TB collections – Web-based access – Web services (WSDL/SOAP) for data subsetting • Implemented 15 -TBs of storage – Astronomy sky surveys, NARA prototype persistent archive, NSDL web crawls • Must still apply Linux security patches to each Grid Brick San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Grid Brick Logical Names • Logical name space for files – Common collection hierarchy across modules • Collection owned data – Data grid manages access control lists for files – Data grid manages authentication • Logical resource name – Aggregate modules under a single logical name – Support load leveling across modules San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Latency Management Bulk Operations • Bulk register – Create a logical name for a file – Load context (metadata) • Bulk load – Create a copy of the file on a data grid storage repository • Bulk unload – Provide containers to hold small files and pointers to each file location • Requests for bulk operations for delete, access control, … San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
SRB Latency Management Remote Proxies, Staging Source Data Aggregation Containers Network Replication Streaming Server-initiated I/O Parallel I/O San Diego Supercomputer Center Prefetch Destination Caching Client-initiated I/O National Partnership for Advanced Computational Infrastructure
Latency Management Example - Digital Sky Project • 2 MASS (2 Micron All Sky Survey): – Bruce Berriman, IPAC, Caltech; John Good, IPAC, Caltech, Wen-Piao Lee, IPAC, Caltech • NVO (National Virtual Observatory): – Tom Prince, Caltech, Roy Williams CACR, Caltech, John Good, IPAC, Caltech • SDSC – SRB : – Arcot Rajasekar, Mike Wan, George Kremenek, Reagan Moore San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Digital Sky Data Ingestion star catalog Informix SUN SRB SUN E 10 K 800 GB input tapes from telescopes Data Cache HPSS …. 10 TB IPAC CALTECH San Diego Supercomputer Center SDSC National Partnership for Advanced Computational Infrastructure
Digital Sky - 2 MASS • http: //www. ipac. caltech. edu/2 mass • The input data was originally written to DLT tapes in the order seen by the telescope – 10 TBytes of data, 5 million files • Ingestion took nearly 1. 5 years - manual loading of tapes • Images aggregated into 147, 000 containers by SRB San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Digital Sky Web-based Data Retrieval • average 3000 images a day SUNs WEB SRB SUN E 10 K Informix IPAC CALTECH 800 GB WEB SUNs HPSS SGIs …. 10 TB JPL San Diego Supercomputer Center SDSC National Partnership for Advanced Computational Infrastructure
Remote Proxies • Extract image cutout from Digital Palomar Sky Survey – Image size 1 Gbyte – Shipped image to server for extracting cutout took 2 -4 minutes (5 -10 Mbytes/sec) • Remote proxy performed cutout directly on storage repository – Extracted cutout by partial file reads – Image cutouts returned in 1 -2 seconds • Remote proxies are a mechanism to aggregate I/O commands San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Real-Time Data Example - Road. Net Project • Manage interactions with a virtual object ring buffer • Demonstrate federation of ORBs • Demonstrate integration of archives, VORBs and file systems • Support queries on objects in VORBs San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Federated VORB Operation Logical Name of the Sensor wiith Stream Characteristics Automatically Contact ORB 2 Through VORB server At Nome Get Sensor Data ( from Boston) VORB server 1 VORB agent 4 San Diego 2 VORB server 3 VORB agent Check ORB 1 is down 6 VCAT Nome ORB 1 Contact VORB Catalog: 1. Logical-to-Physical mapping Physical Sensors Identified 2. Identification of Replicas ORB 1 and ORB 2 are identified as sources of reqd. data 3. Access & Audit Control San Diego Supercomputer Center 5 Format Data and Transfer R 2 ORB 2 Check ORB 2 is up. Get Data National Partnership for Advanced Computational Infrastructure
Access Abstraction Example - Data Assimilation Office HSI has implemented metadata schema in SRB/MCAT Origin: host, path, owner, uid, gid, perm_mask, [times] Ingestion: date, user_email, comment Generation: creator (name, uid, user, gid), host (name, arch, OS name & flags), compiler (name, version, flags), library, code (name, version), accounting data Data description: title, version, discipline, project, language, measurements, keywords, sensor, source, prod. status, temporal/spatial coverage, location, resolution, quality Fully compatible with GCMD San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Management System: Software Architecture San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
DODS Access Environment Integration San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Data Grid Federation • Data grids provide the ability to name, organize, and manage data on distributed storage resources • Federation provides a way to name, organize, and manage data on multiple data grids. San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Peer-to-Peer Federated Systems • Consistency constraints in federations • Cross-register a digital entity from one collection into another – Who manages the access control lists? – Who maintains consistency between context and content? • How can federation systems be characterized? San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
SRB Zones • Each SRB zone uses a metadata catalog (MCAT) to manage the context associated with digital content • Context includes: – Administrative, descriptive, authenticity attributes – Users – Resources – Applications San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
SRB Peer-to-Peer Federation • Mechanisms to impose consistency and access constraints on: – Resources • Controls on which zones may use a resource – User names (user-name / domain / SRB-zone) • Users may be registered into another domain, but retain their home zone, similar to Shibboleth – Data files • Controls on who specifies replication of data – MCAT metadata • Controls on who manages updates to metadata San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Peer-to-Peer Federation 1. 2. 3. 4. 5. Occasional Interchange Replicated Catalogs Resource Interaction Replicated Data Zones Master-Slave Zones 6. 7. Snow-Flake Zones User / Data Replica Zones 8. Nomadic Zones “SRB in a Box” zone 9. Free-floating “my. Zone” 10. Archival “Back. Up Zone” - for specified users - entire state information replication - data replication - no user interactions between zones - slaves replicate data from master zone - hierarchy of data replication zones - user access from remote to home - synchronize local zone to parent - synchronize without a parent zone - synchronize to an archive SRB Version 3. 0. 1 released December 19, 2003 San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Principle peer-to-peer federation approaches (1536 possible combinations) San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Comparison of peer-to-peer federation approaches Free Floating Partial User-ID Sharing Occasional Interchange Partial Resource Sharing Replicated Data System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing User and Data Replica System Managed Replication Connection From Any Zone Complete Resource Sharing Replicated Catalog Hierarchical Zone Organization One Shared User-ID No Metadata Synch Resource Interaction Nomadic System Managed Replication System Set Access Controls System Controlled Partial Metadata Synch No Resource Sharing Snow Flake Super Administrator Zone Control Master Slave System Controlled Complete Metadata Synch Complete User-ID Sharing Archival San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
Knowledge Based Data Grid Roadmap Knowledge Repository for Rules XTM DTD Knowledge Relationships Between Concepts Management Access Services Rules - KQL Ingest Services Knowledge or Topic-Based Query / Browse Attributes Semantics Information Repository SDLIP Information XML DTD (Model-based Access) Attribute- based Query Fields Containers Folders San Diego Supercomputer Center Storage (Replicas, Persistent IDs) Grids Data MCAT/HDF (Data Handling System) Feature-based Query National Partnership for Advanced Computational Infrastructure
Data Grid Demonstration • Use web browser to access a collection housed at SDSC • Retrieve an image • Browse through a collection • Search for a file • Examine grid federation San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc. edu http: //www. npaci. edu/DICE/SRB http: //www. npaci. edu/dice/srb/my. SRB. html San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure
ac5112ded6e510d300447071d971dbb0.ppt