Скачать презентацию Integrated Rule Oriented Data System i RODS Reagan Скачать презентацию Integrated Rule Oriented Data System i RODS Reagan

ccfdc5e96a02b0dd873c4deb38412c4c.ppt

  • Количество слайдов: 41

Integrated Rule Oriented Data System (i. RODS) Reagan W. Moore Arcot Rajasekar Mike Wan Integrated Rule Oriented Data System (i. RODS) Reagan W. Moore Arcot Rajasekar Mike Wan {moore, sekar, mwan}@diceresearch. org http: //irods. diceresearch. org

Data Management Infrastructure • Assemble distributed data into a shared collection – – Manage Data Management Infrastructure • Assemble distributed data into a shared collection – – Manage properties of the collection Enforce management policies Validate assessment criteria Automate administrative tasks • Support wide range of management applications – Data sharing, publication, preservation, analysis – Works at scale (petabytes, hundreds of millions of files)

Data Management Challenges • Data driven research generates massive data collections – Data sources Data Management Challenges • Data driven research generates massive data collections – Data sources are remote and distributed – Collaborators are remote – Wide variety of data types: observational data, experimental data, simulation data, real-time data, office products, web pages, multi-media • Collections contain millions of files – Logical arrangement is needed for distributed data – Discovery requires the addition of descriptive metadata • Long-term retention requires migration of output into a reference collection – Automation of administrative functions is essential to minimize longterm labor support costs – Creation of representation information for describing file context – Validation of assessment criteria (authenticity, integrity)

Preservation Context • Preservation metadata – Authenticity (provenance) information – Representation information (structure, semantics) Preservation Context • Preservation metadata – Authenticity (provenance) information – Representation information (structure, semantics) – Administrative information (replication, checksums, access controls, retention, disposition) • Preservation procedures – Administration procedures – ISO MOIMS-rac assessment procedures – Preservation procedures generate preservation metadata

Overview of i. RODS Data System Overview of i. RODS Architecture User Can Search, Overview of i. RODS Data System Overview of i. RODS Architecture User Can Search, Access, Add and Manage Data & Metadata i. RODS Data System i. RODS Data Server Disk, Tape, etc. i. RODS Rule Engine Track policies i. RODS Metadata Catalog Track data *Access data with Web-based Browser or i. RODS GUI or Command Line clients.

i. RODS Distributed Data Management i. RODS Distributed Data Management

i. RODS Resource Server i. RODS Resource Server

Types of File Manipulation • • Replication Load leveling across storage systems Registration Synchronization Types of File Manipulation • • Replication Load leveling across storage systems Registration Synchronization Checksums Aggregation Metadata Access controls (time dependent)

i. RODS Micro-Services • Function snippets that wrap a well-defined process – – – i. RODS Micro-Services • Function snippets that wrap a well-defined process – – – Compute checksum Replicate file Integrity check Zoom image Get SDSS image cutout Search Pub. Med • Written in C or Python (PHP, Java soon) – Recovery micro-services to handle failure – Web services can be wrapped as micro-services • Can be chained to perform complex tasks – Micro-services invoked by rule engine

i. RODS Rules • Server-side workflows Action | condition | workflow chain | recovery i. RODS Rules • Server-side workflows Action | condition | workflow chain | recovery chain • Condition - test on any attribute: – Collection, file name, storage system, file type, user group, elapsed time, IRB approval flag, descriptive metadata • Workflow chain: – Micro-services / rules that are executed at the storage system • Recovery chain: – Micro-services / rules that are used to recover from errors

iput With Replication iput data Client icat Resource 1 metadata Metadata Data a Resource iput With Replication iput data Client icat Resource 1 metadata Metadata Data a Resource 2 Rule Base Data m t da a et data / Rule Base Rule added to rule database

Policy-Virtualization: Automate Operations • System-centric Policies & Obligations: – Manage retention, disposition, distribution, replication, Policy-Virtualization: Automate Operations • System-centric Policies & Obligations: – Manage retention, disposition, distribution, replication, integrity, authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication • Domain-specific Policies: – Identification & Extraction of Metadata – Ingestion Control for Provenance Attribution – Processing of Data on Ingestion • Creation of multi-resolution images, type-identification, anonymization, … – Processing of Data on Access • IRB Approval for data access, Data sub-setting, Merging of multiple images, conversion, redaction, …

Policy/rule execution • • Immediate - enforced at time of action invocation Deferred - Policy/rule execution • • Immediate - enforced at time of action invocation Deferred - applied at a future time Periodic - applied at defined interval Interactive - applied on demand • i. SEC scheduler / batch system supports – – Local workflows Distributed workflows Deferred and periodic workflows (Launch micro-services on clusters, clouds, supercomputers)

Checksum Validation Rule my. Checksum. Rule{ msi. Make. Query( Checksum Validation Rule my. Checksum. Rule{ msi. Make. Query("DATA_NAME, COLL_NAME, DATA_CHECKSUM", *Condition, *Query); msi. Exec. Str. Cond. Query(*Query, *B); assign(*A, 0); for. Each. Exec (*B) { msi. Get. Val. By. Key(*B, COLL_NAME, *C); msi. Get. Val. By. Key(*B, DATA_NAME, *D); msi. Get. Val. By. Key(*B, DATA_CHECKSUM, *E); msi. Data. Obj. Chksum(*B, *Operation, *F); if. Exec (*E != *F) { write. Line(stdout, file *C/*D has registered checksum *E and computed checksum *F); } else { assign(*A, *A + 1); } } if. Exec(*A > 0) { write. Line(stdout, have *A good files); } } *Condition can be COLL_NAME like ‘/ils 161/home/moore/genealogy/%’

Quota Checking Rule mytest. Rule|| assign(*A, 0)## assign(*Cont. Inx, 1)## assign(*G, 0)## msi. Make. Quota Checking Rule mytest. Rule|| assign(*A, 0)## assign(*Cont. Inx, 1)## assign(*G, 0)## msi. Make. Gen. Query("DATA_SIZE", *Condition, *Query)## msi. Exec. Gen. Query(*Query, *B)## for. Each. Exec(*B, msi. Get. Val. By. Key(*B, DATA_SIZE, *C)## assign(*A, *A + *C)## assign(*G, *G + 1), nop)## ` while. Exec(*Cont. Inx > 0, msi. Get. More. Rows(*Query, *B, *Cont. Inx)## for. Each. Exec(*B, msi. Get. Val. By. Key(*B, DATA_SIZE, *C)## assign(*A, *A + *C)## assign(*G, *G + 1), nop)## write. Line(stdout, Total size of data owned by *D on resource *E is *A)## write. Line(stdout, Number of files is *G)|nop *D= rods%*E= renci-vault 1% *Condition= DATA_OWNER_NAME = 'rods' AND RESC_NAME = 'renci-vault 1' rule. Exec. Out

Managing Structured Information • Information exchange between micro-services – Parameter passing – White board Managing Structured Information • Information exchange between micro-services – Parameter passing – White board memory structures – High performance message passing (i. XMS) – Persistent metadata catalog (i. CAT) • Structured Information Resource Drivers – Interact with remote structured information resource (HDF 5, net. CDF, tar file)

Structured Data • Aggregate data into a tar file – Mount a tar file Structured Data • Aggregate data into a tar file – Mount a tar file to enable manipulation of files within the tar file • Use HDF 5 to manage aggregations of files – Micro-services that apply HDF 5 library calls at the remote storage location • Mount a remote directory – Synchronize files in directory with files in i. RODS collection

Micro-services vs Web Services • Micro-services – Manage exchange of structured information between micro-services Micro-services vs Web Services • Micro-services – Manage exchange of structured information between micro-services through memory – Serialize information for transmission over a network – Optimized protocol for data transmission • Single message for small files (<32 MBs) • Parallel I/O for large files • Web Services – SOAP /HTTP data transmission between services

Research Collaborations • NSF NARA - supports application of data grids to preservation environments Research Collaborations • NSF NARA - supports application of data grids to preservation environments • NSF SDCI - supports development of core i. RODS data grid infrastructure • NSF OOI - future integration of data grids with realtime sensor data streams and grid computing • NSF TDLC - production TDLC data grid and extension to remaining 5 Science of Learning Centers (0. 3 FTE) • NSF SCEC - current production environment (0. 1 FTE) • NSF Teragrid - production environment (0. 1 FTE)

NSF Software Development for Cyberinfrastructure • Conduct research on policy management in distributed data NSF Software Development for Cyberinfrastructure • Conduct research on policy management in distributed data systems – – – – Collection oriented data management Adaptive middleware architecture Distributed rule engine Server-side (remote) workflow execution Transactional recovery semantics Automated validation Automation of large-scale data administrative functions Enforcement of management policies

Overview of i. RODS Architecture i. RODS Shows Unified “Virtual Collection” User With Client, Overview of i. RODS Architecture i. RODS Shows Unified “Virtual Collection” User With Client, Views & Manages Data Archivist Sees Single “Virtual Collection” Processing Cache Disk, Tape, Database, File system, etc. Archive Disk, Tape, Database, File system, etc. Access Cache Disk, Tape, Database, File system, etc. The i. RODS Data Grid installs in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection.

Generic Data Management Systems i. RODS - integrated Rule-Oriented Data System Generic Data Management Systems i. RODS - integrated Rule-Oriented Data System

NARA Preservation Application • Transcontinental Persistent Archive Prototype – Use data grid technology to NARA Preservation Application • Transcontinental Persistent Archive Prototype – Use data grid technology to build a preservation environment – Conduct research on preservation concepts • • Infrastructure independence Enforcement of preservation properties Automation of administrative preservation processes Validation of preservation assessment criteria – Demonstrate preservation on selected NARA digital holdings • Integration of generic infrastructure with preservation technologies (Cheshire, MVD, JHOVE, Pronom, Fedora, Dspace)

Preservation is an Integral Part of the Data Life Cycle • Organize project data Preservation is an Integral Part of the Data Life Cycle • Organize project data into a shared collection • Publish data in a digital library for use by other researchers • Enable data-discovery & data-driven analyses • Preserve reference collections for use by future research initiatives • Analyze new collection against prior state-of-the-art data • Define & Enforce Policies for long-term management and curation

National Archives and Records Administration Transcontinental Persistent Archive Prototype Federation of Seven Independent Data National Archives and Records Administration Transcontinental Persistent Archive Prototype Federation of Seven Independent Data Grids NARA I MCAT NARA II MCAT Georgia Tech MCAT Rocket Center MCAT U NC MCAT U Md MCAT UCSD MCAT Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.

To Manage Long-term Preservation • Define desired preservation properties – Authenticity / Integrity / To Manage Long-term Preservation • Define desired preservation properties – Authenticity / Integrity / Chain of Custody / Original arrangement – Life Cycle Data Requirements Guide • Implement preservation processes – Appraisal / accession / arrangement / description / preservation / access • Manage preservation environment – Minimize costs – Validate assessment criteria to verify preservation properties

ISO MOIMS repository assessment criteria • Are developing 150 rules that implement the ISO ISO MOIMS repository assessment criteria • Are developing 150 rules that implement the ISO assessment criteria 90 Verify descriptive metadata and source against SIP template and set SIP compliance flag 91 Verify descriptive metadata against semantic term list 92 Verify status of metadata catalog backup (create a snapshot of metadata catalog) 93 Verify consistency of preservation metadata after hardware change or error

Sustainability • Economic sustainability – Reference collections – Repurpose reference collections to support use Sustainability • Economic sustainability – Reference collections – Repurpose reference collections to support use by multiple communities – Federate resources across multiple communities • Technological sustainability – Open source software – Support continued porting through international collaborations • Policy sustainability – Evolve management policies to support new user communities • Access sustainability – Support data manipulation and display by new communities

Data Virtualization Access Interface Map from the actions Standard Micro-services Data Grid Standard Operations Data Virtualization Access Interface Map from the actions Standard Micro-services Data Grid Standard Operations Storage Protocol Storage System requested by the access method to a standard set of micro-services. The standard micro-services are mapped to the operations supported by the storage system

Migration of Parsing Routines • Data Grids minimize the effort needed to sustain parsing Migration of Parsing Routines • Data Grids minimize the effort needed to sustain parsing routines – Parsing routine is encapsulated as a micro-service – New clients can then be ported on top of the data grid without changing the parsing routine • Map from actions to standard actions – New storage systems can be added to the data grid without changing the parsing routine • Map from standard operations to storage protocol

Clients • • • Unix shell commands Java I/O library C I/O redirection library Clients • • • Unix shell commands Java I/O library C I/O redirection library Windows browser Web-DAV Kepler workflow HDF 5 client DSpace Fedora Python library

Scale of i. RODS Data Grid • Number of files – Tens of millions Scale of i. RODS Data Grid • Number of files – Tens of millions to hundreds of millions of files • Size of data – Hundreds of terabytes to petabytes of data • Number of policy enforcement points – 20 actions define when policy is checked • Amount of metadata – 112 metadata attributes for system information per file • Number of policies – 150 policies • Number of data grids – Federation of tens of data grids

Federation Across Spatial Scales • International collaborations – Australian Research Collaboration Service (ARCS) – Federation Across Spatial Scales • International collaborations – Australian Research Collaboration Service (ARCS) – Sustaining Heritage Access through Multivalent Archivi. Ng (SHAMAN) – Cinegrid • National collaborations – Temporal Dynamics of Learning Center (TDLC) – Ocean Observatories Initiative (OOI) • Regional collaborations – LSU data grid – HASTAC humanities data grid – Distributed Custodial Archive Preservation Environment (DCAPE) • State collaborations – RENCI data grid – North Carolina State Library • Institutional repositories – Carolina Digital Repository – SIO Repository

Integrating across Supercomputer / Cloud / Grid i. RODS Data Grid i. RODS Server Integrating across Supercomputer / Cloud / Grid i. RODS Data Grid i. RODS Server Software Supercomputer File System Cloud Disk Cache Teragrid Node Parallel Application Virtual Machine Environment Grid Services RENCI OOI SCEC

ARCS Data Fabric ARCS Data Fabric

Davis – Modes Davis – Modes

Temporal Dynamics of Learning Center Scientist A Scientist B Adds data to Shared Collection Temporal Dynamics of Learning Center Scientist A Scientist B Adds data to Shared Collection Accesses and analyzes shared Data i. RODS Data System Brain Data Server, CA Audio Data Server, NJ Video Data Server, TN i. RODS Metadata Catalog Scientists can use i. RODS as a “data grid” to share multiple types of data, near and far. i. RODS Rules also enforce and audit human subjects access restrictions.

i. RODS Evaluations • NASA Jet Propulsion Laboratory – i. RODS selected for managing i. RODS Evaluations • NASA Jet Propulsion Laboratory – i. RODS selected for managing distribution of Planetary Data System records • NASA National Center for Computational Sciences – i. RODS chosen to manage archive of simulation output and serve as access data cache for distribution • AVETEC appraisal for Do. D HPC centers – i. RODS now provides all required capabilities • French National Library – i. RODS rules control ingestion, access, and audit functions • Australian Research Coordination Service – i. RODS manages data distributed between academic institutions

Development Team • DICE team – – – – Arcot Rajasekar - i. RODS Development Team • DICE team – – – – Arcot Rajasekar - i. RODS development lead Mike Wan - i. RODS chief architect Wayne Schroeder - i. RODS developer Bing Zhu - Fedora, Windows Lucas Gilbert - Java (Jargon), DSpace Paul Tooby - documentation, foundation Sheau-Yen Chen - data grid administration • Preservation – Richard Marciano - Preservation development lead – Chien-Yi Hou - preservation micro-services – Antoine de Torcy - preservation micro-services

Foundation • Data Intensive Cyber-environments – Non-profit open source software development – Promote use Foundation • Data Intensive Cyber-environments – Non-profit open source software development – Promote use of i. RODS technology – Support standards efforts – Coordinate international development efforts • • IN 2 P 3 - quota and monitoring system King’s College London - Shibboleth Australian Research Collaboration Services - Web. DAV Academia Sinica - SRM interface

i. RODS is a i. RODS is a "coordinated NSF/OCI-Nat'l Archives research activity" under the auspices of the President's NITRD Program and is identified as among the priorities underlying the President's 2009 Budget Supplement in the area of Human and Computer Interaction Information Management technology research Reagan W. Moore [email protected] org http: //irods. diceresearch. org NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” NSF SDCI-0721400 “Data Grids for Community Driven Applications”