b809e4746085a1bbb4c086313a474edf.ppt
- Количество слайдов: 15
GO-ESSP, Paris, May 2007 Vocabulary management: a foundation for semantic interoperability through ontology development Roy Lowry British Oceanographic Data Centre
Presentation Outline • Sea. Data. Net Project • Sea. Data. Net Metadata Evolution • Controlled Vocabularies and Sea. Data. Net • Mappings and Ontologies
Sea. Data. Net Project • Sea. Data. Net in a Nutshell · Combine over 40 oceanographic data centres across Europe into a single interoperable data system · Approach is to adopt established standards and technologies wherever possible · Two phases: * One brings 12 centres together with centralised metadata and distributed data as files. Due in autumn 2008 * Two introduces data virtualisation, aggregation, cutting and 30 more centres. Due in 2010 · The major problem facing the project is heterogeneous legacy content
Sea. Data. Net Metadata Evolution • In the beginning there was plaintext · In the 1980 s the enlightened few saw the value and potential of metadata · Most data suppliers saw metadata as an unnecessary waste of their valuable time · The enlightened in the MAST Data Committee and Search aimed to change this by making metadata creation as easy as possible · Plaintext is much easier to create than structured metadata so metadata formats based on plaintext fields such as EDMED were promoted and populated
Sea. Data. Net Metadata Evolution • Problem is that whilst humans have the intelligence to read and understand plaintext it is of very limited use to a computer • Consider some example EDMED plaintext parameter descriptions from a knowledge management viewpoint: · · · A wide variety of chemical and biological parameters CTD data Amplitude de l'echo retrodiffuse Cu, Zn, Fe, Pb, Cd, Cr, Ni in biota MACR 0 -MEIOFAUNA, SED BIOCHEMISTRY, ZOOPLANKTON, CILIATES, BACT CELLS, BACT BIOMASS, LEUCINE UPT, PRIM. PROD, METABOL, COCCOLITH • Consequently, the chances of these data sets being discovered by conventional parameter searching is virtually zero
Sea. Data. Net Metadata Evolution • Plaintext kills interoperability stone dead · In 2005 Taco de Bruin asked me to build Antarctic Portal DIFs from Dutch EDMED entries · DIF is a structured format representing each parameter thus: <Parameters> <Category>EARTH SCIENCE</Category> <Topic>Oceans</Topic> <Term>Salinity/Density</Term> <Variable>Salinity</Variable> </Parameters> • How does one derive this automatically from the plaintext ‘CTD data’, ‘T/S’, ‘temp+salin’, ’temperature and salinity’ and all the other variants in EDMED? • Needless to say, Taco is still waiting…. .
Sea. Data. Net Metadata Evolution • Sea. Data. Net has inherited thousands of EDMED dataset descriptions from Search • Sea. Data. Net wants to establish interoperability with other metadata repositories • The key to EDMED’s evolution to interoperability is to replace plaintext descriptions by keywords from controlled vocabularies
Vocabularies • Controlled vocabularies featured in the legacy metadata inherited by Sea. Data. Net • However · Content governance was total anarchy * Decisions made by individuals – even students * Terms were set up and used with inadequate thought about their meaning and formal definitions were conspicuous by their absence · Technical governance wasn’t much better * No formal maintenance or versioning * Vocabularies delivered on an ad-hoc basis as CSV files on FTP servers or web sites * Data models differed from one vocabulary to the next
Vocabularies • All this has now changed for Sea. Data. Net · Content governance * Sea. Data. Net internal vocabularies are governed by the Technical Task Team * Vocabularies with wider implications are governed by Sea. Data. Net and Marine. XML Vocabulary Content Governance Group (Sea. Vo. X) e-mail list · Technical governance * Vocabulary Server technology developed by NERC Data. Grid adopted – Oracle back end with automated versioning and audit trail maintenance – Web Service API front end – Clients using the API are available » http: //vocab. ndg. nerc. ac. uk/client/vocab. Server. jsp (BODC) » http: //seadatanet. maris 2. nl/v_bodc_vocab/welcome. asp (Maris)
Vocabularies • The importance of controlled vocabularies to Sea. Data. Net will increase as we stitch together disparate data sources for real • BODC client currently exposes 69 lists • Maris client exposes 27 of these that are particularly relevant to Sea. Data. Net • These numbers will grow as vocabulary harmonisation within Sea. Data. Net and between Sea. Data. Net and the wider community progresses
Mappings and Ontologies • The availability of heavily populated, well-managed controlled vocabularies provide the basis for interoperability within Sea. Data. Net • However, this leaves important unaddressed issues: · Creating standard content from partner legacy systems · Applying new standards to Search legacy content · External interoperability with metadata formats such as DIF and ISO 19115 profiles • The solution is ontology development
Mappings and Ontologies • Basic bottom-up ontologies are simply sets of lists with the relationships between their entries • These are being built between legacy, internal and external lists • Draft maps produced between parameter vocabularies from BODC, GCMD and CF using SKOS relationships (5 -6000 relationships) • These provide the basis for automatic creation of Sea. Data. Net and GCMD metadata from BODC and CF file aggregations
Mappings and Ontologies • Ontology building won’t solve all Sea. Data. Net’s interoperability problems • The only cure for legacy plaintext is manual standardisation • Fortunately Sea. Data. Net has many partners to share this burden
Some Conclusions • Plaintext is a destroyer of interoperability • Vocabularies need rigorous management if they are to effectively support interoperability • Multiple controlled vocabularies covering a common domain topic whilst undesirable is repairable by mapping/ontology building • Establishing interoperability in a scenario with legacy content is 10% inspiration and 90% perspiration
That’s All Folks Thank you for your attention Any questions?
b809e4746085a1bbb4c086313a474edf.ppt