13450bcd9076b361ddd35111d06b847c.ppt
- Количество слайдов: 78
Future Database Needs SC 32 Study Period February 5, 2007 JTC 1 SC 32 N 1633 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510 -495 -2905 bebargmeyer@lbl. gov 1
Topics F Study period purpose F New challenges F A brief tutorial on Semantics and semantic computing F where XMDR fits u Semantic computing technologies u Traditional Data Administration F Some limitations of current relational technologies F Some input from other sources 2
Future Database Needs Study Period F A one-year study period to identify and understand case studies related to this area. F Bring together a small group of experts in a meeting on “Case Studies on new Database Standards Requirements”. F The workshop would provide input to existing SC 32 projects and may provide background material for new proposals for upgrades or for new work within SC 32 in time for 2007 SC 32 Plenary --Document 32 N 1451 3
The Internet Revolution A world wide web of diverse content: 4 The information glut is nothing new. The access to it is astonishing.
Challenge: Find and process nonexplicit data For example… Patient data on drugs contains brand names (e. g. Tylenol, Anacin-3, Datril, …); Analgesic Agent Non-Narcotic Analgesic and Antipyretic However, want to study patients taking analgesic agents Nonsteroidal Antiinflammatory Drug Tylenol Acetominophen Anacin-3 Datril 5
Challenge: Specify and compute across Relations, e. g. , within a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http: //en. wikipedia. org/wiki/Food_web#Food_web (from SPIRE) 6
Challenge: Combine Data, Metadata & Concept Systems Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003” Data: ID Date Temp Hg A 06 -09 -13 4. 4 06 -09 -13 9. 3 06 -09 -13 6. 7 Biological 2 X Contamination 78 Radioactive Chemical 4 B Concept system: mercury lead cadmium Metadata: Name Datatype Definition Units ID text Monitoring Station Identifier not applicable Date date Date yy-mm-dd number Temperature (to 0. 1 degree C) degrees Celcius number Mercury contamination micrograms per liter Temp Hg 7
Challenge: Use data from systems that record the same facts with different terms Database Catalogs Common Content ISO 11179 Registries Common Content OASIS/eb. XML Registries Common Content Data Element XML Tag UDDI Registries Table Column Common Content Business Specification Country Attribute Identifier CASE Tool Repositories Common Content Business Object Coverage Software Component Registries Common Content Dublin Core Registries Common Content Term Hierarchy Ontological Registries Common Content 9
Same Fact, Different Terms Data Element Concept Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org. : Steward: Classification: Registration Authority: Others Algeria Belgium China Denmark Egypt France. . . Zimbabwe Data Elements Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others Algeria L`Algérie DZ DZA 012 Belgium Belgique BE BEL 056 China Chine CN CHN 156 Denmark Danemark DK DNK 208 Egypte EG EGY 818 France La France FR FRA 250 . . . . Zimbabwe ZW ZWE 716 ISO 3166 French Name ISO 3166 2 -Alpha Code ISO 3166 3 -Numeric Code ISO 3166 English Name 10
Challenge: Draw information together from a broad range of studies, databases, reports, etc. 11
Challenge: Gain Common Understanding of meaning between Data Creators and Data Users A common interpretation of what the data represents EEA text environ agriculture climate human health industry tourism soil water air USGS ers Us 123 3268 345 0825 445 1348 670 5038 248 2708 591 0000 308 2178 text environ agriculture climate human health industry tourism soil water air text ambiente agricultura tiempo salud huno industria turismo tierra agua aero Users data Do. D EPA text ambiente agricultura tiempo salud hunano industria turismo tierra agua aero text environ agriculture climate human health industry tourism soil water air data 123 3268 345 0825 445 1348 670 5038 248 2708 591 0000 308 2178 data 3268 0825 123 1348 345 123 5038 3268 2708 445 345 0000 0825 2178 670 445 1348 248 670 5038 591 248 308 591 308 Information systems Others. . . Data Creation 12
Challenge: Drawing Together Dispersed Data A common interpretation of what the data represents EEA text environ agriculture climate human health industry tourism soil water air USGS ers Us 123 3268 345 0825 445 1348 670 5038 248 2708 591 0000 308 2178 text environ agriculture climate human health industry tourism soil water air text ambiente agricultura tiempo salud huno industria turismo tierra agua aero Users data Do. D EPA text ambiente agricultura tiempo salud hunano industria turismo tierra agua aero text environ agriculture climate human health industry tourism soil water air data 123 3268 345 0825 445 1348 670 5038 248 2708 591 0000 308 2178 data 3268 0825 123 1348 345 123 5038 3268 2708 445 345 0000 0825 2178 670 445 1348 248 670 5038 591 248 308 591 308 Information systems Others. . . Data Creation 13
Semantic Computing F We are laying the foundation to make a quantum leap toward a substantially new way of computing: Semantic Computing F How can we make use of semantic computing? F What do organizations need to do to prepare for and stimulate semantic computing? 14
Coming: A Semantic Revolution Searching and ranking Pattern analysis Knowledge discovery Question answering Reasoning Semi-automated decision making 15
The Nub of It F Processing that takes “meaning” into account F Processing based on the relations between things not just computing about the things themselves. F Computing that takes people out of the processing, reducing the human toil u Data access, extraction, mapping, translation, formatting, validation, inferencing, … F Delivering higher-level results that are more helpful for the user’s thought and action 16
Semantics Challenges F Managing, harmonizing, and vetting semantics is essential to enable enterprise semantic computing F Managing, harmonizing and vetting semantics is important for traditional data management. u In the past we just covered the basics F Enabling “community intelligence” through efforts similar to Wikipedia, Wikitionary, Flickr 17
A Brief Tutorial on Semantics F What is meaning? F What are concepts? F What are relations? F What are concept systems? F What is “reasoning”? 18
Meaning: The Semiotic Triangle Thought or Reference (Concept) Refers to Referent Symbolises Stands for C. K Ogden and I. A. Richards. The Meaning of Meaning. Symbol “Rose”, “Clip. Art” 19
Semiotic Triangle: Concepts, Definitions and Signs Definition CONCEPT Refers To Referent Symbolizes Stands For “Rose”, “Clip. Art” Sign 20
Definitions in the EPA Environmental Data Registry Mailing Address: State USPS Code: Mailing Address State Name: http: //www. epa/gov/edr/sw/Administered. Item#Mailing. Address The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box http: //www. epa/gov/edr/sw/Administered. Item#State. USPSCode The U. S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U. S. or Canada http: //www. epa/gov/edr/sw/Administered. Item#State. Name The name of the state where mail is delivered 24
SNOMED – Terms Defined by Relations 26
Computable Meaning rdfs: sub. Class. Of owl: equivalent. Class owl: disjoint. With CONCEPT Refers To Referent Symbolizes Stands For “Rose”, “Clip. Art” If “rose” is owl: disjoint. With “daffodil”, then a computer can determine that an assertion is invalid, if it states that a rose is also a daffodil (e. g. , in a knowledgebase). 30
What are Relations? Relation Water. Body Merced River Fletcher Creek is. A Merced Lake Fletcher Creek Concepts and relations can be represented as nodes and edges in formal graph structures, e. g. , “is-a” hierarchies. 31
Concept Systems have Nodes and may have Relations Nodes represent concepts A Lines (arcs) represent relations 1 a 2 b c Concept systems are concepts and the relations between them. Concept systems can be represented & queried as graphs d 32
A More Complex Concept Graph Concept lattice of inland water features Linear Large linear Large Non-linear Small non- linear Deep Natural Flowing Shallow Stagnant Artificial River Stream Canal Reservoir Lake Marsh Pond From Supervaluation Semantics for an Inland Water Feature Ontology Paulo Santos and Brandon Bennett http: //ijcai. org/papers/1187. pdf#search=%22 terminology%20 water%20 ontology%22 33
Types of Concept System Graph Structures Tree Partial Order Tree Ordered Tree Partial Order Graph Bipartite Graph Faceted Classification Powerset of 3 element set Directed Acyclic Graph Clique Compound Graph 35
Graph Taxonomy Graph Directed Graph Undirected Graph Directed Acyclic Graph Bipartite Graph Clique Partial Order Graph Faceted Classification Lattice Partial Order Tree Ordered Tree Note: not all bipartite graphs are undirected. 36
What Kind of Relations are There? Lots! Relationship class: A particular type of connection existing between people related to or having dealings with each other. F acquaintance. Of - A person having more than slight or superficial knowledge of this person but short of friendship. F ambivalent. Of - A person towards whom this person has mixed feelings or emotions. F ancestor. Of - A person who is a descendant of this person. F antagonist. Of - A person who opposes and contends against this person. F apprentice. To - A person to whom this person serves as a trusted counselor or teacher. F child. Of - A person who was given birth to or nurtured and raised by this person. F close. Friend. Of - A person who shares a close mutual friendship with this person. F collaborates. With - A person who works towards a common goal with this person. F… 37
Example of relations in a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http: //en. wikipedia. org/wiki/Food_web#Food_web (from SPIRE) 38
Ontologies are a type of Concept System Ontology: explicit formal specifications of the terms in the domain and relations among them (Gruber 1993) F An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them. F Why would someone want to develop an ontology? Some of the reasons are: F u u u To share common understanding of the structure of information among people or software agents To enable reuse of domain knowledge To make domain assumptions explicit To separate domain knowledge from the operational knowledge To analyze domain knowledge http: //www. ksl. stanford. edu/people/dlm/papers/ontology 101 -noy-mcguinness. html 39
What is Reasoning? Inference Disease is-a Infectious Disease is-a Polio Chronic Disease is-a Smallpox is-a Diabetes is-a Heart disease Signifies inferred is-a relationship 40
Reasoning: Taxonomies & partonomies can be used to support inference queries E. g. , if a database contains information on events by city, we could query that database for events that happened in a particular county or state, even though the event data does not contain explicit state or county codes. part-of Oakland California part-of Alameda County part-of Berkeley part-of Santa Clara County part-of Santa Clara San Jose 41
Reasoning: Relationship metadata can be used to infer non-explicit data For example… (1) patient data on drugs currently being taken contains brand names (e. g. Tylenol, Anacin-3, Datril, …); Analgesic Agent Non-Narcotic Analgesic (2) concept system connects different drug types and names with one another (via is-a, part-of, etc. relationships); (3) so… patient data can be linked and searched by inferred terms like “acetominophen” and “analgesic” as well as trade names explicitly stored as text strings in the database Analgesic and Antipyretic Nonsteroidal Antiinflammatory Drug Tylenol Acetominophen Anacin-3 Datril 42
Reasoning: Least Common Ancestor Query What is the least common ancestor concept in the NCI Thesaurus for Acetominophen and Morphine Sulfate? (answer = Analgesic Agent) Analgesic Agent Opioid Non-Narcotic Analgesic and Antipyretic Opiate Morphine Codeine Sulfate Phosphate Nonsteroidal Antiinflammatory Drug Acetominophen 43
Reasoning: Example “sibling” queries: concepts that share a common ancestor FEnvironmental: u "siblings" of Wetland (in NASA SWEET ontology) FHealth Siblings of ERK 1 finds all 700+ other kinase enzymes u Siblings of Novastatin finds all other statins u F 11179 Metadata u Sibling values in an enumerated value domain 44
Reasoning: More complex “sibling” queries: concepts with multiple ancestors FHealth u Find all the siblings of Breast Neoplasm site neoplasms breast disorders Breast Eye Respiratory neoplasm System neoplasm FEnvironmental u Find all chemicals that are a u carcinogen (cause cancer) and u toxin (are poisonous) and u terratogenic (cause birth defects) Non-Neoplastic Breast Disorder 45
End of Tutorial about concept systems What are the “Database Language” challenges? 46
Metadata Registries & Database Technologies – Which Does What? Traditional Data Registries (11179 Edition 2) F Register metadata which describes data—in databases, applications, XML Schemas, data models, flat files, paper F Assist in harmonizing, standardizing, and vetting metadata F Assist data engineering F Provide a source of well formed data designs for system designers F Record reporting requirements F Assist data generation, by describing the meaning of data entry fields and the potential valid values F Register provenance information that can be provided to end users of data F Assist with information discovery by pointing to systems where particular data is maintained. 49
Traditional MDR: Manage Code Sets Data Element Concept Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org. : Steward: Classification: Registration Authority: Others Algeria Belgium China Denmark Egypt France. . . Zimbabwe Data Elements Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others Algeria L`Algérie DZ DZA 012 Belgium Belgique BE BEL 056 China Chine CN CHN 156 Denmark Danemark DK DNK 208 Egypte EG EGY 818 France La France FR FRA 250 . . . . Zimbabwe ZW ZWE 716 ISO 3166 French Name ISO 3166 2 -Alpha Code ISO 3166 3 -Numeric Code ISO 3166 English Name 50
What Can XMDR Do? Support a new generation of semantic computing F Concept system management F Harmonizing and vetting concept systems F Linkage of concept systems to data F Interrelation of multiple concept systems F Grounding ontologies and RDF in agreed upon semantics F Reasoning across XMDR content (concept systems and metadata) F Provision of Semantic Services 51
We are trying to manage semantics in an increasingly complex content space Structured data Semi-structured data Unstructured data Text Pictographic Graphics Multimedia Voice video 52
Case Study F Combining Concept Systems, Data, and Metadata to answer queries. 53
Linking Concepts: Text Document Title 40 --Protection of Environment CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY PART 141 --NATIONAL PRIMARY DRINKING WATER REGULATIONS § 141. 62 40 CFR Ch. I (7– 1– 02 Edition) § 141. 62 Maximum contaminant levels for inorganic contaminants. (a) [Reserved] (b) The maximum contaminant levels for inorganic contaminants specified in paragraphs (b) (2)–(6), (b)(10), and (b) (11)–(16) of this section apply to community water systems and non-transient, non-community water systems. The maximum contaminant level specified in paragraph (b)(1) of this section only applies to community water systems. The maximum contaminant levels specified in (b)(7), (b)(8), and (b)(9) of this section apply to community water systems; non-transient, noncommunity water systems; and transient non-community water systems. Contaminant MCL (mg/l) (1) Fluoride. . . . 4. 0 (2) Asbestos. . . 7 Million Fibers/liter (longer than 10 μm). (3) Barium. . . . 2 (4) Cadmium. . . 0. 005 (5) Chromium. . . 0. 1 (6) Mercury. . . . 0. 002 (7) Nitrate. . . . 10 (as Nitrogen) 54
Thesaurus Concept System (From GEMET) Chemical Contamination Definition The addition or presence of chemicals to, or in, another substance to such a degree as to render it unfit for its intended purpose. Broader Term contamination Narrower Terms cadmium contamination, lead contamination, mercury contamination Related Terms chemical pollutant, chemical pollution Deutsch: Chemische Verunreinigung English (US): chemical contamination Español: contaminación química SOURCE General Multi-Lingual Environmental Thesaurus (GEMET) 55
Concept System (Thesaurus) Contamination chemical pollutant Biological Radioactive cadmium Chemical lead chemical pollution mercury 56
Chemicals in EPA Environmental Data Registry Name Mercury, bis(acetato. kappa. O) (benzenamine)- Mercury, (acetato. kappa. O) phenyl-, mixt. with phenylmercuric propionate Type Biological Recent Additions | Contact Us Organism Chemical CAS Number 7439 -97 -6 63549 -47 -3 No CAS Number TSN Acalypha ostryifolia 28189 ICTV EPA ID E 17113275 E 965269 57
Data X Merced River Fletcher Creek B A Merced Lake Monitoring Stations Name Latitude Longitude Measurements Location ID Date Temp Hg A 41. 45 N 125. 99 W Merced Lake A 2006 -09 -13 4. 4 4 B 43. 23 N 120. 50 W Merced River B 2006 -09 -13 9. 3 2 X 2006 -09 -15 5. 2 3 118. 12 W Fletcher Creek X 2006 -09 -13 6. 7 78 X 39. 45 N 58
Metadata Contaminants Contaminant Threshold mercury 5 lead 42? cadmium 250? Metadata System Data Element Definition Units Precision Measurements ID Monitoring Station Identifier not applicable Measurements Date sample was collected not applicable Measurements Temperature degrees Celcius 0. 1 Measurements Hg Mercury contamination micrograms per liter 0. 004 Monitoring Stations Name Monitoring Station Identifier Monitoring Stations Latitude where sample was taken Monitoring Stations Longitude where sample was taken Monitoring Stations Location Body of water monitored Contaminants Contaminant Name of contaminant Contaminants Threshold Acceptable threshold value 59
Relations among Inland Bodies of Water Fletcher Creek feeds into Merced River fed from Fletcher Creek feeds into Merced Lake 60
Combining Data, Metadata & Concept Systems Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 2 parts per billion between December 2001 and March 2003” Data ID Date Temp Hg A 06 -09 -13 4. 4 06 -09 -13 9. 3 06 -09 -13 6. 7 Biological 2 X Contamination 78 Radioactive Chemical 4 B Concept system mercury lead cadmium Metadata Name Datatype Definition Units ID text Monitoring Station Identifier not applicable Date date Date yy-mm-dd number Temperature (to 0. 1 degree C) degrees Celcius number Mercury contamination micrograms per liter Temp Hg 61
Example – Environmental Text Corpus F Idea: Develop an environmental research corpus that could attract R&D efforts. Include the reports and other material from over $1 b EPA sponsored research. u Prepare the corpus and make it available n Research results from years of ORD R&D u Publish associated metadata and concept systems in XMDR u Use open source software for EPA testing 62
Information Extraction & Semantic Computing Extraction Engine Segment Classify Discover patterns Associate Select models Normalize Fit parameters Deduplicate Inference Report results 11179 -3 (E 3) XMDR Actionable Information Decision Support 63
Metadata Registries are Useful Registered semantics F For “training” extraction engines F The“Normalize” function can make use of standard code sets that have mapping between representation forms. F The “Classify” function can interact with pre-established concept systems. Provenance F High precision for proper nouns, less precision (e. g. , 70%) for other concepts -> impacts downstream processing, Need to track precision 65
Normalize – Need Registered and Mapped Concepts/Code Sets Data Element Concept Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org. : Steward: Classification: Registration Authority: Others Algeria Belgium China Denmark Egypt France. . . Zimbabwe Data Elements Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others Algeria L`Algérie DZ DZA 012 Belgium Belgique BE BEL 056 China Chine CN CHN 156 Denmark Danemark DK DNK 208 Egypte EG EGY 818 France La France FR FRA 250 . . . . Zimbabwe ZW ZWE 716 ISO 3166 French Name ISO 3166 2 -Alpha Code ISO 3166 3 -Numeric Code ISO 3166 English Name 66
Challenge for Database Languages F The extraction database can contain graphs with > a billion nodes. u Types of queries that can be done u Query performance u Linkage of “extract database” concepts and relations to same concepts and relations in traditional databases. 67
Example – 11179 -3 (E 3) Support Semantic Web Applications XMDR may be used to “ground” the Semantics of an RDF Statement. The address state code is “AB”. This can be expressed as a directed Graph e. g. , an RDF statement: Graph Node RDF Subject Address Edge Predicate Node Object State Code AB 68
Example: Grounding RDF nodes and relations: URIs Reference a Metadata Registry db. A: e 0139 ai: Mailing. Address db. A: ma 344 ai: State. USPSCode “AB”^^ai: State. Code @prefix db. A: “http: /www. epa. gov/database. A” @prefix ai: “http: //www. epa. gov/edr/sw/Administered. Item#” 69
Definitions in the EPA Environmental Data Registry Mailing Address: State USPS Code: Mailing Address State Name: http: //www. epa/gov/edr/sw/Administered. Item#Mailing. Address The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box http: //www. epa/gov/edr/sw/Administered. Item#State. USPSCode The U. S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U. S. or Canada http: //www. epa/gov/edr/sw/Administered. Item#State. Name The name of the state where mail is delivered 70
Ontologies for Data Mapping Ontologies can help to capture and express semantics Concept Geographic Area Concept Geographic Sub-Area Country Identifier Country Name Short Name Mailing Address Country Name Long Name Distributor Country Name Country Code ISO 3166 2 -Character Code ISO 3166 3 -Numeric Code ISO 3166 3 - Character Code FIPS Code 72
Example: Content Mapping Service F Collect data from many sources – files contain data that has the same facts represented by different terms. E. g. , one system responds with Danemark, DK, another with DNK, another with 208; map all to Denmark. F XMDR could accept XML files with the data from different code sets and return a result mapped to a single code set. 73
Actions to Manage Enterprise Semantics F Define, data, concepts, and relations F Harmonize and vet data and concept systems F Ground semantics for RDF, concept systems, ontologies F Provide semantics services 74
Challenge: Concept System Store s ser U Concept System Thesaurus Themes Ontology GEMET Structured Metadata Data Standards } Metadata Registry Concept systems: Keywords Controlled Vocabularies Thesauri Taxonomies Ontologies Axiomatized Ontologies (Essentially graphs: node-relation-node + axioms) 75
Challenge: Management of Concept Systems s ser U Metadata Registry Concept System Thesaurus Themes Ontology GEMET Structured Metadata Data Standards } Concept system: Registration Harmonization Standardization Acceptance (vetting) Mapping (correspondences) 76
Challenge: Life Cycle Management s ser U Metadata Registry Concept System Thesaurus Themes Ontology GEMET Structured Metadata Data Standards } Life cycle management: Data and Concept systems (ontologies) 77
Challenge: Grounding Semantics s ser U Metadata Registries Metadata Registry Concept System Thesaurus Themes Ontology GEMET Structured Metadata Semantic Web RDF Triples Subject (node URI) Verb (relation URI) Object (node URI) Ontologies Data Standards 78
Some Limitations of Relational Technologies & SQL F Limited graph computations u Weak graph query language F Limited object computations u Weak object query language F Inadequate linkage of metadata to data (underspecified “catalog”) u CASE tools also disable, rather than enable data administration & semantics management 79
Limitations (Cont. ) F Limited linkage of concept system (graphs) to data (relational, graph, object) 80
Some Input From WG 2 and XMDR F Look at recent work on a graph query language by David Silberberg of Johns Hopkins University Applied Physics Lab. 81
Input from WG 2 and XMDR F David Jensen, of the University of Massachussetts Amherst ( http: //kdl. cs. umass. edu/people/jensen/ ) has been developing a very interesting Proximity system and in the process has worked with complex patterns in very large data sets, including alternative query languages and database technologies. ( http: //kdl. cs. umass. edu/proximity/index. html ). QGRAPH is a new visual language for querying and updating graph databases. A key feature of QGRAPH is that the user can draw a query consisting of vertices and edges with specified relations between their attributes. The response will be the collection of all subgraphs of the database that have the desired pattern. 82
Input from WG 2 and XMDR F Query languages are necessary to extract useful information from massive data sets. Moreover, annotated corpora require thousands of hours of manual annotation to create, revise and maintain. Query languages are also useful during this process. For example, queries can be used to find parse errors or to transform annotations into different schemes. However, they suffer from several problems. First, updates are not supported as query languages focus on the needs of linguists searching for syntactic constructions. u Second, their relationship to existing database query languages is poorly understood, making it difficult to apply standard database indexing and query optimization techniques. As a consequence they do not scale well. u Finally, linguistic annotations have both a sequential and a hierarchical organization. Query languages must support queries that refer to both of these types of structure simultaneously. Such hybrid queries should have a concise syntax. The interplay between these factors has resulted in a variety of mutually-inconsistent approaches. Catherine Lai and Steven Bird Department of Computer Science and Software Engineering University of Melbourne, Victoria 3010, Australia u 83
Input from WG 2 and XMDR Try to keep an eye on companies that are grappling with advanced database, knowledge management, information extraction, and analysis requirements, such as Metamatrix, I 2, Net. Viz, Top Quadrant, Ontology. Works, Franz, Cogito, or Objectivity, with new ones cropping up very often. F Check out the EU sites given the large investments being made there in areas of interest. For example, KAON. F Watch the outcome of an NSF funded project on querying linguistic databases, including annotated corpora ( http: //projects. ldc. upenn. edu/QLDB/ ). Steven Bird at U. Melbourne is one of the principals on that project. F 84
Input from WG 2 and XMDR F F F Need for graph query languages that go beyond RDF and XML Frank Olken: Make SQL a strongly typed language with respect to measurement dimensionality. Performance: project graph structured queries against graph structured data. Express with great difficulty the query in SQL. Complex objects. Model gets complex. Putting humpty dumpty together again at query time. Political problem in govt. Vendors on board, hard to pursue other technologies. Object systems. OMG working on it? (OQL? ). JAVA has ugly layer that maps into relational system. Franz has SPARQL built on top of a graph store. 85
Input from WG 2 and XMDR F Link Mining Applications: Progress and Challenges - Ted E. Senator Link mining is a fairly new research area that lies at the intersection of link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining. However, and perhaps more important, it also represents an important and essential set of techniques for constructing useful applications of data mining in a wide variety of real and important domains, especially those involving complex event detection from highly structured data. Imagine a complete “link mining toolkit. ” What would such a toolkit look like? 86
Input from WG 2 and XMDR Link Mining Applications: Progress and Challenges - Ted E. Senator FMost important, it would require a language that enabled the natural representation of entities and links. Such a language would also allow for the representation of pattern templates and for specifying matches between the templates and their instantiations. FThe language would have to accept an arbitrary database schema as input, with a specified mapping between relations in the database and fundamental link types in the language. FIt would have to compile into efficient and rapidly executable database queries. FIt would need to be able to represent grouped entities and multiple abstraction hierarchies and reason at all levels. FIt would have to enable the creation of new schema elements in the database to represent newly discovered concepts. 87
Input from WG 2 and XMDR Link Mining Applications: Progress and Challenges - Ted E. Senator FIt would need to represent both pattern templates and pattern instances, and to have a mechanism for tracking matches between the two. F It would have to have constructs for representing fundamental relationships such as part-of, is-a, and connected-to (the most generic link relationship), as well as perhaps other high-level link types such as temporal relationships (e. g. , before, after, during, overlapping, etc. ), geo-spatial relationships, organizational relationships, trust relationships, and activities and events. FThe toolkit would include at least one and possibly many pattern matchers. It would require tools for creating and editing patterns. It would have to include visualizations for many different types of structured data. FIt would need mechanisms for handling uncertainty and confidence. FIt would have to track the dependence of any conclusion (e. g. , pattern match or discovered pattern) back to the underlying data, and perhaps incorporate backtracking so the impact of data corrections could be detected. 88
Input from WG 2 and XMDR Link Mining Applications: Progress and Challenges - Ted E. Senator FIt would need configuration management tools to track the history of discovered and matched patterns. FIt would need workflow mechanisms to support multiple users in an organizational structure. FIt would need mechanisms for ingesting domain-specific knowledge. FIt would have to be able to deal with multiple data types including text and imagery. FAnd it would have to be able to rapidly incorporate new link mining techniques as they are developed. FFinally, it would need to include mechanisms for maximum privacy protection. 89
Where to Progress Semantics Management? F SC 32 in WG 2 and WG 3 as extensions to ongoing work or as New Work Items F W 3 C as XQuery, SPARQL, Semantic Web Deployment WG (RDF vocabularies, SKOS) F OMG as extensions to the MOF F… 90
Thanks & Acknowledgements F F F John Mc. Carthy Karlo Berket Kevin Keck Frank Olken Harold Solbrig L 8 and SC 32/WG 2 Standards Committees F Major XMDR Project Sponsors and Collaborators F u u u U. S. Environmental Protection Agency Department of Defense National Cancer Institute U. S. Geological Survey Mayo Clinic Apelon 91
13450bcd9076b361ddd35111d06b847c.ppt