eadb82b6e0d555258bfb6a21623bd61a.ppt
- Количество слайдов: 31
The C 2 M system
The C 2 M system Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands 2
Setting • Scientist working with multiple, heterogeneous resources like º Databases º Knowledge bases º Programs • Task requires co-operation of resources • Resources in-house or remote makes no difference 3
Sci. Dashboard™ • Long-term vision: scientist’s dashboard • Sci. Dashboard™ allows scientist to visually: º º Select resources Connect resources Identify sources and sinks Specify data transformations underway • C 2 M first step towards 4
Co-operating resources • First problem: format multiplicity • Format multiplicity is unavoidable º Standardisation social process with high stakes º No format caters for all needs • Second problem: combining resources 5 º Merging, comparing, deduplicating
Format multiplicity • Chemical example: molecular structure files 6
Molecular structure files • About 20 formats in daily use, for example: º º º MDL Molfile (MOL) Connection table (CT) Standard Molecular Description file (SMD) • Almost all formats specify plaintext files with record-field structure • Delimiters often space and newline characters 7
CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. 2500 0. 0000 C 0. 0000 0. 2500 0. 0000 C 0. 8667 -0. 2500 0. 0000 O 1 2 1 1 2 3 1 1 8
CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. 2500 0. 0000 C 0. 0000 0. 2500 0. 0000 C 0. 8667 -0. 2500 0. 0000 O 1 2 1 1 2 3 1 1 9
CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. 2500 0. 0000 C (1) 0. 0000 0. 2500 0. 0000 C (2) 0. 8667 -0. 2500 0. 0000 O (3) 1 2 1 1 2 3 1 1 10
MOL-file ethanol CH 3 CH 2 OH ethanol. mol Chem. Draw 03070310372 D 3 2 0 0 -1. 2975 0 0. 0025 0 0 1. 3000 0 1 2 1 0 112 3 1 0 0 0 -0. 3750 0 0999 V 2000 0. 0000 C 0 0 0 0 0. 3750 0. 0000 C 0 0 0 0 0 -0. 3750 0. 0000 O 0 0 0 0
Solving format multiplicity: Wrappers 12
Wrappers • Wrapper tools exist such as º Chemistry: Babel, Chem. Draw º Molecular biology: SRS º Bibliography management: End. Notes, bp • Disadvantage: adding new format impossible or very difficult • “Roll your own” wrappers: awk, perl • 13 Difficult to maintain
Wrapper generators • Basic idea: produce wrapper from high-level description of formats • Often two-step process: A → R → B with R an internal representation • Obvious argument: two-step process takes fewer converters than direct conversion • Disadvantage: R fixed and dedicated 14
Preparing for middleware • Keyword: modularisation • Stakeholders are responsible for their own specifications, for example: º Content provider offers syntactic format description º User determines internal representation • Internal representation allows combination of resources 15
The C 2 M system • C 2 M: chemical configurable middleware • Implemented in Quintus Prolog • Current state: a wrapper generator • Wrappers produced from high-level specifications of formats and internal representation • Internal representation chosen by user, if desired per task • C 2 M can be extended to middleware 16
Current C 2 M is … • a specification language º for specifying the format of foreign files º for specifying the internal representation • a programming language º for programming wrappers by means of specifications º for inserting copious documentation • a system 17 º for producing wrappers and their documentation
C 2 M system overview 18
File conversion by C 2 M 19
C 2 M specifications • Two kinds of specifications: º Specification of internal representation º Specification of file format each in a file of its own • Internal representation: ontology • File format specification: read-only, write-only, or both read and write 20
Language design principles • Adhere to well-known designs º HTML (tags and tag attributes) º context-free grammar (as in BNF) º functions • Use or mimic well-known symbols º grammar rules: lhs -> rhs 1 rhs 2 rhs 3 (→) or lhs : : = rhs 1 rhs 2 rhs 3 (as in BNF) º instantiation: lhs <- funct(arg 1, arg 2) (←) 21
Ontology • Frame system • Tree structure with concepts and attributes • Three kinds of concepts: º º º concept 1 = concept 2 concept 3 concept 4 concept 1 = repeated(concept 2) primitive concepts (leaves) • Leaves hold information 22
Ontology example
File format specification • File format specification: grammar + semantic bindings • Grammar specifies structure • System uses grammar to produce parse tree • Semantic bindings map nodes in parse tree onto concepts in internal representation 24
File format spec example
File format spec: readgram
File format spec: sbread
Claims C 2 M is • sufficiently expressive • fully declarative • a literate programming environment (specification and documentation in one) • easy to learn • amenable to division of labour 28
Claims (contnd. ) • Compared to Chem. Draw and their likes, C 2 M: º º º Allows for easy addition of new formats Format specifications can be reused Prepares for true middleware • Compared to “roll-your-own” wrappers, C 2 M: 29 º Facilitates reuse and adaptation º Facilitates extensive documentation
To be done (short term) • • • Stabilise system Experiment Provide extensive manual and documentation • Prepare system for others to experiment º But current version implemented in proprietary software platform 30
To be done (long term) • Test language by means of user surveys • Develop version 2 • Version x may well be wholly visual • Embed system in larger environment º Sci. Dashboard™ º “Habitable Interfaces” 31


