Скачать презентацию The C 2 M system The C Скачать презентацию The C 2 M system The C

eadb82b6e0d555258bfb6a21623bd61a.ppt

  • Количество слайдов: 31

The C 2 M system The C 2 M system

The C 2 M system Paul van der Vet, Peter Geurts, Theo Huibers, Hans The C 2 M system Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands 2

Setting • Scientist working with multiple, heterogeneous resources like º Databases º Knowledge bases Setting • Scientist working with multiple, heterogeneous resources like º Databases º Knowledge bases º Programs • Task requires co-operation of resources • Resources in-house or remote makes no difference 3

Sci. Dashboard™ • Long-term vision: scientist’s dashboard • Sci. Dashboard™ allows scientist to visually: Sci. Dashboard™ • Long-term vision: scientist’s dashboard • Sci. Dashboard™ allows scientist to visually: º º Select resources Connect resources Identify sources and sinks Specify data transformations underway • C 2 M first step towards 4

Co-operating resources • First problem: format multiplicity • Format multiplicity is unavoidable º Standardisation Co-operating resources • First problem: format multiplicity • Format multiplicity is unavoidable º Standardisation social process with high stakes º No format caters for all needs • Second problem: combining resources 5 º Merging, comparing, deduplicating

Format multiplicity • Chemical example: molecular structure files 6 Format multiplicity • Chemical example: molecular structure files 6

Molecular structure files • About 20 formats in daily use, for example: º º Molecular structure files • About 20 formats in daily use, for example: º º º MDL Molfile (MOL) Connection table (CT) Standard Molecular Description file (SMD) • Almost all formats specify plaintext files with record-field structure • Delimiters often space and newline characters 7

CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. 2500 0. 0000 C 0. 0000 0. 2500 0. 0000 C 0. 8667 -0. 2500 0. 0000 O 1 2 1 1 2 3 1 1 8

CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. 2500 0. 0000 C 0. 0000 0. 2500 0. 0000 C 0. 8667 -0. 2500 0. 0000 O 1 2 1 1 2 3 1 1 9

CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. CT-file ethanol CH 3 CH 2 OH ethanol. ct 3 2 -0. 8667 -0. 2500 0. 0000 C (1) 0. 0000 0. 2500 0. 0000 C (2) 0. 8667 -0. 2500 0. 0000 O (3) 1 2 1 1 2 3 1 1 10

MOL-file ethanol CH 3 CH 2 OH ethanol. mol Chem. Draw 03070310372 D 3 MOL-file ethanol CH 3 CH 2 OH ethanol. mol Chem. Draw 03070310372 D 3 2 0 0 -1. 2975 0 0. 0025 0 0 1. 3000 0 1 2 1 0 112 3 1 0 0 0 -0. 3750 0 0999 V 2000 0. 0000 C 0 0 0 0 0. 3750 0. 0000 C 0 0 0 0 0 -0. 3750 0. 0000 O 0 0 0 0

Solving format multiplicity: Wrappers 12 Solving format multiplicity: Wrappers 12

Wrappers • Wrapper tools exist such as º Chemistry: Babel, Chem. Draw º Molecular Wrappers • Wrapper tools exist such as º Chemistry: Babel, Chem. Draw º Molecular biology: SRS º Bibliography management: End. Notes, bp • Disadvantage: adding new format impossible or very difficult • “Roll your own” wrappers: awk, perl • 13 Difficult to maintain

Wrapper generators • Basic idea: produce wrapper from high-level description of formats • Often Wrapper generators • Basic idea: produce wrapper from high-level description of formats • Often two-step process: A → R → B with R an internal representation • Obvious argument: two-step process takes fewer converters than direct conversion • Disadvantage: R fixed and dedicated 14

Preparing for middleware • Keyword: modularisation • Stakeholders are responsible for their own specifications, Preparing for middleware • Keyword: modularisation • Stakeholders are responsible for their own specifications, for example: º Content provider offers syntactic format description º User determines internal representation • Internal representation allows combination of resources 15

The C 2 M system • C 2 M: chemical configurable middleware • Implemented The C 2 M system • C 2 M: chemical configurable middleware • Implemented in Quintus Prolog • Current state: a wrapper generator • Wrappers produced from high-level specifications of formats and internal representation • Internal representation chosen by user, if desired per task • C 2 M can be extended to middleware 16

Current C 2 M is … • a specification language º for specifying the Current C 2 M is … • a specification language º for specifying the format of foreign files º for specifying the internal representation • a programming language º for programming wrappers by means of specifications º for inserting copious documentation • a system 17 º for producing wrappers and their documentation

C 2 M system overview 18 C 2 M system overview 18

File conversion by C 2 M 19 File conversion by C 2 M 19

C 2 M specifications • Two kinds of specifications: º Specification of internal representation C 2 M specifications • Two kinds of specifications: º Specification of internal representation º Specification of file format each in a file of its own • Internal representation: ontology • File format specification: read-only, write-only, or both read and write 20

Language design principles • Adhere to well-known designs º HTML (tags and tag attributes) Language design principles • Adhere to well-known designs º HTML (tags and tag attributes) º context-free grammar (as in BNF) º functions • Use or mimic well-known symbols º grammar rules: lhs -> rhs 1 rhs 2 rhs 3 (→) or lhs : : = rhs 1 rhs 2 rhs 3 (as in BNF) º instantiation: lhs <- funct(arg 1, arg 2) (←) 21

Ontology • Frame system • Tree structure with concepts and attributes • Three kinds Ontology • Frame system • Tree structure with concepts and attributes • Three kinds of concepts: º º º concept 1 = concept 2 concept 3 concept 4 concept 1 = repeated(concept 2) primitive concepts (leaves) • Leaves hold information 22

Ontology example <C 2 M-SPECIFICATION type=“ontology” name=“simpleont”> <ONTOLOGY> sentence = repeated(word) </ONTOLOGY> </C 2 Ontology example sentence = repeated(word) 23

File format specification • File format specification: grammar + semantic bindings • Grammar specifies File format specification • File format specification: grammar + semantic bindings • Grammar specifies structure • System uses grammar to produce parse tree • Semantic bindings map nodes in parse tree onto concepts in internal representation 24

File format spec example <C 2 M-SPECIFICATION type=“file-format” name=“simple-form” <READGRAM>. . </READGRAM> <SBREAD>. . File format spec example . . . . . . 25

File format spec: readgram <READGRAM> <ULG> line -> string line -> sp-string+ sp-string -> File format spec: readgram line -> string line -> sp-string+ sp-string -> spaces string spaces -> space+ string -> printable-char+ 26

File format spec: sbread <SBREAD> sentence =^ line word <- identity(string) </SBREAD> 27 File format spec: sbread sentence =^ line word <- identity(string) 27

Claims C 2 M is • sufficiently expressive • fully declarative • a literate Claims C 2 M is • sufficiently expressive • fully declarative • a literate programming environment (specification and documentation in one) • easy to learn • amenable to division of labour 28

Claims (contnd. ) • Compared to Chem. Draw and their likes, C 2 M: Claims (contnd. ) • Compared to Chem. Draw and their likes, C 2 M: º º º Allows for easy addition of new formats Format specifications can be reused Prepares for true middleware • Compared to “roll-your-own” wrappers, C 2 M: 29 º Facilitates reuse and adaptation º Facilitates extensive documentation

To be done (short term) • • • Stabilise system Experiment Provide extensive manual To be done (short term) • • • Stabilise system Experiment Provide extensive manual and documentation • Prepare system for others to experiment º But current version implemented in proprietary software platform 30

To be done (long term) • Test language by means of user surveys • To be done (long term) • Test language by means of user surveys • Develop version 2 • Version x may well be wholly visual • Embed system in larger environment º Sci. Dashboard™ º “Habitable Interfaces” 31