fe28ae6a0247e197948e021fbf8891a9.ppt
- Количество слайдов: 52
Project Prospect and the Semantic Web Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc@rsc. org
Project Prospect and the Semantic Web Who we are What we’ve done Motivation Means The In. Ch. I and the Semantic Web Ontology development for chemistry RXNO and MOP 2
Who we are
4
Royal Society of Chemistry Advancing the Chemical Sciences § Learned and professional society § Scientific publisher § 25 journals, 8 databases and a growing book program § 8000 articles yearly § Covering a broad spectrum of chemical sciences from systems biology (Molecular Bio. Systems) to physical and theoretical chemistry (PCCP) 5
What we’ve done
7
8
9
The motivation
The motivation § Scientific papers are formulaic and consistently structured (but not necessarily IMRD: see later) § There may be infinitely many possible chemical compounds BUT § Nomenclature is productive and susceptible to machine parsing 11
The means
The means how publishing really works 13
Data capture Editing and proof-reading 14
Enhanced HTML Database Text mining (Oscar) Manual QA 15 Enhanced RSS
16
17
Regular polysemy … where words stand for multiple things in a consistent way. Examples: § Brand names § Grinding § Figure–ground § Exact–class–part polysemy in chemistry Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM 08 at LREC 2008, Marrakech, Morocco. 18
Regular polysemy Brand names “Learning to buy a Renault and talk to BMW” Grinding “The squirrel scampered down the path and kept stopping and looking at the officers to check they were behind” vs. “[…] the trick was to serve squirrel fresh and not to leave it hanging like other game” 19
Regular polysemy Figure–ground § Audrey Hepburn painted the door (figure) § Audrey Hepburn walked through the door (ground) § The Incredible Hulk walked through the door (ambiguous) 20
Imidazole 21
An imidazole 22
The imidazole side-chain/ group/ring/etc. 23
Can Ch. EBI handle this? J Imidazoles (!) J Imidazole (CHEBI: 24780) (CHEBI: 16069) L Imidazole ring L Imidazolyl group etc. ) not yet (but methyl, benzyl, … and there are no disambiguation cues 24
Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions 25
Disambiguation: toy model CLASS: § w(– 1) = a, an, the, this § w(0) plural (bit of a cheat, as not a collocation) PART: § w(– 1) = bridging, terminal § w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) § w(+1)w(+2) = “building block”, “protecting group”, “side chain” 26
Why is this hard? Coordination resolution Part of speech ambiguity: tosylates; noun or verb? 27
Why is this hard? How many numbered compounds actually are named in a given paper? iloprost (1) tributyl-1 -hexynylstannane (2) the desired 2 -heptyne (3) methyl–Pd(II) iodide 4 or 4′ alkynylstannane 5 the hypervalent stannate 6 (alkynyl)(methyl)Pd(II) complex 7 the desired methylalkyne 8 compounds 9– 14 the stannyl precursors 15 and 16 methylated compounds 17 and 18 stannyl precursor 19 iloprost methyl ester 20 “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid! 28
Why is this hard? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% Pub. Chem ~20% Chem. Draw For compound numbers: ~70% author Chem. Draw ~30% editors 29
What are we marking up? § § § Chemical compounds (In. Ch. I, Ch. EBI) Chemical classes and parts (Ch. EBI) Nanoparticles (in Ch. EBI from end of October) Chemical terms from the IUPAC Gold Book Name reactions (RXNO) § Gene products: function, process, location (GO) § Nucleotide and polypeptide sequence terms (SO) § Cell types (CL) 30
In. Ch. I and the Semantic Web
What In. Ch. I is for § Can represent complete molecules (may be ions or radicals) of less than 1024 heavy (non-H) atoms. (however) § Cannot yet represent metal atom geometry. § Cannot yet represent polymers. § Cannot yet represent diradicals etc. 32
What In. Ch. I is not for § Classes of molecule § Parts of molecule (these have been done in Chem. Blast) 33
In. Ch. I in RDF (We don’t like this. ) We use the RSS content module. (As if articles contained molecules. ) And we use info: inchi URIs. Look… 34
Some RDF <content: items> <rdf: Bag> <rdf: li> <content: item rdf: about="info: inchi/In. Ch. I=1/C 15 H 22 O 9/c 18(16)19 -6 -15(7 -20 -9(2)17)12(21 -10(3)18)11 -13(24 -15)23 -14(4, 5)22 -11/h 1113 H, 6 -7 H 2, 1 -5 H 3/t 11? , 12 -, 13+/m 1/s 1"/> </rdf: li> <content: item rdf: about="info: inchi/In. Ch. I=1/C 21 H 34 O 9/c 1 -6 -914(22)25 -12 -21(13 -26 -15(23)10 -7 -2)18(27 -16(24)11 -8 -3)17 -19(30 -21)2920(4, 5)28 -17/h 17 -19 H, 6 -13 H 2, 1 -5 H 3/t 17? , 18 -, 19+/m 1/s 1"/> </rdf: li> </rdf: Bag> </content: items> <content: item> <owl: Class rdf: ID="GO_0016298"> <rdfs: label>lipase activity</rdfs: label> </owl: Class> </content: items> 35
36
RXNO David Barden Colin Batchelor Celia Gitterman
RXNO the name reaction ontology (1) § Every chemist knows about famous chemists like Wittig, Cannizzaro, Diels, Alder, benzoin § They’re pretty unambiguous and well-suited to logical definitions § But what organizing principle do we use? 38
RXNO the name reaction ontology (2) § Sort reactions by what they do to the ‘skeleton’ of the molecule. § Skeleton-changing reactions: § Joinings, cleavings, rearrangements, ring formation, ring expansion § Skeleton-preserving reactions: § Additions, eliminations, substitutions, protections, deprotections 39
RXNO the name reaction ontology (3) § Quality? Subjectivity? § Get our curators to assign reactions to categories without conferring, check percentage agreement, discuss disagreements, improve guidelines, iterate to convergence. 40
41
42
RXNO the name reaction ontology (4) 43
44
What do people say?
46
The spectroscopist’s tale The enriched html version came as something of a revelation and the current emphasis on links to, and through biomolecular terminology was very much a plus for us, since my colleagues and I are a mix of physical and biological chemists who are dabbling in inter-disciplinary waters. Given the steadily increasing burden of keeping up with the current literature and accessing earlier publications - a fortiori when conventional disciplinary boundaries are being crossed - the ability to 'grow a tree' from current articles (including one's own) is going to make 'targeted sleuthing' a great deal easier. John Simons, Oxford 47
The high-throughput screener’s tale An interesting opportunity particularly for managers, students and beginners that are not that deeply immersed in the detail and the terminology. It further opens access to those who want to explore areas they are not specialists in. Great idea! Eberhard Krausz, MPI-CBG Dresden 48
Lastly… “My only criticism would be the need for a time warning… I spent 4 hours digging about which generated at least six new research ideas printed half a ream of paper and I missed my bus home. At least it was a new excuse my wife had not heard, so another first. ” An analytical chemist, The North. 49
50
Acknowledgements Royal Society of Chemistry Richard Kidd, Jeff White, David Barden, Celia Gitterman, Hilary Burch, the Informatics team University of Cambridge Peter Corbett, Simone Teufel, Ann Copestake, Peter Murray-Rust OBO Karen Eilbeck, Midori Harris, Jen Deegan, Jane Lomax, Chris Mungall, Barry Smith, the Ch. EBI team 51
52
fe28ae6a0247e197948e021fbf8891a9.ppt