Скачать презентацию Design of a Grid Enabled Database System to Скачать презентацию Design of a Grid Enabled Database System to

d78a99853551478f13d3fa01529aad9c.ppt

  • Количество слайдов: 24

Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Design of a Grid Enabled Database System to Facilitate Reuse, Provenance Tracking and Automated Processing of Chemical Information Robert Gledhill University of Southampton

He who would change the world should first change himself • We are building He who would change the world should first change himself • We are building a system to automate the management of some of our data and compute resources, and provide an interface to allow people we choose, either inside or outside the university, to make use of these as we see fit • We would also like to provide general web access to all the data we are legally entitled to

General Aims of Project • To automate the calculation of molecular properties from experimental General Aims of Project • To automate the calculation of molecular properties from experimental information • To simplify the development of new property calculation algorithms • To provide a storage mechanism for this information, along with the original structures and measurements • To track the provenance of individual items of information • Develop a system with both chemist-friendly and script-friendly frontends

What is the Data? • Crystal structures from the NCS • Crystal structures from What is the Data? • Crystal structures from the NCS • Crystal structures from elsewhere • Experimentally measured physical properties from diverse databases, both public and private • Properties derived from the experimental data by calculation

Who are the Users? • NCS, as a test bed system • Grad students Who are the Users? • NCS, as a test bed system • Grad students working in computational chemistry, developing new ways of deriving unkown physical properties from known ones • Organic chemists, who should benefit from the pooling of diverse sources of information

What Do We Want to Calculate? • p. Ka values from QM calculations • What Do We Want to Calculate? • p. Ka values from QM calculations • Electron densities, polarisabilities, etc. from QM calculations • Diffusion constants, RDFs, etc. from MC • Binding affinities to proteins • QSAR properties • Statistically calculated solubilities

What Type of User Interfaces are Needed? • A user friendly one! • Many What Type of User Interfaces are Needed? • A user friendly one! • Many of the users are anticipated to have a straight chemistry background • For those users with a higher degree of computer sophistication, a WSDL API will make scripting their jobs easier • All interaction between the system and its users goes through a single chokepoint: the webserver

What Hardware Do We Have at the Moment? • A dual Xeon server machine What Hardware Do We Have at the Moment? • A dual Xeon server machine • A RAID array, currently with about a T of space, but easily expandable • A spare machine to use as an internal firewall • A cluster of linux machines, dedicated to running calculations and under our control • A number of other machines dotted around the department have particular single-seat license software on

Security: What are We Very Worried About? • An external user compromising the server Security: What are We Very Worried About? • An external user compromising the server and using it to attack other machines, either inside or outside the university firewall

Security: What are We Less Worried About? • A remote user compromising the server Security: What are We Less Worried About? • A remote user compromising the server and damaging the software system or the data stored on it – so long as any irreplaceable data is backed up, we just reboot, reinstall, patch the hole and continue

Security: The Firewall • Only one machine – running the web server should be Security: The Firewall • Only one machine – running the web server should be reachable from outside the university firewall • If we assume that the morass of perl/python/etc. CGI scripts on this machine are inherently hard to secure, then the webserver itself must be considered unsafe • We need an internal firewall pointing towards the server machine, blocking most traffic out from it!

Security: Access Control • Authentication is by means of Combechem certificates • Authorisation is Security: Access Control • Authentication is by means of Combechem certificates • Authorisation is controlled by the local system administrators • No direct access to the database is allowed: everything goes through the WWW/WSDL interface – the server software is implicitly trusted not to break consistency

Architecture • The firewall comes between the web server and the rest of the Architecture • The firewall comes between the web server and the rest of the campus network • The web server machine also runs the database (in the present design) • An internal dispatcher machine connects to the web server to check for jobs that need doing or to provide the results from them • The dispatcher machine communicates with other machines running calculation web services

Web Services: What? • Take a piece of code that calculates some useful chemical Web Services: What? • Take a piece of code that calculates some useful chemical information • Write a wrapper around this that provides an API in a standardised format • Add authentication/authorisation checking to the wrapper • Add the appropriate hooks into the dispatcher and database to interface with this

Web Services: Why? • Now a user of the website with the correct authorisation Web Services: Why? • Now a user of the website with the correct authorisation can ask for the newly wrapped calculation to be performed on a selection of molecules, and the generated information to be inserted into the database (along with metadata noting who asked for the calculation to be done, when, what program version, etc. ) • The web service wrapping should streamline and simplify this sort of task

Database: Requirements • Store information of many different data types (e. g. boiling point, Database: Requirements • Store information of many different data types (e. g. boiling point, 3 d structure) • Cope with multiple units (e. g. Celsius, Kelvin) • Cope with conditions (e. g. Boiling point at 1 atm. Pressure) • Cope with multiple forms of a molecule (e. g. stereoisomers) • Cope with degenerate datasets (e. g. 5 different measurements of the melting point, along with values calculated by 9 different versions of a particular algorithm) • Retain information about the provenance of dataset items

Database: Precedents • The most common type of database is the ‘relational’ scheme, where Database: Precedents • The most common type of database is the ‘relational’ scheme, where data is thought of as being stored in tables • A database which deals with most of our requirements (degeneracy, in particular) is DTHERM, a private store of thermodynamic data on organic molecules

DTHERM • DTHERM is a monument to what can be achieved with the relational DTHERM • DTHERM is a monument to what can be achieved with the relational database model • It has many, many tables, and is very, very complicated • Many tables have no single primary key, but require subsearches to achieve halfway reasonable speeds • If we choose to go down the SQL route, the properties database will likely end up looking like DTHERM

A Saner Path? • An alternative to the straight relational model was drawn to A Saner Path? • An alternative to the straight relational model was drawn to our attention: Triplestore • This is a database whose structure is described not by tables, but by ‘subject’, ‘predicate’, ‘object’ triples • Effectively, one creates a graph of relationships between entities, and search by specifying subgraphs of this

Triplestore • We are currently experimenting with this form of database • The description Triplestore • We are currently experimenting with this form of database • The description of the database and its queries, while strange and new, seems more straightforward than something like DTHERM • The impression created is one of working with the database, which contrasts to that given by DTHERM, whose designers seemed to have been fighting the relational model every step of the way

A Primary Key • We would like a single identifier for a given molecular A Primary Key • We would like a single identifier for a given molecular structure • We have been working with the INCHI codes to do this • We have a command line linux application to generate these • Some sort of substructure searching would be nice for this

CIFs • We are going to store CIFS more or less as-is • We CIFs • We are going to store CIFS more or less as-is • We will then extract out (to begin with) just those pieces we are most interested in • These will be inserted into the database, with the original CIF file still available for those interested in the extra data contained in it

A Project in Motion • We aim to have a working system by the A Project in Motion • We aim to have a working system by the first quarter of next year

Thank You • • Jeremy Frey Jonathan Essex Mike Hursthouse Simon Coles Everyone from Thank You • • Jeremy Frey Jonathan Essex Mike Hursthouse Simon Coles Everyone from ITI Steve Harris Keiron and Jamie You, the audience