93724c6edc2816276da34be6fa4cdf51.ppt
- Количество слайдов: 72
Grids for Chemical Informatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana. edu http: //www. infomall. org
Why are Grids Important n n n n Grids are important for Chemistry because they support key functionalities that grow in importance as we are deluged with data from instruments and simulations Grids provide information access, storage and management Grids manage multiple simulations with different defining parameters Grids allow complex workflows with data flowing between filters Grids define models for portals Grids are built on top of commodity web service technology with broad industry support – the next generation information technology Grids are used in multiple NIH and other life science/chemistry projects across the world (BIRN, ca. BIG, my. Grid, Comb-e-Chem )
Internet Scale Distributed Services n n n Grids use Internet technology and are distinguished by managing or organizing sets of network connected resources • Classic Web allows independent one-to-one access to individual resources • Grids integrate together and manage multiple Internetconnected resources: People, Sensors, computers, data systems Organization can be explicit as in • Tera. Grid which federates many supercomputers; • Deep Web Technologies IR Grid which federates multiple data resources; • Crisis. Grid which federates first responders, commanders, sensors, GIS, (Tsunami) simulations, science/public data Organization can be implicit as in Internet resources such as curated databases and simulation resources that “harmonize a community”
Different Visions of the Grid n n n Grid just refers to the technologies • Or Grids represent the full system/Applications Do. D’s vision of Network Centric Computing can be considered a Grid (linking sensors, warfighters, commanders, backend resources) and they are building the Gi. G (Global Information Grid) Utility Computing or X-on-demand (X=data, computer. . ) is major computer Industry interest in Grids and this is key part of enterprise or campus Grids e-Science or Cyberinfrastructure are virtual organization Grids supporting global distributed science (note sensors, instruments are people are all distributed Skype (Kazaa) VOIP system is a Peer-to-peer Grid (and VRVS/Global. MMCS like Internet A/V conferencing are Collaboration Grids) Commercial 3 G Cell-phones and Do. D ad-hoc network initiative are forming mobile Grids
Types of Computing Grids n n n n n Running “Pleasing Parallel Jobs” as in United Devices, Entropia (Desktop Grid) “cycle stealing systems” Can be managed (“inside” the enterprise as in Condor) or more informal (as in SETI@Home) Computing-on-demand in Industry where jobs spawned are perhaps very large (SAP, Oracle …) Support distributed file systems as in Legion (Avaki), Globus with (web-enhanced) UNIX programming paradigm • Particle Physics will run some 30, 000 simultaneous jobs Distributed Simulation HLA/RTI style Grids Linking Supercomputers as in Tera. Grid Pipelined applications linking data/instruments, compute, visualization Seamless Access where Grid portals allow one to choose one of multiple resources with a common interfaces Parallel Computing typically NOT suited for a Grid (latency)
Analysis and Visualization Large Disks Old Style Metacomputing Grid Large Scale Parallel Computers Original: Spread a single large Problem over multiple supercomputers Now-1: Control multiple smallish jobs each on independent Computers Now-2: Choose which of a few supercomputers to use
Towards an International Compute Grid Infrastructure US Tera. Grid SDSC Starlight (Chicago) UK NGS Leeds Manchester Netherlight (Amsterdam) Oxford RAL NCSA PSC UCL UKLight SC 05 All sites connected by production network (not all shown) Computation Steering clients Network Po. P Service Registry Local laptops in Seattle and UK
Information/Knowledge Grids n n Distributed (10’s to 1000’s) of data sources (instruments, file systems, curated databases …) Data Deluge: 1 (now) to 100’s petabytes/year (2012) • Moore’s law for Sensors n Possible filters assigned dynamically (on-demand) • Run image processing algorithm on telescope image • Run Gene sequencing algorithm on compiled data n n n Needs decision support front end with “what-if” simulations Metadata (provenance) critical to annotate data Integrate across experiments as in multi-wavelength astronomy Data Deluge comes from pixels/year available
Data Deluged Science n n Now particle physics will get 100 petabytes from CERN using around 30, 000 CPU’s simultaneously 24 X 7 Exponential growth in data and compare to: • • n n n n The Bible = 5 Megabytes Annual refereed papers = 1 Terabyte Library of Congress = 20 Terabytes Internet Archive (1996 – 2002) = 100 Terabytes Weather, climate, solid earth (Earth. Scope) Bioinformatics curated databases (Biocomplexity only 1000’s of data points at present) Virtual Observatory and Sky. Server in Astronomy Environmental Sensor nets In the past, HPCC community worried about data in the form of parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new science and new ways of computing Data assimilation was not central to HPCC Do. E ASCI set up because didn’t want test data!
Virtual Observatory Astronomy Grid Integrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map
International Virtual Observatory Alliance • Reached international agreements on Astronomical Data Query Language, VOTable 1. 1, UCD 1+, Resource Metadata Schema • Image Access Protocol, Spectral Access Protocol and Spectral Data Model, Space-Time Coordinates definitions and schema • Interoperable registries by Jan 2005 (NVO, Astro. Grid, AVO, JVO) using OAI publishing and harvesting • So each Community of Interest builds data AND service standards that build on GS-* and WS-*
• Imminent ‘deluge’ of data • Highly heterogeneous • Highly complex and inter-related • Convergence of data and literature archives my. Grid Project
The Williams Workflows A A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence B C
Web services n n n Web Services build loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles. Web Services interact by exchanging messages in SOAP format The contracts for the message exchanges that implement those interactions are described via WSDL interfaces.
A typical Web Service n n In principle, services can be in any language (Fortran. . Java. . Perl. . Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining) The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python Web Services WSDL interfaces Portal Service Security WSDL interfaces Web Services Payment Credit Card Catalog Warehouse Shipping control
Two-level Programming I • The Web Service (Grid) paradigm implicitly assumes a two -level Programming Model • We make a Service (same as a “distributed object” or “computer program” running on a remote computer) using conventional technologies – C++ Java or Fortran Monte Carlo module – Data streaming from a sensor or Satellite – Specialized (JDBC) database access • Such services accept and produce data from users files and databases Service Data • The Grid is built by coordinating such services assuming we have solved problem of programming the service
Two-level Programming II n n The Grid is discussing the composition of distributed services with the runtime Service 1 Service 2 interfaces to Grid as opposed to UNIX Service 3 Service 4 pipes/data streams Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs Such interpretative environments are the single processor analog of Grid Programming Some projects like Gr. ADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately
Repositories Federated Databases Database Sensors Streaming Data Field Trip Database Sensor Grid Database Grid Research Compute Grid Data Filter Services Research Simulations SERVOGrid ? GIS Discovery Grid Services Education Customization Services From Research to Education Analysis and Visualization Portal Grid of Grids: Research Grid and Education Grid Computer Farm
SERVOGrid Requirements n n Seamless Access to Data repositories and large scale computers Integration of multiple data sources including sensors, databases, file systems with analysis system • Including filtered OGSA-DAI (Grid database access) n n n Rich meta-data generation and access with SERVOGrid specific Schema extending open. GIS (Geography as a Web service) standards and using Semantic Grid Portals with component model for user interfaces and web control of all capabilities Collaboration to support world-wide work Basic Grid tools: workflow and notification NOT metacomputing
SERVOGrid Portal Screen Shots
Earthquake Grid Do. D NCOW Grid C 2 (JBI CEE etc. ) NCOW-IS Services … Co. I Specific … Grids/Services Earthquake Data & Simulation Service Servo. IS Information Grid 7: Portals Compute Grid 6: Collaboration Grid GIS Grid Sensor Grid 9: Application Services 10: Policy (ECS) 8: Data Access/Storage 4: Discovery 2: Security Core Low Level Grid Services 3: Messaging 5: Mediation 11: Metadata 1: Management Physical Network n: Service refers to core services identified by Do. D Co. I Community of Interest GIS Geographical Information System
Bio. Informatics Grid Chemical Informatics Grid … HTS Tools Quantum Calculations CIS … Domain Specific Grids/Services 7: Portals Compute Grid MIS Grid Instrument Grid Information Grid 6: Collaboration Grid 9: Application Services 10: Policy 8: Data Access/Storage 4: Discovery 2: Security Sequencing Tools Biocomplexity Simulations BIS Core Low Level Grid Services 3: Messaging 5: Workflow 11: Metadata 1: Management Physical Network M(B, C)IS Molecular (Bio, Chem) Information System
GIS Grid with WMS, WFS, data sources and GML <gml: feature. Member> <fault> <name> Northridge 2 </name> <segment> Northridge 2 </segment> <author> Wald D. J. </author> <gml: line. String. Property> <gml: Line. String srs. Name="null"> <gml: coordinates> -118. 72, 34. 243 118. 591, 34. 176 </gml: coordinates> </gml: Line. String> </gml: line. String. Property> </fault> </gml: feature. Member> GML becomes CML, Cell. ML, SBML
Electric Power and Natural Gas data from LANL Interdependent Critical Infrastructure Simulations Zoom-in Zoom-out Feature. Info mode Measure distance mode Clear Distance Drag and Drop mode Refresh to initial map
Integrating Archived Web Feature Services and Google Maps Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records.
What is Happening? n n n n Grid ideas are being developed in (at least) four communities • Web Service – W 3 C, OASIS, (DMTF) • Grid Forum (High Performance Computing, e-Science) • Enterprise Grid Alliance (Commercial “Grid Forum” with a near term focus) Service Standards are being debated Grid Operational Infrastructure is being deployed Grid Architecture and core software being developed • Apache has several important projects as do academia; large and small companies Particular System Services are being developed “centrally” – OGSA or GS-* framework for this in GGF; WS-* for OASIS/W 3 C/Microsoft-IBM Lots of fields are setting domain specific standards and building domain specific services USA started but now Europe is probably in the lead and Asia will soon catch USA if momentum (roughly zero for USA) continues
The Grid and Web Service Institutional Hierarchy 4: Application or Community of Interest Specific Services such as “Run BLAST” or “Look at Houses for sale” 3: Generally Useful Services and Features Such as “Access a Database” or “Submit a Job” or “Manage Cluster” or “Support a Portal” or “Collaborative Visualization” OGSA GS-* and some WS-* GGF/W 3 C/…. WS-* from Handlers like WS-RM, Security, Programming Models like BPEL OASIS/W 3 C/ Industry 2: System Services and Features or Registries like UDDI 1: Container and Run Time (Hosting) Environment Must set standards to get interoperability Apache Axis. NET etc.
Location of software for Grid Projects in Community Grids Laboratory n n n htpp: //www. naradabrokering. org provides Web service (and JMS) compliant distributed publish-subscribe messaging (software overlay network) htpp: //www. globlmmcs. org is a service oriented (Grid) collaboration environment (audio-video conferencing) http: //www. crisisgrid. org is an OGC (open geospatial consortium) Geographical Information System (GIS) compliant GIS and Sensor Grid (with POLIS center) http: //www. opengrids. org has WS-Context, Extended UDDI etc. The work is still in progress but Narada. Brokering is quite mature All software is open source and freely available
Project Goals n Establish Requirements from stakeholders • Research • Pharmaceutical Industry • Government n Consider educational implications • e-Science v Bio/Chem/Molecular Informatics n n n Consider other national and international projects to ensure we either lead or use best practice Design a Grid architecture and staged implementation Start pilot projects led by Chemistry/Chemical Informatics Evaluate and iterate Design and implement ? (Chem, Life Science, Molecular) Informatics educational program that will attract students Write winning center grant in 2006 -7
Web Services Introduction • What are “Web Services”? – A distributed invocation system built on Grid computing • Independent of platform and programming language • Built on existing Web standards – A service oriented architecture with • Interfaces based on Internet protocols • Messages in XML (except for binary data attachments)
Web Services Introduction • A web-based architecture providing for interoperability among resources – Centralized service registry – Solves problems associated with finding, using, and combining online resources • Employ standard Internet protocols for: – Communication with resources – Automated discovery using centralized registries • Communicate with devices, people, and each other with the protocols and computer languages
Service Oriented Architecture (SOA) • Goal is to achieve loose coupling among interacting software agents • Define service: a unit of work done by a service provider to achieve desired end results for a service consumer • Both provider and consumer are roles played by software agents on behalf of their owners.
How does SOA work? • Two architectural constraints are employed – Small set of simple and ubiquitous interfaces to all participating software agents – Descriptive messages constrained by an extensible schema delivered through the interfaces
Web Services Architectures • Individual services are registered globally – Broken down into individual services with inputs and outputs specified • Services are published • Services are requested • Open registry, publishing, and requesting
Service-Oriented Architecture • From Curcin et al. DDT, 2005, 10(12), 867
Web Services for Science • Invisible Services, Semantic Web, and Grid • Easy-to-use tools for any scientist • High throughput, resource intensive computing done for low cost/resources • Shared community – Collaborations between labs and fields – Shared data – Shared tools
e-Science and the Grid 1 • e-Science: Major UK Program – global collaboration in key areas of science and the next generation of infrastructure that will enable it • reflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teams • total investment of some £ 200 M over the five-year period from 2001 to 2006 • Cyber. Infrastructure: the analogous US initiative • Grid Technology: supports e-Science & Cyberinfrastructure
Basic Architectures: Servlets/CGI and Web Services Browser GUI Client Browser HTTP GET/POST Web Server WSDL SOAP JDBC DB or MPI Appl. Web Server WSDL JDBC DB or MPI Appl.
Importance of Web Services • Building a true science community • Enabling interoperability between tools and the integration of data • Less time coding, more time for science • Change the way scientists work by achieving new levels of integration
When To Use Web Services? • Applications do not have severe restrictions on reliability and speed. • Two or more organizations need to cooperate. – One needs to write an application that uses another’s service. • Services can be upgraded independently of clients. • Services can be easily expressed with simple request/response semantics and simple state.
Web Services Benefits • Web services provide a clean separation between a capability and its user interface. • Increase in productivity • Increase in flexibility • Rapid return on investment • Integration across multiple applications
Web Services Advantages • Output in human- and computer-readable formats • I/O formats based on standard Internet protocols • Resources accessible server to server allow automated I/O • Integration based on specific services: you select services or data needed without downloading the entire data set
Web Services Advantages • Description protocols provide details of service provided and interface components • Semantic Web standards increase efficiency • Use a central registry and standardized description of services • Quality and status of the information is dynamically available
Web Services Drawbacks • • Based on new technologies Time and commitment required to learn Standards still in a state of rapid flux Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy
Components of Web Services • Protocols – SOAP – WSDL – UDDI • XML as a basis for the protocols • Ontologies – OWL: Ontology Web Language • Semantic Web
Components of the Semantic Web for Chemistry • • • XML – e. Xtensible Markup Language RDF – Resource Description Framework RSS – Rich Site Summary Dublin Core – allows metadata-based newsfeeds OWL – for ontologies BPEL 4 WS – for workflow and web services – Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 31923203.
SOAP: Simple Object Access Protocol • Flexible protocol to communicate information between server and server or client and server using XML • Supports Remote Procedure Calls • Allows layers (security, authentication, transactions) over the basic SOAP elements
WSDL: Web Service Definition Language • Describes a service’s interface to clients • Services register themselves with Web Services • WSDL describes how to contact and interact with services – I/O, operations and messages to aid interaction with client
WSDL Overview • An XML-based Interface Definition Language. – You can define the APIs for all of your services in WSDL. • WSDL docs are broken into five major parts: – Data definitions (in XML) for custom types – Abstract message definitions (request, response) – Organization of messages into “ports” and “operations” ( classes and methods). – Protocol bindings (to SOAP, for example) – Service point locations (URLs) • Some interesting features – A single WSDL document can describe several versions of an interface. – A single WSDL doc can describe several related services.
UDDI: Universal Description, Discovery, and Integration • Provides ways for clients and services to interact with other services • Uses XML • Defines the means of access, e. g. , – URL – E-Mail • Defines services hosted by an entity • Business-oriented tags • Uses SOAP for communicating
XML: e. Xtensible Markup Language • Allows definitions of types of documents • Tags are used to specify components of documents • Allows specification of namespaces to differentiate between identical tag names • Tag names do not provide semantics other than simple hierarchical relations
XML Overview • A language for building languages • Basic rules: be well formed and be valid • Particular XML “dialects” are defined by XML schemas. – XML itself is defined by its own schema. • Extensible via namespaces • Many non-Web services dialects – RDF, SVG, GML, CML, XForms, XHTML • Many basic tools available: parsers, XPath and XQuery for searching/querying, etc.
XML and Web services • XML lends itself to distributed computing: – It’s just a data description. – Platform, programming language independent • Web Services Description Language (WSDL) – Describes how to invoke a service – Can bind to SOAP, other protocols for actual invocation • Simple Object Access Protocol (SOAP) – Wire protocol extension for conveying RPC calls – Can be carried over HTTP, SMTP
OWL: Web Ontology Language • Builds on RDF and RDFS and adds a means for richer descriptions of properties and classes – Disjoint classes – Cardinality of classes – Characteristics of relations, like symmetry
Standards for Web Services • Business Process Execution Language for Web Services (BPEL 4 WS) • Ontology Web Language Semantics (OWL -S) • Web Service Modeling Ontology (WSMO)
Standards Setting Boards • OASIS: Organization for Advancement of Structured Information Standards – eb. XML: e-business XML – UDDI: Universal Description, Discovery and Integration • Global Grid Forum – community of users, developers, and vendors leading the global standardization effort for grid computing
Standards Setting Boards • W 3 C: World Wide Web Consortium – OWL: Ontology Web Language – RDF/RDFS: Resource Description Framework/Schema – SOAP: Simple Object Access Protocol – URI/URL/URN: Universal Resource Identifier/Locator/Name – WSDL: Web Service Definition Language – XML: e. Xtensible Markup Language
SWWS: Semantic Web-Enabled Web Services • Main objectives: – Provide a comprehensive Web Service description framework – Define a Web Service discovery framework – Provide a scalable Web Service mediation middleware • A program of the European Commission to run 2002 -2005 – http: //swws. semanticweb. org
Web Services Integration Projects: Biosciences • my. Grid – http: //www. mygrid. org. uk/ • BIOPIPE – http: //biopipe. org/ • Bio. MOBY – http: //biomoby. org/
Web Services for Chemistry: Problems • Performance and scalability • Proprietary data • Competition from high-performance desktop applications -- Geoff Hutchison, it’s a puzzle blog, 2005 -01 -05 • ALSO: – Lack of a substantial body of trustworthy Open Access databases – Non-standard chemical data formats (over 40 in regular use and requiring normalization to one another)
Missing Ingredients in Chemistry • Chemical communities to assemble Open Access databases – Well-defined quality assurance procedures performed by distributed peer-review systems – Software underlying the databases needs to be open source.
Chemistry Databases on the Web • Marc Nicklaus lists 37 databases as of October 2001 – Must have structure searching and at least 100 molecules – http: //cactus. nci. nih. gov/ncidb 2/chem_www. html • Soaring. Bear’s List has 15 databases – http: //geocities. com/soaringbear/biomed/chem. html
Institutional Repositories • NARSTO Quality Systems Science Center – http: //cdiac. esd. ornl. gov/programs/NARSTO/ – Pollutant species in the troposphere over North America – Part of the Carbon Dioxide Information Analysis Center at ORNL – NARSTO Data and Information Sharing Tool • http: //mercury. ornl. gov/narsto/
Public Data Repositories • Developmental Therapeutics Program/NCI – Some assay data for download – Structures for over 200, 000 compounds • http: //dtp. nci. nih. gov/docs/dtp_search. html • Zinc and other screening databases • NIST computational chemistry database • Environmental fate and exposure databases
Other Public Repositories 1 • Chem. Exper Chemical Directory – > 200, 000 substances; > 10, 000 IR spectra – http: //chemexper. com/ • HIC-Up; Hetero-Compound Identification Centre – Uppsala – 5384 substances as of 1/15/05 – http: //xray. bmc. uu. se/hicup/ • Chemicals with Pharmaceutical Activity; a 3 D Structural Database – 400 3 D structures – http: //www. chem. ox. ac. uk/mom/chemical-database/
Other Public Repositories 2 • Cheminformatics. org – 41 data sets in 9 categories as of 8/18/05 – http: //www. cheminformatics. org/ • Web. Reactions – http: //webreactions. net/
Other Public Repositories 3 • Mol. Table – http: //www. moltable. org/ • Mat. Web Materials Property Data – http: //www. matweb. com/index. asp? ckck=1 • Spectral Database for Organic Compounds (SDBS) – Over 32, 000 compounds – Has EI-MS, FT-IR, 1 H NMR, 13 C NMR, Raman, ESR – http: //www. aist. go. jp/RIODB/SDBS/cgi-bin/cre_index. cgi • NMRShift. DB (Christoph Steinbeck) – 14, 753 structures as of 8/19/05 – Features peer-reviewed submission of data sets – http: //www. nmrshiftdb. org/
Other Public Repositories: Commercial Teasers • FTIRsearch. com (Thermo Electron) – Demo file of 575 spectra from 87, 000 in the full database – https: //ftirsearch. com/default 3. htm • Chem. ACX – 30 of >350 suppliers catalog data – http: //chemacx. cambridgesoft. com/chemacx/index. asp • Sunset Molecular Discovery, LLC – Wombat (World of Molecular Bio. Ac. Tivity) • 117, 007 entries with over 230, 000 biological activities – Wombat PK • Database for Clinical Pharmacokinetics: 643 substances with 4668 measurements – Three sample files from Wombat containing 341 Histamine-1 receptor antagonists – http: //www. sunsetmolecular. com/
Blue. Obelisk. org • A group of chemists, programmers, and informaticians working collaboratively on projects such as: – – – – – Chemistry Development Kit (CDK) JChem. Paint Jmol JUMBO NMRShift. DB Octet Open Babel QSAR World Wide Molecular Matrix (WWMM)
Indiana University Existing Projects • System for the Integration of Bioinformatics Services (SIBIOS) – http: //sibios. engr. iupui. edu • Plat. Com: A Platform for Computational Comparative Genomics – http: //bio. informatics. indiana. edu/sunkim/Platcom/ • Reciprocal Net – http: //www. reciprocalnet. org/index. html
Indiana University Planned Projects • Design of a Grid-based distributed data architecture • Development of tools for HTS data analysis and virtual screening • Database for quantum mechanical simulation data • Chemical prototype projects – Novel routes to enzymatic reaction mechanisms – Mechanism-based drug design – Data-inquiry-based development of new methods in natural product synthesis
Web Services Future • Depends on – Adoption of standards – Incorporation of WS in current and newly developed applications – Security, privacy, quality of data issues – Development of WS tools and resources for e. Science
93724c6edc2816276da34be6fa4cdf51.ppt