b65cb2e0b1ca456bed05bc4c2c8b3468.ppt
- Количество слайдов: 45
Bioinformatics Analytical Pipelines Biological/ Clinical Experiments Instruments Data Pre. Processing Analytical Algorithms Interpretation of Results Perl Life Science Discovery Phases: New Paper Perl Oracle Life Sciences Platform Algorithms Scripts New Drug • Exploratory/Prototype Analysis Files • Application Development DB • Production System Files Files New Treatment Files DB New DB Entries C A T G 0 0 1
DLBC Follicular Bio. Oracle DNA Microarray Analysis of Lymphoma Integrated Demo e. Seminar Feb. 13 th 2003 Biopsies Samples Instruments Filtering and Pre. Processing SQL, XML, Java Affymetrix Micro-Array Feature Selection SQL Molecular Pattern Recognition Interpretation of Results Discoverer Oracle Data Mining Reports Feature Selection Bayesian Classifier Java Servlets Microarray Lab Dataset from Golub et al Science 286: 531 -537. Portals Prediction: DLBC Follicular
More Examples • ŸDNA microarray analysis for cancer classification – – Gene expressions from Leukemia cells Target: Leukemia morphologies & treatment outcome ŸEarly disease screening proteomic analysis – Mass. spec. “peaks” from patient blood samples – Target: cancer or normal status ŸDrug activity analysis – – Molecular characteristics of new drug candidates Target: binding affinities to targets
Caprion Ÿ Oracle Environment Ÿ Discover & develop innovative products for the diagnosis & treatment – Oracle database of diseases – Oracle 9 i Application Server – – – Scalability for a multi-TB system Integration of all components with existing computing environment Security & protection of data integrity Ÿ Key Advantages of Oracle – – – Easy access & management of integrated information Rapid deployment of new ad hoc query Scalability necessary to accommodate growth – – – Oracle 9 i Developer Suite Oracle 9 i AS Discoverer Oracle Warehouse Builder Ÿ “The Oracle Data Warehouse is a key component of our IT platform for proteomics analysis. The massive amount of information we produce every day requires a system with proven performance to effectively capture our biological data”. Bernard Gagnon, IT Director
Myriad Proteomics Ÿ Mapping human protein interactions at Ÿ Oracle Environment a system scale using two-hybrid & – Oracle 9 i database with mass spec. partitioning Online database system to automate – Oracle Enterprise Manager laboratory flows – Plan to use XML DB and – Databases for intermediate results External Tables in 2003 for quality control and tracking – Data marts that are specific to Ÿ “One of the keys to the technological customer needs success of this project is our use of Oracle software. Every aspect of our business touches Oracle technology; Ÿ Key Advantages of Oracle it’s a key component of our work”. – Ease of maintenance Marcel Davidson, Head of DB – Partitioning keeps up with end-user Architecture & Administration demands for fast query times – Meets scalability needs –
Applied Biosystems Ÿ Enterprise software for laboratory automation and integration – – Life Science LIMS SQL*LIMSTM Software Ÿ Key Advantages – – – Sample and container management to support complex sample fan-outs Full audit trail support to help meet regulatory compliance Application specific interfaces to meet customers’ needs Integration with third party software Supported by world-wide professional services group Ÿ www. appliedbiosystems. com Ÿ Oracle Features – – Scalable, highly available Open standards for messaging and program integration Powerful reporting tools Web publishing supported
5. Collaborate Securely Ÿ Oracle Collaboration Suite - Integrated communications – Single enterprise search across all repositories ŸInternal & external – Flexible access ŸWeb, desktop ŸWireless and telephone
5. Collaborate Securely Ÿ Oracle 10 g. AS Portal – Build personalized portals Ÿ Oracle Workflow – Automate laboratory and business processes Ÿ Oracle 10 g. AS Files – Enable content management and collaboration ŸRevision control, check-in/check-out, access control Ÿ Virtual Private Database – Different users have unique access privileges Ÿ Auditing – Create audit trail to facilitate FDA compliance Ÿ Oracle 10 g. AS Web Services – Standard way to collaborate through the web
Taratec e-Compliance TM An i. FS Application Ÿ Taratec e Compliance. TM – – Built specifically to supports FDA 21 CFR Part 11 Compliance Designed for Life Sciences Data & File Management Screen shot or diagram Ÿ Features – – Versioning, Advance Searching, Check-in/Check-Out Integrated storage of files from any source Universal access through Web browser Complete Audit Trail of File Operations “With Oracle as the foundation, we were able to develop a solution that can secure a vast array of file-based data with vault like security. ” Bill Gargano, President and COO Taratec Development Corporation Taratec and Taratec’s logo are registered trademarks of Taratec Development Corporation © 1999 Taratec Development Corporation
Gen. Sys Software Ÿ Products – Gen. Sys/ELN (Electronic Laboratory Notebook) – Gen. Sys/R&D (Research & Discovery software integration platform) Ÿ Key Advantages – Most dependable and secure enterprise-wide application – Easy for Researchers to Learn & Use Ÿ Oracle 9 i Features: – XML - Automatic creation of XML Open – Integrates well with Scientist views desktop & back-end applications (i. e. – Adobe PDF Support registration, LIMS, search & – Oracle’s Scalability and flexibility document management – – – Also supports legal, regulatory and records management users www. gensys. com – is key to Gen. Sys’ enterprise solution i. FS and XML DB offer powerful potential in future releases
Web Services - Life Sciences Data Sources & Applications
Oracle Web Services SOAP, WSDL, UDDI Together – I 3 C participation Web Service Consumer 2. Find UDDI Repository 3. Invoke SOAP Ÿ Communicate Oracle’s support for web services industry standards Servlet 1. Publish Service WSDL Document Web Service Supplier
Oracle 10 g Unbreakable Security Ÿ Ÿ Ÿ Complete data protection Manage user access Detect data misuse with Auditing Facilitate regulatory compliance (HIPPA, 21 CFR PART 11) Proven against 15 independent evaluations Security Evaluations Oracle Microsoft IBM US TCSEC, Level B 1 1 - - US TCSEC, Level C 2 1 1 - UK ITSEC, Levels E 3/F-C 2 3 - - UK ITSEC, Levels E 3/F-B 1 3 - - ISO Common Criteria, EAL-4 4 - - Russian Criteria, Levels III, IV 2 - - US FIPS 140 -1, Level 2 1 Failed TOTAL 15 1 0
University of California San Diego School of Medicine Ÿ The Patient Centered Access to Secure Systems Online (PCASSO) – – – 178, 000 Medical Records Provides trusted access to a patient’s health information from healthcare providers over the Internet Oracle Label Security & Virtual Private Database Ÿ The security is locked to the data and therefore can’t be subverted. Ÿ No application coding needed to implement security.
San Diego Supercomputing Center “In the beginning, we considered using My. SQL, Oracle, and another database. But when we evaluated our project needs over the next ten years and realized that our database could grow to terabytes, we decided we needed a scalable database and one that was reliable. We didn’t want to be forced to change databases in the middle of the project. …. “We do not need a lot of DBAs to maintain the database. ” Joshua Li, Senior Computational Scientist, University of California, San Diego, Supercomputing Center Systemwide, SDSC relies on only three DBAs to run over 40 Oracle databases.
AMR Research “Regulatory compliance has become a business risk. Big fines can be levied and the FDA can shut down manufacturing lines. The FDA wants to make sure companies have electronic signatures and a full, auditable track record. All the IT systems deployed have to guarantee full accountability and change management. Oracle offers full traceability of the database. In other products, you have to make sure the application that’s using the data gives you the change management. Oracle provides security and authentication built into the database technology. ” Roddy Martin, Service Director of Consumer Package Goods and Life Sciences, AMR Research
European Bioinformatics Institute “Our mission is to build molecular biology databases of importance and place them in the public domain so they can be used by the research community as easily as possible. ” Peter Stoehr, Head of Database Operations, EBI “Researchers tap directly into the data repository we host here. We’re an international data repository. ” Weimin Zhu, Head of Database Application Group, EBI
Oracle’s Contribution to Life Sciences Find me any compound that looks like my current structure, and that has been tested on any assay in my company where the IC 50>200 n. M, where I know that I have a unique patent position, and hasn't been published in any journal? Oracle 9 i select c. id, p. structure, from compound c, protein p, assay a where a. compound_id = c. id and a. protein_id = p. id and a. company = “BIO_SYS” and a. IC 50 > 200 n. M and similar_to(p. id, “protein kinase”) and not_published(p. id, “Medline”) and extract_value(p. id), ‘Dgene/Protein/Id’) = p. id Message XML Text Relational Image
IDC Analysts “Even IBM's own partners say that DB 2 and Discovery. Link have failed to gain much ground in the life sciences despite IBM's giveaways. According to Hall, Oracle, the "de facto standard, " still holds a commanding 75 percent to 80 percent market share in this vertical. ” Mark Hall, Director of Life Sciences, IDC, in Info. Week 12/12/2002
Life Sciences Highlights Ÿ Life Sciences featured in Oracle Magazine – Features San Diego Supercomputer Center, European Bioinformatics Institute, and Celera Genomics Group Ÿ Info. World article – Even IBM's own partners say that DB 2 and Discovery. Link have failed to gain much ground in the life sciences despite IBM's giveaways. According to Hall, Oracle, the "de facto standard, " still holds a commanding 75 percent to 80 percent market share in this vertical. Ÿ Mark Hall, Director of Life Sciences, IDC, in Info. Week 12/12/2002 Ÿ Bio. Inform article, Feb 2003 – Oracle currently claims to hold 85 percent of the life science research database market, but the company isn't resting on its laurels. On the contrary, the database giant is expanding the capabilities of its software in a bid to retain its edge in the increasingly competitive market.
Additional Life Sciences Information Ÿ Server Technology Development, Life Sciences Product Management Team – – – charlie. berger@oracle. com yao-chun. peng@oracle. com susie. stephens@oracle. com Ÿ OTN – http: //otn. oracle. com/industries/life_sciences/content. html Ÿ Oracle Life Sciences Platform Ÿ Tech e. Seminars, white papers, Partner Solutions, Customer Profiles, OTN Discussion Forum, etc. Ÿ Oracle. com – http: //www. oracle. com/industries/life_sciences/index. html? content. html Ÿ Internal site – http: //bioinformatics. us. oracle. com
Oracle Life Sciences User Community Ÿ Customer Advisory Board (CAB) Ÿ User group meetings being formed in North America, Europe, and Asia Pacific. – – May 2003 in Hinxton Hall Conference Centre, Wellcome Trust Genome Campus, Hinxton, UK Sept 10, 2003, Oracle. World, San Francisco Ÿ Discussion Forum on OTN Ÿ “Oracle Life Sciences” Source. Forge. net project administered by SDSC to facilitate code & experience sharing: http: //sourceforge. net/projects/oraclelifesci/
Oracle 10 g “The Bioinformatics Release”
Life Sciences 10 g Features Ÿ Data Access – – Heterogeneous transportablespaces Merge enhancements Ÿ Variety of Data Types – – – XML DB enhancements Enhanced text processing and searches (to cluster and classify) Network Data Model feature for managing “graph” databases Ÿ Scalability and High Throughput – – – Grid Distributed query optimization Data pump Ÿ Finding Patterns and Insights – – – Data Mining: DM 4 J GUI, 2 New Algorithms (SVMs & NMF) , PL/SQL API BLAST Text Mining Regular expression searches Expanded basic statistics IEEE floating point Ÿ Collaborate Securely
Oracle Data Mining BLAST C A T G 0 0 1 Ÿ Implemented using a table function interface Ÿ BLAST search functions can be placed in SQL queries Ÿ Different functions for match & alignment Ÿ SQL queries can be used to pre-filter database of sequences & post-process the search results Ÿ Combination of SQL queries & BLAST is very powerful and flexible
Oracle Data Mining BLAST Ÿ Web Services GUI (available via OTN) C A T G 0 0 1
Sample BLAST Query Ÿ For the query sequence “ATCGCGTT”, find the top 3 matches above a similarity threshold from each organism select seq_id, organism, score, expect from (select t. seq_id, t. score, t. expect, g. organism, RANK() OVER (PARTITION BY organism ORDER BY score DESC) as o_rank from Swiss. Prot_DB g, Table(SYS_BLASTP_MATCH (‘ATCGCGTT’, cursor (select seq_id, sequence from Swiss. Prot_DB), 5)) t /* expect_value */ where t. seq_id = g. seq_id) where o_rank <= 3 Ÿ BLAST “Delighters” – – C A T G 0 0 1 seq_id, organism, score, expect o_rank <= 3 RANK seq_id, organism, score, expect t. seq_id = g. seq_id, score, expect Swiss. Prot_DB SYS_BLASTP_MATCH Queries performed in the database Ability to perform combinatorial query_sequence, parameters Swiss. Prot_DB queries e. g. sequence similarity AND annotation contains “Lymphoma”
Oracle Data Mining Ÿ New GUI: “DM 4 J” – – JDeveloper GUI add-in wizards for building data mining components in the database Results browser Ÿ New algorithms – Support vector machines Ÿ To handle very wide & shallow data Ÿ Regression – Nonnegative Matrix Factorization Ÿ Feature creation – Ability to mine “text” data Ÿ Combine unstructured data and structured data Ÿ Pl/SQL API
Oracle Text Enhanced Advanced Text Searches Ÿ Perform enhanced information searches (using Oracle data mining functionality) Ÿ Ability to perform fast autoclustering of documents, URLs etc. into natural groupings for more useful searches Ÿ Ability to provide “example documents” search and classify documents “likely” to be similar based on patterns beyond simple key word searches
New Statistics & SQL Analytics ŸRanking functions – rank, dense_rank, cume_dist, percent_rank, ntile ŸWindow Aggregate functions (moving and cumulative) – Avg, sum, min, max, count, variance, stddev, first_value, last_value ŸLAG/LEAD functions – Sum, avg, min, max, variance, stddev, count, ratio_to_report ŸStatistical Aggregates – Correlation, linear regression family, covariance ŸLinear regression – – Fitting of an ordinary-least-squares regression line to a set of number pairs. Frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions. average, standard deviation, variance, min, max, median (via percentile_count), mode, group-by & roll-up DBMS_STAT_FUNCS: summarizes numerical columns of a table and returns count, min, max, range, mean, stats_mode, variance, standard deviation, median, quantile values, +/- 3 sigma values, top/bottom 5 values ŸCorrelations – Direct inter-row reference using offsets ŸReporting Aggregate functions – ŸDescriptive Statistics Pearson’s correlation coefficients, Spearman's and Kendall's (both nonparametric). ŸCross Tabs – Enhanced with % statistics: chi squared, phi coefficient, Cramer's V, contingency coefficient, Cohen's kappa ŸHypothesis Testing – t-test , F-test, One-way ANOVA, Chi-square, Mann Whitney, Kolmogorov-Smirnov, Wilcoxon signed ranks ŸDistribution Fitting – Normal, uniform, Poisson, exponential, Weibull ŸPareto Analysis (documented) – 80: 20 rule, cumulative results table
Other Features Important to Life Sciences Ÿ Grid Computing Ÿ IEEE Floating Point Ÿ XML DB Ÿ Heterogeneous Transportable Tablespaces Ÿ Distributed Query Optimization Ÿ Network Data Model Ÿ Upsert” (Merge), or not Ÿ Regular Expression Searches
Grid Computing Ÿ Automated job scheduling across Grid Ÿ Already has been lots of support for Grid concepts provided within Oracle environment – – Distributed queries External tables Security RAC Ÿ Participate in Global Grid Forum Ÿ Incremental Grid support
IEEE Floating Point Ÿ Support for industry standard treatment of numbers and precision Ÿ Critical for compute intensive operations Ÿ Faster performance
XML DB Ÿ Already have best support for XML today Ÿ Applications can use standard SQL/XML operators to generate complex XML documents from SQL queries and to store XML documents Ÿ The XML Parser is also extended to support the updated and new W 3 C XML standards Ÿ Support for evolution of XML schemas Ÿ Major improvements in XML processing performance – XML Developer Kit (XDK) libraries and interfaces in Java, C, and C++ all transparently support the database XMLType, increasing throughput and scalability without high resource and processing costs. Additionally, the architectures been redesigned using a pipeline process model and SAX to increase performance while reducing resources.
Heterogeneous Transportable Tablespaces • Mechanism to quickly move a tablespace across Oracle databases • Most efficient means to move bulk data between databases. • Enhance to support cross platforms and operating systems source database target database
Distributed Query Optimization Ÿ Excellent support for distributed queries today Ÿ Performance addressed in each release Ÿ Cost-based optimizer enhanced to capture complete statistics for remote tables Ÿ Considers network bandwidth and latency in deciding what parts of the query plan should be remotely mapped Flat files My. SQL
Network Data Model Ÿ Model, store, manage and analyze generic connectivity relationships in the DB, – – i. e. represent data as nodes and links Can model hierarchies, logical or spatial information, directionality Ÿ Network analysis at client or application level, e. g. shortest-path, tracing, within-distance analysis, minimum cost spanning tree, nearest neighbor – Network management, e. g. add, delete, modify, load
“Upsert” (Merge), or not Ÿ Provides conditional “insert or update” processing – e. g. perform check sum on annotation or DNA sequence Ÿ Used in periodic data loads where new data is merged with existing data and the content of source and/or destination are unknown so INSERT or UPDATE cannot be used exclusively
Merge Statement Example MERGE INTO USING ON WHEN table/view/subquery ( condition ) MATCHED SKIP WHEN NOT THEN WHEN update clause ( condition ) MATCHED THEN insert clause
Regular Expression Searches Ÿ Enable Regexp support in database through SQL and PL/SQL Ÿ Provide SQL and PL/SQL functions for Regexp matching and string manipulations Ÿ Follow POSIX style Regexp syntax Ÿ Support standard Regexp operators including *, +, ? , |, ^, $, . , [ ], {m, n}, etc. Ÿ Include common extensions such as case-insensitive matching, sub-expression back-references, etc. Ÿ Compatible with popular Regexp implementations like GNU, Perl, Awk
Oracle 10 g Customer Quotes "Oracle 10 g's new BLAST feature will enable us to easily integrate multiple types of genomic and proteomic data for complicated queries used in the mining of our proprietary protein-protein interaction and c. DNA sequence datasets. " - Jake Chen, Principal Bioinformatics Scientist, Myriad Proteomics “Using Infor. Sense discovery workflows built upon the world leading Oracle data mining, text mining and R&D Database functionality, researchers and organizations can now automate large scale and complex knowledge discovery and management activities with performance and reliability. ” - Yike Guo, CEO Infor. Sense "Oracle 10 g's Network Data Model feature is great for building a semantic work infrastructure. Oracle 10 g's graphical representation is an excellent tool for planning our Y 2 H protein interaction data storage needs and for building a signaling network from our Nature-Af. CS Molecule Pages Database. " - Joshua Li, Sr. Computational Scientist, San Diego Supercomputer Center / UCSD "Thanks to Oracle 10 g's Regular Expressions (RE) query support, it's no longer necessary to export data from the database, process it with a RE enabled tool and then import the data back into the database. Now, RE processing can be handled with a single query. " - Marcel Davidson, Head of Database Administration, Myriad Proteomics
Oracle 10 g Customer Quotes "With Oracle 10 g, sequence data that formerly needed to exported, BLASTed, and reimported, can now be analyzed with a single SQL statement. " - Marcel Davidson, Head of Database Administration, Myriad Proteomics "Oracle 10 g's implementation of REs enables the expression of complex Query logic-particularly against text strings-which is extremely useful in bioinformatics applications where queries are often formulated against complex genetic or proteomic code patterns. " - Jake Chen, Principal Bioinformatics Scientist, Myriad Proteomics "Beyond Genomics, Inc. , as a leading systems biology company, believes that Oracle 10 g's network data model will significantly advance the integration of metabolomic, proteomic, transcriptomic, and clinical data sets and the applications that derive value from these data. " – Eric Neumann, Vice President Strategic Informatics, Beyond Genomics, Inc.
Oracle Life Sciences Platform Summary Ÿ Life sciences not just a “wet lab” environment – In silico drug discovery now a critical component – Oracle, the “de facto standard”, enjoys an 80% market share - IDC Ÿ Enables you to – – – Access data from multiple sources Integrate a variety of data types Manage vast quantities of data Find patterns and insights Collaborate securely with other researches Ÿ Oracle 10 g is an ideal platform for life sciences
Q & A Q U E S T I O N S A N S W E R S
b65cb2e0b1ca456bed05bc4c2c8b3468.ppt