49f97ce964072b223a32407316b5ed89.ppt
- Количество слайдов: 31
1 IBM Research Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee IBM Watson Research Center Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret Swart Almaden Research Center Impliance -- Information Management Appliance © 2002 IBM Corporation
Agenda § Motivation: Observations Requirements § What is Impliance? § How is Impliance different from…? § Research opportunities § Conclusions 2 Impliance -- Information Management Appliance © 2007 IBM Corporation
After all our successes (and last night’s revelry), it’s easy to become self-congratulatory. Sorry, time for… 3 Impliance -- Information Management Appliance © 2007 IBM Corporation
Some embarrassing questions: § Why is most (>80%) of the world’s data still not in databases § Didn’t we “solve” this problem in the 1980 s with object-relational systems? § Do you use a database to store your data on your laptop? § Why not? (You are a database bigot, aren’t you? ) § Have you ever tried to query (with SQL) a database that: – You didn’t create, and… – Had more than 500 tables? § Just how easy is it to incrementally add DB capacity beyond 1 machine? 100 machines? § Have “self-managing” databases significantly simplified administration? 4 Impliance -- Information Management Appliance © 2007 IBM Corporation
Observation Requirements (1 of 5) Observation #1: Information converging § § Many types of data in today’s enterprise § Structured (traditional Data Base) § Semi-structured (traditional Content Management, XML) § Unstructured (text, multimedia) Each needs a different search interface, today § SQL § JSR-170 § Keyword search / Information Retrieval Requirement #1: Store / Search / Analyze all data § § § Need to rapidly relate information of different types With one unified interface! Real use cases in paper
Observation Requirements (2 of 5) Observation #2: Awash in data, but not information § Typical complaint: “I can’t find what I’m looking for!” § But just finding data isn’t enough! § Today’s Business Intelligence is too human-intensive Requirement #2: Pro-actively derive useful information § § Need to glean more business value from enterprise data What sort of analytics exploit unstructured data? Need to automatically extract the semantics of text A rebirth of data mining?
Observation Requirements (3 of 5) Obs. #3: Total Cost of Ownership (TCO) is paramount § People costs dominate TCO – Hardware often less than 50% of TCO § Minimize Time To Value – Databases take too long to set up! § Wizards & Advisors simply mask complexity, add brittleness Reqmt. #3: System must be simple, robust, & secure § Sacrifice resource utilization for radical simplification of: – Setup / Configuration / Deployment (e. g. , Self-Organizing) – Operation + § KISS (you know this one) § KIWI – Kill It With Iron [Weikum]! § Example: “Good enough” plans exploiting massive parallelism
Observation Requirements (4 of 5) Observation #4: Data volumes growing fast § Data is kept longer § Lots of new kinds of data: RFID, email, photos, videos § Disk densities improving, but not seek times! – 1 TB disk for $399 (Hitachi) Requirement #4: Simple & massive scale-out § 1000 s of nodes § With low management overhead § No single point of failure
Observation Requirements (5 of 5) Obs. #5: Today’s Info. Mgmt. software based upon hardware 30 yrs. ago § Example: Update-in-place databases due to expensive disk § Today: Cheap CPUs, large storage, fast networks Requirement #5: Need new (software) architecture § Opportunity to radically rethink Info. Mgmt. software architecture (Stonebraker: “refactor”), based upon: – Hardware economics • e. g. , cheap (multi-core) CPUs, storage, memory, network – Software: • Formats (e. g. , XML, semi-structured data) • Functionality required (e. g. , unstructured search, analytics) – Specified in the right order: • Service requirements Software Hardware
IBM Research What is Impliance? Administrator-less: Scalable: ü Low Time to Value by Self-Organizing ü Massively parallel scale-out… ü Low Total Cost of Ownership ü…to Petabytes! Bundled: Structured Data (Tables) XML Text Manage and Search All Data: ü Structured, Semi-Structured, … ü …Even Unstructured Text! Impliance – Information Management Appliance ü HW & SW ü Pre-configured ü Pre-tuned ü Limited APIs Pro-actively Mine Information: ü Glean business insight from data © 2007 IBM Corporationi 10
What Does Impliance Actually Do? § All enterprise information: √ Stores & Retrieves (Search / Query) √ Composes / Integrates / Mashups √ Finds trends & exceptions (Business Intelligence) 11 Impliance -- Information Management Appliance © 2007 IBM Corporation
Think of Impliance as… § Content Management on steroids (beyond JSR-170) § File System with all content searchable § Data Warehouse with all your enterprise’s data Not just structured information Excluding high-rate OLTP (web site) § A Jambalaya 12 Impliance -- Information Management Appliance © 2007 IBM Corporation
Content Management Impliance XML Un Structured Transaction Ingestion 3/16/2018 Archiving Products DBMS Structured Semi. Structured Types of Data Where does Impliance fit? OLTP Warehousing/OLAP Lifetime of Data Archiving 13
How is Impliance related to… § Google Base? Primary data store Appliance (product, i. e. , sits in customer site), not a Service Enterprise, not “the masses” § Data. Spaces / Google “Pay as you go”? Primary data store (vs. lazy federation of existing data sources) Enterprise, not “the web” § Database “Appliances” (Netezza, Data. Alegro, Green Plum, etc. )? Not just structured (relational) data Discovery of semantics More pro-active 14 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Opportunities § Reducing TCO – Make categories of administration just GO AWAY – Self-Organizing to obviate database design – Exploit appliance’s limited externalized interfaces § New HW & SW architectures using off-the-shelf components – Achieving fine-grained scale-out – Targetting robust, “good enough” designs – Exploiting integration of components § Data and query models that – Unify all data, yet are simple – Tolerate “schema chaos” – Combine best features of keyword search & SQL § Automated discovery of – Data & query semantics for – Improving precision of queries – Organizing data adaptively – Trends, exceptions, etc. (pro-active Business Intelligence) 15 Impliance -- Information Management Appliance © 2007 IBM Corporation
Conclusions § We’ve come a long way towards – the autonomic dream – incorporating all data § But we can do much more! § Impliance provides exciting opportunity for DB research – To lower TCO for information management – To exploit today’s hardware and software advances – To rethink information management in a fundamentally new way § Join us! 16 Impliance -- Information Management Appliance © 2007 IBM Corporation
IBM Research Hindi Thai Traditional Chinese Gracias Spanish Russian Thank You English Arabic Obrigado Brazilian Portuguese Danke German Grazie Merci Italian Simplified Chinese Tamil French Japanese Korean 17 Impliance – Information Management Appliance © 2007 IBM Corporation
Appendix 18 Impliance -- Information Management Appliance © 2007 IBM Corporation
Redefining Information Systems -- Players Web 2. 0 oriented next generation systems (delivered through services or appliances): § Google, Yahoo, MSN, (IBM) Google base (a semi-structured/un-structured information base) Google One. Box § Next. Gen systems built by integration of successful open source (Green Plum) Data models: RSS/ATOM/Wiki/… Architecture: DB+Search+Content systems (e. g. , MYSQL+Lucene+Jackrabbit) Entrenched HW/Storage/middleware companies § Storage-driven: EMC-- Moving up the value chain, brought in a classic Content system IBM– IDS: synergy between classic CM (JCR) and storage § Server-driven: Netezza, Datallegro (for BI) Zantaz (for email compliance) Data Power (XSLT filtering) § Middleware-driven (IBM, Oracle, Microsoft) Oracle Secure Enterprise Search 19 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Focus 1: Reducing TCO § Make entire categories of administration JUST GO AWAY § Reducing time-to-value through new design principles Self-organization of “schema chaos” obviates lengthy logical & physical design, REORG Fine-grained scale-out (instead of scale-up) obviates need for load balancing, etc. § New software architecture Target robust, highly-predictable, “good enough” utilization (KIWI = Kill It With Iron) Componentization Each component simple, robust, and adaptive Virtual service model Service Broker optimizes resources and assigns the workload § Exploit integrated hardware and storage systems to provide Built-in redundancy and availability Automated backup and archiving (ILM) Easy cluster management Schema chaos support at storage level (semantic storage) Ability to use new types of grid elements (cell blade server) seamlessly 20 Impliance -- Information Management Appliance © 2007 IBM Corporation
Research Focus 2: Scalability Xaction Stream Transactional Cluster Analytic Grid § True Grid Model Off-the-shelf, commodity hardware Dedicate blades to different tasks Transaction Blade Analytic Blade Data Stream Commodity Interconnect Analytics, Search, Content, … Data Array § Fine-grained scale-out Data Array Different blade types scale independently Data Blade Data proc RAID … From SMB to largest enterprises Content Stream RAID § Integrating modern HW & storage, e. g. BC 3, intelligent bricks Logic pushdown into storage Archive/ ILM Stream Data+Content+Search+Digital Media 21 Data: storage and simple filtering Analytical: aggregation & mining Transaction: search, transactional get/put Supports Mixed Workloads Impliance -- Information Management Appliance Predicate application Aggregation Redundancy management © 2007 IBM Corporation
Parallel Run-time: Comparison of Plumbing Platform WS XD Querying model ETL (streaming) (cleansing, Data. Stage (E 2) transformation, composition) moderate yes rich Transactional (composition; no search, no BI) Parallelism Resource Scheduling limited Application Fault tolerance high yes yes limited GPFS Storage DB 2 ESE with DPF Analytics for relational rich high yes Google Map/Reduce Analytics for anything (search, transformation, simplistic composition) limited extremely high yes Impliance Analytics for anything, Search, Composition rich extremely high yes 22 extremely limitedextremely high Impliance -- Information Management Appliance © 2007 IBM Corporation
Applications content Relational data XML Web page § Data Analyzer, JCR Discovery, Query: Data Analyzer Discovery SQL XSLT HTTP Query Large-scale computation Data/ Query Modeler Video § Data Modeler Simple, generic Scalable Reliable Runtime Support § SRRS Fault tolerant Archive ILM … Objects Resource Modeler § DDS Provide reliability Distributed Data Store § VSCR Security Control 23 Virtual Storage and Computing Resource Impliance -- Information Management Appliance Commodity HW © 2007 IBM Corporation
Research Focus 3: Information Modeling and Querying § Simple, rich, unified information model & associated query languages, e. g. Google Base approach promising Defined typed attributes for navigation Defined label for keyword search Infosphere, MUSIC Open community (RSS / Atom / wiki) § Automatic schema discovery and integration – self-organizing! Integrating solutions from Infosphere, CLIO § Intelligence discovery Automatic discovery of semantics (UIMA, Web Fountain, Avatar) Pro-active, continuous mining (vs. passive BI model) Contextual information supply Including reporting and advanced analytics 24 Impliance -- Information Management Appliance © 2007 IBM Corporation
Eliminate Admin Tasks… …Rather than adding layers (1 of 3): § Special-purpose, turn-key appliances for basic services vs. today’s general-purpose SW (but still uses off-the-shelf hardware!) Bundled, Pre-installed, Pre-configured, Pre-tuned software! Examples: Information Management appliance Web Server appliance Minimizes interfaces user has to worry about No need to externalize underlying operating system, storage details Eliminates need to install, configure, and tune § Self-organizing data systems Automatic discovery of data structure Obviates need to Define logical and physical schema a priori, reducing time to value Migrate schema when organization changes 25 Impliance -- Information Management Appliance © 2007 IBM Corporation
Eliminate Admin Tasks (2 of 3): § Universal Data Management Today: Plethora of special-purpose data managers: Databases for structured data Content managers for semi-structured data File systems for unstructured data For each, very different User interfaces (SQL, JSR 170, file interface) Degrees of semantic knowledge about the data’s contents Degrees of searchability Consistency semantics (e. g. , transactions) when updated Management capabilities and interfaces Tomorrow: Single mechanism for managing all data Uniform interfaces for all types of data, for Searching Updating Management Universal indexing (“Google model”) of all data – default search mechanism Plus more precise searching for auto-discovered (above) structured information Obviates need to Impose naming conventions to find desired data 26 Impliance -- Information Management Appliance © 2007 IBM Corporation
Eliminate Admin Tasks (3 of 3): § Robust storage mechanisms to eliminate need for backups Never throw out data –keep versions! Update-in-place Is an anachronism from days of expensive disk Increases complexity of transactions Jeopardizes compliance requirements (Sarbanes-Oxley) Versions permit queries “as of” some time Exploits storage density increases (relative to number of disk arms) RAID provides local reliability Widely accepted and deployed Weaver Codes extend to multiple simultaneous failures How provide universal reliability (i. e. , against site disasters)? Selective, automated replication of new versions? Cross-site RAID? § Universal “Call Home” technology for remote management of Monitoring Problem determination Software maintenance & upgrades 27 Impliance -- Information Management Appliance © 2007 IBM Corporation
Observation / Requirements § Information converging: Store / Search / Analyze ALL data Structured (traditional Data Base) Semi-structured (traditional Content Management, XML, multi-media, call center records) Unstructured (text) Same advanced functionality required § Data volume growing fast: On Demand strategy requires massive scale-out Lots of new data: RFID, email, photos, videos (Deep Internet-scale systems being built) Data is kept longer, due to compliance requirements § Total Cost of Ownership (TCO) is paramount: System simple & robust (not smart & fragile) People costs dominate TCO: Hardware often less than 50% of TCO Hence, sacrifice resource utilization for radical simplification Delivered in services or appliances § Today’s IM software based upon hardware 30 yrs ago: Need new software architecture Cheap CPUs, large storage, fast network in hardware Opportunity to radically rethink IM software architecture, based upon: Hardware economics (e. g. , cheap CPUs, storage, memory, & network) Data: Formats (e. g. , XML, semi-structured data) Functionality required (e. g. , unstructured search, analytics) 28 Impliance -- Information Management Appliance © 2007 IBM Corporation
Total Cost of Ownership is the Driver 29 Impliance -- Information Management Appliance © 2006 IBM Corporation
IBM Research Changing Characteristics of Data Transactions and structured data Text and other human data Actionability Scale Seat on an airplane: Easy to find, structured Actionability Heterogeneity Scale Machine-generated and unstructured data Scale Life. Science data - protein folding, gene expression: Difficult to analyze but we know where to look Impliance – Information Management Appliance Heterogeneity Satellite and surveillance data: An infinite space of "patterns" 30
Impliance: A Highly-Scalable, Rich-Functional Information Management Appliance A box with software pre-installed JCR Native content retrieval SQL interface Relational data XSLT XML HTTP Native Impliance update/ What functions? § Store and manage all information accept all types of enterprises data § Deliver all intelligence Integrate cross silo information Advanced analytics with richer semantics Web page Video load interface How delivered to enterprise: appliance or service Archive ILM … What properties? § Low TCO easy to deploy (“plug & play”) simple and stable § Scalability From SMB to Very Large (Peta. Bytes) (Not for high-end OLTP!) Data+Content+Digital Media 31 Impliance -- Information Management Appliance © 2007 IBM Corporation
49f97ce964072b223a32407316b5ed89.ppt