Скачать презентацию Why Search Engines are used increasingly to Offload Скачать презентацию Why Search Engines are used increasingly to Offload

48aa911b62d6d8aa2dd8a501f8464d1d.ppt

  • Количество слайдов: 37

Why Search Engines are used increasingly to Offload Queries from Databases Bjørn Olstad CTO Why Search Engines are used increasingly to Offload Queries from Databases Bjørn Olstad CTO FAST Search & Transfer Adjunct Prof. The Norwegian University of Science & Technology Email: bjorn. olstad@fast. no Cell: +47 48011157

The Typo Problem. . . The Typo Problem. . .

Talent Offloading. . Talent Offloading. .

The Web Search Experience The Web Search Experience

The RDBMS Experience High input barrier ”You are viewing 5 random jobs out of The RDBMS Experience High input barrier ”You are viewing 5 random jobs out of 2461 jobs in total. . ”

Career. Builder Use scenario, part 1 30956 jobs 1 Career. Builder Use scenario, part 1 30956 jobs 1

Career. Builder Use scenario, part 2 1084 jobs 2 Career. Builder Use scenario, part 2 1084 jobs 2

Career. Builder Use scenario, part 3 30 jobs 3 Career. Builder Use scenario, part 3 30 jobs 3

Career. Builder Use scenario, part 4 5 jobs 30956 5 targeted jobs in 3 Career. Builder Use scenario, part 4 5 jobs 30956 5 targeted jobs in 3 steps

Challenger Shuttle Launch Fax to NASA from contractor with O-ring concern Challenger Shuttle Launch Fax to NASA from contractor with O-ring concern

Presentation Matters … Presentation Matters …

IYP: A Disruptive Change Taylor or Gibson guitar? Good local offers? Compare offerings Phone IYP: A Disruptive Change Taylor or Gibson guitar? Good local offers? Compare offerings Phone / Directions What is the phone number to Will’s Barber shop? BTW: I’m using my i. PAQ ESP: Cleansing, Mining, Relevance and Discovery Product & Company Services web site Blogs++

ISVs: A Disruptive Change Siebel 2000 Siebel 2005 “my” CRM Application Search Information Access ISVs: A Disruptive Change Siebel 2000 Siebel 2005 “my” CRM Application Search Information Access Layer 3 rd party content Search is a tactical afterthought Search is a strategic enabler

Revisit the Assumptions … 2003: 24 B Relational algebra large – but “finite”data sets Revisit the Assumptions … 2003: 24 B Relational algebra large – but “finite”data sets structured data 2002: 12 B Cave paintings, Bone tools 40, 000 BCE Writing 3500 BCE Internet (DARPA) Late 1960 s Search & Explore focused “infinite”data sets Unstructured & Structured The Web 1993 1999 SQL-03 Computing 1950 GIGABYTES Transistor 1947 SQL-70 Oracle-79 SQL-89 SQL-92 Electricity, Telephone 1870 SQL-99 2000: 3 B Printing 1450 80% Unstructured 2001: 6 B 0 C. E. Paper 105

Extreme Capabilities? • Feeding/streaming, transaction, retrieval or analytics centric? • Content size: M, L, Extreme Capabilities? • Feeding/streaming, transaction, retrieval or analytics centric? • Content size: M, L, VVVL or Vn ∞ L? • Schema centric, Semi-structured XML, Text, Agnostic? • Fuzzy & Value vs. Binary & Completeness? • Discovery primitives? • User interaction part of design target?

Query Latency RDBMS vs ESP Test Data: • Structured data: • 5 million records; Query Latency RDBMS vs ESP Test Data: • Structured data: • 5 million records; • 13 fields per record ESP • Structured queries: • 22 SQL queries ( Representative in ERP ) The Result: • #1: FAST ESP w/ disk • Mean = 99 [ms] • St. dev. = 36 [ms] RDBMS • #2: Oracle w/ memory mapping • Mean = 4 057 [ms] • St. dev. = 9 368 [ms]

Query Per Second RDBMS vs ESP QPS Identical HW : single node, 2 CPU, Query Per Second RDBMS vs ESP QPS Identical HW : single node, 2 CPU, 4 GB ram 3 SCSI disks Identical data : auction data from e. Bay, 3. 6 million doc’s Identical queries: 200 queries defined by Oracle

Disruptive Change Relational Model Queries that fit The Model Queries that don’t fit The Disruptive Change Relational Model Queries that fit The Model Queries that don’t fit The Model Alternative I • Star, snowflake schemas++ • Cubes / datamarts ++ Incremental fixes to painful shortcomings Adds complexity Alternative II • • Schema agnostic Scalable ad-hoc querying BLOBS Contextual Insight Real-time fusion of disparate data models • Massive fault tolerant scalability

Extreme Capabilities ESP Design Targets Powering Search Derivative Applications (SDAs) Value/Noise SNR Contextual Refinement Extreme Capabilities ESP Design Targets Powering Search Derivative Applications (SDAs) Value/Noise SNR Contextual Refinement Contextual Insight User Interaction Game Changer driven by Extreme Retrival and on-the-fly Analytics

Database Query Offloading Example: Auto. Trader. com RDBMS: • HW-cost: $320 K (32 CPU Database Query Offloading Example: Auto. Trader. com RDBMS: • HW-cost: $320 K (32 CPU on 4 Sun servers) • 90% sub-second query response Average = 12 s for the rest …. • Relevance = Sorting • 5 FTE to maintain ESP: • HW-cost: $90 K • 100% sub-second query response • Flexible relevance and discovery • 0. 5 FTE to maintain ESP Car Dealers - Product Supply

Content Scalability RDBMS vs ESP Examples of ESP deployments • Compliance case: – 50 Content Scalability RDBMS vs ESP Examples of ESP deployments • Compliance case: – 50 B documents @ 80 k average – 4 PB (around 100 web indexes) • Storage: – Intelligent content addressable storage – XML metadata and full content – EMC Centera: N * 256 TB (N=1. . 400) • Webmining – Webfountain: – 60. 000 : 1 in query capacity (ESP : DB)

Intelligent Storage and Search Unite Discover Simple Scalable Secure Intelligent Storage and Search Unite Discover Simple Scalable Secure

Contextual Search From ACCESS To INSIGHT Where is the email from Peter about ROI Contextual Search From ACCESS To INSIGHT Where is the email from Peter about ROI analysis? FIND Contextual Relevance • “Best of Web” Recommender / Authority • “Best of Enterprise” Linguistic / Statistic Any new supicious financial transaction patterns? EXPLORE Contextual Navigation • Contextual fact discovery • On-the-fly meta-data analysis

Turning around the Pyramid HBZ. de – Leading German Library Service Center From: Librarians Turning around the Pyramid HBZ. de – Leading German Library Service Center From: Librarians To: Researchers Single Field Search Quering WWW (HTML, XML, WML, Java. Script) SQL LIB FAST ESP … DB DB STRUCTURED DB

ESP @ SCOPUS • • >200 M articles / 180 M citations 180 TB ESP @ SCOPUS • • >200 M articles / 180 M citations 180 TB capacity / 14000 journals David Goodman standing up and declaring in public, that Scopus is the best-designed database he's ever seen …

Relevance Drives Revenue Search Reduces Clicks to Purchase and Browsing… 20% -60% 1 k Relevance Drives Revenue Search Reduces Clicks to Purchase and Browsing… 20% -60% 1 k ee ee k W -40% W -20% 10 0% Browsing 60% 40% Revenue 20% 0% -20% -40% -60% 10 40% Search 80% k 60% 100% ee page views per sale 80% 120% W 100% 3. 50 3. 00 2. 50 2. 00 1. 50 1. 00 0. 50 0. 00 1 120% Launched search 140% Clicks to Purchase 4. 50 4. 00 100% increase in search 20% increase in ringtone revenue k Launched search 140% • • ee • Reduced # of clicks to buy content from > 4 to < 2 50% reduction in ringtone browsing W • … and Drives Revenue

Business Analytics Processing of real-time streams Example: Norwegian Customs Foreign Exchange Transaction Monitoring ACL Business Analytics Processing of real-time streams Example: Norwegian Customs Foreign Exchange Transaction Monitoring ACL Monitor SECURITY ACCESS MODULE User Monitor Message Queue Real-time Registration Queries Results Alerts Database connector Transaction Log Data Validation Firewall ØKOKRIM

Technology Maturity. . . RDBMS vs ESP Technology Maturity. . . RDBMS vs ESP

Business Intelligence ESP vs. RDBMS Technology OBSERVATION The Enterprise Search Platform (ESP), a relatively Business Intelligence ESP vs. RDBMS Technology OBSERVATION The Enterprise Search Platform (ESP), a relatively new concept, integrating advanced technologies typically associated with search engines, database tools, and analytical systems, is fast becoming able to solve modern business intelligence problems (using both structured and unstructured data) in a way that is fundamentally different from, and ultimately superior to, that of other currently available analytical or database software. PREDICTION Enterprise Search Platform and search centric application technology represents a true paradigm shift in the way data will be stored, analyzed and reported on in the future. Resulting realignments in the marketplace may be both rapid and tumultuous. - Chief strategist leading BI vendor

If your only tool is a hammer. . . . every problem looks like If your only tool is a hammer. . . . every problem looks like a nail

UIMA: Architecture UIMA: Architecture

Text Structure <Category>FINANCIAL</ Category > BC-dynegy-enron-offer-update 5 Dynegy May Offer at Least $8 Bln Text Structure FINANCIAL BC-dynegy-enron-offer-update 5 Dynegy May Offer at Least $8 Bln to Acquire Enron (Update 5) By George Stein SOURCEc. 2001 Bloomberg News BODY Event Fact ……. ``Dynegy has to act fast, '' said Roger Hamilton, a money manager with John Hancock Advisers Inc. , which sold its Enron shares in recent weeks. ``If Enron can't get financing and its bonds go to junk, they lose counterparties and their marvelous business vanishes. '' Moody's Investors Service lowered its rating on Enron's bonds to ``Baa 2'' and Standard & Poor's cut the debt to ``BBB. '' in the past two weeks. …… George Stein Dynegy Inc Roger Hamilton John Hancock Advisers Inc. Roger Hamilton money manager John Hancock Advisers Inc. Enron Corp Moody's Investors Service Moody's Investors Service Enron Corp downgraded Baa 2 <__Type>bonds

The BI “hammer” Approach Document Vector Antiobiotics, Peptidyl, Eubacteria, RNA, Mg, … SVD Analysis The BI “hammer” Approach Document Vector Antiobiotics, Peptidyl, Eubacteria, RNA, Mg, … SVD Analysis ( λ 1, λ 2, . . . , λn ) { λ 1, λ 2, . . . , λn, Structured attributes }

Contextual Refinement ETL and Semantic understanding unite Direct access to RDBMs for info from Contextual Refinement ETL and Semantic understanding unite Direct access to RDBMs for info from some Telco’s Logic for cleansing ESP lookup Ordered hits (by quality) XML feed from other Telco’s Cleansed data to ESP XML Ambigous data Flat files (CSV or fixed) from the ’laggards’ (close hits or unidentified) clean data ’Error’ database for manual inspection, correction, storage/learning Master database for persistant storage

Contextual Insight “…entry probe carried to [Saturn]’s moon Titan as part of the…” Intent Contextual Insight “…entry probe carried to [Saturn]’s moon Titan as part of the…” Intent Concepts Query-time fact analysis @ sub-document level

Contextual Navigation This. Is. Travel Automated visitor ratings Contextual Navigation This. Is. Travel Automated visitor ratings

Revisit the Assumptions … 2003: 24 B Scalable Search 2002: 12 B Cave paintings, Revisit the Assumptions … 2003: 24 B Scalable Search 2002: 12 B Cave paintings, Bone tools 40, 000 BCE Writing 3500 BCE Internet (DARPA) Late 1960 s The Web 1993 1999 SQL-03 Computing 1950 GIGABYTES Transistor 1947 SQL-70 Oracle-79 SQL-89 SQL-92 Electricity, Telephone 1870 SQL-99 2000: 3 B Printing 1450 80% Unstructured 2001: 6 B 0 C. E. Paper 105