48aa911b62d6d8aa2dd8a501f8464d1d.ppt
- Количество слайдов: 37
Why Search Engines are used increasingly to Offload Queries from Databases Bjørn Olstad CTO FAST Search & Transfer Adjunct Prof. The Norwegian University of Science & Technology Email: bjorn. olstad@fast. no Cell: +47 48011157
The Typo Problem. . .
Talent Offloading. .
The Web Search Experience
The RDBMS Experience High input barrier ”You are viewing 5 random jobs out of 2461 jobs in total. . ”
Career. Builder Use scenario, part 1 30956 jobs 1
Career. Builder Use scenario, part 2 1084 jobs 2
Career. Builder Use scenario, part 3 30 jobs 3
Career. Builder Use scenario, part 4 5 jobs 30956 5 targeted jobs in 3 steps
Challenger Shuttle Launch Fax to NASA from contractor with O-ring concern
Presentation Matters …
IYP: A Disruptive Change Taylor or Gibson guitar? Good local offers? Compare offerings Phone / Directions What is the phone number to Will’s Barber shop? BTW: I’m using my i. PAQ ESP: Cleansing, Mining, Relevance and Discovery Product & Company Services web site Blogs++
ISVs: A Disruptive Change Siebel 2000 Siebel 2005 “my” CRM Application Search Information Access Layer 3 rd party content Search is a tactical afterthought Search is a strategic enabler
Revisit the Assumptions … 2003: 24 B Relational algebra large – but “finite”data sets structured data 2002: 12 B Cave paintings, Bone tools 40, 000 BCE Writing 3500 BCE Internet (DARPA) Late 1960 s Search & Explore focused “infinite”data sets Unstructured & Structured The Web 1993 1999 SQL-03 Computing 1950 GIGABYTES Transistor 1947 SQL-70 Oracle-79 SQL-89 SQL-92 Electricity, Telephone 1870 SQL-99 2000: 3 B Printing 1450 80% Unstructured 2001: 6 B 0 C. E. Paper 105
Extreme Capabilities? • Feeding/streaming, transaction, retrieval or analytics centric? • Content size: M, L, VVVL or Vn ∞ L? • Schema centric, Semi-structured XML, Text, Agnostic? • Fuzzy & Value vs. Binary & Completeness? • Discovery primitives? • User interaction part of design target?
Query Latency RDBMS vs ESP Test Data: • Structured data: • 5 million records; • 13 fields per record ESP • Structured queries: • 22 SQL queries ( Representative in ERP ) The Result: • #1: FAST ESP w/ disk • Mean = 99 [ms] • St. dev. = 36 [ms] RDBMS • #2: Oracle w/ memory mapping • Mean = 4 057 [ms] • St. dev. = 9 368 [ms]
Query Per Second RDBMS vs ESP QPS Identical HW : single node, 2 CPU, 4 GB ram 3 SCSI disks Identical data : auction data from e. Bay, 3. 6 million doc’s Identical queries: 200 queries defined by Oracle
Disruptive Change Relational Model Queries that fit The Model Queries that don’t fit The Model Alternative I • Star, snowflake schemas++ • Cubes / datamarts ++ Incremental fixes to painful shortcomings Adds complexity Alternative II • • Schema agnostic Scalable ad-hoc querying BLOBS Contextual Insight Real-time fusion of disparate data models • Massive fault tolerant scalability
Extreme Capabilities ESP Design Targets Powering Search Derivative Applications (SDAs) Value/Noise SNR Contextual Refinement Contextual Insight User Interaction Game Changer driven by Extreme Retrival and on-the-fly Analytics
Database Query Offloading Example: Auto. Trader. com RDBMS: • HW-cost: $320 K (32 CPU on 4 Sun servers) • 90% sub-second query response Average = 12 s for the rest …. • Relevance = Sorting • 5 FTE to maintain ESP: • HW-cost: $90 K • 100% sub-second query response • Flexible relevance and discovery • 0. 5 FTE to maintain ESP Car Dealers - Product Supply
Content Scalability RDBMS vs ESP Examples of ESP deployments • Compliance case: – 50 B documents @ 80 k average – 4 PB (around 100 web indexes) • Storage: – Intelligent content addressable storage – XML metadata and full content – EMC Centera: N * 256 TB (N=1. . 400) • Webmining – Webfountain: – 60. 000 : 1 in query capacity (ESP : DB)
Intelligent Storage and Search Unite Discover Simple Scalable Secure
Contextual Search From ACCESS To INSIGHT Where is the email from Peter about ROI analysis? FIND Contextual Relevance • “Best of Web” Recommender / Authority • “Best of Enterprise” Linguistic / Statistic Any new supicious financial transaction patterns? EXPLORE Contextual Navigation • Contextual fact discovery • On-the-fly meta-data analysis
Turning around the Pyramid HBZ. de – Leading German Library Service Center From: Librarians To: Researchers Single Field Search Quering WWW (HTML, XML, WML, Java. Script) SQL LIB FAST ESP … DB DB STRUCTURED DB
ESP @ SCOPUS • • >200 M articles / 180 M citations 180 TB capacity / 14000 journals David Goodman standing up and declaring in public, that Scopus is the best-designed database he's ever seen …
Relevance Drives Revenue Search Reduces Clicks to Purchase and Browsing… 20% -60% 1 k ee ee k W -40% W -20% 10 0% Browsing 60% 40% Revenue 20% 0% -20% -40% -60% 10 40% Search 80% k 60% 100% ee page views per sale 80% 120% W 100% 3. 50 3. 00 2. 50 2. 00 1. 50 1. 00 0. 50 0. 00 1 120% Launched search 140% Clicks to Purchase 4. 50 4. 00 100% increase in search 20% increase in ringtone revenue k Launched search 140% • • ee • Reduced # of clicks to buy content from > 4 to < 2 50% reduction in ringtone browsing W • … and Drives Revenue
Business Analytics Processing of real-time streams Example: Norwegian Customs Foreign Exchange Transaction Monitoring ACL Monitor SECURITY ACCESS MODULE User Monitor Message Queue Real-time Registration Queries Results Alerts Database connector Transaction Log Data Validation Firewall ØKOKRIM
Technology Maturity. . . RDBMS vs ESP
Business Intelligence ESP vs. RDBMS Technology OBSERVATION The Enterprise Search Platform (ESP), a relatively new concept, integrating advanced technologies typically associated with search engines, database tools, and analytical systems, is fast becoming able to solve modern business intelligence problems (using both structured and unstructured data) in a way that is fundamentally different from, and ultimately superior to, that of other currently available analytical or database software. PREDICTION Enterprise Search Platform and search centric application technology represents a true paradigm shift in the way data will be stored, analyzed and reported on in the future. Resulting realignments in the marketplace may be both rapid and tumultuous. - Chief strategist leading BI vendor
If your only tool is a hammer. . . . every problem looks like a nail
UIMA: Architecture
Text Structure
The BI “hammer” Approach Document Vector Antiobiotics, Peptidyl, Eubacteria, RNA, Mg, … SVD Analysis ( λ 1, λ 2, . . . , λn ) { λ 1, λ 2, . . . , λn, Structured attributes }
Contextual Refinement ETL and Semantic understanding unite Direct access to RDBMs for info from some Telco’s Logic for cleansing ESP lookup Ordered hits (by quality) XML feed from other Telco’s Cleansed data to ESP XML Ambigous data Flat files (CSV or fixed) from the ’laggards’ (close hits or unidentified) clean data ’Error’ database for manual inspection, correction, storage/learning Master database for persistant storage
Contextual Insight “…entry probe carried to [Saturn]’s moon Titan as part of the…” Intent Concepts Query-time fact analysis @ sub-document level
Contextual Navigation This. Is. Travel Automated visitor ratings
Revisit the Assumptions … 2003: 24 B Scalable Search 2002: 12 B Cave paintings, Bone tools 40, 000 BCE Writing 3500 BCE Internet (DARPA) Late 1960 s The Web 1993 1999 SQL-03 Computing 1950 GIGABYTES Transistor 1947 SQL-70 Oracle-79 SQL-89 SQL-92 Electricity, Telephone 1870 SQL-99 2000: 3 B Printing 1450 80% Unstructured 2001: 6 B 0 C. E. Paper 105