MASS COLLABORATION AND DATA MINING Raghu Ramakrishnan Founder

MASS COLLABORATION AND DATA MINING Raghu Ramakrishnan Founder and CTO, QUIQ Professor, University of Wisconsin-Madison Keynote Talk, KDD 2001, San Francisco

DATA MINING Extracting actionable intelligence from large datasets • Is it a creative process requiring a unique combination of tools for each application? • Or is there a set of operations that can be composed using well-understood principles to solve most target problems? • Or perhaps there is a framework for addressing large classes of problems that allows us to systematically leverage the results of mining. University of Wisconsin-Madison Page 2

“MINING” APPLICATION CONTEXT • Scalability is important. – But when is 2 x speed-up or scale-up important? When is 10 x unimportant? • What is the appropriate measure, model? – Recall, precision – MT for search vs. MT for content conversion Answers to these questions come from the context of the application. University of Wisconsin-Madison Page 3

TALK OUTLINE • A New Approach to Customer Support – Mass Collaboration • Technical challenges – A framework and infrastructure for P 2 P knowledge capture and delivery • Role of data mining – Confluence of DB, IR, and mining University of Wisconsin-Madison Page 4

TYPICAL CUSTOMER SUPPORT Web Support KB Customer Support Center University of Wisconsin-Madison Page 5

TRADITIONAL KNOWLEDGE MANAGMENT QUESTION ANSWER KNOWLEDGE BASE EXPERTS CONSUMERS Knowledge created and structured by trained experts using a rigorous process. University of Wisconsin-Madison Page 6

MASS COLLABORATION QUESTION People using the web to share knowledge and help each other find solutions SELF SERVICE KNOWLEDGE BASE Answer added to power self service ANSWER MASS COLLABORATION -Experts -Partners -Customers -Employees University of Wisconsin-Madison Page 7

TIMELY ANSWERS 77% of answers are provided within 24 h 6, 845 86% (4, 328) 77% (3, 862) • No effort to answer each question • No added experts • No monetary incentives for enthusiasts 74% answered 65% (3, 247) 40% (2, 057) Answers provided in 3 h Answers provided in 12 h in 24 h Answers provided in 48 h Questions University of Wisconsin-Madison Page 8

MASS CONTRIBUTION Users who on average provide only 2 answers provide 50% of all answers Answers 100 % (6, 718) Contributed by mass of users 50 % (3, 329) Top users Contributing Users 7 % (120) 93 % (1, 503) University of Wisconsin-Madison Page 9

POWER OF KNOWLEDGE CREATION SUPPORT SHIELD 1 - 85% SHIELD 2 Self. Service *) Knowledge Creation - 64% Customer Mass Collaboration *) 5% Support Incidents Agent Cases *) Averages from QUIQ implementations University of Wisconsin-Madison Page 10

TYPICAL SERVICE CHAIN 40% 50% FAQ Self Service Knowledge base Auto Email Manual Email $ 10% Call Center Chat $$ 2 nd Tier Support $$$ QUIQ SERVICE CHAIN 80% 15% QUIQ Mass Collaboration QUIQ Self Service $ Manual Email 5% Chat $$ Call Center 2 nd Tier Support $$$ University of Wisconsin-Madison Page 11

CASE STUDIES: COMPAQ “In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for. ” “Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service. ” – Steve Young, VP of Customer Care, Compaq University of Wisconsin-Madison Page 12

ASP 2001 “Top Ten Support Site” “Austin-based National Instruments deployed … a Network to capture the specialized knowledge of its clients and take the burden off its costly support engineers, and is pleased with the results. QUIQ increased customers’ participation, flattened call volume and continues to do the work of 50 support engineers. ” – David Daniels, Jupiter Media Metrix University of Wisconsin-Madison Page 13

+ Knowledge Management + Service Workflows Support Newsgroups Mass Collaboration Few Experts Communities Many Experts MASS COLLABORATION Internet-scale P 2 P knowledge sharing Call Center Support Knowledge Base Interactions Solutions University of Wisconsin-Madison Page 14

CORPORATE MEMORY Untapped Knowledge in Extended Business Community Customers Partners Suppliers Knowledgebase Employees University of Wisconsin-Madison Page 15

User-to-User Exchange Structured User Forum Self-Organizing User-to. Enthusiast User-to. Expert Incentive to Participate User Acquisition Areas of Interest University of Wisconsin-Madison Page 16

GOALS & ISSUES • Interactions must be structured to encourage creation of “solutions” – Resolve issue; escalate if necessary – Capture knowledge from interactions – Encourage participation • Sociology – Privacy, security – Credibility, authority, history – Accountability, incentives University of Wisconsin-Madison Page 17

REQUIRED CAPABILITIES • Roles: Credibility, administration – Moderators, experts, editors, enthusiasts • Groups: Privacy, security, entitlements – Departments, gold customers • Workflow: Qo. S, validation, escalation University of Wisconsin-Madison Page 18

TECHNICAL CHALLENGES University of Wisconsin-Madison Page 19

SEARCHING “PEOPLE-BASES” ROUTING, NOTIFICATION ? SEARCH “If it’s not there, find someone who knows” - And get “it” there (knowledge creation)! University of Wisconsin-Madison Page 20

QUIQ, the “Best in Class” Support Channel SUPPORT Email Support Call Center Automated Emails 1) -20% 100% 80% Support Incidents Agent Cases Mass Collaboration Web Self-Service Self-42% Service 2) Self-85% Service Agent Cases Knowledge Creation -64% 68% Support Incidents Agent Cases Customer Mass Collaboration 5% Support Incidents 1) Source: QUIQ Client Information 2) Source: Association of Support Professionals Agent Cases University of Wisconsin-Madison Page 21

SEARCH AND INDEXING • User types in “How can I configure the IP address on my Presario? ” – Need to find most relevant content that is of high quality and is approved for external viewing, and that this user is entitled to see based on her roles, groups, and service levels. • User decides to post question because no good answer was found in the KB. – Search controls when experts and other users will see this new question; need to make this real-time. – Concurrency, recovery issues! University of Wisconsin-Madison Page 22

SEARCH AND INDEXING • Data is organized into tabular channels – Questions, responses, users, … • Each item has several fields, e. g. , a question: – Author id, author status, service level, item popularity metrics, rating metrics, answer status, approval status, visibility group, update timestamp, notification timestamp, usage signature, category, relevant products, relevant problems, subject, body, responses Which 5 items should be returned? University of Wisconsin-Madison Page 23

RUNTIME ARCHITECTURE Web server Real-time Indexing, Caching, Alerts Email Cache Files, Logs Web server Hive Manager Indexer Alerts Warehouse DBMS RAID STORAGE University of Wisconsin-Madison Page 24

LEARNING FROM ACTIVITY DATA TO KNOWLEDGE Periodic offline activity Miner Indexer Large R/W Small reads Files, Logs Warehouse DBMS RAID STORAGE University of Wisconsin-Madison Page 25

SEARCH AND INDEXING Which 5 items should be returned? • Question text, user attributes, system policies • IR-style ranked output • Search constraints: – – – Show matches; subject match twice as important Show only approved answers to non-editors Give preference to category Laptop Give preference to recent solutions Weight quality of solution University of Wisconsin-Madison Page 26

VECTOR SPACE MODEL • Documents, queries are vectors in term space • Vector distance from the query is used to rank retrieved documents Q 1 = w 11 , w 12, . . . , w 1 t D 2 = w 21 , w 22, . . . , w 2 t sim(Q 1 , D 2 ) = t åw i =1 1 i * w 2 i unnormalized i’th term in summation can be seen as the “relevance contribution” of term i University of Wisconsin-Madison Page 27

TF-IDF DOCUMENT VECTOR University of Wisconsin-Madison Page 28

A HYBRID DB-IR SYSTEM • Searches are queries with three parts: – Filter • DB-style yes/no criteria – Match • TF-IDF relevance based on a combination of fields – Quality • Relevance “boost” based on a policy University of Wisconsin-Madison Page 29

A HYBRID DB-IR SYSTEM • A query is built up from atomic constraints using Boolean operators. • Atomic constraint: – [ value op term, constraint-type ] – Terms are drawn from discrete domains and are of two types: hierarchy and scalar – Constraint-type is exact or approximate University of Wisconsin-Madison Page 30

A HYBRID DB-IR SYSTEM • Applying an atomic constraint to a set of items returns a tagged result set: – The result inherits the constraint-type – Each result item has a (TF-IDF) relevance score; 0 for exact • Combining two tagged item sets using Boolean operators yields a tagged set: – The result type is exact if both inputs are exact, and approximate otherwise – Result contains intersection of input item sets if either input is exact; union otherwise – Each result item is tagged with a combined relevance University of Wisconsin-Madison Page 31

A HYBRID DB-IR SYSTEM • Semantics of Boolean expressions over constraints is associative and commutative • Evaluating exact constraints and approximate constraints separately (in DB and IR subsystems) is a special case. Additionally: – Uniform handling of relevance contributions of categories, popularity metrics, recency, etc. • Absolute and relative relevance modifiers can be introduced for greater flexibility. University of Wisconsin-Madison Page 32

CONCURRENCY, RECOVERY, PARALLELISM • Concurrency – Index is updated in real-time – Automatic partitioning, two-step locking protocol result in very low overhead – Relies upon post-processing to address some anomalies • Recovery – Partitioning is again the key – Leverages recovery guarantees of DBMS – Approach also supports efficient refresh of global statistics • Parallelism – Hash based partitioning University of Wisconsin-Madison Page 33

NOTIFICATION • Extension of search: Each user can define or more “standing searches”, and request instant or periodic notification. – Boolean combinations of atomic constraints. • Major challenges: – Scaling with number of standing searches. • Requires multiple timestamps, indexing searches. – Exactly-once delivery property. • Many subtleties center around “notifiability” of updates! University of Wisconsin-Madison Page 34

ROLE OF DATA MINING University of Wisconsin-Madison Page 35

DATA MINING TASKS • There is a lot of insight to be gained by analyzing the data. – – – What will help the user with her problem? Who does a given user trust? Characteristic metrics for high-quality content. Identify helpful content in similar, past queries. Summarize content. Who can answer this question? University of Wisconsin-Madison Page 36

LEVERAGING DATA MINING • How do we get at the data? – Relevant information is distributed across several sources, not just the DBMS. – Aggregated in a warehouse. • How do we incorporate the insights obtained by mining into the search phase? – Need to constantly update info about every piece of content (Qs, As, users …) University of Wisconsin-Madison Page 37

LEVERAGING DATA MINING • Three-step approach: – Off-line analysis to gather new insight – Periodic refresh indexes – Use insight (from KB/index) to improve search using the extended DB/IR query framework Use mining to create useful metadata University of Wisconsin-Madison Page 38

SOME UNIQUE TWISTS • Identify the kinds of feedback that would be helpful in refining a search. – I. e. , Not just specific terms, but the types of concepts that would be useful discriminators (e. g. , a good hierarchy of feedback concepts) • Metrics of quality – Link-analysis is a good example, but what are the “links” here? • Self-tuning searches – The more the knobs, the more the choices – Next step: self-personalizing searches? University of Wisconsin-Madison Page 39

CONCLUSIONS University of Wisconsin-Madison Page 40

CONFLUENCES IR SEARCH UE KM DB Q P 2 P RIE S ? University of Wisconsin-Madison Page 41