523826c29abeafe47cff69e9802612a4.ppt
- Количество слайдов: 35
Data Mining: Crossing the Chasm Rakesh Agrawal IBM Almaden Research Center
Thesis • The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology • We have the opportunity to make this transition successful
Outline • Chasm in the technology adoption life cycle, à la Geoffrey Moore† • Experience with Quest/Intelligent Miner • Ideas for successful chasm crossing † Geoffrey A Moore. Crossing the Chasm. Harper Business. http: //www. chasmgroup. com
Technology Adoption Life Cycle Pragmatists: Stick with the herd! Conservatives: Hold on! Visionaries: Get ahead of the herd! Techies: Try it! Early Late Innovators Adopters Majority Skeptics: No way! Laggards Psychographic profile of each group is different
Innovators: Technology Enthusiasts • Intrigued by any fundamental advance in technology • Like to alpha test new products • Can ignore the missing elements • Want access to top technologists • Want no-profit pricing (preferably free) Gatekeepers to early adopters
Early Adopters: Visionaries • Driven by vision of dramatic competitive advantage via revolutionary breakthroughs • Great imagination for strategic applications • Not so price-sensitive • Want rapid time to market • Demand high degree of customization Fund the development of early market
Early Majority: Pragmatists • Want sustainable productivity improvement through evolutionary change • Astute managers of mission-critical apps • Understand real-world issues and tradeoffs • Focus on proven applications; want to see the solution in production Bulwark of the mainstream market
Late Majority: Conservatives • • Want to stay even with the competition Risk averse Price sensitive Need completely pre-assembled solutions Extend technology life cycles
Laggards: Skeptics • Driven to maintain status quo • Good at debunking marketing hype • Disbelieve productivity-improvement arguments • Can be formidable opposition to early adoption of a technology Retard the development of high-tech markets
Crack in the curve Chasm Early Market Mainstream Market The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.
Visionaries vs. Pragmatists • • • Adventurous First strike capability Early buy-in State of the art Think big Spend big • • • Prudent Staying power Wait-and-see Industry standard Manage expectation Spend to budget
Is data mining following this curve? • Yes!!! • My personal viewpoint based on Quest/Intelligent Miner experience
Quest • Started as skunk work in early nineties • Inspired by needs articulated by industry visionaries: – Transaction data collected over a long period – Current tools/SQL don’t cut it – About ready to throw data
Approach • Examine “real” applications • Identify operations that cut across applications • Design fast, scalable algorithms for each operation • Develop applications by composing operations
Operations • Associations • Sequential Patterns • Similar time series • Classification • Clustering • Deviations • New Operations • Completeness, scalability • Adopted from Statistics/Learning • Scalability http: //www. almaden. ibm. com/cs/quest
Bringing Quest to market • Visionaries who inspired Quest did not become first customers: – Wanted evidence that the technology “worked” • Frustrating attempts to interest major IBM customers: – Integration with existing applications – Too-far-out technology – Resistance from in-house analytic groups
First hits • Small information-based companies who provided data in exchange for free results • CIO who wanted to be seen as the technology pioneer in his industry • CIO who wanted the success story to feature in the company’s annual report Led to the formation of a group offering services using Quest
Characteristics of engagements • • Mostly associations and sequential patterns Completeness a big plus Unanticipated uses Feedback for further development
Into the product land • Formation of a small “out-of-plan” product group to productize Quest • Facilitated by a closet mathematician • Successes of the services group used for market validation • Continued development and infusion of technology
Intelligent Miner • • Serious product Integrates technologies from various groups Fast, scalable, runs on multiple platforms Several “early market” success stories http: //www. software. ibm. com/data/iminer/
Are we in the chasm? • Perceived to be sophisticated technology, usable only by specialists • Long, expensive projects • Stand-alone, loosely-coupled with data infrastructures • Difficult to infuse into existing missioncritical applications
Chasm Crossing • Personal speculations on some technical challenges • Do not imply IBM research/product directions
XML-based Data Mining Standard (1) Data Specs Parameters Standard DTD Operator Library Model Standard DTD • Model Building: – A pair of standard DTDs for each operation – Interchangeable library of operator implementations Ack: Mattos, Pirahesh, Schwenkries
XML-based Data Mining Standard (2) Standard DTDs • Model Deployment: Data Model Mapping – Mapping XML object Record provides mapping between names and format in the model Application Library object and the data record Standard – Model could have Result DTD been developed on a different system
Implications • Standard interfaces for application developers to incorporate data mining • Coupling with relational databases – mappings from DTDs to relational schemas – implementation using existing infrastructure
Data Mining Benchmarks • UC Irvine repository • Generating synthetic benchmarks modeled after real data sets is a hard problem – How to map names into meaningful literals – How to preserve empirical distributions Ack: Srikant, Ullman
Auto-focus data mining • Automatic parameter tuning • Automatic algorithm selection (à la join method selection in database query optimization) Ack: Andreas Arning
Web: Greatest opportunity • Huge collection of data (e. g. Yahoo collecting ~50 GB every day) • Universal digital distribution medium makes data mining results actionable in fundamentally new ways • But watch for privacy pitfall
Privacy-preserving data mining • Technical vs. legislated solutions • Implication for data mining algorithms when some fields of a data record have been fudged according to the user’s privacy sensitivity Ack: R. Srikant
Personalization • Internet might provide for the first time tools necessary for users to capture information about themselves and to selectively release this information† • Will we be providing these tools? † John Hagel, Marc Singer. Net Worth. Harvard Business School Press.
What about Association Rules? • Very long patterns • Separating wheat from chaff • Principled introduction of domain knowledge
What else? • Formal foundations of data mining
Summary • Closely couple data mining with database systems • Embed data mining into applications • Focus on web • Standard interfaces • Benchmarks • Auto focussing • Personalization • Privacy
Concluding remarks • Data mining, a great technology – Combination of intriguing theoretical questions with large commercial interest in the technology • Poised for transitioning into mainstream technology • Will we rise to the challenge as a community?
Acknowledgments
523826c29abeafe47cff69e9802612a4.ppt