Скачать презентацию Data Mining and Data Warehousing Concepts and Techniques Скачать презентацию Data Mining and Data Warehousing Concepts and Techniques

53f222bcfad64381d3a96466cabfe3bb.ppt

  • Количество слайдов: 20

Data Mining and Data Warehousing: Concepts and Techniques Course outlines ÊMotivation ÊEvolution of Database Data Mining and Data Warehousing: Concepts and Techniques Course outlines ÊMotivation ÊEvolution of Database Technology overview ÊWhy Data Mining? — Potential Applications ÊWhat Is Data Mining? Data Mining: A KDD Process ÊData Mining: On What Kind of Data? Next What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP

Motivation Data explosion problem: Automated data collection tools and database technology lead to tremendous Motivation Data explosion problem: Automated data collection tools and database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. Data are collected from everywhere and in huge amounts We are Data Rich but Information Poor How to make good use of your data? 2

We are Data Rich but Information Poor Databases are too big Data Mining can We are Data Rich but Information Poor Databases are too big Data Mining can help discover knowledge Terrorbytes 3

Data warehousing and data mining - Overview On-line analytical processing (OLAP) Extraction of interesting Data warehousing and data mining - Overview On-line analytical processing (OLAP) Extraction of interesting knowledge (rules, patterns, …) from data in large databases. Ê Bring together scattered information from multiple sources as to provide a consistent database source for decision support queries. Ê Provide architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. 4

Evolution of Database Technology - overview 1960 s: Data collection, database creation, IMS and Evolution of Database Technology - overview 1960 s: Data collection, database creation, IMS and network DBMS. 1970 s: Relational data model, relational DBMS implementation. 1980 s: RDBMS, advanced data models (extendedrelational, OO, deductive, etc. ) and application-oriented DBMS (spatial, scientific, engineering, etc. ). 1990 s: Data mining and data warehousing, multimedia databases, and Web technology. 5

Why Data Mining? — Potential Applications Database analysis and decision support ÊMarket analysis and Why Data Mining? — Potential Applications Database analysis and decision support ÊMarket analysis and management Êtarget marketing, customer relation management, market basket analysis, cross selling, market segmentation. ÊRisk analysis and management ÊForecasting, customer retention, improved underwriting, quality control, competitive analysis. ÊFraud detection and management Other Applications: ÊText mining (news group, email, documents) and Web analysis. ÊIntelligent query answering 6

Market Analysis and Management Ê Where are the data sources for analysis? Credit card Market Analysis and Management Ê Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. Ê Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Ê Determine customer purchasing patterns over time: Ê Conversion of single to a joint bank account: marriage, etc. Ê Cross-market analysis Ê Associations/co-relations between product sales Ê Prediction based on the association information. 7

Market Analysis and Management (2) Ê Customer profiling Êdata mining can tell you what Market Analysis and Management (2) Ê Customer profiling Êdata mining can tell you what types of customers buy what products (clustering or classification). Ê Identifying customer requirements Êidentifying the best products for different customers Êuse prediction to find what factors will attract new customers Ê Provides summary information Êvarious multidimensional summary reports; Êstatistical summary information (data central tendency and variation) 8

Corporate Analysis and Risk Management Ê Finance planning and asset evaluation: Ê cash flow Corporate Analysis and Risk Management Ê Finance planning and asset evaluation: Ê cash flow analysis and prediction Ê contingent claim analysis to evaluate assets Ê cross-sectional and time series analysis (financial-ratio, trend analysis, etc. ) Ê Resource planning: summarize and compare the resources and spending Ê Competition: Ê monitor competitors and market directions (CI: competitive intelligence). Ê group customers into classes and a class-based pricing procedure. Ê set pricing strategy in a highly competitive market (e. g. , REPSOL gas chain station in Spain). 9

Fraud Detection and Management Ê Applications: Ê widely used in health care, retail, credit Fraud Detection and Management Ê Applications: Ê widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Ê Approach: Ê use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. Ê Examples: Ê auto insurance: detect a group of people who stage accidents to collect on insurance Ê money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) Ê medical insurance: detect professional patients and ring of doctors and ring of references 10

Fraud Detection and Management (2) Ê More examples: ÊDetecting inappropriate medical treatment: ÊAustralian Health Fraud Detection and Management (2) Ê More examples: ÊDetecting inappropriate medical treatment: ÊAustralian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1 m/yr). ÊDetecting telephone fraud: ÊTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. ÊBritish Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. ÊRetail: Analysts estimate that 38% of retail shrink is due to dishonest employees. 11

Other Applications Ê Sports ÊIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, Other Applications Ê Sports ÊIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. Ê Astronomy ÊJPL and the Palomar Observatory discovered 22 quasars with the help of data mining Ê Internet Web Surf-Aid ÊIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site 12

Data Mining Should Not be Used Blindly! Ê Data mining find regularities from history, Data Mining Should Not be Used Blindly! Ê Data mining find regularities from history, but history is not the same as the future. Ê Association does not dictate trend nor causality!? ÊDrink diet drinks lead to obesity! ÊDavid Heckerman’s counter-example (1997): ÊBarbecue sauce, hot dogs and hamburgers. Ê Some abnormal data could be caused by human! Ê37 C? Why not registered by doctors? 13

What Is Data Mining? (1/2) Ê Data mining (part of knowledge discovery in databases): What Is Data Mining? (1/2) Ê Data mining (part of knowledge discovery in databases): Ê Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases Ê Alternative names and their “inside stories”: Ê Data mining: a misnomer? Ê Knowledge Discovery in Databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc. Ê What is not data mining? Ê (Deductive) query processing. Ê Expert systems or small statistical programs 14

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Pattern Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Pattern Evaluation Knowledge Data Mining Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases 15

Knowledge Pattern Evaluation Data Mining Taskrelevant Data Steps of a KDD Process Data Warehouse Knowledge Pattern Evaluation Data Mining Taskrelevant Data Steps of a KDD Process Data Warehouse Data Cleaning Selection Data Integration Learning the application domain – relevant prior knowledge and goals of application Databases Data warehousing Ê Creating a target data set: data selection Ê Data cleaning and preprocessing: (may take 60% of effort!) Ê Data reduction and projection – Find useful features, dimensionality/variable reduction, invariant representation. Data mining Ê Choosing functions of data mining - summarization, classification, regression, association, clustering. Ê Choosing the mining algorithm(s) Ê Data mining: search for patterns of interest Ê Interpretation: analysis of results - visualization, transformation, removing redundant patterns, etc. Ê Use of discovered knowledge 16

Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse 17

Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions End Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions End User Data Presentation Business Analyst Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources DBA Paper, Files, Information Providers, Database Systems, OLTP 18

Data Mining: Confluence of Multiple Disciplines Database systems, data warehouse and OLAP Statistics Machine Data Mining: Confluence of Multiple Disciplines Database systems, data warehouse and OLAP Statistics Machine learning Visualization Information science High performance computing Other disciplines: Neural networks, mathematical modeling, information retrieval, pattern recognition, etc. 19

Data Mining: On What Kind of Data? Data mining is performed on data coming Data Mining: On What Kind of Data? Data mining is performed on data coming from: Ê Relational databases Ê Transactional databases Ê Advanced DB systems and information repositories Ê Object-oriented and object-relational databases Ê Spatial databases Ê Time-series data and temporal data Ê Text databases and multimedia databases Ê Heterogeneous and legacy databases Ê WWW … and accumulated in a data warehouse for long periods of time (several months or sometimes years) 20