Data Mining and Data Warehousing Concepts and Techniques

Скачать презентацию Data Mining and Data Warehousing Concepts and Techniques

425beaa305c28aa18c9855ed7cc839b3.ppt

Количество слайдов: 38

Data Mining and Data Warehousing: Concepts and Techniques Course outlines Motivation Evolution of Database Technology - overview Why Data Mining? — Potential Applications What Is Data Mining? Data Mining: A KDD Process Data Mining: On What Kind of Data? What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in Data Mining Query Language DMQL Multi-Tiered Architecture - Approaches to Building OLAP Server Indexing OLAP Data: Bitmap Index Data Warehouse Back-End Tools and Utilities From OLAP to On Line Analytical Mining OLAM, An OLAM Architecture Data Mining Functionalities Are All the “Discovered” Patterns Interesting? Market-Basket Data; typical case, Frequent Pairs in SQL, A-Priori Algorithm

Motivation Data explosion problem: Automated data collection tools and database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. Ê Data are collected from everywhere and in huge amounts Ê We are Data Rich but Information Poor Ê How to make good use of your data? Ê Data warehousing and data mining Ê On-line analytical processing Ê Extraction of interesting knowledge (rules, patterns, …) from data in large databases. Ê Bring together scattered information from multiple sources as to provide a consistent database source for decision support queries. Ê Provide architectures and tools for business executives to systematically organize, understand, and use their data to make 2

Evolution of Database Technology - overview 1960 s: Data collection, database creation, IMS and network DBMS. 1970 s: Relational data model, relational DBMS implementation. 1980 s: RDBMS, advanced data models (extendedrelational, OO, deductive, etc. ) and application-oriented DBMS (spatial, scientific, engineering, etc. ). 1990 s: Data mining and data warehousing, multimedia databases, and Web technology. 3

Why Data Mining? — Potential Applications Ê Database analysis and decision support Ê Market analysis and management Ê target marketing, Ê customer relation management, Customer profiling, Identifying customer requirements, market basket analysis, cross selling, market segmentation. Ê Risk analysis and management Ê Finance planning and asset evaluation: Ê Forecasting, customer retention, improved underwriting, quality control, competitive analysis. Ê Fraud detection and management Ê Applications- widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Ê Approach- use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. More examples: Ê Detecting inappropriate medical treatment Ê Detecting telephone fraud Ê money laundering - detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) Ê Other Applications: Ê Text mining and Web analysis. Ê Etc. 4

What Is Data Mining? Ê Data mining (part of knowledge discovery in databases): Ê Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases Ê Alternative names and their “inside stories”: Ê Data mining: a misnomer? Ê Knowledge Discovery in Databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc. Ê What is not data mining? Ê (Deductive) query processing. Ê Expert systems or small statistical programs 5

Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Pattern Evaluation Knowledge Data Mining Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases 6

Knowledge Pattern Evaluation Data Mining Taskrelevant Data Steps of a KDD Process Data Warehouse Data Cleaning Selection Data Integration Learning the application domain – relevant prior knowledge and goals of application Data where-housing Databases Ê Creating a target data set: data selection Ê Data cleaning and preprocessing: (may take 60% of effort!) Ê Data reduction and projection – Find useful features, dimensionality/variable reduction, invariant representation. Data mining Ê Choosing functions of data mining - summarization, classification, regression, association, clustering. Ê Choosing the mining algorithm(s) Ê Data mining: search for patterns of interest Ê Interpretation: analysis of results - visualization, transformation, removing redundant patterns, etc. Ê Use of discovered knowledge 7

Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources DBA Paper, Files, Information Providers, Database Systems, OLTP 8

Data Mining: On What Kind of Data? Data mining is performed on data coming from: Ê Relational databases Ê Transactional databases Ê Advanced DB systems and information repositories Ê Object-oriented and object-relational databases Ê Spatial databases Ê Time-series data and temporal data Ê Text databases and multimedia databases Ê Heterogeneous and legacy databases Ê WWW … and accumulated in a data warehouse for long periods of time (several months or sometimes years) 9

What is a Data Warehouse? Defined in many different ways, but not rigorously. Ê A decision support database that is maintained separately from the organization’s operational database Ê Support information processing by providing a solid platform of consolidated, historical data for analysis. Knowledge Pattern Evaluation Data Mining “A data warehouse is a Taskrelevant Data Êsubject-oriented, Êintegrated, Êtime-variant, W. H. Inmon Êand nonvolatile collection of data in support of management’s decisionmaking process. ” Data Warehouse Data Cleaning Selection Data Integration Databases Data warehousing: The process of constructing and using data warehouses 10

Knowledge What is a data warehouse? Pattern Evaluation Data Mining Data Warehouse (1/2) Taskrelevant Data Warehouse Data Cleaning Subject Oriented Selection Data Integration Databases Ê Organized around major subjects, such as customer, product, sales. Ê Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Ê Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Integrated Ê Integrate multiple, heterogeneous data sources - relational databases, flat files, on-line transaction records Ê Data cleaning and data integration techniques are applied. Ê Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources (E. g. , Hotel price: currency, tax, breakfast covered, etc. ) Ê When data is moved to the warehouse, it is converted. 11

Knowledge What is a data warehouse? Data Warehouse (2/2) Time Variant Pattern Evaluation Data Mining Taskrelevant Data Warehouse Data Cleaning Selection Data Integration Databases Ê The time horizon for the data warehouse is significantly longer than that of operational systems. Ê Operational database: current value data. Ê Data Warehouse data: provide information from a historical perspective (e. g. , past 5 -10 years) Ê Every key structure in the data warehouse Ê Contains an element of time, explicitly or implicitly Ê But the key of operational data may or may not contain “time element”. Non-Volatile Ê A physically separate store of data transformed from the operational environment. Ê Operational update of data does not occur in the data warehouse environment. Ê Does not require transaction processing, recovery, and concurrency control mechanisms Ê Requires only two operations in data accessing: initial loading of data and access of data. 12

What is a data warehouse? Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: ÊBuild wrappers/mediators on top of heterogeneous databases ÊQuery driven approach: ÊWhen a query is posed to a client site, a metadictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set. ÊComplex information filtering, compete for sources 13

What is a data warehouse? Data Warehouse vs. Operational DB Systems OLTP (on-line transaction processing) Major task of traditional relational DBMS - Most database operations are of a type called OLTP. Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making þ Distinct features (OLTP vs. OLAP): P P P User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries 14

What is a data warehouse? OLTP vs. OLAP - The most complex OLAP queries are often referred to as data mining Common architecture Ê Local databases, say one per branch store, handle OLTP, Ê while a warehouse integrating information from all branches handles OLAP. 15

Knowledge What is a data warehouse? Why Separate Data Warehouse? Ê High performance for both systems: Pattern Evaluation Data Mining Taskrelevant Data Warehouse Data Cleaning Selection Data Integration Databases Ê DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery Ê Warehouse — tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Ê Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled. 16

A multi-dimensional data model Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measurements Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. Fact constellations: Multiple fact tables share dimension tables. 17

Knowledge Conceptual Modeling of Data Warehouses Pattern Evaluation Data Mining Taskrelevant Data Example of Star Schema Data Warehouse Data Cleaning Date Day Month Year Selection Data Integration Databases Product Sales Fact Table Date Store Product Store. ID City State Country Region Store Product. No Prod. Name Prod. Desc Category QOH Customer Cust unit_sales Cust. Id Cust. Name Cust. City Cust. Country dollar_sales Yen_sales Measurements 18

Knowledge Conceptual Modeling of Data Warehouses Example of Snowflake Schema Date Day Month Store City State Country Region Taskrelevant Data Warehouse Data Cleaning Selection Data Integration Product Month Year Country Data Mining Databases Year Pattern Evaluation Store. ID City State Country Sales Fact Table Date Product Store Customer unit_sales dollar_sales Yen_sales Product. No Prod. Name Prod. Desc Category QOH Cust. Id Cust. Name Cust. City Cust. Country Measurements 19

Conceptual Modeling of Data Warehouses A Data Mining Query Language: DMQL Language Primitives Cube Definition ( Fact Table ) Knowledge Pattern Evaluation Data Mining Taskrelevant Data Warehouse Data Cleaning Selection Data Integration Databases define cube []: Dimension Definition ( Dimension Table ) define dimension as () Special Case (Shared Dimension Tables) ÊFirst time as “cube definition” Êdefine dimension as in cube 20

Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in DMQL define cube sales [date, product, store, customer]: dollar_sales = sum(sales_in_dollars), yen_sales = sum(sales_in_yens), unit_sales = count(*) define dimension product as (product_no, prod_name, prod_desc, category, QOH) define dimension cust as (cust. ID, cust_name, cust_city, cust_country) define dimension date as (day, month (month_key, year (year_key) ) ) define dimension store as ( store. ID, city ( city_key, state( state_key, country(country_key, region) ))) 21

A multi-dimensional data model Pr od TV PC VCR sum 1 Qtr 2 Qtr Date 3 Qtr 4 Qtr Total annual sales sum of TV in U. S. A Canada Mexico Country uc t A Sample Data Cube sum 22

Data warehousing Typical OLAP Operations Ê Roll up (drill-up): summarize data; by climbing up hierarchy or by dimension reduction Ê Drill down (roll down): reverse of roll-up; from higher level summary to lower level summary or detailed data, or introducing new dimensions Ê Slice and dice: project and select Ê Pivot (rotate): reorient the cube, visualization, 3 D to series of 2 D planes. Ê Other operations Ê drill across: involving (across) more than one fact table. Ê drill through: through the bottom level to its back-end relational tables. 23

Data warehouse software architecture Multi-Tiered Architecture Other sources Operational DBs Metadata Extract Transform Load ETL Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools 24

Data warehouse software architecture Three-Tier Data Warehouse Architecture Ê Enterprise warehouse: Other sourc es Operational DBs collects all of the information about subjects spanning the entire organization. Ê Data Mart: Metadata Extract Transform Load ETL Refresh Monitor & Integrator Data Warehouse OLAP Server Analysis Query Reports Data mining Serv e Data Marts a subset of corporate-wide data that is of value to a specific groups of users. Ê Its scope is confined to specific, selected groups, such as marketing data mart. Ê Independent vs. dependent (directly from warehouse) data mart Ê Virtual warehouse: Ê A set of views over operational databases. Ê Only some of the possible summary views may be materialized. 25

Building Data Warehouses Approaches to Building Warehouses … OLAP Server Architectures Relational OLAP (ROLAP): Êrelational database system tuned for star schemas, e. g. , using special index structures such as: Ê “Bitmap indexes” (for each key of a dimension table, e. g. , bar name, a bit-vector telling which tuples of the fact table have that value). Ê Materialized views = answers to general queries from which more specific queries can be answered with less work than if we had to work from the raw data. ÊUse relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces. ÊInclude optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services Multidimensional OLAP (MOLAP) - A specialized model based on a “cube” view of data. ÊArray-based multidimensional storage engine (sparse matrix techniques) Êfast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) - User flexibility, e. g. , low level: relational, high-level: array. Specialized SQL servers - specialized support for SQL queries over star, snowflake schemas 26

Building Data Warehouses Indexing OLAP Data: Bitmap Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table Index on Region Join Indices Ê Join index: JI(R-id, S-id) where R (R-id, …) S (S-id, …) Ê Traditional indices map the values to a list of record ids Ê In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table. Ê E. g. fact table: Sales and two dimensions city and product ; A join index on city maintains for each distinct city a list of R- Index on Type 27

Building Data Warehouses Metadata Repository Meta data are the data defining warehouse objects Ê Description the structure of the warehouse - schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents Ê Operational meta-data - data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails) Ê The algorithms used for summarization Ê The mapping from operational environment to the data warehouse Ê Data related to system performance - warehouse schema, view and derived data definitions Ê Business data - business terms and definitions, ownership of data, charging policies 28

Building Data Warehouses Data Warehouse Back-End Tools and Utilities Other sourc es Operational DBs Metadata Extract Transform Load ETL Refresh Monitor & Integrator Data Warehouse OLAP Server Analysis Query Reports Data mining Serv e Data Marts Data Extraction: get data from multiple, heterogeneous, and external sources Data cleaning: detect errors in the data and rectify them when possible Data Transformation: convert data from legacy or host format to warehouse format Load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh: propagate the updates from the data sources to the warehouse 29

From data warehousing to data mining Data Warehouse Usage Three kinds of data warehouse applications Information processing - supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing Ê multidimensional analysis of data warehouse data Ê supports basic OLAP operations, slice-dice, drilling, pivoting Data mining Ê knowledge discovery from hidden patterns Ê supports associations, constructing analytical models, performing classification and prediction, and presenting the 30

From data warehousing to data mining From On-Line Analytical Processing OLAP to On Line Analytical Mining OLAM Ê Why online analytical mining? ü High quality of data in data warehouses ÊDW contains integrated, consistent, cleaned data ÊAvailable information processing structure surrounding data warehouses ÊODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools ÊOLAP-based exploratory data analysis Êmining with drilling, dicing, pivoting, etc. ÊOn-line selection of data mining functions Êintegration and swapping of multiple mining functions, algorithms, and tasks. Ê Architecture of OLAM 31

From data warehousing to data mining An OLAM Architecture Mining query On Line Analytical Mining result Layer 4 User Interface User GUI API OLAM Engine Data Cube API OLAP Engine MDDB Meta Database API Filtering & Integration Data cleaning Databases Data integration Filtering Data Warehouse Layer 3 OLAP/OLAM Layer 2 MDDB Layer 1 Data Repository 32

Data Mining Ê Ê Data mining is a popular term for queries that summarize big data sets in useful ways. Large-scale queries designed to extract patterns from data. Examples: Clustering all Web pages by topic. Finding characteristics of fraudulent credit-card use.

Knowledge Pattern Evaluation Data Mining Taskrelevant Data Mining Functionalities Data Warehouse Data Cleaning Selection Data Integration Ê Concept description: Characterization and Comparison Generalize, summarize, and possibly contrast data characteristics. Databases Ê Association Ê From association, correlation, to causality. Ê finding rules like “inside(x, city) à near(x, highway)”. Ê Classification and Prediction Ê Classify data based on the values in a classifying attribute, e. g. , classify countries based on climate, or classify cars based on gas mileage. Ê Predict some unknown or missing attribute values based on other information. Ê Cluster analysis – Group data to form new classes, e. g. , cluster houses to find distribution patterns. Ê Outlier and exception data analysis Ê Time-series analysis (trend and deviation) Ê Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e. g. , stock analysis. Ê Similarity-based pattern-directed analysis Ê Full vs. partial periodicity analysis 34

Knowledge Pattern Evaluation Are All the “Discovered” Patterns Interesting? Data Mining Taskrelevant Data Warehouse Data Cleaning Selection Data Integration Ê A data mining system/query may generate thousands of patterns, not all of them are interesting - Suggested approach: Human-centered, query-based, focused mining Databases Ê Interestingness measures: A pattern is interesting if it is Ê Ê easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm Ê Objective vs. subjective interestingness measures: Ê Objective: based on statistics and structures of patterns, e. g. , support, confidence, etc. Ê Subjective: based on user’s beliefs in the data, e. g. , unexpectedness, novelty, etc. Can It Find All and Only Interesting Patterns? Ê Find all the interesting patterns: Completeness - Can a data mining system find all the interesting patterns? Ê Search for only interesting patterns: Optimization. Ê Can a data mining system find only the interesting patterns? Ê Approaches Ê First general all the patterns and then filter out the uninteresting ones. Ê Generate only the interesting patterns --- mining query optimization 35

Market-Basket Data; typical case An important form of mining from relational data involves market baskets = sets of “items” that are purchased together as a customer leaves a store. Summary of basket data is frequent itemsets = sets of items that often appear together in baskets. Example: Market Baskets If people often buy hamburger and ketchup together, the store can: 1. Put hamburger and ketchup near each other and put potato chips between. 2. Run a sale on hamburger and raise the price of ketchup. Finding Frequent Pairs Ê The simplest case is when we only want to find “frequent pairs” of items. Ê Assume data is in a relation Baskets(basket, item). Ê The support threshold s is the minimum number of baskets in which a pair appears before we are interested. 36

Frequent Pairs in SQL SELECT b 1. item, b 2. item FROM Baskets b 1, Baskets b 2 WHERE b 1. basket = b 2. basket AND b 1. item < b 2. item GROUP BY b 1. item, b 2. item HAVING COUNT(*) >= s; Look for two Basket tuples with the same basket and different items. First item must precede second, so we don’t count the same pair twice. Create a group for each pair of items that appears in at least one basket. Throw away pairs of items that do not appear at least s times. FA-Priori Trick Above query is prohibitively expensive for large data. O Ê A-priori algorithm uses the fact that a pair (i, j) cannot have support s unless i and j both have support s by themselves. Ê More efficient implementation uses an intermediate relation Baskets 1(a materialized view to hold only information about frequent items). INSERT INTO Baskets 1(bid, item) SELECT * FROM Baskets WHERE item IN ( SELECT Items that appear in item FROM Baskets at least s baskets. GROUP BY item HAVING COUNT(*) >= s Then run the query for pairs); on Baskets 1 instead of Baskets. 37

A-Priori Algorithm ÊMaterialize the view Baskets 1. ÊRun the obvious query, but on Baskets 1 instead of Baskets. ÊBaskets 1 is cheap, since it doesn’t involve a join. ÊBaskets 1 probably has many fewer tuples than Baskets. Running time shrinks with the square of the number of tuples involved in the join. Example: A-Priori Suppose: 1. A supermarket sells 10, 000 items. 2. The average basket has 10 items. 3. The support threshold is 1% of the baskets. Ê At most 1/10 of the items can be frequent. 38