Скачать презентацию Lecture 1 — Introduction to DW Reading Recommendations Скачать презентацию Lecture 1 — Introduction to DW Reading Recommendations

c8f7ea8129435ac04d9d101b3559b931.ppt

  • Количество слайдов: 47

Lecture 1 - Introduction to DW Reading Recommendations ”An overview of Data Warehousing and Lecture 1 - Introduction to DW Reading Recommendations ”An overview of Data Warehousing and OLAP Technology” by Chaudhuri & Bayal, Keywords DW, DSS, OLTP, OLAP, MDM, Data Mart, Data Mining

”We are drowing in information, but starving for knowledge” - John Naisbett ”We are drowing in information, but starving for knowledge” - John Naisbett

The Data Warehouse - definition B. Inmon: ”A data warehouse is a subject oriented, The Data Warehouse - definition B. Inmon: ”A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of manadement’s decisions”.

Subject-oriented Operational Systems Data Warehouse Sales System Customer Data Payroll System Employee Data Purchasing Subject-oriented Operational Systems Data Warehouse Sales System Customer Data Payroll System Employee Data Purchasing System Vendor Data

Integrated Operational Systems Data Warehouse Marketing System Order System Billing System Customer Data Integrated Operational Systems Data Warehouse Marketing System Order System Billing System Customer Data

Time variant Operational Systems Data Warehouse Order System Customer Data 60 -90 days 5 Time variant Operational Systems Data Warehouse Order System Customer Data 60 -90 days 5 -10 years

Non-volatile Operational Systems Data Warehouse Create Update Delete Order System Load Insert Access Customer Non-volatile Operational Systems Data Warehouse Create Update Delete Order System Load Insert Access Customer Data

The Data Warehouse - definition B. Inmon: ”A data warehouse is a subject oriented, The Data Warehouse - definition B. Inmon: ”A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of manadement’s decisions”. S. Chaudhuri & U. Dayal: ”Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions. ”

Decision Support and OLAP (by Navathe) • Information technology to help the knowledge worker Decision Support and OLAP (by Navathe) • Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. – Will a 10% discount increase sales volume sufficiently? – Which of two new medications will result in the best outcome: higher recovery rate & shorter hospitality rate? – How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years? • On-Line Analytical Processing (OLAP) is an element of decision support system (DSS).

Data Warehouse (Navathe) • A decision support database that is maintained separately from the Data Warehouse (Navathe) • A decision support database that is maintained separately from the organisation’s operational databases. • A data warehouse is a – – subject oriented, integrated, time-varying, non-volatile collection of data that is used primarily in the organisational decision making.

OLTP OLAP vs. • • holds current data stores detailed data • • data OLTP OLAP vs. • • holds current data stores detailed data • • data is dynamic repetitive processing • • • high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions serves large number of operational users • • • holds historic data stores detailed and summarised data is largely static ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions serves relatively lower level of managerial users

Why separate data warehouse? • Performance – The operational DBs are tuned to support Why separate data warehouse? • Performance – The operational DBs are tuned to support known OLTP workloads – Supporting OLAP requires special data organisations, access methods and implementation methods • Function – The decision support requires data that may be missing from the operational DBs – Decision support usually requires consolidating data from many heterogeneous sources

Architectur e Monitoring & Administration Tools Metadata repository Data sources OLAP servers Analysis Data Architectur e Monitoring & Administration Tools Metadata repository Data sources OLAP servers Analysis Data warehouse External sources Operational DBs Extract Transform Load Refresh Serve Query/Reporting Data mining Data marts Falö aöldf flaöd aklöd falö alksdf

OLAP for Decision Support (Navathe) • Goal of OLAP is to support ad-hoc querying OLAP for Decision Support (Navathe) • Goal of OLAP is to support ad-hoc querying for the business analyst • Business analysts are familiar with spreadsheets • Extend spreadsheet analysis model to work with warehouse data – Large data set – Semantically enriched to understand business terms (e. g. , time, geography) – combined with reporting features • Multidimensional view of data is the foundation for OLAP

“Multidimensional” view of the data - a popular conceptual model that influenced front-end tools, “Multidimensional” view of the data - a popular conceptual model that influenced front-end tools, database design, and the query engine for OLAP - numeric measures/facts (e. g. number of, sum, total sales) depends on a set of dimensions A data cube: Spreadsheets: e fic f 130 2 300 5 024 200 quarter o ice f of 5 024 200 product

“Multidimensional” view of the data promotion campaign ice ff quarter o product ice ff “Multidimensional” view of the data promotion campaign ice ff quarter o product ice ff o quarter f of ice product customer group

“Multidimensional” view of the data Promotion campaign Quarter Measures/facts Promotion campaign Office Customer group “Multidimensional” view of the data Promotion campaign Quarter Measures/facts Promotion campaign Office Customer group

Dimensional modelling - Star schema Service used Time - date - month - quarter Dimensional modelling - Star schema Service used Time - date - month - quarter - year - service name - service group Telephone calls - sum ($) - number of calls Sales Dimension - seller name - office Customer - customer name - address - region - income group

Dimensional modelling - Star-join schemas Service Dimension Time Dimension Fact table - Transactions C Dimensional modelling - Star-join schemas Service Dimension Time Dimension Fact table - Transactions C 210 C 212 C 213 C 214 Sales Dimension S 1 S 3 S 2 S 1 S 4 F 11 F 13 991011 991012 Sum 25: 00 05: 00 89: 00 12: 00 08: 00 Number of calls 3 1 1 Customer Dimension

Dimensional modelling - Star-join schemas Service Dimension Time Dimension Fact table - Transactions C Dimensional modelling - Star-join schemas Service Dimension Time Dimension Fact table - Transactions C 210 C 212 C 213 C 214 S 1 S 3 S 2 S 1 S 4 F 11 F 13 991011 991012 Sum 25: 00 05: 00 89: 00 12: 00 08: 00 S=37: 00 Sales Dimension Number of calls 3 1 1 Query: For how much did customers in Sthlm use service “Local call” in october 1999? Customer Dimension

Snow-flake schema Year Service used - service name Time Month - date Quarter Telephone Snow-flake schema Year Service used - service name Time Month - date Quarter Telephone calls Service group - sum ($) - number of calls Sales Dimension - seller name Office Region Customer - customer name - address Income group

Architectur e Monitoring & Administration Tools Metadata repository Data sources OLAP servers Analysis Data Architectur e Monitoring & Administration Tools Metadata repository Data sources OLAP servers Analysis Data warehouse External sources Operational DBs Extract Transform Load Refresh Serve Query/Reporting Data mining Data marts Falö aöldf flaöd aklöd falö alksdf

Back End Tools and Utilities Extract & Transform • data selection • data cleaning Back End Tools and Utilities Extract & Transform • data selection • data cleaning – Data migration: “replace the string gender by sex” – Data scrubbing: based on domain specific knowledge – Data auditing: a variant of data mining • data enrichment • data aggregation

Back End Tools and Utilities • Load – full loading: a long batch transaction, Back End Tools and Utilities • Load – full loading: a long batch transaction, takes a long time – incremental loading: during refresh • Refresh – when: periodically e. g. , daily or weekly – how: • extracting the entire source: sometimes the only way when dealing with legacy data sources • incremental refresh: supported by replication servers – data shipping – transaction shipping

Approaches to OLAP Servers • Relational OLAP (ROLAP) – Relational and Extended Relational DBHS Approaches to OLAP Servers • Relational OLAP (ROLAP) – Relational and Extended Relational DBHS to store and manage warehouse data • schema design • extended SQL • Multidimensional OLAP (MOLAP) – – Array-based storage structure (n-dimensional array) Direct access to array data structure Good indexing properties Poor storage utilisation when the data is sparse.

Front End Tools - Basic Functionality • • • Pivoting Rollup (drill-up) and Drill-down Front End Tools - Basic Functionality • • • Pivoting Rollup (drill-up) and Drill-down Slice-and-dice Ranking (sorting) Selection Computed attributes

Metadata Data about data • Administrative metadata (includes all information necessary for setting up Metadata Data about data • Administrative metadata (includes all information necessary for setting up and using a DW, e. g. Information about source databases, dw schemas, dimensions, hierachies, predefined queries, physical organisation, rules and script for extraction, transformation and load, back-end and front end tools) • Business metadata (business terms and definitions, ownership of data) • Operational metadata (information collected during the operations of the DW, e. g. usage statistics, error reports)

Metadata Repository • • • warehouse schema view & derived data definitions predefined queries Metadata Repository • • • warehouse schema view & derived data definitions predefined queries and reports data marts locations and contents data partitions data extraction, cleaning, transformations rules, defaults • data refresh and purging rules • user profiles, user groups • security: user authorisation, access control

Problems of Data Warehousing • • • Underestimation of resources for data loading Hidden Problems of Data Warehousing • • • Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration

Problems of Data Warehousing • • • Underestimation of resources for data loading Hidden Problems of Data Warehousing • • • Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration

Problems of Data Warehousing • • • Underestimation of resources for data loading Hidden Problems of Data Warehousing • • • Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration

Data Warehouse vs. Data Mart (Navathe) • Enterprise warehouse: collects all information about subject Data Warehouse vs. Data Mart (Navathe) • Enterprise warehouse: collects all information about subject (customer, products, sales, assets, personnel) that span the entire organisation – Requires extensive business modelling – May take years to design and build • Data Mart: Departmental subsets that focus on selected subjects: Marketing data mart: customer, product, sales – Faster roll-out – Complex integration in the long term

The Data Warehouse Bus s Order tion Produc Dimensions Time Sales Rep Customer Promotion The Data Warehouse Bus s Order tion Produc Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center Allows the parallell dvlpmt of business process data marts with ability to integrate

The Business Dimensional Lifecycle Technical Architecture Design Product Selection & Installation Business Project Planning The Business Dimensional Lifecycle Technical Architecture Design Product Selection & Installation Business Project Planning Requirement Dimensional Modeling Physical Design Data Staging Design & Development Definition End-User Application Specification End-User Application Development Project Management Deployment Maintenance and Growth

What is data mining? Data Mining is data analysis in order to discover hidden What is data mining? Data Mining is data analysis in order to discover hidden correlations (pattern, rules) in huge data sets “Data Mining is the process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions” Cabena, Hadjinian, Stadler, Verhees, Zanasi

Enabling factors for data mining Data availability • Increased amount of electronically stored data Enabling factors for data mining Data availability • Increased amount of electronically stored data • Increased processing power • Increased data storage ability • Increased data gathering ability (networks, extraction tools) • Increased number of data warehouses Business conditions • Increased need to compete effectively • Increased awareness of need to know customers

Data mining uses in enterprises • Predict customer pattern of behaviour, e. g buying Data mining uses in enterprises • Predict customer pattern of behaviour, e. g buying pattern • Discover market developments driven by demographic changes • Discover shifts in consumption • Identification of new customers • Anticipation of demands on inventory

Data mining process Report results Business Problem Data mining expert needed This step: 70%-80% Data mining process Report results Business Problem Data mining expert needed This step: 70%-80% of the total time X Extraction and transformation of data Y, Z Analysis of results Mining data using specific function Data mining expert needed

Primary operations in data mining A number of basic operations/functions/techniques can be used for Primary operations in data mining A number of basic operations/functions/techniques can be used for prediction and depiction – – – Link Analysis: Associations discovery Link Analysis: Sequential pattern discovery Database Segmentation: Clustering Predictive modelling: Classification Predictive modelling: Value prediction Forensic analysis: Discover anomalous

Link Analysis: Association discovery • Occurrences that are linked to a single event, e. Link Analysis: Association discovery • Occurrences that are linked to a single event, e. g centered on the transaction • For exampel, discovers items that are bought/visited/done together. • Often in the form: x% of all record containing items A and B, also contain items D and E “When a customer buys orange juice: then the customer also buy brandy in 60% of cases”

Link Analysis: Sequential pattern discovery • Discover sequences, that show events linked over time Link Analysis: Sequential pattern discovery • Discover sequences, that show events linked over time • Often in the form: x% of the customers who get B will get C at a later time • Often used on a long time series of records in order to discover trends “ 20% of customers who buy a new carpet, will later buy new curtains”

Database Segmentation: Clustering • Clustering identifies undiscovered grouping • A cluster is a group Database Segmentation: Clustering • Clustering identifies undiscovered grouping • A cluster is a group of objects grouped together because of their similarity of proximity, for example similiar behavior Dept XXX X Profitable customers! XX XX X X XX Income

Predictive modelling: Classification • Classify data items into one of several predefined classes • Predictive modelling: Classification • Classify data items into one of several predefined classes • For example, to predict if a person is going to stay or leave as a customer Customer>2, 5 (yrs) Yes STAY (A) 3 No Service<3 2 1 LEAVE (B) 1 2 Yes STAY (C) 3 4 Services STAY (A) LEAVE (B) No STAY (C)

Predictive modelling: Value prediction • Value prediction or regression is a common statistical technique Predictive modelling: Value prediction • Value prediction or regression is a common statistical technique for modelling the relationship between two or more variables • Linear prediction/regression attempts to fit a straight line through data items. Nonlinear prediction attempts to fit a nonlinear line through data set, see fig. X X XX XXX X X

Mining in e-Commerce systems Information in a Web Server´s Log that can be used Mining in e-Commerce systems Information in a Web Server´s Log that can be used for data mining analysis: - cookie ID (anonymous user) - user ID (registred user), registration information - IP address, MAC nr - date, time - which webpages accessed and in what order - products sold to whom - “Comet_cursor” - Double-click

Problems in data mining • • Limited information Noise and missing values Spurious (false) Problems in data mining • • Limited information Noise and missing values Spurious (false) associations/patterns Expert knowledge needed

”We are drowing in information, but starving for knowledge” - John Naisbett ”We are drowing in information, but starving for knowledge” - John Naisbett