Скачать презентацию What is a Data Warehouse by W H Скачать презентацию What is a Data Warehouse by W H

9eae20fd9bcf75dd8b9e665f549c34f0.ppt

  • Количество слайдов: 33

What is a Data Warehouse by W. H. Inmon http: //www. cait. wustl. edu/cait/papers/prism/vol What is a Data Warehouse by W. H. Inmon http: //www. cait. wustl. edu/cait/papers/prism/vol 1_no 1/

What is a Data Warehouse? • A data warehouse is a: subject-oriented, integrated, time-variant, What is a Data Warehouse? • A data warehouse is a: subject-oriented, integrated, time-variant, nonvolatile, collection of data in support of management's decision making process • The data comes from the operational environment • The data warehouse is always a physically separate store

Difference Between Operational Systems and Data and the Data Warehouse (DW) • DW is Difference Between Operational Systems and Data and the Data Warehouse (DW) • DW is oriented around the major subjects of the enterprise • The data-driven, subject orientation is in contrast to the more classical process/functional orientation of applications • The DW world focuses on data modeling and database design exclusively • DW data excludes data that will not be used for DSS processing • DW data spans a spectrum of time and the relationships found in the data warehouse are many

The data warehouse has a strong subject orientation Operational Data warehouse Customer Loans Savings The data warehouse has a strong subject orientation Operational Data warehouse Customer Loans Savings Vendor Product Bank card Activity Trust An application orientation A subject orientation

Integration • Data found within the DW is integrated – – – ALWAYS WITH Integration • Data found within the DW is integrated – – – ALWAYS WITH NO EXCEPTIONS consistent naming conventions consistent measurement of variables consistent encoding structures consistent physical attributes of data • data needs to be stored in the DW in a singular, globally-acceptable fashion

When data is moved to the DW from the application-oriented operational environment, the data When data is moved to the DW from the application-oriented operational environment, the data is integrated before entering the DW Operational Data warehouse appl A - m, f appl B - 1, 0 appl C - x, y appl D - male, female m, f pipeline cm appl A - pipeline cm appl B - pipeline inches appl C - pipeline mcf appl D - pipeline yds appl A - balance dec fixed (13, 2) appl B - balance pic 9(9)v 99 appl C - balance dec fixed (11, 0) appl D - balance pic s 9(7)v 99 comp 3 appl A - description appl B - description appl C - description appl D - description appl A - bal-on-hand appl B - current-balance appl C - cash-in-house appl D - balance appl A - date (Julian) appl B - date (yymmdd) appl C - date (mmddyy) appl D - date (absolute) ? balance dec fixed (13, 2) description balance date (Julian)

Integration • The collective ability of many application designers to create inconsistent applications is Integration • The collective ability of many application designers to create inconsistent applications is legendary • The integration affects almost every aspect of design - the physical characteristics of data, the dilemma of having more than one source of data, the issue of inconsistent naming standards, inconsistent date formats, and so forth

Time Variancy • All data in the data warehouse is accurate as of some Time Variancy • All data in the data warehouse is accurate as of some moment in time (i. e. , not "right now") • In the operational environment data is accurate as of the moment of access • Data found in the warehouse is said to be "time variant

Time Variancy Operational Data warehouse Current value data: • time horizon -- 60 - Time Variancy Operational Data warehouse Current value data: • time horizon -- 60 - 90 days • key may or may not have an element of time • data can be updated Snapshot data: • time horizon -- 5 - 10 years • key contains an element of time • once snapshot is made, record cannot be updated

Nonvolatile Change Replace Insert Replace Load Access Change Operational Data is updated on a Nonvolatile Change Replace Insert Replace Load Access Change Operational Data is updated on a recordby-record basis regularly Data warehouse Data is loaded into the warehouse and is accessed there, but once the snapshot of data is made, the data in the warehouse does not change

Nonvolatile • The basic manipulation of data that occurs in the data warehouse is Nonvolatile • The basic manipulation of data that occurs in the data warehouse is simple • There are only two kinds of operations – the initial loading of data – the access of data • There is no update of data • The need to be cautious of the update anomaly is no factor • Liberties can be taken to optimize the access of data

Nonvolatile • Another consequence is in the technology • Technologies to support: – – Nonvolatile • Another consequence is in the technology • Technologies to support: – – record-by-record update in an on-line mode backup and recovery transaction and data integrity detection and remedy of deadlock are quite complex and unnecessary for data warehouse processing • DW environment is VERY, VERY different from the classical operational environment

Nonvolatile • The source of nearly all data warehouse data is the operational environment Nonvolatile • The source of nearly all data warehouse data is the operational environment • It is a temptation to think that there is massive redundancy of data between the two environments • In fact there is a MINIMUM of data redundancy – data is filtered; much data never passes out of the operational environment – the time horizon of data is very different – the data warehouse contains summary data – data undergoes a fundamental transformation as it passes into the data warehouse

The Structure of the Warehouse • Data warehouses have a distinct structure • Different The Structure of the Warehouse • Data warehouses have a distinct structure • Different components of the data warehouse are: – – – meta data current detail data older detail data lightly summarized data highly summarized data • The major concern is the current detail data – – – the most recent happenings are always of great interest voluminous, stored at the lowest level of granularity disk storage is fast to access but expensive and complex to manage

There are different levels of summarization and detail that demark the data warehouse Highly There are different levels of summarization and detail that demark the data warehouse Highly summarized M E T A D A T A Lightly summarized Current data Older detail data

The Structure of the Warehouse • Older detail data is stored on some form The Structure of the Warehouse • Older detail data is stored on some form of mass storage – it is infrequently accessed – it is stored at a level of detail consistent with current detailed data • Lightly summarized data is distilled from the low level of detail found at the current detailed level – it is almost always stored on disk storage – the design issues are: • what unit of time is the summarization done over • what attributes will the lightly summarized data contain

The Structure of the Warehouse • Highly summarized data is compact and easily accessible The Structure of the Warehouse • Highly summarized data is compact and easily accessible • Meta data plays a special and very important role in the data warehouse • It is used as: – a directory to help locate the contents – a guide to the mapping of data as the data is transformed from the operational to the DW environment – a guide to the algorithms used for summarization

An example of the levels of summarization that might be found in the data An example of the levels of summarization that might be found in the data warehouse national sales by month 1988 -1996 national sales by week 1986 -1996 M E T A monthly sales by product line 1993 -1996 weekly sales by subproduct 1988 -1996 sales detail 1995 -1996 D A T A sales detail 1985 -1994

An Example of the Data Warehouse • Old sales detail is that detail about An Example of the Data Warehouse • Old sales detail is that detail about sales that is older than 1995 • The current value detail contains data from 1995 to 1996 • The sales detail is summarized weekly by subproduct line and by region to produce the lightly summarized stores of data • The weekly sales detail is further summarized monthly along even broader lines to produce the highly summarized data • Meta data contains (at the least!): – the structure of the data – the algorithms used for summarization – the mapping from the operational environment to the data warehouse

Old Detail Storage Medium • A wide variety of storage media that should be Old Detail Storage Medium • A wide variety of storage media that should be considered for storing older detail data – – – photo optical storage CD-ROM micro fiche magnetic tape mass storage • It is entirely likely that other storage media will serve the needs

The Flow of Data Inside the Data Warehouse Summarization process Operational environment Aging process The Flow of Data Inside the Data Warehouse Summarization process Operational environment Aging process

Flow of Data • As data enters the data warehouse from the operational environment, Flow of Data • As data enters the data warehouse from the operational environment, it is transformed • Upon entering the data warehouse, data goes into the current detail level of detail • It resides there and is used there until one of three events occurs: – it is purged – it is summarized, and/or – it is archived

The higher the levels of summarization, the more the usage of the data The higher the levels of summarization, the more the usage of the data

Summarized Data • The more summarized the data, the quicker and more efficient it Summarized Data • The more summarized the data, the quicker and more efficient it is to get to the data • The DSS analyst in a pre-data warehouse environment has used data at the detailed level • One of the tasks of the data architect is to wean the DSS user from constantly using data at the lowest level of detail – installing a chargeback system – pointing out very good response time when dealing with data at a high level of summarization

Other Considerations • Data at the higher levels of summarization can be freely indexed Other Considerations • Data at the higher levels of summarization can be freely indexed • Data at the lower levels of detail is so voluminous that it can be indexed sparingly • The data model and formal design applies almost exclusively to the current level of detail • The data modeling activities do not apply to the levels of summarization

Indexes and Data Model Data model Indexes and Data Model Data model

Partitioning of DW Data • Partitioning can be done in two ways – at Partitioning of DW Data • Partitioning can be done in two ways – at the DBMS level • the DBMS is aware of the partitions and manages them accordingly • the automatic management of the partitions is inflexible – at the application level • the responsibility for the management of the partitions is left up to the programmer • provides flexibility in the management of data in the data warehouse

Current detail data is almost always partitioned Current detail data is almost always partitioned

The internal structuring of data in a sample data warehouse current detailed data vendor/supplier The internal structuring of data in a sample data warehouse current detailed data vendor/supplier history customer history order/customer 91 - present 88 - present parts/order history 91 92 93 94 95 part/order parts manufacture history q 1 q 2 q 3 q 4 93 93 94 94 95 95 part q 1 96 part/assembly parts shipments assembly history 91 92 93 94 95 87 - present parts bill of material 92 93 94 95

An Example of a Data Warehouse • The levels of summarization are not shown, An Example of a Data Warehouse • The levels of summarization are not shown, nor is the old detail archive shown • There are tables of the same type divided over time • For different types of tables there are different units of time physically dividing the units of data • Different tables are linked by means of a common identifier

Other Anomalies • Public summary data is summary data that has been calculated outside Other Anomalies • Public summary data is summary data that has been calculated outside the boundaries of the data warehouse but is used throughout the corporation • Another anomaly is that of external data • Another exceptional type of data sometimes found in a data warehouse is that of permanent detail data stored for ethical or legal reasons – the medium the data is stored on must be as safety proof as possible – the data must be able to be restored – the data needs special treatment in the indexing to be accessible

Summary A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in Summary A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management's decision needs There are four levels of data warehouse data: – – old detail current detail lightly summarized data highly summarized data Meta data is also an important part of the data warehouse environment