b362b591c252911a373adae9c01f18a5.ppt
- Количество слайдов: 36
ISQS 3358, Business Intelligence Data Warehousing Zhangxi Lin Texas Tech University
Outlines So far students have learned ◦ Basic concepts of business intelligence ◦ The definition and importance of data warehouse In this lecture, the following topics will be covered ◦ SQL Server 2008 data mart case study How to access data in a network directory How to access SQL Server 2008 on the Citrix Server How to load data from an Excel file to a database ◦ Data warehouse overview ◦ Data warehouse architecture ISQS 3358 BI 2
Data Warehousing Definitions and Concepts Data warehouse ◦ Video – Overview of data warehouse 2’ 38” A physical repository where relational data are specially organized to provide enterprise-wide, cleansed data in a standardized format Benefits of data warehouse 3’ 18” ISQS 3358 BI 3
Data Mart - The IMW Case IMW, standing for Internet Media Works!, is an ASP in real estate information services. It is headquartered in Austin, Texas. CEO is Gary Anderson. Web page: http: //www. inetworks. com
Why need Data Mart? Data mart complements the centralized data warehousing based on UDM model, for the situations where UDM cannot be used ◦ ◦ Legacy databases Data are from nondatabase sources No physical connection the centralized data warehouse Data are not clean ISQS 3358 BI 5
Data mart Definition A localized data warehouse that stores only relevant data to a department or even an individual ◦ Dependent data mart A subset that is created directly from a data warehouse ◦ Independent data mart A small data warehouse designed for a strategic business unit or a department ISQS 3358 BI 6
Commrex Real Estate Operational Database Users: property listors, webmaster, marketing manager of IMW Objective: Encourage realtors to use the online ASP services with the best information services to increase IMW’s revenue. Value Chain ◦ Listors create their account ◦ Listors post their real estate properties to the web-based database services and pay listing fees ◦ Property buyers search the website-based database and buy properties from listors. This is the incentive for listors to use the ASP services Business Processes ◦ Listor sign up ◦ Listor account management ◦ Property data posting ◦ Property search ◦ Property database maintenance 7 ISQS 3358 BI 7
IMW’s Database ERD Model Property Listing Database Transaction. ID M: 1 User. ID Property ID Prop. ID Listor ID M: M Property Type Membership Database M: 1 Listor ID Listor Name Property Type Name Address Company ID Subtype 1 City Subtype 2 Chapter Update. Date Feature Subtype n M: M Functions Specializations Legends Company ID Comp Name Primary Key Address Secondary Key Telephone # Link to a table ISQS 3358 BI 8
Data Mart Structures Fact tables ◦ Measures Dimension tables ◦ Dimensions and Hierarchies ◦ Attributes (or columns) Dimensional modeling – Stars and Snowflakes ISQS 3358 BI 9
Measures A numeric quantity expressing some of the organization’s performance. The information represented by this quantity is used to support or evaluate the decision making and performance of the organization. A measure is also called a fact The table holding measure information is called as a fact table Dimensions vs. Measures 2’ 38” ISQS 3358 BI 10
Commrex Data Warehousing Users: CEO of IMW, IMW business analyst, IMW marketing manager Analytic themes ◦ Fast retrieval of business key performance indicators (KPIs) ◦ Decision making on business promotions Applications ◦ Geographic distribution of property listings ◦ Scorecard for main performance indicators ◦ Dashboard Questions ◦ ◦ How to model data warehouse? What are required in data transformation and preprocessing? Any missing dimension for data ware housing? How to perform routine data warehouse updates – frequency, timing, etc. ISQS 3358 BI 11
IMW’s Data Warehouse Dimensional Model Property Listing Fact Property Type Dimension Membership Dimension Property ID Listor ID Listor Name Prop. Type Sub. Name Address Company ID City Chapter … Update. Date Functions Features Specializations Year Legends Primary Key Company ID Quarter Comp Name Month Date Company Dimension Address Secondary Key Telephone # Link to a table ISQS 3358 BI 12
Data Warehouse Overview
Data Warehousing Characteristics Basic characteristics of data warehousing ◦ ◦ Subject oriented Integrated Time variant (time series) Nonvolatile (not allow to change) Others ◦ ◦ ◦ Web based Relational/multidimensional Client/server Real-time Include metadata ISQS 3358 BI 14
Data Warehousing Process Overview Data in DW are constantly accumulated. ◦ Organizations continuously collect data, information, and knowledge at an increasingly accelerated rate and store them in computerized systems The number of users is constantly increasing. ◦ The number of users needing to access the information continues to increase as a result of improved reliability and availability of network access, especially the Internet The organization using data warehouse relied on DW more and more ISQS 3358 BI 15
Data Warehousing More Concepts Operational data stores (ODS) A type of database often used as an interim area for a data warehouse, especially for customer information files Enterprise data warehouse (EDW) A large-scale data warehouse used across the enterprise for decision support. It integrates different sources of information into a consolidated information system. Metadata (Video 1’ 41”) Data about data. In a data warehouse, metadata describe the contents of a data warehouse and the manner of its use ◦ Syntactic metadata, structural metadata, and semantic metadata ISQS 3358 BI 16
Data Warehousing Process Overview ISQS 3358 BI 17
Data Warehousing Process Overview The major components of a data warehousing process ◦ Data sources ◦ Data extraction ◦ Data loading ◦ Comprehensive database ◦ Metadata ◦ Middleware tools ISQS 3358 BI 18
Data Warehouse Architectures
Three Parts of Data Warehouse The data warehouse that contains the data and associated software Data acquisition (back-end) software that extracts data from legacy systems and external sources, consolidates and summarizes them, and loads them into the data warehouse Client (front-end) software that allows users to access and analyze data from the warehouse ISQS 3358 BI 20
Three-Tier Data Warehouse ISQS 3358 BI 21
Alternative Data Warehouse Architectures (1) ISQS 3358 BI 22
Alternative Data Warehouse Architectures (2) ISQS 3358 BI 23
Alternative Data Warehouse Architectures (3) ISQS 3358 BI 24
Alternative Data Warehouse Architectures (4) ISQS 3358 BI 25
Alternative Data Warehouse Architectures (5) ISQS 3358 BI 26
Architectures Comparison ISQS 3358 BI 27
Teradata’s EDW ISQS 3358 BI 28
Structure and Components of Business Intelligence MS SQL Server 2008 SSMS SSIS SSAS BIDS SSRS SAS EG SAS EM ISQS 3358 BI 29
Exercise 1 – Walk through data warehousing process Learning Objectives ◦ To gain a general impression how to use SQL Server 2008 to implement a data mart Tasks ◦ Create your database with SSMS, named as ISQS 3358 -002 -2016 -lastname ◦ Import data from Commrex_2011. xls Path: \techsharecobadisqs 3358datasetcommrex_2011. xls ◦ Use SSMS to create a ERD diagram ◦ Create a SSAS project using BIDS ◦ Define data source, data source view, and cube Deliverable: ◦ One-page printout of the cube diagram screenshot ISQS 3358 BI 30 30
Election data analysis HARVARD ELECTION DATA ARCHIVE ◦ Sharing and Improving Election Data ◦ Understand the data -ak_geo_id - row identifier from raw Census VTD file -ak_vote_id - row identifier from raw vote total file -VAP - Voting Age Population from Census FTP -Pop 100 - full population from Census FTP -STATEFP 10 - state code from census -COUNTYFP 10 - county code from census -VTDST 10 - voting district code from census -GEOID 10 - geo identifier from census -VTDI 10 - voting code from census -Name 10 - voting district string - best description -Name. LSAD 10 - voting district string, with ``voting district at end …… ISQS 3358 BI 31
Distributed business intelligence Deal with big data – the open & distributed approach ◦ LAMP: Linux, Apache, My. SQL, PHP/Perl/Python ◦ Hadoop (video) ◦ Map. Reduce ◦ HDFS ◦ NOSQL ◦ Zookeeper ◦ Storm ISQS 3358 BI 32
Hadoop – for BI in the Cloudera Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Hadoop was inspired by Google's Map. Reduce, a software framework in which anapplication is broken down into numerous small parts. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. ISQS 3358 BI 33
Map. Reduce is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster or a grid. ISQS 3358 BI 34
Cloudera’s Hadoop System ISQS 3358 BI 35
Comparison between big data platform and traditional BI platform ISQS 3358 BI 36
b362b591c252911a373adae9c01f18a5.ppt