e05237d91d8185291f1dd2bf21cdfd55.ppt
- Количество слайдов: 55
Data and Knowledge Management Sumit Sarkar & Varghese S. Jacob University of Texas at Dallas Sumit Sarkar & Varghese Jacob, UT Dallas
Overview of Session u Data l l l Management Technical considerations: Databases and database management systems Design Issues Managerial issues u Data Warehousing u Knowledge Discovery and Data Mining u Knowledge Management
What Should Managers Know About Data Management? u Organizations store and use large quantities of data · Sheer volume of data alone means that proper management is essential u Data are a valuable resource that must be managed · Value can be derived by capturing, validating, and protecting the data Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know About Database Management? u. Management of database is an important issue · No longer just a technical issue, the management of databases has become increasingly important throughout all types of organizations. Sumit Sarkar & Varghese Jacob, UT Dallas
Data Issues u. Amount of data we capture doubles approximately every 18 months. · How do we capture it? · Where do we store it? · What do we do with it? Sumit Sarkar & Varghese Jacob, UT Dallas
Database Approach u Database: A non redundant collection of logically related records or files. It enables a common pool of data records to serve many processing applications. u Abstraction of the real world. u Database Management Systems (DBMS): A set of computer programs that handle all access to the database. It allows one to define a collection of related files, manipulate this collection (i. e. insert, delete, and update) and read records. E. g. Access, Oracle u Data occupies the central position; referenced as needed. Sumit Sarkar & Varghese Jacob, UT Dallas
DBMS: Database Management Systems
Database Approach (cont’d) u Data Sharing: data is not the property of one person; one representation for each piece of data; avoids (minimizes) redundancy. u User Views: allows a user to have a view of the database that is different from the view used by others; user is isolated from changes in the data/programs. u Query Language: An English-like language designed for end-users to query the database. E. g. , Structured Query Language (SQL) u Database Administrator: Specialist who manages the database. Sumit Sarkar & Varghese Jacob, UT Dallas
Major Database Activities u Database development - design and implement the database u Database interrogation - query the database u Database maintenance - keep it operational u Application development - write application programs against it Sumit Sarkar & Varghese Jacob, UT Dallas
Database Administrator u Coordinates all the activities of the database management system u Must have a good understanding of the enterprise’s information resources and needs Sumit Sarkar & Varghese Jacob, UT Dallas
Database Administrator u Duties include: · DB Planning · Design & Development · Performance Analysis · Liaison with Users · Security and Authorization · Backup & Recovery · Maintenance · Training Sumit Sarkar & Varghese Jacob, UT Dallas
Database Users u Differentiated by the way they expect to interact with the system · Application programmers – interact with system through query languages and other programs · Sophisticated users – form requests in a database query language · Naive users – invoke one of the permanent application programs that have been written previously Sumit Sarkar & Varghese Jacob, UT Dallas
What Should Managers Know about Database Technology? u. How data is represented can significantly affect the ways in which data · can be manipulated to assist in decisionmaking · can be controlled and managed Sumit Sarkar & Varghese Jacob, UT Dallas
Data Models • Hierarchical • Network • Relational • Object Oriented • Multi-Dimensional
Which Data Model is Best? § Depends! § Transaction Processing (OLTP) • Relational • Hierarchical § Decision Support (OLAP) • Multi-Dimensional • Relational § Multi-Media § Object Oriented • Relational Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Model u. Most widely used type of database management system u. Examples: Oracle (Oracle), DB 2 (IBM), Informix (Informix), SQL Server (Microsoft), MS Access (Microsoft), etc. Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Model u One basic construct: the relation u Relations represent both entities and relationships u Data Manipulation Language is English-like (e. g. , SQL) u Note - A relationship occurs during conceptual design (E-R Model) and should not be confused with a relation Sumit Sarkar & Varghese Jacob, UT Dallas
Book Relation Column Record Field Sumit Sarkar & Varghese Jacob, UT Dallas
Notation u Relations can be expressed concisely by eliminating the sample data and showing just the relation name and the column names. For example: Book: [title-id, title, type, price, pub_id] (Note that the key is underlined) Sumit Sarkar & Varghese Jacob, UT Dallas
Properties of relations u Each row is unique. No two rows may have identical values. u Each column must have a unique name. u The sequence of columns (left to right) is insignificant. u The sequence of rows (top to bottom) is insignificant. u Attribute values are atomic Sumit Sarkar & Varghese Jacob, UT Dallas
Relations and keys u. A primary key is a column or group of columns in a relation whose values uniquely identify each row in a relation. u. A foreign key is a column or group of columns whose values are the same as the primary key of another relation. You can think of a foreign key as a copy of the primary key from another relation. Note that you do not underline a foreign key. Sumit Sarkar & Varghese Jacob, UT Dallas
Example Patient Bill Patient No Location 3249 137 -2 Patient Name Date Admitted 5/6/99 Date Discharged 5/9/99 Item_Code Description 200 Room Semi-pr 205 Television 307 X-Ray Mary Baker Charge $150. 00 $100. 00 Total: $260. 00 Sumit Sarkar & Varghese Jacob, UT Dallas
Example (Cont. ) Patient & Charges Data P_No Patient_ Patient Fname Lname Date Admit Date Disch Item Code Desc Charge 1379 John Cribbs 4/23/99 4/28/99 413 307 Lab Test $40. 00 X-Ray $100. 00 3249 Mary Baker 5/6/99 5/9/99 307 X-Ray 3249 Mary Baker 5/6/99 5/9/99 205 Television $10. 00 3249 Mary Baker 5/6/99 5/9/99 200 Room S-pr$150. 00 6213 David Rose 5/2/99 5/6/99 205 Television $10. 00 $100. 00 Redundancies and Anomalies Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Databases (RDBMS) PATIENT TABLE ITEM TABLE CHARGES TABLE
Design Steps User Requirements Conceptual Design (E. g. entity-relationship model) Logical Design (E. g. relational model) Physical Design (optimize for performance) Sumit Sarkar & Varghese Jacob, UT Dallas
Relational Operations u. Database operators (operations) represent ways in which data can be manipulated to assist in decisionmaking · Without some sense of the possibilities of queries and reports, people often have a misconception of what they can expect from a database. Sumit Sarkar & Varghese Jacob, UT Dallas
What does the user want? • Patients admitted after February 26, 2001 • Items with Charge Amount > $100 Select • Admit Date for all Patients Project • All Patients and associated Charges Join • Patients in Semi Private rooms Combination
Relational Databases (RDBMS) PATIENT TABLE ITEM TABLE CHARGES TABLE
Security Features u. Four Dimensions • Who (User) • What (Data Objects) • Where (Point of Access) • How (Nature of Access)
Questions?
In Class Exercise XYZ is a luxury hotel chain with several branches across the US. The clientele consists of corporate travelers and well-heeled individuals. A key to XYZ’s success is in providing personalized attention to its guests. XYZ relies on two information systems. One is GLOBAL, which is a centralized reservation system owned by an international travel organization TRV. The other is LOCAL, which keeps records of reservations for each individual hotel and preferences for guests who have stayed in each hotel. Given their clientele, XYZ’s management feels that it is imperative to remember their repeat guests, and their preferences. While LOCAL allows a hotel to easily access its own guest list, it is currently not possible to locate the preferences of a guest who may have stayed in a different hotel of the chain. A related problem is that when a repeat guest with a common name (e. g. , John Doe) calls, all the John Doe’s in the system would be displayed, and the reservation agents would have no way of knowing who that caller is. Furthermore, if John Doe had been listed previously as John Dow, the system would not recognize this as the same person. Consequently, a new record would be created for John Doe.
In Class Exercise (cont. ) Your group has been designated to address the above problems. § What could be done to enable a reservation agent to access data from some other hotel? What issues must you consider in making such a decision? § What can help in identifying a caller with a common name? § What can be done to resolve the “duplicate records” problem? § In your preliminary discussions with the CEO, the possibility of moving to a completely new system has been considered. You believe that it would be easier to justify a new system if additional value could be generated from the data on your guests. What could be possible ways to achieve that? § Lead discussion group: 0153
DATA WAREHOUSING § Repository of Data for Decision Support Applications § End Users: Business Analysts, Executive Management, etc. § Goal: Deliver needed data (information) when user wants, in form that user understands, to user’s preferred location
DATA WAREHOUSING § Working Definitions: l l Technique to assemble and manage data from various sources in order to gain a single, detailed view of a business (Stephen R. Gardner, NCR) An integrated and consistent store of subject oriented data, structured for query and retrieval in order to support management decision making
TYPICAL APPLICATIONS u Marketing · · Applications Scanner Data Market Basket analyses One-to-one Marketing Customer Support u Logistics u Financial Services u Health Care
DRIVERS FOR DATA WAREHOUSING § Business Drivers • Quick reaction to business needs • Extensive analysis requirements • E. g, Understanding customer behavior, Assessing performance in different markets, etc.
DRIVERS FOR DATA WAREHOUSING § Data Drivers • Patterns hidden in transactional data • Need for “Unified View” of data from “Islands” of data ØExternal data needs ØDisparate platforms and formats ØMultiple database technologies ØDifferent application priorities
DATA WAREHOUSE ARCHITECTURE
KEY FUNCTIONS: FRONT END § Data Retrieval and Analysis • Querying Tools • Multidimensional Analysis • Data Visualization • Knowledge Discovery and Data Mining
KEY FUNCTIONS: BACK END § Data Preparation • Extraction • Transformation • Cleansing • Data Integration/Reconciliation • Aggregation • Summarization • Loading
MULTIDIMENSIONAL ANALYSIS § Business Analysis based on • Products Ø Trends • Customers in Sales of Product P 23 (Jazz Drives) Ø Purchase • Dates • habits of Customer C 1234 (Jill) Ø Seasonality in sales Combination of dimensions (Product & Date) Ø Sales of Product P 23 in Spring § On-Line Analytical Processing (OLAP): Techniques and tools used to analyze data in warehousing environments
MULTIDIMENSIONAL DATABASES § OLAP tools: On Line Analytical Processing • Relational OLAP (ROLAP) • Multidimensional OLAP (MOLAP)
RELATIONAL: Example Sumit Sarkar & Varghese Jacob, UT Dallas
MULTIDIMENSIONAL: Example § Implications for viewing data from an end-user standpoint: Slice the data on relevant dimensions Sumit Sarkar & Varghese Jacob, UT Dallas
Data Warehouse Vs Operational (Transactional) Database Data Warehouse Operational DB Integrated Limited Integration Non-volatile Continuously Updated Ad-hoc Retrieval Predictable Retrieval Historical data Current data
Development Approaches u Data Centered · Variety of Applications Supported u Application Centered (Data Marts) · Quick implementation · Low cost (relatively) · High payoffs u Staged Data Marts
DATA WAREHOUSING LIFE CYCLE § Define Business Strategy § Identify Scope § Design Architecture/Infrastructure § Determine Data Model § Identify Data Extraction Routines § Implementation & Testing § On-going application development §Process, not Product
KNOWLEDGE DISCOVERY & DATA MINING (KDD) § Process of Identifying • Novel • Valid • Potentially Useful ØPatterns in Data
KDD: Concepts u Pattern · If Income > $20, 000 and House-owner then person will not default payment u Novel · Relationships previously unknown to users u Valid · Discovered patterns should be true for new situations in general u Potentially Useful · Interesting · Actionable
Sample Examples of Data Mining u Campaign management by propensity modeling · u Risk management by credit scoring · u People who bought Data Mining books are likely to buy Data Warehousing books Customized products and care · u Customers with a credit score of 700 or less default at an average rate of 5% Cross-selling by affinity prediction · u Young Males are 75% more likely to respond to sports memorabilia offers than young girls Residents in 90210 prefer Geo Metros Assignment of call center agents · People who call on Monday Mornings call due to billing problems Sumit Sarkar & Varghese Jacob, UT Dallas
Data Mining: Analytical Techniques u Regression (Linear and Non-Linear) u Multivariate Analysis · Discriminant Analysis · Clustering · Multi-Dimensional Analysis (MDA) u Decision Tree Induction u Neural Networks u Dependency Graphs & Belief Networks u Genetic Algorithms
Web Mining: Emerging Area u Types of Web mining · Web content mining · Web structure mining · Web usage mining u Interesting problems on Web mining · Mining what Web search engine finds · Weblog mining (usage, access) · Web document classification u New Applications · Personalization of web sites · Targeted advertising on web sites · Product recommendations (Books, Movies etc. ) Sumit Sarkar & Varghese Jacob, UT Dallas
COLLABORATIVE FILTERING u Very popular in Recommendation systems u E. g. , when a customer of Amazon. com logs on, they are provided recommendations for books, music, etc. How is it done? u Case-based approach used to identify other customers with similar profiles (e. g. , common purchases, demographics, etc. ) u Recommendation made based on purchases of other such customers Sumit Sarkar & Varghese Jacob, UT Dallas
Data Mining Myths u Data Mining is a Magic Wand u All patterns are useful u You don’t need to understand statistical tools to use data mining Sumit Sarkar & Varghese Jacob, UT Dallas
Knowledge Management u u Case - KPMG Peat Marwick U. S. : One Giant Brain Lead discussion group 0224 • • • u Which primary activity of KPMG’s value chain is most significantly impacted by it’s Knowledge Management initiative? Why was the Shadow Partner initiative not implemented? Was that a good decision? How should KPMG approach the technology and systems implementation issue? Lead discussion group 0225 · How should knowledge and intellectual assessment be performed? • How can KPMG integrate knowledge management activities into its business processes? • What cultural changes are necessary? How can they be achieved? Sumit Sarkar & Varghese Jacob, UT Dallas


