Скачать презентацию Other Topics 2 Warehousing Mining and Information Retrieval Скачать презентацию Other Topics 2 Warehousing Mining and Information Retrieval

173a0e1b82d688c18fa9717e5ca8200f.ppt

  • Количество слайдов: 25

Other Topics 2: Warehousing, Mining and Information Retrieval Database System Concepts, 6 th Ed. Other Topics 2: Warehousing, Mining and Information Retrieval Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See www. db-book. com for conditions on re-use

Decision Support Systems n Decision-support systems are used to make business decisions, often based Decision Support Systems n Decision-support systems are used to make business decisions, often based on data collected by on-line transaction-processing systems. n Examples of business decisions: l What items to stock? l What insurance premium to change? l To whom to send advertisements? n Examples of data used for making decisions l Retail sales transaction details l Customer profiles (income, age, gender, etc. ) Database System Concepts - 6 th Edition 2 ©Silberschatz, Korth and Sudarshan

Decision-Support Systems: Overview n Data analysis tasks are simplified by specialized tools and SQL Decision-Support Systems: Overview n Data analysis tasks are simplified by specialized tools and SQL extensions l Example tasks 4 For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year 4 As above, for each product category and each customer category n Statistical analysis packages (e. g. , : S++) can be interfaced with databases l Statistical analysis is a large field, but not covered here n Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases. n A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site. l Important for large businesses that generate data from multiple divisions, possibly at multiple sites l Data may also be purchased externally Database System Concepts - 6 th Edition 3 ©Silberschatz, Korth and Sudarshan

Data Warehousing n Data sources often store only current data, not historical data n Data Warehousing n Data sources often store only current data, not historical data n Corporate decision making requires a unified view of all organizational data, including historical data n A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site l Greatly simplifies querying, permits study of historical trends l Shifts decision support query load away from transaction processing systems Database System Concepts - 6 th Edition 4 ©Silberschatz, Korth and Sudarshan

Data Warehousing Database System Concepts - 6 th Edition 5 ©Silberschatz, Korth and Sudarshan Data Warehousing Database System Concepts - 6 th Edition 5 ©Silberschatz, Korth and Sudarshan

Warehouse Schemas n Dimension values are usually encoded using small integers and mapped to Warehouse Schemas n Dimension values are usually encoded using small integers and mapped to full values via dimension tables n Resultant schema is called a star schema l More complicated schema structures 4 Snowflake schema: multiple levels of dimension tables 4 Constellation: Database System Concepts - 6 th Edition multiple fact tables 6 ©Silberschatz, Korth and Sudarshan

Data Warehouse Schema Database System Concepts - 6 th Edition 7 ©Silberschatz, Korth and Data Warehouse Schema Database System Concepts - 6 th Edition 7 ©Silberschatz, Korth and Sudarshan

Data Mining Database System Concepts - 6 th Edition 8 ©Silberschatz, Korth and Sudarshan Data Mining Database System Concepts - 6 th Edition 8 ©Silberschatz, Korth and Sudarshan

Data Mining n Data mining is the process of semi-automatically analyzing large databases to Data Mining n Data mining is the process of semi-automatically analyzing large databases to find useful patterns n Prediction based on past history l Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, . . ) and past history l Predict if a pattern of phone calling card usage is likely to be fraudulent n Some examples of prediction mechanisms: l Classification 4 Given a new item whose class is unknown, predict to which class it belongs l Regression formulae 4 Given a set of mappings for an unknown function, predict the function result for a new parameter value Database System Concepts - 6 th Edition 9 ©Silberschatz, Korth and Sudarshan

Data Mining (Cont. ) n Descriptive Patterns l Associations 4 Find books that are Data Mining (Cont. ) n Descriptive Patterns l Associations 4 Find books that are often bought by “similar” customers. If a new such customer buys one such book, suggest the others too. l Associations may be used as a first step in detecting causation 4 E. g. l association between exposure to chemical X and cancer, Clusters 4 E. g. typhoid cases were clustered in an area surrounding a contaminated well 4 Detection Database System Concepts - 6 th Edition of clusters remains important in detecting epidemics 10 ©Silberschatz, Korth and Sudarshan

Classification Rules n Classification rules help assign new objects to classes. l E. g. Classification Rules n Classification rules help assign new objects to classes. l E. g. , given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? n Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. l person P, P. degree = masters and P. income > 75, 000 P. credit = excellent l person P, P. degree = bachelors and (P. income 25, 000 and P. income 75, 000) P. credit = good n Rules are not necessarily exact: there may be some misclassifications n Classification rules can be shown compactly as a decision tree. n Several algorithms for constructing decision trees: see book for details Database System Concepts - 6 th Edition 11 ©Silberschatz, Korth and Sudarshan

Decision Tree Database System Concepts - 6 th Edition 12 ©Silberschatz, Korth and Sudarshan Decision Tree Database System Concepts - 6 th Edition 12 ©Silberschatz, Korth and Sudarshan

Other Types of Classifiers n Neural net classifiers are studied in artificial intelligence and Other Types of Classifiers n Neural net classifiers are studied in artificial intelligence and are not covered here n Bayesian classifiers (see book for details) n Support Vector Machines (see book for details) Database System Concepts - 6 th Edition 13 ©Silberschatz, Korth and Sudarshan

Association Rules n Retail shops are often interested in associations between different items that Association Rules n Retail shops are often interested in associations between different items that people buy. l Someone who buys bread is quite likely also to buy milk l A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. n Associations information can be used in several ways. l E. g. when a customer buys a particular book, an online shop may suggest associated books. n Association rules: bread milk DB-Concepts, OS-Concepts Networks Left hand side: antecedent, right hand side: consequent l An association rule must have an associated population; the population consists of a set of instances 4 E. g. each transaction (sale) at a shop is an instance, and the set of all transactions is the population l Database System Concepts - 6 th Edition 14 ©Silberschatz, Korth and Sudarshan

Association Rules (Cont. ) n Rules have an associated support, as well as an Association Rules (Cont. ) n Rules have an associated support, as well as an associated confidence. n Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. l E. g. suppose only 0. 001 percent of all purchases include milk and screwdrivers. The support for the rule is milk screwdrivers is low. n Confidence is a measure of how often the consequent is true when the antecedent is true. l E. g. the rule bread milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk. Database System Concepts - 6 th Edition 15 ©Silberschatz, Korth and Sudarshan

Clustering n Clustering: Intuitively, finding clusters of points in the given data such that Clustering n Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster n Can be formalized using distance metrics in several ways l Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized 4 Centroid: point defined by taking average of coordinates in each dimension. l Another metric: minimize average distance between every pair of points in a cluster n Has been studied extensively in statistics, but on small data sets l Data mining systems aim at clustering techniques that can handle very large data sets l E. g. the Birch clustering algorithm (more shortly) Database System Concepts - 6 th Edition 16 ©Silberschatz, Korth and Sudarshan

Hierarchical Clustering n Example from biological classification l (the word classification here does not Hierarchical Clustering n Example from biological classification l (the word classification here does not mean a prediction mechanism) chordata mammalia leopards humans reptilia snakes crocodiles n Other examples: Internet directory systems (e. g. Yahoo, more on this later) Database System Concepts - 6 th Edition 17 ©Silberschatz, Korth and Sudarshan

Other Types of Mining n Text mining: application of data mining to textual documents Other Types of Mining n Text mining: application of data mining to textual documents l cluster Web pages to find related pages l cluster pages a user has visited to organize their visit history l classify Web pages automatically into a Web directory Database System Concepts - 6 th Edition 18 ©Silberschatz, Korth and Sudarshan

Information Retrieval Database System Concepts - 6 th Edition 19 ©Silberschatz, Korth and Sudarshan Information Retrieval Database System Concepts - 6 th Edition 19 ©Silberschatz, Korth and Sudarshan

Information Retrieval Systems n Information retrieval (IR) systems use a simpler data model than Information Retrieval Systems n Information retrieval (IR) systems use a simpler data model than database systems l Information organized as a collection of documents l Documents are unstructured, no schema n Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents l e. g. , find documents containing the words “database systems” n Can be used even on textual descriptions provided with non-textual data such as images n Web search engines are the most familiar example of IR systems Database System Concepts - 6 th Edition 20 ©Silberschatz, Korth and Sudarshan

Information Retrieval Systems (Cont. ) n Differences from database systems l IR systems don’t Information Retrieval Systems (Cont. ) n Differences from database systems l IR systems don’t deal with transactional updates (including concurrency control and recovery) l Database systems deal with structured data, with schemas that define the data organization l IR systems deal with some querying issues not generally addressed by database systems 4 Approximate 4 Ranking Database System Concepts - 6 th Edition searching by keywords of retrieved answers by estimated degree of relevance 21 ©Silberschatz, Korth and Sudarshan

Keyword Search n In full text retrieval, all the words in each document are Keyword Search n In full text retrieval, all the words in each document are considered to be keywords. l We use the word term to refer to the words in a document n Ranking of documents on the basis of estimated relevance to a keyword query is critical l Relevance ranking is based on factors such as 4 Term frequency – Frequency of occurrence of query keyword in document 4 Inverse document frequency – How many documents the query keyword occurs in » Fewer give more importance to keyword 4 Hyperlinks to documents – More links to a document is more important Database System Concepts - 6 th Edition 22 ©Silberschatz, Korth and Sudarshan

Relevance Using Hyperlinks n Use number of hyperlinks to a site as a measure Relevance Using Hyperlinks n Use number of hyperlinks to a site as a measure of the popularity or prestige of the site l Count only one hyperlink from each site (why? - see previous slide) l Popularity measure is for site, not for individual page 4 But, most hyperlinks are to root of site 4 Also, concept of “site” difficult to define since a URL prefix like cs. yale. edu contains many unrelated pages of varying popularity n Refinements l When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige 4 Definition 4 Set l is circular up and solve system of simultaneous linear equations Above idea is basis of the Google Page. Rank ranking mechanism Database System Concepts - 6 th Edition 23 ©Silberschatz, Korth and Sudarshan

Web Search Engines n Web crawlers are programs that locate and gather information on Web Search Engines n Web crawlers are programs that locate and gather information on the Web l Recursively follow hyperlinks present in known documents, to find other documents 4 Starting l from a seed set of documents Fetched documents 4 Handed 4 Can over to an indexing system be discarded after indexing, or store as a cached copy Database System Concepts - 6 th Edition 24 ©Silberschatz, Korth and Sudarshan

Information Retrieval and Structured Data n Information retrieval systems originally treated documents as a Information Retrieval and Structured Data n Information retrieval systems originally treated documents as a collection of words n Information extraction systems infer structure from documents, e. g. : l Extraction of house attributes (size, address, number of bedrooms, etc. ) from a text advertisement l Extraction of topic and people named from a new article n Relations or XML structures used to store extracted data l System seeks connections among data to answer queries l Keyword querying on structured data Database System Concepts - 6 th Edition 25 ©Silberschatz, Korth and Sudarshan