Chapter 22 Advanced Querying and Information Retrieval n

Скачать презентацию Chapter 22 Advanced Querying and Information Retrieval n

a8981495828ef5ba5985b23ba9b308f5.ppt

Количество слайдов: 77

Chapter 22: Advanced Querying and Information Retrieval n Decision-Support Systems n Data Analysis and OLAP n Data Mining n Data Warehousing n Information-Retrieval Systems 1 Database System Concepts 4 th Edition 22. 1 ©Silberschatz, Korth and Sudarshan

Decision Support Systems n Decision-Support systems are used to make business decisions often based on data collected by on-line transactionprocessing systems. n Examples of business decisions: H what items to stock? H What insurance premium to change? H Who to send advertisements to? n Examples of data used for making decisions H Retail sales transaction details H Customer profiles (income, age, sex, etc. ) 2 Database System Concepts 4 th Edition 22. 2 ©Silberschatz, Korth and Sudarshan

Decision-Support Systems: Overview n Data analysis tasks are simplified by specialized tools and SQL extensions H Example tasks 4 For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year 4 As above, for each product category and each customer category n Statistical analysis packages (e. g. , : S++) can be interfaced with databases H Statistical analysis is a large field will not study it here n Data mining seeks to discover knowledge automatically in the form of statistical rules and patterns from Large databases. n A data warehouse archives information gathered from multiple sources, and stores it under a unified schema, at a single site. H Important for large businesses which generate data from multiple divisions, possibly at multiple sites H Data may also be purchased externally 3 Database System Concepts 4 th Edition 22. 3 ©Silberschatz, Korth and Sudarshan

Data Analysis and OLAP n Aggregate functions summarize large volumes of data n Online Analytical Processing (OLAP) H Tools to support interactive analysis of data, allowing data to be summarized and viewed in different ways n A histogram partitions the values taken by an attribute into ranges, and computes an aggregate over the values in each range; cumbersome to use standard SQL to construct a histogram. Extension proposed by Red Brick: H select percentile, avg (balance) H from account H group by N_tile (balance, 10) as percentile 4 Database System Concepts 4 th Edition 22. 4 ©Silberschatz, Korth and Sudarshan

Cross Tabulation of sales by item-name and color n The table above is an example of a cross-tabulation(or cross- tab) also referred to as a pivot-table. In general, a cross-table is a table where values for one attribute form the row headers, values for another attribute form the column headers, and the values in an individual cell are derived as follows. n A cross tab with summary rows/columns can be represented by introducing a special value all to represent subtotals. Database System Concepts 4 th Edition 22. 5 5 ©Silberschatz, Korth and Sudarshan

Relational Representation of the Data in Figure 22. 1 6 Database System Concepts 4 th Edition 22. 6 ©Silberschatz, Korth and Sudarshan

Three-Dimensional Data Cube 7 Database System Concepts 4 th Edition 22. 7 ©Silberschatz, Korth and Sudarshan

Online Analytical Processing n The operation of changing the dimensions used in a cross-tab is called pivoting. n An OLAP system provides other functionality as well. For instance, the analyst may wish to see a cross-tab on item-name and color for a fixed value of size, for example, large, instead of the sum across all sizes. Such an operation is referred to as slicing. The operation is sometimes called dicing, particularly when values for multiple dimensions are fixed. n The operation of moving from finer-granularity data to a coarser granularity is called a rollup. n The opposite operation - that of moving from coarser-granularity data to finer-granularity data – is called a drill down. 8 Database System Concepts 4 th Edition 22. 8 ©Silberschatz, Korth and Sudarshan

OLAP Implementation n The earliest OLAP systems used multidimensional arrays in memory to store data cubes, and are referred to as mutidimensional OLAP (MOLAP) systems. n Hybrid systems, which store some summaries in memory and store the base data and other summaries in a relational database, are called hybrid OLAP (HOLAP) systems. 9 Database System Concepts 4 th Edition 22. 9 ©Silberschatz, Korth and Sudarshan

Hierarchies on Dimensions 10 Database System Concepts 4 th Edition 22. 10 ©Silberschatz, Korth and Sudarshan

Cross Tabulation of sales With Hierarchy on item-name 11 Database System Concepts 4 th Edition 22. 11 ©Silberschatz, Korth and Sudarshan

Data Analysis (Cont. ) Small Medium Large Total Light Dark 8 20 35 10 310 5 53 35 Total 28 45 15 88 n Cross-tabulation of number by size and color of sample relation sales with the schema Sales(color, size, number). 12 Database System Concepts 4 th Edition 22. 12 ©Silberschatz, Korth and Sudarshan

Data Analysis (Cont. ) Color Size Number Light Dark all all Small Medium Large all 8 35 10 53 20 10 5 35 28 45 15 88 n Can represent subtotals in relational form by using the value all n E. g. : obtain (Light, all, 53) and (Dark, all, 35) by aggregating individual tuples with different values for size for each color. 13 Database System Concepts 4 th Edition 22. 13 ©Silberschatz, Korth and Sudarshan

Data Analysis (Cont. ) n Rollup: Moving from finer-granularity data to a coarser granularity by means of aggregation. n Drill down: Moving from coarser-granularity data finer- granularity data. n Proposed extensions to SQL, such as the cube operation help to support generation of summary data n The following query generates the previous table. select color, size, sum (number) from sales groupby color, size with cube 14 Database System Concepts 4 th Edition 22. 14 ©Silberschatz, Korth and Sudarshan

Data Analysis (Cont. ) n Figure shows the combinations of dimensions size, color, price n In general computing cube operation with n groupby columns gives 2 nd different groupby combinations. 15 Database System Concepts 4 th Edition 22. 15 ©Silberschatz, Korth and Sudarshan

Data Mining n Like knowledge discovery in artificial intelligence data mining discovers statistical rules and patterns it differs from machine learning in that it deals with large volumes of data stored primarily on disk. n Knowledge discovered from a database can be represented by a set of rules. n e. g. , : “Young women with annual incomes greater than $50, 000 are most likely to buy sports cars” n Discover rules using one of two models: 1. The user is involved directly in the process of knowledge discovery. 2. The system is responsible for automatically discovering knowledge from the database by detecting patterns and correlation's in the data. 16 Database System Concepts 4 th Edition 22. 16 ©Silberschatz, Korth and Sudarshan

Knowledge Representation Using Rules n General form of rules X antecedent consequent H X is a list of one or more variables with associated ranges. n The rule transactions T, buys (T, bread) buys(T, milk) states: if there is a tuple (t 1, bread) in the relation buys, there must also be a tuple (t 1, milk) in the relation buys. n Population: Cross-product of the ranges of the variables in the rule. H In the above example, the set of all transactions. n Support: Measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. H e. g. , 10% of transactions buy bread and milk. n Confidence : Measure of how often the consequent is true when the antecedent is true. H e. g. , 80% of transactions that buy bread also buy milk 17 Database System Concepts 4 th Edition 22. 17 ©Silberschatz, Korth and Sudarshan

Some Classes of Data-Mining Problems n Classification : Finding rules that partition the given data into disjoint groups (classes) that are relevant for making a decision H (e. g. , : which of several factors help classify a person’s credit worthiness). n Associations: Useful to determine associations between different items H (e. g. , : someone who buys bread is quite likely also to buy milk). n Sequence correlations: determine correlations between related sequence data. H (e. g. , : when bond rates go up stock prices go down within two days. ) 18 Database System Concepts 4 th Edition 22. 18 ©Silberschatz, Korth and Sudarshan

User-Guided Data Mining n In user-guided data mining, primary responsibility for discovering rules is with the user. n User may runs tests on the database to verify or refute a hypothesis. Confidence and support for rules expressing a hypothesis are derived from the database. n An iterative process of forming and refining rules is used. Example: Test the hypothesis “People who hold master’s degrees are the most likely to have an excellent credit rating. ” If confidence of rule is low, may refine it into the rule: people P, P. degree = Masters and C. income 75, 000 C. credit = excellent n Data-visualization though graphical representations like maps, charts, and color-coding, helps detect patterns in data 19 Database System Concepts 4 th Edition 22. 19 ©Silberschatz, Korth and Sudarshan

Classification Rules n Classification rules help assign new objects to a set of classes. E. g. , given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? n Classification rules for above example could use a variety of knowledge, such as educational level of applicant, salary of applicant, age of applicant, etc. n Classification rules can be compactly shown as a Classification tree. 20 Database System Concepts 4 th Edition 22. 20 ©Silberschatz, Korth and Sudarshan

Part of Credit Risk Classification Tree 21 Database System Concepts 4 th Edition 22. 21 ©Silberschatz, Korth and Sudarshan

Discovery of Classification Rules n Training set: a data sample in which the grouping for each tuple is already known. n Top down generation of classification tree. H Each internal node of the tree partitions the data into groups based on the attribute. H The data at a node is not partitioned further if either all (or most) of the items at the node belong to the same class, or all attributes have been considered. Such a node is a leaf node. H Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node. 22 Database System Concepts 4 th Edition 22. 22 ©Silberschatz, Korth and Sudarshan

Discovery of Classification Rules (Cont. ) n Consider credit risk example: Suppose degree is chosen to partition the data at the root. Since degree has a small number of possible values, one child is created for each value. n At each child node of the root, further classification is done tuple if required. Here, partitions are defined by income. Since income is a continuous attribute, some number of intervals are chosen, and one child created for each interval. n Different classification algorithms use different ways of choosing which attribute to partition on at each node, and what the intervals, if any, are. n In general, different branches of the tree could grow to different levels. Different nodes at the same level may use difficult partitioning attributes. 23 Database System Concepts 4 th Edition 22. 23 ©Silberschatz, Korth and Sudarshan

Discovery of Association Rules n Example: transactions T, buys (T, bread) buys (T, milk) n In general: notion of transaction , and its intemset, the set of items contained in the transaction n General form of rule: transactions T, c(T, i 1) and. . . and c(T, io) c(T, i 0) where c(T, ik) denotes that transaction T contains item ik. Above can be represented as A b where A = {i 1, i 2, . . . , in} and b = io. n Support of rule = number of transactions whose itemsets contain A {b} n Usually desire rules with strong support, which will involve only items purchased in a significant percentage of the transactions. 24 Database System Concepts 4 th Edition 22. 24 ©Silberschatz, Korth and Sudarshan

Discovery of Association Rules (Cont. ) n Consider all possible sets of relevant items. n For each set find its support (i. e. , how many transactions purchase all items in the set). n Use sets with sufficiently high support to generate association rules. H From set A generate the rules A - {b} b for each b A. H Support of each of the rules is support of A. H Confidence of a rule is support of A divided by support of A - {b}. 25 Database System Concepts 4 th Edition 22. 25 ©Silberschatz, Korth and Sudarshan

Finding Support n Few sets: Determine level of support via a single pass. H A count is maintained for each set, initially set to 0. H When a transaction is fetched, the count is incremented for each set of items which contained in the itemset of the transaction. H Sets with a high count at the end of the pass correspond to items with a high degree of association. n Many sets: If memory not enough to hold all counts for all sets Use multiple passes, considering only some sets in each pass. n Optimization: Once a set is eliminated because it occurs in too small a fraction of the transactions, none of its supersets needs to be considered. 26 Database System Concepts 4 th Edition 22. 26 ©Silberschatz, Korth and Sudarshan

Data Warehousing n A data warehouse is a repository of information gathered from multiple sources. 27 Database System Concepts 4 th Edition 22. 27 ©Silberschatz, Korth and Sudarshan

Data Warehousing (Cont. ) n Provides a single consolidated interface to data n Data stored for an extended period, providing access to historical data n Data/updates are periodically downloaded form online transaction processing (OLTP) systems. H Typically, download happens each night. H Data may not be completely up-to-date, but is recent enough for analysis. n Running large queries at the warehouse ensures that OLTP systems are not affected by the decision-support workload. 28 Database System Concepts 4 th Edition 22. 28 ©Silberschatz, Korth and Sudarshan

Issues in Building a Warehouse n When and how to gather data. H Source driven: data source initiates data transfer H Destination driven: warehouse initiates data transfer n What schema to use. H Schema integration H Cleaning and conversion of incoming data n What data to summarize. H Raw data may be too large to store on-line H Aggregate values (totals/subtotals) often suffice H Queries on raw data can often be transformed by query optimizer to use aggregate values n How to propagate updates. H Date at warehouse is a view on source data H Efficient view maintenance techniques required 29 Database System Concepts 4 th Edition 22. 29 ©Silberschatz, Korth and Sudarshan

Information Retrieval Systems n Information retrieval (IR) systems use a simpler data model than database systems, but provide more powerful querying capabilities within the restricted model. n Queries attempt to locate documents that are of interest by specifying, for example, sets of keywords. H e. g. , find documents containing the words “database systems” n Information retrieval systems order answers based on their estimated relevance. H e. g. , user may really only want documents about database systems, but the system may retrieve all documents that mention the phrase database systems”. H Documents may be ordered by, for example, how many times the phrase appears in the document. 30 Database System Concepts 4 th Edition 22. 30 ©Silberschatz, Korth and Sudarshan

Queries n Combinations of keywords H motorcycle and maintenance H computer or micro-processor H computer but not database n Closeness of keyword s in the and case affects ranking. Some systems allow user to specify that the keywords must occur close to each other. n Synonyms H To retrieve document title motorcycle repair for the query motorcycle and maintenance, need to realize that maintenance and repair are synonyms n Similarity based retrieval - retrieve documents similar to a given document. Similarity may be defined based on metrics such as number of common keywords. 31 Database System Concepts 4 th Edition 22. 31 ©Silberschatz, Korth and Sudarshan

Differences From Database Systems n Information retrieval systems, unlike traditional database systems, handle: H Unstructured documents H Searching using keywords and relevance ranking n Most information retrieval systems do not handle: H High update rates H Concurrency control H Data structured using more complex data models (e. g. , relational or object oriented data models) H Complex queries written in, e. g. , SQL 32 Database System Concepts 4 th Edition 22. 32 ©Silberschatz, Korth and Sudarshan

Indexing of Documents n Documents that contain a specified keyword can be located using an inverted index which maps each keyword Ki to the set Si of identifiers of documents that contain Ki. n IR systems save space by using index structures that support only approximate retrieval. May result in: H false drop - some relevant documents may not be retrieved. H false positive - some irrelevant documents may be retrieved. H For many applications a good index should not permit any false drops, but may permit a few false positives. n Relevant performance metrics: H Precision - what percentage of the retrieved documents are relevant to the query. H Recall - what percentage of the documents relevant to the query were retrieved. 33 Database System Concepts 4 th Edition 22. 33 ©Silberschatz, Korth and Sudarshan

Indexing of Documents (Cont. ) n and operation: Finds documents that contain all of a set of keywords K 1, K 2, . . . , Kn. H Retrieve the corresponding sets of identifiers of documents S 1, S 2, . . . Sn. H The intersection, S 1 S 2 . . . Sn, gives the identifiers of the desired set of documents. n or operation: Gives the set of all documents that contain at least one of the keywords K 1, K 2, …, Kn H Found by computing the union, S 1 S 2 . . . Sn, of the sets. 34 Database System Concepts 4 th Edition 22. 34 ©Silberschatz, Korth and Sudarshan

Indexing of Documents (Cont. ) n not operation: Finds documents that do not contain a specified keyword Ki H Let Si be the set of identifiers of documents that contain the keyword Ki. H Given a set of document identifies S, eliminate documents that contain the specified keyword Ki by taking the difference S-Si n A full-text index uses every work in the document as a keyword. n Stop words are very commonly occurring words that are useless as key words, e. g, a, an, the, it etc. These are eliminated from the index. 35 Database System Concepts 4 th Edition 22. 35 ©Silberschatz, Korth and Sudarshan

Browsing n Storing related documents together facilitates browsing, where a user can see not only requested document but also related ones. n Browsing in a library facilitated by classification system that organizes logically related books together. 36 Database System Concepts 4 th Edition 22. 36 ©Silberschatz, Korth and Sudarshan

A Classification Tree 37 Database System Concepts 4 th Edition 22. 37 ©Silberschatz, Korth and Sudarshan

Classification DAG n Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important. n Classification hierarchy is thus Directed Acyclic Graph (DAG). 38 Database System Concepts 4 th Edition 22. 38 ©Silberschatz, Korth and Sudarshan

A Classification DAG for a library information retrieval system 39 Database System Concepts 4 th Edition 22. 39 ©Silberschatz, Korth and Sudarshan

Locating of Information on the Web n The Archie system automatically follows Gopher links to locate information, and creates a centralized index of information found various sites. n Web indexing systems (Web crawlers) follow the hypertext links in documents to find other documents, and build an index on the documents. n The indices are often full-text indices, and are stored locally at the indexing system. n These systems run a background process to H find new sites. H obtain updated information from known sites. H discard defunct sites. 40 Database System Concepts 4 th Edition 22. 40 ©Silberschatz, Korth and Sudarshan

Locating of Information on the Web (Cont. ) n Web indexing systems permit documents to be located even though they are not registered with any central authority. n Drawback: poor precision of recall H Full text indexing retrieves unrelated documents that just happen to mention the requested keyword H HTML extensions now allow documents to be tagged with keywords to be used by search engines; unfortunately, many documents do not provide such keywords. H The extremely large number of documents on the Web often leads to far more results than a human can handle. n Alternative approach: a cataloging system for the Web, such as that provide by Yahoo n Combinations of catalogs and indexing are quite useful Provided by, e. g. , Yahoo (www. yahoo. com). 41 Database System Concepts 4 th Edition 22. 41 ©Silberschatz, Korth and Sudarshan

Classification Tree 42 Database System Concepts 4 th Edition 22. 42 ©Silberschatz, Korth and Sudarshan

Data-Warehouse Architecture 43 Database System Concepts 4 th Edition 22. 43 ©Silberschatz, Korth and Sudarshan

Star Schema For A Data Warehouse 44 Database System Concepts 4 th Edition 22. 44 ©Silberschatz, Korth and Sudarshan

A Classification Hierarchy For A Library System 45 Database System Concepts 4 th Edition 22. 45 ©Silberschatz, Korth and Sudarshan

A Classification DAG For A Library Information Retrieval System 46 Database System Concepts 4 th Edition 22. 46 ©Silberschatz, Korth and Sudarshan

Data Analysis and OLAP n Online Analytical Processing n Data that can be modeled as dimension attributes and measure attributes are called multidimensional data. H Given a relation used for data analysis, we can identify some of its attributes as measure attributes, since they measure some value, and can be aggregated upon. For instance, the attribute number of the sales relation is a measure attribute, since it measures the number of units sold. H Some of the other attributes of the relation are identified as dimension attributes, since they define the dimensions on which measure attributes, and summaries of measure attributes, are viewed. 47 Database System Concepts 4 th Edition 22. 47 ©Silberschatz, Korth and Sudarshan

Extended Aggregation n SQL: 1999 also supports generalizations of the group by constructs, using the cube and rollup constructs. A representative use of the cube construct is; select item-name, color, size, sum(number) from sales group by cube(item-name, color, size) n This query computes the union of eight different groupings of the sales relation: { (item-name, color, size), (item-name, color), (item-name, size), (color, size), (item-name), (color), (size), ( ) } Where ( ) denotes an empty group by list. n For each grouping, the result contains the null value for attributes not present in the grouping. For instance, with occurrences of all replaced by null, can be computed by the query select item-name, color, sum(number) from sales group by cube(item-name, color) 48 Database System Concepts 4 th Edition 22. 48 ©Silberschatz, Korth and Sudarshan

Extended Aggregation (Cont. ) n A representative rollup construct is select item-name, color, size, sum(number) from sales group by rollup(item-name, color, size) n Here only four grouping are generated: { (item-name, color, size), (item-name, color), (item-name), ( ) } n Rollup can be used to generate aggregates at multiple levels of a hierarchy on a column. For instance, we have a table itemcategory(item-name, category) giving the category of each item. Then the query select category, item-name, sum(number) from sales, category where sales. item-name = itemcategory. item-name group by rollup(category, item-name) would give a hierarchical summary by item-name and by category. 49 Database System Concepts 4 th Edition 22. 49 ©Silberschatz, Korth and Sudarshan

Extended Aggregation (Cont. ) n Multiple rollups and cubes can be used in a single group by clause. For instances, the following query select item-name, color, size, sum(number) from sales group by rollup(item-name), rollup(color, size) generates the groupings { (item-name, color, size), (item-name, color), (item-name), (color, size), (color), ( ) } n The function grouping can be applied on an attribute; it returns 1 if the value is a null value representing all, and returns 0 in all other cases. Consider the following query: select item-name, color, size, sum(number), grouping(item-name) as item-name-flag, grouping(color) as color-flag, grouping(size) as size-flag, from sales group by cube(item-name, color, size) 50 Database System Concepts 4 th Edition 22. 50 ©Silberschatz, Korth and Sudarshan

Ranking n Ranking is done in conjunction with an order by specification. Suppose we are given a relation student-marks(student-id, marks) which stores the marks obtained by each student. The following query gives the rank of each student. select student-id, rank( ) over (order by (marks) desc as s-rank from student-marks n An extra order by clause is needed to get them in sorted order, as shown below. select student-id, rank ( ) (order by (marks) desc) as s-rank from student-marks order by s-rank n Ranking can be done within partition of the data. The following query then gives the rank of students within each section. select student-id, section, rank ( ) over (partition by section order by marks desc) as sec -rank from student-marks, student-section where student-marks. student-id = student-section. student-id order by section, sec-rank 51 Database System Concepts 4 th Edition 22. 51 ©Silberschatz, Korth and Sudarshan

Ranking (Cont. ) n For a given constant n, the ranking the function ntile(n) takes the tuples in each partition in the specified order, and divides them into n buckets with qual numbers of tuples. For instance, we an sort employees by salary, and use ntile(3) to find which range (bottom third, middle third, or top third) each employee is in, and compute the total salary earned by employees in each range: select threetile, sum(salary) from ( select salary, ntile(3) over (order by (salary) as threetile from employee) as s group by threetile n SQL: 1999 permits the user to specify where they should occur by using nulls first or nulls last, for instance select student-id, rank ( ) over (order by marks desc nulls last) as s-rank from student-marks 52 Database System Concepts 4 th Edition 22. 52 ©Silberschatz, Korth and Sudarshan

Windowing n An example of window query is that, given sales values for each date, calculates for each date the average of the sales on that day, the previous day, and the next day; such moving average queries are used to smooth out random variations. n In contrast to group by, the same tuple can exist in multiple windows. Suppose we are given a relation transaction(accountnumber, date-time, value), where value is positive for a deposit and negative for a withdrawal, consider a query select account-number, date-time, sum(value) over (partition by account-number order by date-time rows unbounded preceding) as balance from transaction order by account-number, date-time 53 Database System Concepts 4 th Edition 22. 53 ©Silberschatz, Korth and Sudarshan

Decision - Tree Construction n The main idea of decision tree construction is to evaluate different attributes and different partitioning conditions and pick the attribute and partitioning condition that results in the maximum information gain ratio. Procedure Grow. Tree(S) Partition(S); Procedure Partition (S) if (purity(S) > p or |S| < s) then return; for each attribute A evaluate splits on attribute A; Use best split found (across all attributes) to partition S into S 1, S 2, …. , Sr, for i = 1, 2, …. . , r Partition(Si); n We can generate classification rules from a decision tree, if we so desire. For each leaf we generate a rule as follows: The left hand side is the conjunction of all the split conditions on the path to the leaf, and the class is the class of the majority of the training instances at the leaf. An example of such a classification rule is degree = masters and income > 75, 000 excellent 54 Database System Concepts 4 th Edition 22. 54 ©Silberschatz, Korth and Sudarshan

Other Types of Classifiers n There are two types of classifiers H Neural net classifiers H Bayesian classifiers n Neural net classifiers use the training data to train artificial neural nets. n To find the probability p(cj|d) of instance d being in class cj, Bayesian classifiers use Baye’s theorem, which says p(cj|d) = p(d|cj)p(cj) p(d) where p(d|cj) is the probability of generating instance d given class cj, p(cj) is the probability of occurrence of class cj, and p(d) is the probability of instanced occuring. n To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d 1|cj) * p(d 2|cj) * …. * (p(dn|cj) That is, the probability of the instance d occuring is the product of the probability of occurrence of each of the attribute values di of d, given the class cj. 55 Database System Concepts 4 th Edition 22. 55 ©Silberschatz, Korth and Sudarshan

Regression n Regression deals with the prediction of a value, rather than a class. Given values for a set of variables, X 1, X 2, …, Xn, we wish to predict the value of a variable Y. n One way is to infer coefficients a 0, a 1, …, an such that Y = a 0 + a 1 * X 1 + a 2 * X 2 + … + an * X n n Finding such a linear polynomial is called linear regression. The process to find a curve that fits the data is called curve fitting. n The fit may only be approximate, because of noise in the data or because the relationship is not exactly a polynomial, so regression aims to find coefficients that give the best possible fit. 56 Database System Concepts 4 th Edition 22. 56 ©Silberschatz, Korth and Sudarshan

Association Rules n Retail shops are often interested in associations between different items that people buy. Examples of such associations are: H Someone who buys bread is quite likely also to buy milk H A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. n Associations information can be used in several ways. When a customer buys a particular book, an online shop may suggest associated books. n Association Rules An example of Association Rules is bread milk An association rule must have an associated population; the population consists of a set of instances. 57 Database System Concepts 4 th Edition 22. 57 ©Silberschatz, Korth and Sudarshan

Association Rules (Cont. ) n Association Rules: An example of Association Rules is bread milk n An association rule must have an associated population; the population consists of a set of instances. Rules have an associated support, as well as an associated confidence. These are defined in the context of population: n Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. For instance, suppose only 0. 001 percent of all purchases include milk and screwdrivers. The support for the rule is milk screwdrivers is low. n Confidence is a measure of how often the consequent is true when the antecedent is true. For instance, the rule bread milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk. A rule with a low confidence is not meaningful. Note that the confidence of bread milk may be very different from the confidence of milk bread, although both have the same supports. 58 Database System Concepts 4 th Edition 22. 58 ©Silberschatz, Korth and Sudarshan

Components of Data Warehouse n When and how to gather data. n What schema to use. n Data cleansing n How to propagate updates. n What data to summarize 62 Database System Concepts 4 th Edition 22. 62 ©Silberschatz, Korth and Sudarshan

Keyword Search n Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not. n A query containing keywords without any of the above connectives is assumed to have ands implicitly connecting the keywords n In full text retrieval, all the words in each document are considered to be keywords. We shall use the word term to refer to the words in a document, since all words are keywords. 65 Database System Concepts 4 th Edition 22. 65 ©Silberschatz, Korth and Sudarshan

Relevance Ranking Using Terms n One way of measuring r(d, t), the relevance of a document d to a term t, is r(d, t) = log 1 + n(d, t) n(d) Where n(d) denotes the number of terms in the document and n(d, t) denotes the number of occurrences of term t in the document d. r(d, t) r(d, Q) = n(t) t Q 66 Database System Concepts 4 th Edition 22. 66 ©Silberschatz, Korth and Sudarshan

Applications of Data Mining person P, P. degree = masters and P. income > 75, 000 P. credit – excellent person P, P. degree = bachelors or (p. income 25, 000 and P. income 75, 000) P. credit = good 75 Database System Concepts 4 th Edition 22. 75 ©Silberschatz, Korth and Sudarshan

Best Splits n The purity of a set S of training instances can be measured quantitatively in several ways. Suppose there are k classes, and of the instances in S the fraction of instances in class I is pi. One measure of purity, the Gini measure is defined as k [ Gini (S) = 1 - p 2 i i- 1 n When all instances are in a single class, the Gini value is 0, while it reaches its maximum (of 1 – 1 /k) if each class the same number of instances. Another measure of purity is the entropy measure, which is defined as k Entropy (S) = - pilog 2 pi i- 1 n When a set S is split into multiple sets Si, I=1, 2, …, r, we can measure the purity of the resultant set of sets as: r |Si| Purity(S 1, S 2, …. . , Sr) = i= 1 |S| purity (Si) The above formula can be used with both the Gini measure and the entropy 76 measure of purity. Database System Concepts 4 th Edition 22. 76 ©Silberschatz, Korth and Sudarshan

Best Splits (Cont. ) n The information gain due to particular split of S into Si, I = 1, 2, …. , r is then Information-gain (S, {S 1, S 2, …. , Sr) = purity(S) – purity (S 1, S 2, … Sr) n The information content of a particular split can be defined in terms of entropy as r Information-content(S, {S 1, S 2, …. . , Sr})) = - |Si| i- 1 |S| log 2 |Si| |S| n All of this leads to a definition: The best split for an attribute is the one that gives the maximum information gain ratio, defined as Information-gain (S{S 1, S 2, ……, Sr}) Information-content (S, {S 1, S 2, …. . , Sr}) 77 Database System Concepts 4 th Edition 22. 77 ©Silberschatz, Korth and Sudarshan