Database Clustering and Summary Generation Tae-Wan Ryu and

Database Clustering and Summary Generation Tae-Wan Ryu and Christoph F. Eick

Similarity Measures For Multivalued Attributes for Database Clustering Tae-wan Ryu and Christoph F. Eick Department of Computer Science University of Houston n n n Talk Organization Database Clustering Problems of Database Clustering Extended Data Sets Similarity Measures for Sets and Bags An Architecture for Database Clustering Summary and Conclusion

General KDD Steps Data sources Selected/Preprocessed data Select/preprocess Transform Data preparation Transformed data Extracted information Knowledge Data mine Interpret/Evaluate/Assimilate

Research Goal To develop methodologies, techniques, and tools to create summaries from databases using cluster analysis and genetic programming Our approach n Partition the database into groups of similar objects using cluster analysis n Find commonalities that objects belonging to each group share using genetic programming

Database Summary Generation Steps and Example < Steps > < Example > Database Restaurant database Database Clustering Clusters Groups of similar objects Young White color Retired Summary Generation Midnight Summaries describing the commonalities within each group Dinner Lunch

An Example Schema Diagram

Preprocessing for Database Clustering Preparing input data sets for clustering Appropriate data selection and preparation from a database is important task Key Problems n How to support a user’s viewpoint including attribute selection n Data model discrepancy between storage format and the input format that clustering algorithms assume n How to cope with structural information, especially 1: n and n: m relationships

Input Format for Data Mining Algorithms Data Format for Input Data Sets n Single flat file format (basically, the data set has to be stored as a single(!) relation) n Complex and structured formats Problem: Almost all existing data mining and clustering approaches assume that input data set is in single flat file format.

An Example Database to Illustrate the Problems with Relationship Information in Database Clustering Person ssn name 11111 Johny 22222 Andy 33333 Post 44444 Jenny Purchase age sex 43 M 21 F 67 M 35 F ssn location ptype amount date 11111 Warehouse 1 400 02 -10 -96 11111 Grocery 2 70 05 -14 -96 11111 Mall 3 200 12 -24 -96 22222 Mall 2 300 12 -23 -96 22222 Grocery 3 100 06 -22 -96 33333 Mall 1 30 11 -05 -96 (a) Joined result name age Johny 43 Andy 42 Post 67 Jenny 35 sex ptype amount location M 1 400 Mall M 2 70 Grocery M 3 200 Warehouse F 2 300 Mall F 3 100 Grocery M 1 30 Mall F null (b) ptype (payment type): 1 for cash, 2 for credit, and 3 for check, the cardinality ratio is 1: n (a) an example of Personal relational database, (b) a joined table from Person and Purchase relations

Existing Approaches Applying aggregate functions or generalization operators to convert a multi-valued attribute into a single valued attribute. Problems n User has to make a critical decision (e. g. , which aggregate function to use? ) n Valuable related information may be lost.

Extended Data Sets name age sex ptype amount location Johny 43 M 1 400 Mall Johny 43 M 2 70 Grocery Johny 43 M 3 200 Warehouse Andy 42 F 2 100 Mall Andy 42 F 3 100 Grocery Post 67 M 1 30 Mall Jenny 35 F null name age Johny 43 Andy 21 Post 67 Jenny 35 sex M F p. ptype p. amount p. location {1, 2, 3} {400, 70, 200} {Mall, Grocery, Warehouse} {2, 3} {100, 100} {Mall, Grocery} 1 30 Mall null A converted table with a bag of values How to measure similarity between bags of values? Æ Group similarity measures are needed.

Approaches for Database Clustering Structured database Manual transformation Flat file Clustering algorithms <Current approach> Structured database Automated preprocessing Extended data set <Proposed approach> Generalized Clustering algorithms

Related Work LABYRINTH (Thompson et al. ) n Ketterlin’s extended COBWEB n KATE (Manago et al. ) n n SUBDUE (Holder et al. ) n INLEN (Ribeiro et al. ) n KBG (Bisson et al. ), KLUSTER (Kietz et al. )

Research Objectives for Database Clustering n To alleviate the representational gab between databases on the one hand input formats of clustering algorithms on the other hand n To design and implement semi-automatic tools to facilitate database clustering n To generalize clustering algorithms

Generating Extended Data Sets From a Structured Database d 1, d 2, …, dn Extended data set generator Extended data set 1 User’s interests and objectives

A Unified Similarity Measure for Clustering Extended Data Sets Group Similarity Measures Mixed Types: qualitative, quantitative types. Qualitative type: Tversky’s set-theoretical similarity models. n Contrast model S(a, b) = f(A B) f(A B) f(B A), where a and b be two objects, and A and B denote the sets of features for some , , 0; f is the cardinality of the set n Ratio model (e. g. , normalized similarity) S(a, b) = f(A B) / [f(A B) + f(A B) + f(B A)], , 0

Group Similarity Measures. . . continued Quantitative type: group average n Group average between group A and B where n is the total number of object-pairs, d(a, b)i is the dissimilarity measure for the ith pair of objects a and b, a A, b B. E By taking the average of all the inter-object measures for those pairs of objects from which each object of a pair is in different groups.

A Framework for Mixed Type Similarity Measures for Extended Data Sets n Gower’s similarity measure for data sets with mixed-types. n Extended similarity measure for multi-valued data sets with mixed-types. where m = l + q. The functions, sl(a, b) and sq(a, b) are similarity functions for qualitative attributes and quantitative attributes respectively.

Clustering Algorithms for Extended Data Sets n Nearest-neighbor clustering n DBSCAN n Leader algorithm n Hierarchical clustering

Database Clustering Environment Library of clustering algorithms Extended Data set Data Extraction Tool DBMS Clustering Tool User Interface A set of clusters Similarity measure Similarity Measure Tool Default choice and domain information Library of similarity measures Type and weight information

A More Detailed Tool Architecture

A Join Template Form Begin-spec Database-name: DB; Link-definitions: Link-list; Begin-join Dataset-of-interest: Dsetintrest; Selected-attributes: Attr-list; Objective-attributes: Obj-attr-list; Extended-data-set: E; End-join End-spec

An Example of the Interface of the Extended Data Set Generation Tool Begin-spec DB-name: Company Link-definitions: superv(Employee. ssn, Employee. superssn), husband(Employee. ssn, Marriage. hssn), wife(Employee. ssn, Marriage. wssn), ehusband(Marriage. hssn, Employee. ssn), ewife(Marriage. wssn, Employee. ssn), works_on(Employee. ssn, Works_on. essn), project(Works_on. pno, Project. pnum), works_for(Employee. dno, Department. dnum), works_loc(Department. dnum, Dept_loc. dnum) Begin-join Dateset-of-interest: Employee Selected-attributes: ssn, sex, salary, superv. salary, wife. ewife. salary, works_on. hours, works_on. project. pname, works_for. works_loc. dloc Objective-attributes: ssn Output-data-set: E 1 End-join End-spec

Algorithm to Generate Extended Data Sets Project the Data Set of Interest by Primary key and Selected Attributes n Join the Data set of Interest and related data sets to get all related attributes for each join-path n Group attributes together that describe the same object n

Summary Representation n Our approach uses database queries as our summary representation language. n Queries that compute the objects belonging to a cluster and no other objects are considered to be perfect summaries for a cluster. An example query for a cluster (SELECT ssn name address FROM person purchase WHERE (amount-spent > 1000) and (payment-type = ‘cash’)and (store-name = ‘flea-market’)) “Typically, members in the cluster have spent more than $1, 000 cash for shopping in a flea-market”

Summary and Contributions n n n Discussed the data model discrepancy between database storage format and input data format for traditional clustering algorithms Discussed the problems of dealing with relationship information in database clustering Presented a different way of representing related information using extended data sets Introduced the design and architecture of an automatic tools to generate extended data sets from databases Generalized the traditional similarity measures and present a framework to cope with extended data sets in similarity-based clustering

Architecture of MASSON cluster g 1 g 2 Object set Schema information user input user interface . . . gk Clustering module system input GP based discovery system generate select user input KB Domain knowledge apply DB Query set Interface GP engine Query result evaluate system input Discovered query set return DBMS

Evolution Process Initial generation Initial population Generationn generation 2 evolve q 11, q 12, . . , q 1 m evolved population qn 1, qn 2, . . , qnm q 21, q 22, . . , q 2 m selection crossover mutation n: number of generation m: the size of population selection Solution Q