Clustering Tackling Challenges with Data Recovery Approach B

Clustering: Tackling Challenges with Data Recovery Approach B. Mirkin School of Computer Science Birkbeck University of London Advert of a Special Issue: The Computer Journal, Profiling Expertise and Behaviour: Deadline 15 Nov. 2006. To submit, http: // www. dcs. bbk. ac. uk/~mark/cfp_cj_profiling. txt

WHAT IS CLUSTERING; WHAT IS DATA K-MEANS CLUSTERING: Conventional K-Means; Initialization of KMeans; Intelligent K-Means; Mixed Data; Interpretation Aids WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward Clustering DATA RECOVERY MODELS: Statistics Modelling as Data Recovery; Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One Clustering DIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph -Theoretic Approaches; Conceptual Description of Clusters GENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

What is clustering? Finding homogeneous fragments, mostly sets of entities, in data for further analysis

Example: W. Jevons (1857) planet clusters, updated (Mirkin, 1996) Pluto doesn’t fit in the two clusters of planets

Example: A Few Clusters Clustering interface to WEB search engines (Grouper): Query: Israel (after O. Zamir and O. Etzioni 2001) Cluster # sites Interpretation 1 View Refine 24 Society, religion • Israel and Iudaism • Judaica collection 2 View Refine 12 Middle East, War, History • The state of Israel • Arabs and Palestinians 3 View Refine 31 Economy, Travel • Israel Hotel Association • Electronics in Israel

Clustering algorithms Nearest neighbour Ward Conceptual clustering K-means Kohonen SOM Etc………………….

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) * * 1. Assign points to the centroids * * ** * according to minimum distance rule @ @ 2. Put centroids in gravity centres of thus obtained clusters @ 3. Iterate 1. and 2. until convergence ** *** K= 3 hypothetical centroids (@)

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) * * 1. Assign points to the centroids * * *** according to Minimum distance rule ** * @ @ 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence @ ** ***

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) * * 1. Assign points to the centroids * * ** * according to Minimum distance rule @ @ 2. Put centroids in gravity centres of thus obtained clusters @ 3. Iterate 1. and 2. until convergence ** ***

K-Means: a generic clustering method Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) * * 1. Assign points to the centroids @ * * *@* according to Minimum distance rule ** * 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence ** @ 4. Output final centroids and clusters ***

Advantages of K-Means Models typology building Computationally effective Can be utilised incrementally, `on-line’ Shortcomings of K-Means Instability of results Convex cluster shape

Initial Centroids: Correct Two cluster case

Initial Centroids: Correct Initial Final

Different Initial Centroids

Different Initial Centroids: Wrong Initial Final

Clustering issues: K-Means gives no advice on: * Number of clusters * * Initial setting Data normalisation Mixed variable scales Multiple data sets K-Means gives limited advice on: * Interpretation of results

Data recovery for data mining (=“discovery of patterns in data”) Type of Data n n Similarity Temporal Entity-to-feature Co-occurrence Type of Model n n n Regression Principal components Clusters Model: Data = Model_Derived_Data + Residual Pythagoras: Data 2 = Model_Derived_Data 2 + Residual 2 The better fit, the better the model

Pythagorean decomposition in Data recovery approach, provides for: Data scatter – a unique data characteristic (A perspective at data normalisation) Additive contributions of entities or features to clusters (A perspective for interpretation) Feature contributions are correlation/association measures affected by scaling (Mixed scale data treatable) Clusters can be extracted one-by-one (Data mining perspective, incomplete clustering, number of clusters) Multiple data can be approximated as well as single sourced ones (not talked of today)

Example: Mixed scale data table

Conventional quantitative coding + … data standardisation

Standardisation of features w. Yik = (Xik –Ak)/Bk n n n X - original data Y – standardised data i – entities k – features Ak – shift of the origin, typically, the average Bk – rescaling factor, traditionally the standard deviation, but range may be better in clustering

No standardisation Tom Sawyer

Z-scoring (scaling by std) Tom Sawyer

Standardising by range & weight Tom Sawyer

K-Means as a data recovery method

Representing a partition Cluster k: Centroid ckv (v - feature) Binary 1/0 membership zik (i - entity)

Basic equations (analogous to PCA, with score vectors zk constrained to be binary) y – data entry, z – membership, not score c - cluster centroid, i - entity, N – cardinality v - feature /category, k - cluster

Meaning of Data scatter contributions of features – the basis for feature pre-processing (dividing by range rather than std) The sum of Proportional to the summary variance

Contribution of a feature F to a partition Contrib(F) = Proportional to correlation ratio 2 if F is quantitative n a contingency coefficient between cluster partition n and F, if F is nominal: w Pearson chi-square (Poisson normalised) w Goodman-Kruskal tau-b (Range normalised)

Contribution of a quantitative feature to a partition Proportional to n correlation ratio 2 if F is quantitative

Contribution of a nominal feature to a partition Proportional to a contingency coefficient w Pearson chi-square (Poisson normalised) w Goodman-Kruskal tau-b (Range normalised) n Bj=1

Pythagorean Decomposition of data scatter for interpretation

Contribution based description of clusters C. Dickens: FCon = 0 M. Twain: Len. D < 28 L. Tolstoy: Num. Ch > 3 Direct = 1 or

PCA based Anomalous Pattern Clustering yiv =cv zi + eiv, where zi = 1 if i S, zi = 0 if i S With Euclidean distance squared c. S must be anomalous, that is, interesting

Initial setting with Anomalous Pattern Cluster Tom Sawyer

Anomalous Pattern Clusters: Iterate 1 2 0 Tom Sawyer 3

i. K-Means: Anomalous clusters + K-means After extracting 2 clusters (how one can know that 2 is right? ) Final

Example of i. K-Means: Media Mirrored Russian Corruption (55 cases) with M. Levin and E. Bakaleinik Features: Corrupt office (1) Client (1) Rendered service (6) Mechanism of corruption (2) Environment (1)

A schema for Bribery Environment Interaction Client Service Office

Data standardisation Categories as one/zero variables Subtracting the average All features: Normalising by range Categories, sometimes by the number of them

i. K-Means: Initial Setting with Iterative Anomalous Pattern Clustering 13 clusters found with AC, of which 8 do not fit (4 singletons, 4 doublets) 5 clusters remain, to get initial seeds from Cluster elements are taken as seeds

Interpretation II: Patterning (Interpretation I: Representatives Interpretation III: Conceptual description) Patterns in centroid values of salient features Salience of feature v at cluster k : ~ (grand mean - within-cluster mean)2

Interpretation II Cluster 1 (7 cases): n n Other branch (877%) Improper categorisation (439%) Level of client (242%) III Branch = Other Branch = Law Enforc. n & Cluster 2 (19 cases): Service: No Cover-Up n Obstruction of justice (467%) & n Law enforcement (379%) Client Level n Occasional (251%) Organisation

Interpretation II (pattern) III (appcod) Cluster 3 (10 cases): 0 <= Extort - Obstruct <= 1 & n Extortion (474%) 2 <= Extort + Bribe <=3 n Organisation(289%) & n Government (275%) No Inspection & No Protection NO ERRORS

Overall Description: It is Branch that matters Government n n Extortion for free services (Cluster 3) Protection (Cluster 4) Law enforcement n n Obstruction of justice (Cluster 2) Cover-up (Cluster 5) Other n Category change (Cluster 1) Is this knowledge enhancement?

Data recovery clustering of similarities Example: Similarities between algebraic functions in an experimental method for knowledge evaluation lnx x² x³ x½ x¼ lnx 1 1 2. 5 x² 1 - 6 2. 5 X³ 1 6 3 3 x½ 2. 5 3 4 x¼ 2. 5 3 4 Scoring similarities between algebraic functions by a 6 th grade student in scale 1 to 7

Additive clustering Similarities are the sum of intensities of clusters Cl. 0: “All are funcrtions”, {lnx, x², x³, x½, x¼} Intensity 1 (upper sub-matrix) lnx x² x³ x½ lnx 1 1 1 x² 1 - 1 1 X³ 1 6 1 x½ 2. 5 3 x¼ 2. 5 3 4 x¼ 1 1 - Scoring similarities between algebraic functions by a 6 th grade student in scale 1 to 7 (lower sub-matrix)

Additive clustering Similarities are the sum of intensities of clusters Cl. 1: “Power functions”, {x², x³, x½, x¼} Intensity 2 (upper sub-matrix) lnx x² x³ x½ x¼ lnx 0 0 x² 1 - 2 2 2 X³ 1 6 2 2 x½ 2. 5 3 2 x¼ 2. 5 3 4 Scoring similarities between algebraic functions by a 6 th grade student in scale 1 to 7 (lower sub-matrix)

Additive clustering Similarities are the sum of intensities of clusters Cl. 2: “Sub-linear functions”, {lnx, x½, x¼} Intensity 1 (upper sub-matrix) lnx x² x³ x½ x¼ lnx 0 0 1 1 x² 1 - 0 0 0 X³ 1 6 0 0 x½ 2. 5 3 1 x¼ 2. 5 3 4 Scoring similarities between algebraic functions by a 6 th grade student in scale 1 to 7 (lower sub-matrix)

Additive clustering Similarities are the sum of intensities of clusters Cl. 3: “Fast growing functions”, {x², x³} Intensity 3 (upper sub-matrix) lnx x² x³ x½ x¼ lnx 0 0 x² 1 - 3 0 0 X³ 1 6 0 0 x½ 2. 5 3 0 x¼ 2. 5 3 4 Scoring similarities between algebraic functions by a 6 th grade student in scale 1 to 7 (lower sub-matrix)

Additive clustering Similarities are the sum of intensities of clusters Residuals – relatively small (upper sub-matrix) lnx x² x³ x½ x¼ lnx 0 0. 5. 5 x² 1 - 0 -. 5 X³ 1 6 0 0 x½ 2. 5 3 0 x¼ 2. 5 3 4 Scoring similarities between algebraic functions by a 6 th grade student in scale 1 to 7 (lower sub-matrix)

Data recovery Additive clustering Observed similarity matrix B = Ag + A 1 +A 2 +A 3 + E Problem: given B, find As to minimize E, the differences between B and summary A B – (Ag + A 1 +A 2 +A 3) min A

Doubly greedy strategy: OUTER LOOP: One cluster at a time Find real c and binary z to minimize L 2(B, c, z) Take cluster S = { i | z i = 1 }; Update B: B B - czz. T Reiterate After m iterations: Sk, Nk=|Sk|, ck T(B) = c 12 N 12 + …+ cm 2 Nm 2 + L 2 (●)

Inner loop: finding a cluster Maximize: Contribution to (●), Max (c. NS)2 N. Property: Average similarity b(i, S) of i to S > c/2 if i S and < c/2 if i S Algorithm ADDI-S: Take S={ i } for arbitrary i; Given S, find c=c(S) and b(i, S) for all i ; If b(i, S)-c/2. is >0 for i S or < 0 for i S change the state of i. Else, stop and output S. Resulting S satisfies the property. Holzinger (1941) B-coefficient, Arkadiev&Braverman (1964, 1967) Specter, Mirkin (1976, 1987) ADDI-…, Ben-Dor, Shamir, Yakhini (1999) CAST

DRA on Mixed variable scales and normalisation Feature Normalisation: any measure, clear of the distribution; e. g. , range Nominal scale: Binary categories normalised to get the total feature contribution right; e. g. by the square root of the number of categories

DRA on Interpretation Cluster centroids are supplemented with contributions of feature/cluster pairs or entity/cluster pairs K-Means: What is Representative? Distance Min: (conventional) Inner product Max: (data recovery)

DRA on Incomplete clustering With the model assigning un-clustered entities to the “norm” (e. g. , gravity centre), Anomalous Pattern clustering (iterated)

DRA on Number of clusters i. K-Means (under the assumption that every cluster, in sequence, contributes more than the next one [a planetary model]) Otherwise, the issue is rather bleak

Failure of statistically sound criteria Ming. Tso Chiang (2006): 100 entities in 6 D; 4 clusters; between dist. 50 times > within dist. Hartigan’s F coefficient and Jump statistic fail

Conclusion Data recovery approach should be the major mathematical underpinning for data mining as a framework for finding patterns in data