Clustering Cluster a number of things of the

Clustering Cluster: a number of things of the same kind being close together in a group (Longman dictionary of contemporary English CS 240 B lecture notes by C. Zaniolo. 1

Example: Custormer Segmentation z Given: a Large data base of customer data containing their properties and past buying records: z Find groups of customers with similar behavior (clusters) z Find customers with unusual behavior (outliers) 2

Problem Definition: Given a set of N items in D dimensions z Find: a natural partitioning of the data set into a number of clusters (k) + outliers, such that: y items in same cluster are similar intra-cluster similarity is maximized y items from different clusters are different inter-cluster similarity is minimized z No predefined classes! Unsupervised Learnig z Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms. 3

These slides are based on those downloaded from www. cs. uiuc. edu/~hanj Data Mining: Concepts and Techniques — Chapter 7 — Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign © 2006 Jiawei Han and Micheline Kamber 4

Clustering: Rich Applications and Multidisciplinary Efforts z Pattern Recognition z Spatial Data Analysis y Create thematic maps in GIS by clustering feature spaces y Detect spatial clusters or for other spatial mining tasks z Image Processing z Economic Science (especially market research) z WWW y Document classification y Cluster Weblog data to discover groups of similar access patterns 5

Examples of Clustering Applications z Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs z Land use: Identification of areas of similar land use in an earth observation database z Insurance: Identifying groups of motor insurance policy holders with a high average claim cost z City-planning: Identifying groups of houses according to their house type, value, and geographical location z Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults 6

K-Means K-means (Mac. Queen, 1967) is one of the simplest clustering algorithms to minimize distance from centers. 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. 7

K-means example, step 1 k 1 Y Pick 3 initial cluster centers (randomly) k 2 k 3 X 8

K-means example, step 2 k 1 Y Assign each point to the closest cluster center k 2 k 3 X 9

K-means example, step 3 k 1 Y Move each cluster center to the mean of each cluster k 2 k 3 X 10

K-means example, step 4 Reassign points closest to a different new cluster center k 1 Y Q: Which points are reassigned? k 3 k 2 X 11

K-means example, step 4 Reassign points to the closest center Q: points reassigned: k 1 Y k 3 k 2 X 12

K-means example, step 5 k 1 k 1 Y re-compute cluster means k 2 k 3 X 13

K-means example, step 6 Reassign points to clusters: No change: The end k 1 Y k 2 k 3 X 14

K-means clustering summary Advantages z Simple, understandable z items automatically assigned to clusters Disadvantages z Must pick number of clusters before hand z All items forced into a cluster z Too sensitive to outliers 15

Similarity and Distance z K-means and all methods group together the most similar objects z Where some notion of distance is used to define similarity y. Close-by, i. e. , similar y. Far apart, i. e. dissimilar z Distance obvious in our XY planes, not so obvious in general: categorical, boolean, vectors, etc. 16

Dissimilarity between Items is expressed by their Distance z Data matrix y No assumption z Typical Symmetric matrix 17

Type of data in clustering analysis z Interval-scaled variables z Binary variables z Nominal, ordinal, and ratio variables z Variables of mixed types 18

Interval-Scaled Variables z Interval-scaled are continuous measurements in roughly linear scale—e. g. , temperature, weight, coordinates—which are then assumed to range over an interval. z Notion of Distance between two vectors: X=<x 1, …, xn> and Y=<y 1, …, yn>: (|x 1 -y 1|q + … + |xn-yn|q)1/q y q=2: Euclidean distance y q=1: Manhattan distance y 1<q<2: Minkowski distance 19

Metric Properties y Are satisfied by all three previous distances: xd(i, j) 0 xd(i, i) = 0 xd(i, j) = d(j, i) xd(i, j) d(i, k) + d(k, j) 20

Heterogeneous Variables z Standardization is needed: E. g. if have n values for x y Calculate the mean absolute deviation: w. r. t. the mean: y Calculate the standardized measurement (z-score) z Using mean absolute deviation is more robust than using standard deviation 21

Dissimilarity between Binary Variables z Example y gender is a symmetric attribute y the remaining attributes are asymmetric binary (0 denotes normal condition) y let the values Y and P be set to 1, and the value N be set to 0 22

Binary Variables—vector of size p Object j z A contingency table for binary data Object i z Distance measure for symmetric binary variables: 23

Binary Variables—vector of size p Object j z A contingency table for binary data Object i z Distance measure for symmetric binary variables: z Jaccard coefficient (similarity measure for asymmetric binary variables): z Distance measure for asymmetric binary variables. [1 -sim] 24

Dissimilarity between Binary Variables z Example y gender is a symmetric attribute y the remaining attributes are asymmetric binary y dissimilarity for asymmetric attribute only 25

Categorical Variables z A generalization of the binary variable in that it can take more than 2 states, e. g. , red, yellow, blue, green z Method 1: Simple matching y m: # of matches, p: total # of variables: z Method 2: use a large number of binary variables y creating a new binary variable for each of the M nominal states 26

Ordinal Variables z An ordinal variable can be discrete or continuous z Order is important, e. g. , rank z Can be treated like interval-scaled y replace xif by their rank y map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by y compute the dissimilarity using methods for interval-scaled variables 27

Ratio-Scaled Variables z Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as Ae. Bt or Ae-Bt z Methods: y treat them like interval-scaled variables—not a good choice! (why? —the scale can be distorted) y apply logarithmic transformation yif = log(xif) y treat them as continuous ordinal data treat their rank as interval-scaled 28

Combining Variables of Mixed types z Bring all the variables into a common scale —typically ranging between 0 and 1. 29

Vector Objects z Vector objects: keywords in documents, gene features in micro-arrays, etc. z Broad applications: information retrieval, biologic taxonomy, etc. z Cosine measure z A variant: Tanimoto coefficient (for binary) 30