Dimension reduction PCA and Clustering Slides by

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer

The DNA Array Analysis Pipeline Question Experimental Design Array design Probe design Sample Preparation Hybridization Buy Chip/Array Image analysis Normalization Expression Index Calculation Comparable Gene Expression Data Statistical Analysis Fit to Model (time series) Advanced Data Analysis Clustering Meta analysis PCA Classification Promoter Analysis Survival analysis Regulatory Network

Motivation: Multidimensional data 209619_at 32541_at 206398_s_at 219281_at 207857_at 211338_at 213539_at 221497_x_at 213958_at 210835_s_at 209199_s_at 217979_at 201015_s_at 203332_s_at 204670_x_at 208788_at 210784_x_at 204319_s_at 205049_s_at 202114_at 213792_s_at 203932_at 203963_at 203978_at 203753_at 204891_s_at 209365_s_at 209604_s_at 211005_at 219686_at 38521_at 217853_at 217028_at 201137_s_at 202284_s_at 201999_s_at Pat 1 7758 280 1050 391 1425 37 124 120 179 203 758 570 533 649 5577 648 142 298 3294 833 646 1977 97 315 1468 78 472 772 49 694 775 367 4926 4733 600 897 Pat 2 4705 387 835 593 977 27 197 86 225 144 1234 563 343 354 3216 327 151 172 1351 674 375 1016 63 279 1105 71 519 74 58 342 604 168 2667 2846 1823 959 Pat 3 5342 392 1268 298 2027 28 454 175 449 197 833 972 325 494 5323 1057 144 200 2080 733 370 2436 77 221 381 152 365 130 129 345 305 107 3542 1834 1657 800 Pat 4 7443 238 1723 265 1184 38 116 99 174 314 1449 796 270 554 4423 746 173 298 2066 1298 436 1856 136 260 1154 74 349 216 70 502 563 160 5163 5471 1177 808 Pat 5 8747 385 1377 491 939 33 162 115 185 250 769 869 691 710 5771 541 148 196 3726 862 738 1917 85 227 980 127 756 108 56 960 542 287 4683 5079 972 297 Pat 6 4933 329 804 517 814 16 113 80 203 353 1110 494 460 455 3374 270 145 104 1396 371 497 822 74 222 1419 57 528 311 77 403 543 264 3281 2330 2303 1014 Pat 7 7950 337 1846 334 658 36 97 83 186 173 987 673 563 748 4328 361 131 144 2244 886 546 1189 91 232 1253 66 637 80 61 535 725 273 4822 3345 1574 998 Pat 8 5031 163 1180 387 593 23 97 119 185 285 638 1013 321 392 3515 774 146 110 2142 501 406 1092 61 141 554 153 828 235 61 513 587 113 3978 1460 1731 663

Dimension reduction methods • Principal component analysis (PCA) – Singular value decomposition (SVD) • Multi. Dimensional Scaling (MDS) • Correspondence Analysis (CA) • Cluster analysis – Can be thought of as a dimensionality reduction method as clusters summarize data

Principal Component Analysis (PCA) • Used for visualization of high-dimensional data • Projects high-dimensional data into a small number of dimensions – Typically 2 -3 principle component dimensions • Often captures much of the total data variation in a only few dimensions • Exact solutions require a fully determined system (matrix with full rank) – i. e. A “square” matrix with independent entries

PCA

Singular Value Decomposition

Principal components • 1 st Principal component (PC 1) – Direction along which there is greatest variation • 2 nd Principal component (PC 2) – Direction with maximum variation left in data, orthogonal to PC 1

PCA: Variance by dimension

PCA dimensions by experiment

PCA projections (as XY-plot)

PCA: Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2

Why do we cluster? • Organize observed data into meaningful structures • Summarize large data sets • Used when we have no a priori hypotheses • Optimization: – Minimize within cluster distances – Maximize between cluster distances

Many types of clustering methods • Methods: – Hierarchical, e. g. UPGMA • Agglomerative (bottom-up) • Divisive (top-down) – partitioning • K-means • PAM • SOM

Hierarchical clustering • Representation of all pair-wise distances • Parameters: none (distance measure) • Results: – One large cluster – Hierarchical tree (dendrogram) • Deterministic

Hierarchical clustering – UPGMA Algorithm • • Assign each item to its own cluster Join the nearest clusters Re-estimate the distance between clusters Repeat for 1 to n UPGMA: Unweighted Pair Group Method with Arithmetic mean

Hierarchical clustering

Hierarchical clustering Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of patients on top 100 significant genes

Leukemia data - clustering of genes

K-means clustering • Input: N objects given as data points in Rp • Specify the number k of clusters • Initialize k cluster centers. Iterate until convergence: - Assign each object to the cluster with the closest center (Euclidean distance) - The centroids of the obtained clusters are taken as new cluster centers • K-means can be seen as an optimization problem: Minimize the sum of squared within-clusters distances • The result is depended on the initialization

K-means - Algorithm

K-means clustering, k=3

K-means clustering of Leukemia data

K-means clustering of Cell Cycle data

Partioning Around Medoids (PAM) • PAM is a partitioning method like K-means • For a prespecified number of clusters k, the PAM procedure is based on the search for k representative objects, or medoids M = (m 1, . . . , mk) • The medoids minimize the sum of the distances of the observations to their closest medoid • After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid • PAM can be applied to general data types and tends to be more robust than k-means

Self Organizing Maps (SOM) • Partitioning method (similar to the K-means method) • Clusters are organized in a two-dimensional grid • SOM algorithm finds the optimal organization of data in the grid • Iteration steps (20000 -50000): - Pick data point P at random - Move all nodes in direction of P, the closest node more - Decrease amount of movement

SOM - example

Comparison of clustering methods • Hierarchical – Advantage: Fast to compute – Disadvantage: Rigid • Partitioning – Advantage: Provides clusters that roughly satisfy an optimality criterion – Disadvantage: Needs initial k, and is time consuming

Distance measures • Euclidian distance • Vector angle distance • Pearsons distance

Comparison of distance measures

Summary • Dimension reduction important to visualize data • Methods: – PCA/SVD – Clustering • Hierarchical • K-means/PAM • SOM (distance measure important)

Coffee break Next: Exercises in Dimension Reduction and clustering