DATA MINING Algorithms Applications and Beyond Chandan K

DATA MINING: Algorithms, Applications and Beyond Chandan K. Reddy Department of Computer Science Wayne State University, Detroit, MI – 48202.

Organization n Introduction Basic components Fundamental Topics n n Research Topics n n n Classification Clustering Association Analysis Probabilistic Graphical Models Boosting Algorithms Active Learning Mining under Constraints Teaching

Lots of Data …. n n n Customer Transactions Bioinformatics Banking Internet / Web Biomedical Imaging

So What ? ? ? n Computers have become cheaper and more powerful, so storage is not an issue n There is often information “hidden” in the data that is not readily evident We are drowning in data, n but starving for knowledge!!! Human analysts may take weeks to discover useful information n Much of the data is never analyzed at all

Data Mining is … n “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data” n “the science of extracting useful information from large data sets or databases” -Wikipedia. org n More appropriate term will be …. Knowledge Discovery in Databases

Steps in Knowledge Discovery

Steps in the KDD Procedure n Data Cleaning n n Data Integration n n (application of intelligent methods in order to extract data patterns) Model Evaluation n n (converting data into a form more appropriate for mining) Data Mining n n (only data relevant for the task are retrieved from the database) Data Transformation n n (combining multiple sources) Data Selection n n (removal of noise and inconsistent records) (identification of truly interesting patterns representing knowledge) Knowledge Presentation n (visualization or other knowledge presentation techniques)

What can Data mining do? n n n Figures out some intelligent ways of handling the data Finds valuable information hidden in large volumes of data. Analyze the data and find patterns and regularities in data. Mining analogy: in a mining operation large amounts of low grade materials are sifted through in order to find something of value. Identify some abnormal/suspicious activities To provide guidelines to humans - what to look for in a dataset?

Related CS Topics Pattern Recognition Database Systems Artificial Intelligence Data Mining Machine Learning Visualization Optimization Algorithms Statistics

Typical Data Mining Tasks are … n Prediction Methods (You know what to look for) n n Use some variables to predict unknown or future values of other variables. Description Methods (you don’t know what to look for) n Find human-interpretable patterns that describe the data. From [Fayyad, et. al. ] Advances in Knowledge Discovery and Data Mining, 1996

Basic components Data Pre-processing n Data Visualization n Model Evaluation n Classification n Clustering n Association Analysis n

Different kinds of Data n Record Data n Data Matrix n Document Data n Transaction Data n Graph Data n Ordered n Temporal Data n Sequence Data n Spatio-Temporal Data

Record Data n Data that consists of a collection of records, each of which consists of a fixed set of attributes

Document Data n Each document becomes a `term' vector, n each term is a component (attribute) of the vector, n the value of each component is the number of times the corresponding term occurs in the document.

Transaction Data n A special type of record data, where n n Each record (transaction) involves a set of items. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

Graph Data n Data with Relationships among objects n Examples: (a) Generic Web Data (b) Citation Data Analysis

Ordered Data n Time Series data – series of some measurements taken over certain time frame n E. g. financial Data

Ordered Data n Sequence data – no time stamps, but order is still important. E. g. Genome data

Ordered Data n Spatio-Temporal Data Average Monthly Temperature of land ocean collected for a variety of geographical locations ( a total of 250, 000 data points)

Data Pre-Processing n Removal of noise and outliers n n Sampling is employed for data selection n n Curse of dimensionality Data Normalization n n Processing entire Data might be expensive Dealing with High-dimensional data n n Will improve the performance of mining Different features have different range values e. g. human age, height, weight. Feature Selection n Remove unnecessary features – redundant or irrelevant

Data Visualization n Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Histograms Pie Chart

Scatter Plot Array of Iris Attributes

Contour Plot Example: Celsius

Parallel Coordinates Plots for Iris Data

Chernoff Faces for Iris Data Setosa Versicolour Virginica

Pr od TV PC VCR sum 1 Qtr 2 Qtr Date 3 Qtr 4 Qtr Total annual sales sum of TV in U. S. A Canada Mexico sum Country uc t A Sample Data Cube

Organization n Introduction Basic components Fundamental Topics n n Research Topics n n n Classification Clustering Association Analysis Probabilistic Graphical Models Boosting Algorithms Active Learning Mining under Constraints Teaching

Classification Training Algorithm Learn Model Apply Model Existing Data Result New Data ? ? ? Training Phase Testing Phase

Classification models Outlook Sunny Rainy Overcast High No Windy Yes Humidity Normal Yes True No False Yes

Metrics for Performance Evaluation PREDICTED CLASS Class=Yes Class=No Class=Yes ACTUA L Class=No CLASS Most widely-used metric: a (TP) c (FP) b (FN) d (TN)

Evaluating Data Mining techniques n Predictive Accuracy (ability of a model to predict future) or n Descriptive Quality (ability of a model to find meaningful descriptions of the data, e. g. clusters) n Speed (computation cost involved in generating and using the model) n Robustness (ability of a model to work well even with noisy or missing data) n Scalability (ability of a model to scale up well with large amounts of data) n Interpretability (level of understanding and insight provided by the model)

Clustering n n n No class Labels – so, no prediction Groupings in the data (descriptive) Can be used to summarize the data Can help in removing outliers and noise Image segmentation, document clustering, gene expression data etc. .

Association Analysis n Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Eggs, Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality!

Organization n Introduction Basic components Fundamental Topics n n Research Topics n n n Classification Clustering Association Analysis Probabilistic Graphical Models Boosting Algorithms Active Learning Mining under Constraints Teaching

Probabilistic Graphical Models n n Real World Data is very complicated We would like to understand the underlying distribution that generated the data If it is unimodal, then it is easy to solve But, usually the distribution is multimodal – not unimodal

Parameter Estimation n Modeling with Probabilistic Graphical Models Mixture Models n Hidden Markov Models n Mixture-of-Experts n Bayesian Networks n Mixture of Factor Analyzers n Neural Networks n And so on…. . n We don’t want Sub-optimal models

Example

Motivation ? ? ? ? “Searching for a needle in hay stack”

Problems with Local Optimization Local methods suffer from “fine-tuning ” capability and there is a need for a method that explores a subspace in a systematic manner.

TRUST-TECH Approach Systematic Tier-by-Tier search

Mixture Models n n n Let x = [ x 1, x 2, …, xd ] T be the d - dimensional feature vector Assumption : K components in the mixture model. Let = { 1, 2, …, k, 1, 2, …, k } represent the collection of parameters

Maximum Likelihood Estimation n Let X = { x(1), x(2), …, x(n) } be the set of n i. i. d samples n Goal : Find that maximizes the likelihood function n Difficulty : (i) No closed-form solution and (ii) The likelihood surface is highly nonlinear

EM Algorithm n n Initialization : Set the initial parameters Iteration : Iterate the following until convergence n E-Step : Compute the Q-function i. e. expectation of the log likelihood given the current parameters n M-Step : Maximize the Q-function with respect to

Nonlinear Transformation Original Function Dynamical System one-to-one correspondence of the critical points Local Minimum Saddle Point Local Maximum Likelihood Function [ JCB ’ 06 ] Stable Equilibrium Point Decomposition Point Source Energy Function

Experimental Results [ IEEE PAMI ’ 08 ]

Finding Motifs using Probabilistic Models J k=b k=1 k=2 k=3 k=4 … k=l {A} C 0, 1 C 1, 1 C 2, 1 C 3, 1 C 4, 1 … Cl, 1 {T} C 0, 2 C 1, 2 C 2, 2 C 3, 2 C 4, 2 … Cl, 2 {G} C 0, 3 C 1, 3 C 2, 3 C 3, 3 C 4, 3 … Cl, 3 {C} C 0, 4 C 1, 4 C 2, 4 C 3, 4 C 4, 4 … Cl, 4

Results

Results Different Motifs and the average score using random starts. The first tier and second tier improvements [ BMC AMB ’ 06 ]

Neural Network Diagram Inputs : xi Output : y Weights : wij Biases : bi Targets : t # of Input Nodes : n # of Hidden Layers : 1 # of Hidden Nodes : k # of Output Nodes : 1

Results – Classification Error (%) [ IJCNN ’ 07 ] Train Test BP TRUSTTECH+BP Improve ment(%) Best BP TRUSTTECH+BP Improve ment(%) Cancer 2. 21 1. 74 27. 01 3. 95 2. 63 50. 19 Image 9. 37 8. 04 16. 54 11. 08 9. 74 13. 76 Ionosphere 2. 35 0. 57 312. 28 10. 25 7. 96 28. 77 Iris 1. 25 1. 00 25. 00 3. 33 2. 67 24. 72 Diabetes 22. 04 20. 69 6. 52 23. 83 20. 58 15. 79 Sonar 1. 56 0. 72 116. 67 19. 17 12. 98 47. 69 Wine 4. 56 3. 58 27. 37 14. 94 6. 73 121. 99

Boosting Algorithms for Biomedical Imaging Tumor Detection and Tumor Tracking must be performed in almost real-time Wavelet features are good classifiers but not very good

Medical Image Retrieval using Boosting Methods Retrieving similar medical images is very valuable for diagnosis (automated diagnosis systems) Each category is trained separately and different models are learned Given a query image, the most similar images are displayed

Identification of Microbes Segment the objects by accurately identifying the boundaries Semi-automated methods perform very well Apply Active Learning Methods for labeling the pixels

Results [ JMA ’ 04 ]

n n n Labeling/Annotating Images is a daunting task We need help the medical doctors to efficiently label the images Active Learning for Biomedical Imaging Rather than showing the images at random order, Active Learning can pick the most hard ones

n n Business problems pose many real-world constraints Mining models without the knowledge of Obviously training. Under Constraints these constraints do not perform well [ submitted ] Constraints Learn Model Training Phase Apply Model Testing Phase

Mining Under Constraints Learn Model Training Phase Learn Constraints Model Apply Model Testing Phase Apply Model

Conclusion Different Data Mining related tasks are discussed in general n n Core data mining algorithms are illustrated Data Mining helps existing technologies but it doesn’t override them n n Few challenges still remain unsolved Problems like parameter estimation and automated parameter selection are still on-going research tasks n Handling real-world constraints n Incorporating domain knowledge during the training phase n

Teaching Fall 2007 : CSC 5991 Data Mining I – Fundamentals of Data Mining n http: //www. cs. wayne. edu/~reddy/Courses/CS 5991/ Winter 2008 : CSC 7991 Data Mining II – Topics in Data Mining n http: //www. cs. wayne. edu/~reddy/Courses/CSC 7991/

Data Mining I ( Fall 2007 ) This course introduces the fundamental principles, algorithms and applications of data mining. n n Topics covered in this course include: data pre-processing n data visualization n model evaluation n predictive modeling n association analysis n clustering n anomaly detection. n

Data Mining II ( Winter 2008 ) This will be a continuation course. Data mining problems that arise various application domains will be discussed. (No Prereq: special classes) n The following topics will be covered: Text Mining Data Warehousing Mining Data Streams Probabilistic Graphical Models Frequent Pattern Mining Multi-relational Data Mining Graph Mining Visual Data Mining Sequence Pattern Mining Time-Series Data Privacy-preserving Data Mining High-Dimensional Data Clustering

Thank You Questions and Comments!!!!!! Contact Information : Office : 452 State Hall Email : reddy@cs. wayne. edu WWW : http: //www. cs. wayne. edu/~reddy/