Εξόρυξη Γνώσης data mining Χ Παπαθεοδώρου Εργαστήριο Ψηφιακών

Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο Πανεπιστήμιο 1

Data Mining ® Εξόρυξη γνώσης από πολύ μεγάλες συλλογές δεδομένων ® Γνώση: κανόνες, πρότυπα συμπεριφοράς και συσχετίσεις μεταξύ αντικειμένων (όχι προφανής, λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη) ® Αντικείμενο: Αποτελείται από ένα σύνολο χαρακτηριστικών ® Δεν είναι: ® ® (Deductive) query processing. Expert systems, small machine learning /statistical programs 2

Why Data Mining? Potential Applications ® Database analysis and decision support ® Market analysis and management ® ® Risk analysis and management ® ® ® target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications ® Text mining (news group, email, documents) and Web analysis. ® Intelligent query answering 3

Market Analysis and Management (1) ® Where are the data sources for analysis? ® ® Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing ® Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. ® Determine customer purchasing patterns over time ® ® Conversion of single to a joint bank account: marriage, etc. Cross-market analysis ® Associations/co-relations between product sales ® Prediction based on the association information 4

Market Analysis and Management (2) ® Customer profiling ® data mining can tell you what types of customers buy what products (clustering or classification) ® Identifying customer requirements ® identifying the best products for different customers ® use prediction to find what factors will attract new customers ® Provides summary information ® various multidimensional summary reports ® statistical summary information (data central tendency and variation) 5

Corporate Analysis and Risk Management ® Finance planning and asset evaluation ® ® Resource planning: ® ® cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financialratio, trend analysis, etc. ) summarize and compare the resources and spending Competition: ® ® ® monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market 6

Steps of a KDD Process ® Learning the application domain: ® ® Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: ® ® Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining ® ® relevant prior knowledge and goals of application summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation ® visualization, transformation, removing redundant patterns, etc. 7

Data Mining: A KDD Process Pattern Evaluation ® Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Warehouse Selection Data Cleaning Data Integration Databases 8

Data pre-processing ® Data preparation is a big issue for data mining ® Data preparation includes ® ® Data reduction and feature selection ® ® Data cleaning and data integration Discretization A lot a methods have been developed but still an active area of research 9

Data pre-processing 10

Clustering ® ® Partition data set into clusters, and one can store cluster representation only Can have hierarchical clustering and be stored in multi-dimensional index tree structures ® There are many choices of clustering definitions and clustering algorithms 11

Cluster Analysis 12

Classification ® ® Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions ® Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic ® Research directions: classification of non-relational data, e. g. , text, spatial, multimedia, etc. . 13

Classification process ® Model construction: describing a set of predetermined classes ® ® Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects ® Estimate accuracy of the model ® The known label of test sample is compared with the classified result from the model ® Accuracy rate is the percentage of test set samples that are correctly classified by the model ® Test set is independent of training set, otherwise over-fitting will occur 14

Classification Process (1): Model Construction Training Data Classification Algorithms Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes 15

Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? 16

Supervised vs. Unsupervised Learning ® Supervised learning (classification) ® ® ® Supervision: The training data (observations, measurements, etc. ) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) ® ® The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 17

Document category modelling ® Example: Filtering spam email. ® Task: classify incoming email as spam and legitimate (2 document categories). ® Simple blacklist and keyword-based methods have failed. ® More intelligent, adaptive approaches are needed (e. g. naive Bayesian category modeling). 18

Document category modelling ® Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization. ® Step 2 (vector representation): bag-of-words or n-gram modeling (n=2, 3). ® Step 3 (feature selection): information gain evaluation. ® Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency. 19

What Is Association Mining? ® Association rule mining: ® Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. ® Applications: ® Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. ® Example. form: "Body Head [support, confidence]. ® buys(x, "diapers ) buys(x, "beers ) [0. 5%, 60%] ® Rule 20

Association Rule: Basic Concepts ® ® Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items ® ® E. g. , 98% of people who purchase tires and auto accessories also get automotive services done Applications ® ® * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up? ) 21

Rule Measures: Support and Confidence Custome r buys both Customer ® buys diaper Find all the rules X & Y Z with minimum confidence and support ® ® Customer buys beer support, s, probability that a transaction contains {X & Y & Z} confidence, c, conditional probability that a transaction having {X & Y} also contains Z Find the rules with support and confidence equal or grater than a given threshold 22

Mining Association Rules An Example Min. support 50% Min. confidence 50% For rule A C: support = support({A =>C}) = 50% confidence = support({A =>C})/support({A}) = 66. 6% 23

References ® ® U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. ® T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39: 58 -64, 1996. ® G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U. M. Fayyad, et al. (eds. ), Advances in Knowledge Discovery and Data Mining, 1 -35. AAAI/MIT Press, 1996. ® G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. 24