
bd6d30c3c451037447da8708c28bdb6d.ppt
- Количество слайдов: 52
1 Last update: 15 November 2007 Advanced databases – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 1
2 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 2
3 What is the impact of genetically modified organisms? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 3
Is our school system good for immigrants and/or children from poor backgrounds? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 4 4
5 What are the effects of teaching in English at universities? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 5
6 What makes people happy? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 6
7 What do men and women like? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 7
8 Is this a man or a woman? clicked on Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 8
9 Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. Classification maps a data item to a real-valued prediction variable. Regression discovering the most significant changes in the data Deviation and change detection identifying a finite set of categories or clusters to describe the data. Clustering finding a model which describes significant dependencies between variables. Dependency Modeling finding a compact description for a subset of data Summarization Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 9
10 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 10
11 „Data mining“ and „knowledge discovery“ n (informal definition): data mining is about discovering knowledge in (huge amounts of) data n Therefore, it is clearer to speak about “knowledge discovery in data(bases)” Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 11
12 Recall: Data, information, and knowledge Data represents a fact or statement of event without relation to other things. n Ex: It is raining. Information embodies the understanding of a relationship of some sort, possibly cause and effect. n Ex: The temperature dropped 15 degrees and then it started raining. Knowledge represents a pattern that connects and generally provides a high level of predictability as to what is described or what will happen next. n Ex: If the humidity is very high and the temperature drops substantially the atmospheres is often unlikely to be able to hold the moisture so it rains. (This is from knowledge-management theory. If you want to know about wisdom, check the Web page: G. Bellinger, D. Castro, & A. Mills: Data, Information, Knowledge, and Wisdom. http: //www. systems-thinking. org/dikw. htm ) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 12
13 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability l Automated data collection tools, database systems, Web, computerized society n Major sources of abundant data l Business: Web, e-commerce, transactions, stocks, … l Science: Remote sensing, bioinformatics, scientific simulation, … l Society and everyone: news, digital cameras, We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 13
14 Background: Evolution of Database Technology 1960 s: n Data collection, database creation, IMS and network DBMS 1970 s: n Relational data model, relational DBMS implementation 1980 s: n RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) n Application-oriented DBMS (spatial, scientific, engineering, etc. ) 1990 s: n Data mining, data warehousing, multimedia databases, and Web databases 2000 s n Stream data management and mining n Data mining and its applications n Web technology (XML, data integration) and global information systems Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 14
15 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) Multiple process non-trivial process valid novel useful understandable Justified patterns/models Previously unknown Can be used by human and machine Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 15
16 The process part of knowledge discovery CRISP-DM • CRoss Industry Standard Process for Data Mining • a data mining process model that describes commonly used approaches that expert data minershttp: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ Berendt: Advanced databases, winter term 2007/08, use to tackle problems. 16
17 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process n Machine learning the application of induction algorithms and other algorithms that can be said to „learn. “ = „modeling“ phase n Data mining l sometimes = KD, sometimes = ML Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 17
18 Data organized by function Create/select target database The KDD Process Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data Normalize values Transform values 2 Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Transform to different representation Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods 5 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 18
19 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 19
20 Main Contributing Areas of KDD [data warehouses: integrated data] Statistics [OLAP: On-Line Analytical Processing] Databases Store, access, search, update data (deduction) Infer info from data (deduction & induction, mainly numeric data) KDD Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 20
21 Data Mining: Classification Schemes General functionality n Descriptive data mining n Predictive data mining Different views lead to different classifications n Data view: Kinds of data to be mined n Knowledge view: Kinds of knowledge to be discovered n Method view: Kinds of techniques utilized n Application view: Kinds of applications adapted Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 21
22 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 22
23 Why Not Traditional Data Analysis? Tremendous amount of data n Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data n Micro-array may have tens of thousands of dimensions High complexity of data n Data streams and sensor data n Time-series data, temporal data, sequence data n Structure data, graphs, social networks and multi-linked data n Heterogeneous databases and legacy databases n Spatial, spatiotemporal, multimedia, text and Web data n Software programs, scientific simulations New and sophisticated applications Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 23
24 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 24
25 Data Mining: On What Kinds of Data? Database-oriented data sets and applications n Relational database, data warehouse, transactional database Advanced data sets and advanced applications n Data streams and sensor data n Time-series data, temporal data, sequence data (incl. bio-sequences) n Structure data, graphs, social networks and multi-linked data n Object-relational databases n Heterogeneous databases and legacy databases n Spatial data and spatiotemporal data n Multimedia database n Text databases n The World-Wide Web Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 25
26 Data Mining Functionalities Multidimensional concept description: Characterization and discrimination n Generalize, summarize, and contrast data characteristics, e. g. , dry vs. wet regions Frequent patterns, association, correlation vs. causality n Diaper Beer [0. 5%, 75%] (Correlation or causality? ) Classification and prediction n Construct models (functions) that describe and distinguish classes or concepts for future prediction l n E. g. , classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 26
27 Data Mining Functionalities (2) Cluster analysis n Class label is unknown: Group data to form new classes, e. g. , cluster houses to find distribution patterns n Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis n Outlier: Data object that does not comply with the general behavior of the data n Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis n Trend and deviation: e. g. , regression analysis n Sequential pattern mining: e. g. , digital camera large SD memory n Periodicity analysis n Similarity-based analysis Other pattern-directed or statistical analyses Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 27
28 Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interesting n Suggested approach: Human-centered, query-based, focused mining Interestingness measures n A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures n Objective: based on statistics and structures of patterns, e. g. , support, confidence, etc. n Subjective: based on user’s belief in the data, e. g. , unexpectedness, novelty, actionability, etc. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 28
29 Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness n Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? n Heuristic vs. exhaustive search n Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem n Can a data mining system find only the interesting patterns? n Approaches l First general all the patterns and then filter out the uninteresting ones l Generate only the interesting patterns—mining query optimization Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 29
30 Other Pattern Mining Issues Precise patterns vs. approximate patterns n Association and correlation mining: possible find sets of precise patterns l l n But approximate patterns can be more compact and sufficient How to find high quality approximate patterns? ? Gene sequence mining: approximate patterns are inherent l How to derive efficient approximate pattern mining algorithms? ? Constrained vs. non-constrained patterns n Why constraint-based mining? n What are the possible kinds of constraints? How to push constraints into the mining process? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 30
31 Data Mining Query Languages Automated vs. query-driven? n Finding all the patterns autonomously in a database? —unrealistic because the patterns could be too many but uninteresting Data mining should be an interactive process n User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system Incorporating these primitives in a data mining query language n More flexible user interaction n Foundation for design of graphical user interface n Standardization of data mining industry and practice Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 31
Primitives that Define a Data Mining Task 32 Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 32
33 Primitive 1: Task-Relevant Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 33
Primitive 2: Types of Knowledge to Be Mined 34 Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 34
Primitive 3: Background Knowledge 35 A typical kind of background knowledge: Concept hierarchies Schema hierarchy n E. g. , street < city < province_or_state < country Set-grouping hierarchy n E. g. , {20 -39} = young, {40 -59} = middle_aged Operation-derived hierarchy n email address: [email protected] uiuc. edu login-name < department < university < country Rule-based hierarchy n low_profit_margin (X) <= price(X, P 1) and cost (X, P 2) and (P 1 - P 2) < $50 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 35
36 Primitive 4: Pattern Interestingness Measure Simplicity e. g. , (association) rule length, (decision) tree size Certainty e. g. , confidence, P(A|B) = #(A and B)/ #(B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility potential usefulness, e. g. , support (association), noise threshold (description) Novelty not previously known, surprising (used to remove redundant rules, e. g. , Illinois vs. Champaign rule implication support ratio) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 36
37 Primitive 5: Presentation of Discovered Patterns Different backgrounds/usages may require different forms of representation n E. g. , rules, tables, crosstabs, pie/bar chart, etc. Concept hierarchy is also important n Discovered knowledge might be more understandable when represented at high level of abstraction n Interactive drill up/down, pivoting, slicing and dicing provide different perspectives to data Different kinds of knowledge require different representation: association, classification, clustering, etc. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 37
38 Architecture: Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edge. Base Database or Data Warehouse Server data cleaning, integration, and selection Database Data World-Wide Other Info Repositories Warehouse Web Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 38
39 Major Issues in Data Mining methodology n Mining different kinds of knowledge from diverse data types, e. g. , bio, stream, Web n Performance: efficiency, effectiveness, and scalability n Pattern evaluation: the interestingness problem n Incorporation of background knowledge n Handling noise and incomplete data n Parallel, distributed and incremental mining methods n Integration of the discovered knowledge with existing one: knowledge fusion User interaction n Data mining query languages and ad-hoc mining n Expression and visualization of data mining results n Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts n Domain-specific data mining & invisible data mining n Protection of data security, integrity, and privacy Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 39
40 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 40
41 Classification “What factors determine cancerous cells? ” Examples Data Cancerous Cell Data Mining Algorithm General patterns Classification Algorithm - Rule Induction - Decision tree - Neural Network Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 41
42 Classification: Rule Induction “What factors determine a cell is cancerous? ” If and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell If and Then Color = dark Tails = 2 Nuclei = 2 Cancerous Cell (certainty = 92%) (certainty = 87%) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 42
43 Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous Color = light #nuclei=1 #nuclei=2 healthy #tails=1 #tails=2 healthy cancerous Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 43
44 Classification: Neural Networks “What factors determine a cell is cancerous? ” Color = dark # nuclei = 1 Healthy Cancerous … # tails = 2 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 44
45 Clustering “Are there clusters of similar cells? ” Light color with 1 nucleus Dark color with 2 tails 2 nuclei 1 nucleus and 1 tail Dark color with 1 tail and 2 nuclei Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 45
46 Association Rule Discovery Task: Discovering association rules among items in a transaction database. An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B. In general: A 1, A 2, … => B Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 46
47 Association Rule Discovery “Are there any associations between the characteristics of the cells? ” If color = light and # nuclei = 1 then # tails = 1 (support = 12. 5%; confidence = 50%) If # nuclei = 2 and Cell = Cancerous then # tails = 2 (support = 25%; If # tails = 1 then Color = light confidence = 100%) (support = 37. 5%; confidence = 75%) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 47
48 Many Other Data Mining Techniques Genetic Algorithms Rough Sets Bayesian Networks Text Mining Statistics Time Series Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 48
A goal: From databases to deductive databases to inductive databases n A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. n 49 inductive databases l contain not only data, but also patterns. l In an IDB, inductive queries can be used to generate (mine), manipulate, and apply patterns. l The IDB framework supports the process of knowledge discovery in databases (KDD): – the results of one (inductive) query can be used as input for another – nontrivial multi-step KDD scenarios can be supported, rather than just single data mining operations. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 49
50 Next lecture Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Deductive databases Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 50
51 References / background reading; acknowledgements n Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives: l l a machine learning perspective: Witten, I. H. , & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2 nd ed. Morgan Kaufmann. http: //www. cs. waikato. ac. nz/%7 Eml/weka/book. html l n a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann. http: //www. cs. sfu. ca/%7 Ehan/dmbook a statistics perspective: Hand, D. J. , Mannila, H. , & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http: //mitpress. mit. edu/catalog/item/default. asp? tid=3520&ttype=2 pp. 9, 15, 18, 20, 41 -44 were taken from l n pp. 45 -48 were taken from l n Tzacheva, A. A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http: //faculty. uscupstate. edu/atzacheva/SIMS 422/Overview. I. ppt Tzacheva, A. A. (2006). Knowledge Discovery and Data Mining. http: //faculty. uscupstate. edu/atzacheva/SIMS 422/Overview. II. ppt pp. 13, 14, 22, 23, 25 -39 were taken from l Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques — Chapter 1 — Introduction. http: //www. cs. sfu. ca/%7 Ehan/bk/1 intro. ppt Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 51
52 Picture credits; CRISP-DM reference p. 3: http: //www. siu-weeds. com/publications/Wheat_field. jpg p. 4: http: //www. dkimages. com/discover/previews/889/30039025. JPG p. 5: http: //www. viebahnfinearts. com/website/Pages/Photos/Furniture/Mirror%201005. jpg p. 6: http: //charles. robinsontwins. org/twinsdays_96/john/smiley. jpg p. 16: http: //www. palagems. com/Images/ceylon_mining. jpg, http: //www. crisp-dm. org/Images/187343_CRISPart. jpg The CRISP-DM phase model can be found at http: //www. crisp-dm. org Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 52