Скачать презентацию 1 Last update 15 November 2007 Advanced databases Скачать презентацию 1 Last update 15 November 2007 Advanced databases

bd6d30c3c451037447da8708c28bdb6d.ppt

  • Количество слайдов: 52

1 Last update: 15 November 2007 Advanced databases – Inferring new knowledge from data(bases): 1 Last update: 15 November 2007 Advanced databases – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 1

2 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major 2 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 2

3 What is the impact of genetically modified organisms? Berendt: Advanced databases, winter term 3 What is the impact of genetically modified organisms? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 3

Is our school system good for immigrants and/or children from poor backgrounds? Berendt: Advanced Is our school system good for immigrants and/or children from poor backgrounds? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 4 4

5 What are the effects of teaching in English at universities? Berendt: Advanced databases, 5 What are the effects of teaching in English at universities? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 5

6 What makes people happy? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. 6 What makes people happy? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 6

7 What do men and women like? Berendt: Advanced databases, winter term 2007/08, http: 7 What do men and women like? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 7

8 Is this a man or a woman? clicked on Berendt: Advanced databases, winter 8 Is this a man or a woman? clicked on Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 8

9 Primary Tasks of Data Mining finding the description of several predefined classes and 9 Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. Classification maps a data item to a real-valued prediction variable. Regression discovering the most significant changes in the data Deviation and change detection identifying a finite set of categories or clusters to describe the data. Clustering finding a model which describes significant dependencies between variables. Dependency Modeling finding a compact description for a subset of data Summarization Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 9

10 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major 10 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 10

11 „Data mining“ and „knowledge discovery“ n (informal definition): data mining is about discovering 11 „Data mining“ and „knowledge discovery“ n (informal definition): data mining is about discovering knowledge in (huge amounts of) data n Therefore, it is clearer to speak about “knowledge discovery in data(bases)” Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 11

12 Recall: Data, information, and knowledge Data represents a fact or statement of event 12 Recall: Data, information, and knowledge Data represents a fact or statement of event without relation to other things. n Ex: It is raining. Information embodies the understanding of a relationship of some sort, possibly cause and effect. n Ex: The temperature dropped 15 degrees and then it started raining. Knowledge represents a pattern that connects and generally provides a high level of predictability as to what is described or what will happen next. n Ex: If the humidity is very high and the temperature drops substantially the atmospheres is often unlikely to be able to hold the moisture so it rains. (This is from knowledge-management theory. If you want to know about wisdom, check the Web page: G. Bellinger, D. Castro, & A. Mills: Data, Information, Knowledge, and Wisdom. http: //www. systems-thinking. org/dikw. htm ) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 12

13 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes n 13 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability l Automated data collection tools, database systems, Web, computerized society n Major sources of abundant data l Business: Web, e-commerce, transactions, stocks, … l Science: Remote sensing, bioinformatics, scientific simulation, … l Society and everyone: news, digital cameras, We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 13

14 Background: Evolution of Database Technology 1960 s: n Data collection, database creation, IMS 14 Background: Evolution of Database Technology 1960 s: n Data collection, database creation, IMS and network DBMS 1970 s: n Relational data model, relational DBMS implementation 1980 s: n RDBMS, advanced data models (extended-relational, OO, deductive, etc. ) n Application-oriented DBMS (spatial, scientific, engineering, etc. ) 1990 s: n Data mining, data warehousing, multimedia databases, and Web databases 2000 s n Stream data management and mining n Data mining and its applications n Web technology (XML, data integration) and global information systems Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 14

15 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and 15 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) Multiple process non-trivial process valid novel useful understandable Justified patterns/models Previously unknown Can be used by human and machine Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 15

16 The process part of knowledge discovery CRISP-DM • CRoss Industry Standard Process for 16 The process part of knowledge discovery CRISP-DM • CRoss Industry Standard Process for Data Mining • a data mining process model that describes commonly used approaches that expert data minershttp: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ Berendt: Advanced databases, winter term 2007/08, use to tackle problems. 16

17 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process 17 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process n Machine learning the application of induction algorithms and other algorithms that can be said to „learn. “ = „modeling“ phase n Data mining l sometimes = KD, sometimes = ML Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 17

18 Data organized by function Create/select target database The KDD Process Data warehousing 1 18 Data organized by function Create/select target database The KDD Process Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data Normalize values Transform values 2 Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Transform to different representation Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods 5 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 18

19 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major 19 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 19

20 Main Contributing Areas of KDD [data warehouses: integrated data] Statistics [OLAP: On-Line Analytical 20 Main Contributing Areas of KDD [data warehouses: integrated data] Statistics [OLAP: On-Line Analytical Processing] Databases Store, access, search, update data (deduction) Infer info from data (deduction & induction, mainly numeric data) KDD Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 20

21 Data Mining: Classification Schemes General functionality n Descriptive data mining n Predictive data 21 Data Mining: Classification Schemes General functionality n Descriptive data mining n Predictive data mining Different views lead to different classifications n Data view: Kinds of data to be mined n Knowledge view: Kinds of knowledge to be discovered n Method view: Kinds of techniques utilized n Application view: Kinds of applications adapted Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 21

22 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics 22 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Visualization Other Disciplines Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 22

23 Why Not Traditional Data Analysis? Tremendous amount of data n Algorithms must be 23 Why Not Traditional Data Analysis? Tremendous amount of data n Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data n Micro-array may have tens of thousands of dimensions High complexity of data n Data streams and sensor data n Time-series data, temporal data, sequence data n Structure data, graphs, social networks and multi-linked data n Heterogeneous databases and legacy databases n Spatial, spatiotemporal, multimedia, text and Web data n Software programs, scientific simulations New and sophisticated applications Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 23

24 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major 24 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 24

25 Data Mining: On What Kinds of Data? Database-oriented data sets and applications n 25 Data Mining: On What Kinds of Data? Database-oriented data sets and applications n Relational database, data warehouse, transactional database Advanced data sets and advanced applications n Data streams and sensor data n Time-series data, temporal data, sequence data (incl. bio-sequences) n Structure data, graphs, social networks and multi-linked data n Object-relational databases n Heterogeneous databases and legacy databases n Spatial data and spatiotemporal data n Multimedia database n Text databases n The World-Wide Web Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 25

26 Data Mining Functionalities Multidimensional concept description: Characterization and discrimination n Generalize, summarize, and 26 Data Mining Functionalities Multidimensional concept description: Characterization and discrimination n Generalize, summarize, and contrast data characteristics, e. g. , dry vs. wet regions Frequent patterns, association, correlation vs. causality n Diaper Beer [0. 5%, 75%] (Correlation or causality? ) Classification and prediction n Construct models (functions) that describe and distinguish classes or concepts for future prediction l n E. g. , classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 26

27 Data Mining Functionalities (2) Cluster analysis n Class label is unknown: Group data 27 Data Mining Functionalities (2) Cluster analysis n Class label is unknown: Group data to form new classes, e. g. , cluster houses to find distribution patterns n Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis n Outlier: Data object that does not comply with the general behavior of the data n Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis n Trend and deviation: e. g. , regression analysis n Sequential pattern mining: e. g. , digital camera large SD memory n Periodicity analysis n Similarity-based analysis Other pattern-directed or statistical analyses Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 27

28 Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: 28 Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interesting n Suggested approach: Human-centered, query-based, focused mining Interestingness measures n A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures n Objective: based on statistics and structures of patterns, e. g. , support, confidence, etc. n Subjective: based on user’s belief in the data, e. g. , unexpectedness, novelty, actionability, etc. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 28

29 Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness n 29 Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness n Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? n Heuristic vs. exhaustive search n Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem n Can a data mining system find only the interesting patterns? n Approaches l First general all the patterns and then filter out the uninteresting ones l Generate only the interesting patterns—mining query optimization Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 29

30 Other Pattern Mining Issues Precise patterns vs. approximate patterns n Association and correlation 30 Other Pattern Mining Issues Precise patterns vs. approximate patterns n Association and correlation mining: possible find sets of precise patterns l l n But approximate patterns can be more compact and sufficient How to find high quality approximate patterns? ? Gene sequence mining: approximate patterns are inherent l How to derive efficient approximate pattern mining algorithms? ? Constrained vs. non-constrained patterns n Why constraint-based mining? n What are the possible kinds of constraints? How to push constraints into the mining process? Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 30

31 Data Mining Query Languages Automated vs. query-driven? n Finding all the patterns autonomously 31 Data Mining Query Languages Automated vs. query-driven? n Finding all the patterns autonomously in a database? —unrealistic because the patterns could be too many but uninteresting Data mining should be an interactive process n User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system Incorporating these primitives in a data mining query language n More flexible user interaction n Foundation for design of graphical user interface n Standardization of data mining industry and practice Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 31

Primitives that Define a Data Mining Task 32 Task-relevant data Type of knowledge to Primitives that Define a Data Mining Task 32 Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 32

33 Primitive 1: Task-Relevant Database or data warehouse name Database tables or data warehouse 33 Primitive 1: Task-Relevant Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 33

Primitive 2: Types of Knowledge to Be Mined 34 Characterization Discrimination Association Classification/prediction Clustering Primitive 2: Types of Knowledge to Be Mined 34 Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 34

Primitive 3: Background Knowledge 35 A typical kind of background knowledge: Concept hierarchies Schema Primitive 3: Background Knowledge 35 A typical kind of background knowledge: Concept hierarchies Schema hierarchy n E. g. , street < city < province_or_state < country Set-grouping hierarchy n E. g. , {20 -39} = young, {40 -59} = middle_aged Operation-derived hierarchy n email address: [email protected] uiuc. edu login-name < department < university < country Rule-based hierarchy n low_profit_margin (X) <= price(X, P 1) and cost (X, P 2) and (P 1 - P 2) < $50 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 35

36 Primitive 4: Pattern Interestingness Measure Simplicity e. g. , (association) rule length, (decision) 36 Primitive 4: Pattern Interestingness Measure Simplicity e. g. , (association) rule length, (decision) tree size Certainty e. g. , confidence, P(A|B) = #(A and B)/ #(B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility potential usefulness, e. g. , support (association), noise threshold (description) Novelty not previously known, surprising (used to remove redundant rules, e. g. , Illinois vs. Champaign rule implication support ratio) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 36

37 Primitive 5: Presentation of Discovered Patterns Different backgrounds/usages may require different forms of 37 Primitive 5: Presentation of Discovered Patterns Different backgrounds/usages may require different forms of representation n E. g. , rules, tables, crosstabs, pie/bar chart, etc. Concept hierarchy is also important n Discovered knowledge might be more understandable when represented at high level of abstraction n Interactive drill up/down, pivoting, slicing and dicing provide different perspectives to data Different kinds of knowledge require different representation: association, classification, clustering, etc. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 37

38 Architecture: Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine 38 Architecture: Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edge. Base Database or Data Warehouse Server data cleaning, integration, and selection Database Data World-Wide Other Info Repositories Warehouse Web Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 38

39 Major Issues in Data Mining methodology n Mining different kinds of knowledge from 39 Major Issues in Data Mining methodology n Mining different kinds of knowledge from diverse data types, e. g. , bio, stream, Web n Performance: efficiency, effectiveness, and scalability n Pattern evaluation: the interestingness problem n Incorporation of background knowledge n Handling noise and incomplete data n Parallel, distributed and incremental mining methods n Integration of the discovered knowledge with existing one: knowledge fusion User interaction n Data mining query languages and ad-hoc mining n Expression and visualization of data mining results n Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts n Domain-specific data mining & invisible data mining n Protection of data security, integrity, and privacy Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 39

40 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major 40 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 40

41 Classification “What factors determine cancerous cells? ” Examples Data Cancerous Cell Data Mining 41 Classification “What factors determine cancerous cells? ” Examples Data Cancerous Cell Data Mining Algorithm General patterns Classification Algorithm - Rule Induction - Decision tree - Neural Network Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 41

42 Classification: Rule Induction “What factors determine a cell is cancerous? ” If and 42 Classification: Rule Induction “What factors determine a cell is cancerous? ” If and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell If and Then Color = dark Tails = 2 Nuclei = 2 Cancerous Cell (certainty = 92%) (certainty = 87%) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 42

43 Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous 43 Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous Color = light #nuclei=1 #nuclei=2 healthy #tails=1 #tails=2 healthy cancerous Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 43

44 Classification: Neural Networks “What factors determine a cell is cancerous? ” Color = 44 Classification: Neural Networks “What factors determine a cell is cancerous? ” Color = dark # nuclei = 1 Healthy Cancerous … # tails = 2 Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 44

45 Clustering “Are there clusters of similar cells? ” Light color with 1 nucleus 45 Clustering “Are there clusters of similar cells? ” Light color with 1 nucleus Dark color with 2 tails 2 nuclei 1 nucleus and 1 tail Dark color with 1 tail and 2 nuclei Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 45

46 Association Rule Discovery Task: Discovering association rules among items in a transaction database. 46 Association Rule Discovery Task: Discovering association rules among items in a transaction database. An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B. In general: A 1, A 2, … => B Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 46

47 Association Rule Discovery “Are there any associations between the characteristics of the cells? 47 Association Rule Discovery “Are there any associations between the characteristics of the cells? ” If color = light and # nuclei = 1 then # tails = 1 (support = 12. 5%; confidence = 50%) If # nuclei = 2 and Cell = Cancerous then # tails = 2 (support = 25%; If # tails = 1 then Color = light confidence = 100%) (support = 37. 5%; confidence = 75%) Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 47

48 Many Other Data Mining Techniques Genetic Algorithms Rough Sets Bayesian Networks Text Mining 48 Many Other Data Mining Techniques Genetic Algorithms Rough Sets Bayesian Networks Text Mining Statistics Time Series Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 48

A goal: From databases to deductive databases to inductive databases n A deductive database A goal: From databases to deductive databases to inductive databases n A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. n 49 inductive databases l contain not only data, but also patterns. l In an IDB, inductive queries can be used to generate (mine), manipulate, and apply patterns. l The IDB framework supports the process of knowledge discovery in databases (KDD): – the results of one (inductive) query can be used as input for another – nontrivial multi-step KDD scenarios can be supported, rather than just single data mining operations. Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 49

50 Next lecture Motivation: Application examples The process of knowledge discovery Origins and context 50 Next lecture Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Deductive databases Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 50

51 References / background reading; acknowledgements n Knowledge discovery is now an established area 51 References / background reading; acknowledgements n Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives: l l a machine learning perspective: Witten, I. H. , & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2 nd ed. Morgan Kaufmann. http: //www. cs. waikato. ac. nz/%7 Eml/weka/book. html l n a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufmann. http: //www. cs. sfu. ca/%7 Ehan/dmbook a statistics perspective: Hand, D. J. , Mannila, H. , & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http: //mitpress. mit. edu/catalog/item/default. asp? tid=3520&ttype=2 pp. 9, 15, 18, 20, 41 -44 were taken from l n pp. 45 -48 were taken from l n Tzacheva, A. A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http: //faculty. uscupstate. edu/atzacheva/SIMS 422/Overview. I. ppt Tzacheva, A. A. (2006). Knowledge Discovery and Data Mining. http: //faculty. uscupstate. edu/atzacheva/SIMS 422/Overview. II. ppt pp. 13, 14, 22, 23, 25 -39 were taken from l Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques — Chapter 1 — Introduction. http: //www. cs. sfu. ca/%7 Ehan/bk/1 intro. ppt Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 51

52 Picture credits; CRISP-DM reference p. 3: http: //www. siu-weeds. com/publications/Wheat_field. jpg p. 4: 52 Picture credits; CRISP-DM reference p. 3: http: //www. siu-weeds. com/publications/Wheat_field. jpg p. 4: http: //www. dkimages. com/discover/previews/889/30039025. JPG p. 5: http: //www. viebahnfinearts. com/website/Pages/Photos/Furniture/Mirror%201005. jpg p. 6: http: //charles. robinsontwins. org/twinsdays_96/john/smiley. jpg p. 16: http: //www. palagems. com/Images/ceylon_mining. jpg, http: //www. crisp-dm. org/Images/187343_CRISPart. jpg The CRISP-DM phase model can be found at http: //www. crisp-dm. org Berendt: Advanced databases, winter term 2007/08, http: //www. cs. kuleuven. be/~berendt/teaching/2007 w/adb/ 52