Скачать презентацию Knowledge discovery data mining Towards KD Support Скачать презентацию Knowledge discovery data mining Towards KD Support

2e64cec38fddf94f0d4c152b1d8e302d.ppt

  • Количество слайдов: 63

Knowledge discovery & data mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Knowledge discovery & data mining Towards KD Support Environments Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http: //www-kdd. di. unipi. it/ EDBT 2000 tutorial A tutorial @ EDBT 2000 Konstanz, March 2000

Module outline z Data analysis and KD Support Environments z Data mining technology trends Module outline z Data analysis and KD Support Environments z Data mining technology trends yfrom tools … y… to suites y… to solutions z Towards data mining query languages z DATASIFT: a logic-based KDSE z Future research challenges EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 2

Vertical applications z We outlined three classes of vertical data analysis applications that can Vertical applications z We outlined three classes of vertical data analysis applications that can be tackled using KDD & DM techniques y. Fraud detection y. Market basket analysis y. Customer segmentation EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 3

Why are these applications challenging? z Require manipulation and reasoning over knowledge and data Why are these applications challenging? z Require manipulation and reasoning over knowledge and data at different abstraction levels yconceptual Ø semantic integration of domain knowledge, expert (business) rules and extracted knowledge Ø semantic integration of different analysis paradigms ylogical/physical Ø interoperability with external components: DBMS’s, data mining tools, desktop tools Ø querying/mining optimization: loose vs. tight coupling between query language and specialized mining tools EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 4

Why are these applications challenging? z The associated KDD process needs to be carefully Why are these applications challenging? z The associated KDD process needs to be carefully Interpretation specified, tuned and Evaluation controlled Data Mining Selection and Preprocessing Data Consolidation Warehouse Prepared Data Knowledge p(x)=0. 02 Patterns & Models Consolidated Data Sources EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 5

Why are these applications challenging? z Still not properly supported by available KDD technology Why are these applications challenging? z Still not properly supported by available KDD technology z what is offered: horizontal, customizable toolkits/suites of data mining primitives z what is needed: KD support environments for vertical applications EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 6

Datamining vs. traditional Sw development process Traditional Data mining z Focus on knowledge transfer, Datamining vs. traditional Sw development process Traditional Data mining z Focus on knowledge transfer, design and coding z 30% - analysis and design z 70% - program design, coding and testing z Prototyping - expensive z Development process has few loops z Maintenance requires human analysis EDBT 2000 tutorial - KDSE z Focus on data selection, representation and search z 70% - data preparation z 30% - model generation and testing z Prototyping - cheap z Development process is inherently iterative z Maintenance requires relearning model Konstanz, 27 -28. 3. 2000 7

From R. Agrawal’s invited lecture @ KDD’ 99 Chasm Early Market Mainstream Market The From R. Agrawal’s invited lecture @ KDD’ 99 Chasm Early Market Mainstream Market The greatest peril in market lies in making market dominated by EDBT 2000 tutorial - KDSE the development of a high-tech the transition from an early a few visionaries to a mainstream pragmatists. Konstanz, 27 -28. 3. 2000 8

Is data mining in the chasm? z Perceived to be sophisticated technology, usable only Is data mining in the chasm? z Perceived to be sophisticated technology, usable only by specialists z Long, expensive projects z Stand-alone, loosely-coupled with data infrastructures z Difficult to infuse into existing missioncritical applications EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 9

Module outline z Data analysis and KD Support Environments z Data mining technology trends Module outline z Data analysis and KD Support Environments z Data mining technology trends yfrom tools … y… to suites … y… to solutions z Towards data mining query languages z DATASIFT: a logic-based KDSE z Future research challenges EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 10

Generation 1: data mining tools z ~1980: first generation of DM systems z research-driven Generation 1: data mining tools z ~1980: first generation of DM systems z research-driven tools for single tasks, e. g. ybuild a decision tree - say C 4. 5 yfind clusters - say Autoclass (Cheeseman 88) y… z Difficult to use more than one tool on the same data – lots of data/metadata transformation z Intended user: a specialist, technically sophisticated. EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 11

Generation 2: data mining suites z ~1995: second generation of DM systems z toolkits Generation 2: data mining suites z ~1995: second generation of DM systems z toolkits for multiple tasks with support for data preparation and interoperability with DBMS, e. g. y. SPSS Clementine y. IBM Intelligent Miner y. SAS Enterprise Miner y. SFU DBMiner z Intended user: data analyst – suites require significant knowledge of statistics and databases EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 12

Growth of DM tools (source: kdnuggets. com) z From G. Piatetsky-Shapiro. The data-mining industry Growth of DM tools (source: kdnuggets. com) z From G. Piatetsky-Shapiro. The data-mining industry coming of age. IEEE Intelligent Systems, Dec. 1999. EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 13

Generation 3: data mining solutions z Beginning end of 1990 s z vertical data Generation 3: data mining solutions z Beginning end of 1990 s z vertical data mining-based applications and solutions oriented to solving one specific business problem, e. g. ydetecting credit card fraud ycustomer retention y… z Address entire KDD process, and push result into a front-end application z Intended user: business user – the interfaces hid the data mining complexity EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 14

Emerging short-term technology trends z Tighter interoperability by means of standards which facilitate the Emerging short-term technology trends z Tighter interoperability by means of standards which facilitate the integration of data mining with other applications: y. KDD process, e. g. the Cross-Industry Standard Process for Data Mining model (www. crisp-dm. org) yrepresentation of mining models: e. g. , the PMML predictive modeling markup language (www. dmg. org) y. DB interoperability: the Microsoft OLE DB for data mining interface EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 15

Approaches in data mining suites z Database-oriented approach y. IBM Intelligent Miner z OLAP-based Approaches in data mining suites z Database-oriented approach y. IBM Intelligent Miner z OLAP-based mining y. DBMiner - Jiawei Han’s group @ SFU z Machine learning y. CART, ID 3/C 4. 5/C 5. 0, Angoss Knowledge Studio z Statistical approaches y. The SAS Institute Enterprise Miner. z Visualization approach: y. SGI Mine. Set, Vis. DB (Keim et al. 94). EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 16

Other approaches in data mining suites z. Neural network approach: y. Cognos 4 thoughts, Other approaches in data mining suites z. Neural network approach: y. Cognos 4 thoughts, Neuro. Rule (Lu et al. ’ 95). z. Deductive DB integration: y. Knowlege. Miner (Shen et al. ’ 96) y. Datasift (Pisa KDD Lab - see refs). z. Rough sets, fuzzy sets: y. Datalogic/R, 49 er z. Multi-strategy mining: y. INLEN, KDW+, Explora EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 17

SFU DBMiner: OLAP-centric mining Active Object Elements Warehouse Workplace Active Object EDBT 2000 tutorial SFU DBMiner: OLAP-centric mining Active Object Elements Warehouse Workplace Active Object EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 18

IBM Intelligent Miner – DB-centric mining Contents Container Mining Base Container Work Area EDBT IBM Intelligent Miner – DB-centric mining Contents Container Mining Base Container Work Area EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 19

IBM – IM architecture EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 IBM – IM architecture EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 20

Angoss Knowledge Studio: ML-centric mining Work Area Project Outline Additional Visualizations EDBT 2000 tutorial Angoss Knowledge Studio: ML-centric mining Work Area Project Outline Additional Visualizations EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 21

KS project outline tool z (Limited) support to the KDD process EDBT 2000 tutorial KS project outline tool z (Limited) support to the KDD process EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 22

Support for data consolidation step z DBMiner y. ODBC databases – SQL + Smart. Support for data consolidation step z DBMiner y. ODBC databases – SQL + Smart. Drives y. Single database – multiple tables y. Consolidation of heterogeneous sources unsupported z Intelligent Miner y. DB 2 and text – SQL without Smart. Drives y. Multiple databases y. Consolidation of heterogeneous sources supported z Knowledge Studio y. ODBC databases and text y. Single table y. Consolidation of heterogeneous sources unsupported EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 23

Support for selection and preprocessing z DBMiner y. SQL only z Intelligent Miner y. Support for selection and preprocessing z DBMiner y. SQL only z Intelligent Miner y. SQL + standard and advanced statistical functionalities z Knowledge Studio ydescriptive statistics EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 24

Support for data mining step z Knowledge Studio z DBMiner z. Decision trees z. Support for data mining step z Knowledge Studio z DBMiner z. Decision trees z. Clustering z. Prediction y. Association rules y. Decision trees y. Prediction z Intelligent Miner y. Associations rules y. Sequential patterns y. Clustering y. Classification y. Prediction y. Similar time series EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 25

Support for interpretation and evaluation z Predefined interestingness measures z Emphasis on visualization z Support for interpretation and evaluation z Predefined interestingness measures z Emphasis on visualization z Limited export capability of analysis results z Gain charts for comparison of predictive models (KS and IM) z Limited model combination capabilities (KS) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 26

Module outline z Data analysis and KD Support Environments z Data mining technology trends Module outline z Data analysis and KD Support Environments z Data mining technology trends yfrom tools … y… to suites … y… to solutions z Towards data mining query languages z DATASIFT: a logic-based KDSE z Future research challenges EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 27

Data Mining Query Languages z A DMQL can provide the ability to support ad-hoc Data Mining Query Languages z A DMQL can provide the ability to support ad-hoc and interactive data mining z Hope: achieve the same effect that SQL had on relational databases. z Various proposals: y. DMQL (Han et al 96) ymine operator (Meo et el 96) y. M-SQL (Imielinski et al 99) yquery flocks (Tsur et al 98) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 28

MINE operator of (Meo et al 96) EDBT 2000 tutorial - KDSE Konstanz, 27 MINE operator of (Meo et al 96) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 29

References - DMQL z J. Han, Y. Fu, W. Wang, K. Koperski, and O. References - DMQL z J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A Data Mining Query Language for Relational Databases. In Proc. 1996 SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'96), pp. 27 -33, Montreal, Canada, June 1996. z R. Meo, G. Psaila, S. Ceri. A New SQL-like Operator for Mining Association Rules. In Proc. VLDB 96, 1996 Int. Conf. Very Large Data Bases, Bombay, India, pp. 122 -133, Sept. 1996. z T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3: 373 -408, 1999. z S. Tsur, J. Ulman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov. Query flocks: a generalization of association rule mining. In Proc. 1998 ACM-SIGMOD, p. 1 -12, 1998. EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 30

Module outline z Data analysis and KD Support Environments z Data mining technology trends Module outline z Data analysis and KD Support Environments z Data mining technology trends yfrom tools … y… to suites … y… to solutions z Towards data mining query languages z DATASIFT: a logic-based KDSE z Future research challenges EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 31

DATASIFT - towards a logic-based KDSE z DATASIFT is LDL++ (Logic Data Language, MCC DATASIFT - towards a logic-based KDSE z DATASIFT is LDL++ (Logic Data Language, MCC & UCLA) extended with mining primitives (decision trees & association rules) z LDL++ syntax: Prolog-like deductive rules z LDL++ semantics: SQL extended with recursion (and more) z Integration of deduction and induction z Employed to systematically develop the methodology for MBA and audit planning z See Pisa KDD Lab references EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 32

Our position z A suitable integration of ydeductive reasoning (logic database languages) yinductive reasoning Our position z A suitable integration of ydeductive reasoning (logic database languages) yinductive reasoning (association rules & decision trees) z provides a viable solution to high-level problems in knowledge-intensive data analysis applications EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 33

Our goal z Demonstrate how we support design and control of the overall KDD Our goal z Demonstrate how we support design and control of the overall KDD process and the incorporation of background knowledge ydata preparation yknowledge extraction ypost-processing and knowledge evaluation ybusiness rules yautofocus datamining EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 34

With respect to other DMQL’s z extending logic query languages yields extra expressiveness, needed With respect to other DMQL’s z extending logic query languages yields extra expressiveness, needed to bridge the gap between ydata mining (e. g. , association rule mining) yvertical applications (e. g. , market basket analysis) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 35

Architecture - client agent z User interface z Access to business rules and visualization Architecture - client agent z User interface z Access to business rules and visualization of results through yweb browser to control interaction y. MS Excel objects (sheets and charts) to represent output of analysis (association rules) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 36

Architecture - server agent z A query engine (mediator) y record previous analyses y Architecture - server agent z A query engine (mediator) y record previous analyses y Metadata/meta knowledge y interaction with other components z LDL++ server y extended with external calls to DBMSs and to … z Inductive modules y Apriori y classifiers (decision trees) z Coupling with DBMS using the Cache-mine approach z Performance comparable with SQL-based approaches on same mining queries (Giannotti at el 2000) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 37

Deductive rules in LDL++ A small database of cash register transactions basket(1, fish). basket(1, Deductive rules in LDL++ A small database of cash register transactions basket(1, fish). basket(1, bread). basket(2, bread). basket(2, milk). basket(2, onions). basket(2, fish). basket(3, bread). basket(3, orange). basket(3, milk). z E. g. : select transactions involving milk_basket(T, I) basket(T, I), basket(T, milk). z Querying ? - milk_basket(T, I) milk_basket(2, bread). milk_ basket(2, milk). milk_ basket(2, onions). milk_ basket(2, fish). EDBT 2000 tutorial - KDSE milk_basket(3, bread). milk_basket(3, orange). milk_basket(3, milk). Konstanz, 27 -28. 3. 2000 38

Aggregates in LDL++ A small database of cash register transactions basket(1, fish). basket(1, bread). Aggregates in LDL++ A small database of cash register transactions basket(1, fish). basket(1, bread). basket(2, bread). basket(2, milk). basket(2, onions). basket(2, fish). basket(3, bread). basket(3, orange). basket(3, milk). z E. g. : count occurrences of pairs of distinct aggregate items in all transactions pair(I 1, I 2, count) basket(T, I 1), basket(T, I 2), I 1 I 2. z Querying ? - pair(fish, bread, N) pair(fish, bread, 2) (i. e. , N=2) z Aggregates are the logical interface between deductive and inductive environment. EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 39

Association rules in LDL++ basket(1, fish). basket(1, bread). basket(2, bread). basket(2, milk). basket(2, onions). Association rules in LDL++ basket(1, fish). basket(1, bread). basket(2, bread). basket(2, milk). basket(2, onions). basket(2, fish). basket(3, bread). basket(3, orange). basket(3, milk). z E. g. , compute one-to-one association rules with at least 40% support rules(patterns<0. 4, 0, {I 1, I 2}>) basket(T, I 1), basket(T, I 2). z patterns is the aggregate interfacing the computation of association rules z patterns EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 40

Association rules in LDL++ basket(1, fish). basket(1, bread). basket(2, bread). basket(2, milk). basket(2, onions). Association rules in LDL++ basket(1, fish). basket(1, bread). basket(2, bread). basket(2, milk). basket(2, onions). basket(2, fish). basket(3, bread). basket(3, orange). basket(3, milk). z Result of the query ? - rules(X, Y, S, C) rules({milk}, {bread}, 0. 66, 1) i. e. milk bread [0. 66, 1] rules({bread}, {milk}, 0. 66) rules({fish}, {bread}, 0. 66, 1) rules({bread}, {fish}, 0. 66) z Same status for data and induced rules EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 41

Reasoning on item hierarchies z Which rules survive/decay up/down the item hierarchy? rules_at_level(I, pattern<S, Reasoning on item hierarchies z Which rules survive/decay up/down the item hierarchy? rules_at_level(I, pattern) itemset_abstraction(I, Tid, Itemset). preserved_rules(Left, Right) rules_at_level(I, Left, Right, _, _), rules_at_level(I+1, Left, Right, _, _). EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 42

Business rules: reasoning on promotions z Which rules are established by a promotion? interval(before, Business rules: reasoning on promotions z Which rules are established by a promotion? interval(before, - , 3/7/1998). interval(promotion, 3/8/1998, 3/30/1998). interval(after, 3/31/1998, + ). established_rules(Left, Right) not rules_partition(before, Left, Right, _, _), rules_partition(promotion, Left, Right, _, _), rules_partition(after, Left, Right, _, _). EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 43

Business rules: temporal reasoning z How does rule support change along time? EDBT 2000 Business rules: temporal reasoning z How does rule support change along time? EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 44

Decision tree construction in DATASIFT z construct training and test set using rules training_set(P, Decision tree construction in DATASIFT z construct training and test set using rules training_set(P, Case_list) . . . test_tuple(ID, F 1, . . . , F 20, Rec, Act_rec, CAR) . . . z construct classifier using external call to C 5. 0 tree_rules(Tree_name, P, PF, MC, BO, Rule_list) training_set(P, Case_list), tree_induction(Case_list, PF, MC, BO, Rule_list). z parameters ypruning factor PF external call ymisclassification costs MC yboosting BO EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 induced classifier 45

Putting decision trees at work z prediction of target variable prediction(Tree_name, ID, CAR, Predicted_CAR) Putting decision trees at work z prediction of target variable prediction(Tree_name, ID, CAR, Predicted_CAR) tree_rules(Tree_name, _ , _ , Rule_list), test_subject(ID, F 1, …, F 20, _, _, CAR), classify(Rule_list , [F 1, …, F 20], Predicted_CAR). z Model evaluation: actual recovery of a classifier (=sum recovery of tuples classified as positive) actual_recovery(Tree_name, sum) prediction(Tree_name, ID, _ , pos), test_subject(ID, F 1, …, F 20, _, Actual_Recovery, _). aggregate EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 46

Combining decision trees z Model conjunction: tree_conjunction(T 1, T 2, ID, CAR, pos) prediction(T Combining decision trees z Model conjunction: tree_conjunction(T 1, T 2, ID, CAR, pos) prediction(T 1, ID, CAR, pos), prediction(T 2, ID, CAR, pos). tree_conjunction (T 1, T 2, ID, CAR, neg) test_subject(ID, F 1, …, F 20, _, _, CAR), ~ tree_conjunction(T 1, T 2, ID, CAR, pos). z More interesting combinations readily expressible: ye. g. meta learning (Chan and Stolfo 93) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 47

We proposed. . . z a KDD methodology for audit planning: y define an We proposed. . . z a KDD methodology for audit planning: y define an audit cost model y monitor training- and test-set construction y assess the quality of a classifier y tune classifier construction to specific policies z and its formalization in a prototype logic-based KDSE, supporting: y integration of deduction and induction y integration of domain and induced knowledge y separation of conceptual and implementation level EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 48

Module outline z Data analysis and KD Support Environments z Data mining technology trends Module outline z Data analysis and KD Support Environments z Data mining technology trends yfrom tools … y… to suites … y… to solutions z Towards data mining query languages z DATASIFT: a logic-based KDSE z Future research challenges EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 49

A data mining research agenda 1. Integration with data warehouse and relational DB 2. A data mining research agenda 1. Integration with data warehouse and relational DB 2. Scalable, parallel/distributed and incremental mining 3. Data mining query language optimization 4. Multiple, integrated data mining methods 5. KDSE and methodological support for vertical appl. 6. Interactive, exploratory data mining environments 7. Mining on other forms of data: z spatio-temporal databases z text z multimedia z web

Scale up! z Scaling up existing algorithms (AI, ML, IR) y. Association rules y. Scale up! z Scaling up existing algorithms (AI, ML, IR) y. Association rules y. Correlation rules y. Causal relationship y. Classification y. Clustering y. Bayesian networks EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 51

Background knowledge & constraints z Incorporating background knowledge and constraints into existing data mining Background knowledge & constraints z Incorporating background knowledge and constraints into existing data mining techniques z Double benefit for DMQL: semantics and optimization! ytraditional algorithms x. Disproportionate computational cost for selective users x. Overwhelming volume of potentially useless results yneed user-controlled focus in mining process x. Association rules containing certain items x. Sequential patterns containing certain patterns x. Classification? EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 52

Vertical applications of data mining z More success stories needed! z Current data mining Vertical applications of data mining z More success stories needed! z Current data mining systems lack a thick semantic layer (similarly to the early relational database systems) z Verticalized data mining systems, e. g. y. Market analysis systems y. Fraud detection systems z Automated mining and interactive mining: how far are they? EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 53

Autofocus data mining policy options, business rules selection of data mining function fine parameter Autofocus data mining policy options, business rules selection of data mining function fine parameter tuning of mining function EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 54

DBMS coupling z Tight-coupling with DBMS y. Most data mining algorithms are based on DBMS coupling z Tight-coupling with DBMS y. Most data mining algorithms are based on flat file data (i. e. loose-coupling with DBMS) y. A set of standard data mining operators (e. g. sampling operator) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 55

Web mining – why? z No standards on the web, enormous blob of unstructured Web mining – why? z No standards on the web, enormous blob of unstructured and heterogeneous info z Very dynamic y One new WWW server every 2 hours y 5 million documents in 1995 y 320 million documents in 1998 z Indices get obsolete very quickly z Better means needed for discovering resources and extracting knowledge EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 56

Web mining: challenges z Today`s search engines are plagued by problems – the abundance Web mining: challenges z Today`s search engines are plagued by problems – the abundance problem: 99% of info of no interest to 99% of people! – limited coverage of the Web – limited query interface based on keywordoriented search – limited customization to individual users EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 57

Web mining z Web content mining y mining what Web search engines find y Web mining z Web content mining y mining what Web search engines find y Web document classification (Chakrabarti et al 99) y warehousing a Meta-Web (Zaïane and Han 98) y intelligent query answering in Web search z Web usage mining y Web log mining: find access patterns and trends (Zaiane et al 98) y customized user tracking and adaptive sites (Perkowitz et al 97) z Web structure mining y discover authoritative pages: a page is important if important pages point to it (Chakrabarti et al 99, Kleinberg 98) EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 58

Warehousing a Meta-Web (Zaïane & Han 98) z Meta-Web: summarizes the contents and structure Warehousing a Meta-Web (Zaïane & Han 98) z Meta-Web: summarizes the contents and structure of the Web, which evolves with the Web z Layer 0: the Web itself z Layer 1: the lowest layer of the Meta-Web yan entry: a Web page summary, including class, time, URL, contents, keywords, popularity, weight, links, etc. z Layer 2 and up: summary/classification/clustering z Meta-Web is warehoused and incrementally updated z Querying and mining is performed on or assisted by meta-Web z Is it feasible/sustainable? Is XML of any help? EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 59

Meta-Web from Jiawei Han’s panel talk @ SIGMOD 99 Layern More Generalized Descriptions . Meta-Web from Jiawei Han’s panel talk @ SIGMOD 99 Layern More Generalized Descriptions . . . Layer 1 Generalized Descriptions Layer 0 EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 60

Weblog mining z Web servers register a log entry for every single access. z Weblog mining z Web servers register a log entry for every single access. z A huge number of accesses (hits) are registered and collected in an ever-growing web log. z Why warehousing/mining web logs? y Enhance server performance by learning access patterns of general or particular users (guess what user will ask next and pre-cache!) y Improve system design of web applications y Identify potential prime advertisement locations z Greatest peril: the privacy pitfall y See e. g. (Markoff 99) the rise of the Little Brother. EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 61

Some web mining references z z z z z M. Perkowitz and O. Etzioni. Some web mining references z z z z z M. Perkowitz and O. Etzioni. Adaptive sites: Automatically learning from user access patterns. In Proc. 6 th Int. World Wide Web Conf. , Santa Clara, California, April 1997. J. Pitkow. In search of reliable usage data on the www. In Proc. 6 th Int. World Wide Web Conf. , Santa Clara, California, April 1997. T. Sullivan. Reading reader reaction : A proposal for inferential analysis of web server log files. In Proc. 3 rd Conf. Human Factors & the Web, Denver, Colorado, June 1997. O. R. Zaiane, M. Xin, and J. Han. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proc. Advances in Digital Libraries Conf. (ADL'98), pages 19 -29, Santa Barbara, CA, April 1998. O. R. Zaiane, and J. Han. Resource and knowledge discovery in global information systems: a preliminary design and experiment. In Proc. KDD’ 95, p. 331 -336, 1995. O. R. Zaiane, and J. Han. Web. ML: querying the world-wide web for resources and knowledge. In Proc. Int. Workshop on Web informtion and Data management (WIDM 98), p. 9 -12, 1998. S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, et al. Mining the web’s link structure. COMPUTER, 32: 60 -67, 1999. S. Chakrabarti, B. E. Dom, P. Indik. Enhanced hypertext classification using hyperlinks. In Proc. 1998 ACM-SIGMOD, p. 307 -318, 1999. J. Kleinberg. Autohoritative sources in a hyperlinked environment. In Proc. ACM-SIAM Symp. on Discrete Algorithms, 1998. J. Markoff. The Rise of Little Brother. Upside, Apr. 1999; http: //www. upside. com/texis/mvm/story? id=36 d 4613 c 0 EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 62

Pisa KDD Lab references z z z z z F. Giannotti and G. Manco. Pisa KDD Lab references z z z z z F. Giannotti and G. Manco. Making Knowledge Extraction and Reasoning Closer. In Proc. PAKDD'99, The Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, 2000. F. Giannotti and G. Manco. Querying Inductive Databases via Logic-Based User Defined Aggregates. In Proc. PKDD'99, The Third Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases. Prague, Sept. 1999. F. Bonchi, F. Giannotti, G. Mainetto, D. Pedreschi. Using Data Mining Techniques in Fiscal Fraud Detection. In Proc. Da. Wak'99, First Int. Conf. on Data Warehousing and Knowledge Discovery. Florence, Italy, Sept. 1999. F. Bonchi , F. Giannotti, G. Mainetto, D. Pedreschi. A Classification-based Methodology for Planning Audit Strategies in Fraud Detection. In Proc. KDD-99, ACM-SIGKDD Int. Conf. on Knowledge Discovery & Data Mining, San Diego (CA), August 1999. F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-based knowledge discovery support environment. In Proc. 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD'99 DMKD). Philadelphia, May 1999. F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Integration of Deduction and Induction for Mining Supermarket Sales Data. In Proc. PADD'99, Practical Application of Data Discovery, Int. Conference, London, April 1999. F. Giannotti, G. Manco, M. Nanni, D. Pedreschi. Nondeterministic, Nonmonotonic Logic Databases. IEEE Trans. on Knowledge and Data Engineering. 2000. F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Using deduction for intelligent data analysis. Submitted, 2000. http: //www-kdd. di. unipi. it/ P. Becuzzi, M. Coppola, S. Ruggieri and M. Vanneschi. Parallelisation of C 4. 5 as a particular divide and conquer computation. Proc. 3 rd Workshop on High Performance Data Mining, Springer-Verlag LNCS, 2000. EDBT 2000 tutorial - KDSE Konstanz, 27 -28. 3. 2000 63