5cc7777349e8fc64694d394ef60465fe.ppt
- Количество слайдов: 24
Contributions to Mining. Mart Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse. cz Mining. Mart prezentation (c) Petr Berka, LISp, 2001
University of Economics, Prague ä LISp - Laboratory for Intelligent Systems ä SALOME - Laboratory for Multidisciplinary Approaches to Decision-making Support in Economics and Management Mining. Mart prezentation (c) Petr Berka, LISp, 2001 2
LISp research ä probabilistic methods - decomposable probability models and bayesian networks ä symbolic ML methods - 4 FT association rules and decision rules ä logical calculi for knowledge discovery in databases Mining. Mart prezentation (c) Petr Berka, LISp, 2001 3
LISp activities ä Organized conferences ä ä Organized workshops ä ä ECML’ 97, PKDD’ 99 Discovery Challenge (PKDD‘ 99, PKDD 20001), WUPES‘ 97, WUPES 2000 International Projects ä ä MLNet, Sol-Eu-Net, EUNITE, KDNet MUM, MGT Mining. Mart prezentation (c) Petr Berka, LISp, 2001 4
SALOME research ä Quantitative and AI (pattern recognition, fuzzy, neural nets) approaches to support of decision making in econmics and management Mining. Mart prezentation (c) Petr Berka, LISp, 2001 5
SALOME activities ä Organized workshops ä ä STIPR‘ 97, MME‘ 99 International Projects ä Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge Mining. Mart prezentation (c) Petr Berka, LISp, 2001 6
LISp software ä LISp-Miner (data mining system) ä Data. Source (for data manipulation) ä 4 FT Miner (4 FT association rules) and ä KEX (decision rules) ä experimental software for building graphical models ä preprocessing procedures related to KEX ä based on information theoretic approach ä Mining. Mart prezentation (c) Petr Berka, LISp, 2001 7
LISP-Miner procedures ä Data. Source creating new (virtual) attributes using SQL ekvidistant and equifrequent discretization grouping attribute values computing attribute-value frequencies Mining. Mart prezentation (c) Petr Berka, LISp, 2001 8
LISP-Miner procedures ä 4 FT-Miner (GUHA procedure) 4 FT association rules in the form Ant ~ Suc / Cond ä KEX weighted decision rules in the form Ant C (weight) Mining. Mart prezentation (c) Petr Berka, LISp, 2001 9
4 FT-Miner basic idea ä Generate a (potential) rule, e. g. COLOUR(red) SIZE(small) 0. 9, 20 TEMP(high) AGE(21 -30) SALARY(low) 0. 85, 15 PAYMENTS (High) LOAN(bad) ä Verify a rule using four-fold table Mining. Mart prezentation (c) Petr Berka, LISp, 2001 10
KEX basic idea ä Generate a (potential) rule, e. g. YEARS-IN-COMPANY(0 -3) AGE(0 -25) LOAN(GOOD) ä If rule refines current set of rules (validity a/(a+b) differs from weight inferred during consultation) add into rule base with proper weight Mining. Mart prezentation (c) Petr Berka, LISp, 2001 13
LISp-Miner architecture Meta. Data (ODBC ACCESS) LM Data (ODBC ACCESS) Windows Mining. Mart prezentation (c) Petr Berka, LISp, 2001 Results 16
Preprocessing (LISp) ä KEX-oriented ä (fuzzy) discretization + grouping of values ä computing the amount of noise in data ä random sampling + balancing of data ä handling missing values ä Information theory ä attribute selection ä attribute grouping Mining. Mart prezentation (c) Petr Berka, LISp, 2001 17
… fuzzy discretization Mining. Mart prezentation (c) Petr Berka, LISp, 2001 18
… amount of noise Amount of noise: 20% max. possible accuracy = 80% Mining. Mart prezentation (c) Petr Berka, LISp, 2001 19
… data sampling ä random split into training and testing set ä select random stratified sample ä balance unbalanced classes Mining. Mart prezentation (c) Petr Berka, LISp, 2001 20
… handling missing values ä remove example ä substitute missing with new value ä substitute missing with majority value ä proportional substitution Mining. Mart prezentation (c) Petr Berka, LISp, 2001 21
… information theory ä Attribute selection - based on mutual information ä Attribute grouping - based on information content Mining. Mart prezentation (c) Petr Berka, LISp, 2001 22
Preprocessing architecture Input data (ASCII) Data procedure Output data (ASCII) Results (ASCII) Mining. Mart prezentation (c) Petr Berka, LISp, 2001 23
SALOME software ä Feature Selection Toolbox (Multi-Purpose Tool for Pattern Recognition) feature selection ä approximation-based modeling ä classification ä a consulting system helping to choose the most suitable method is being developed Mining. Mart prezentation (c) Petr Berka, LISp, 2001 24
Search strategies for FS Search for a subset maximizing a criterion function (distance, divergence): ä with apriori information exhaustive search ä branch and bound based algorithms ä floating search algorithms ä ä without apriori information approximation method ä divergence method ä Mining. Mart prezentation (c) Petr Berka, LISp, 2001 25
FST architecture Data (ASCII) FST Results Windows Mining. Mart prezentation (c) Petr Berka, LISp, 2001 26
References LISp-Miner: · Berka, P. - Ivanek, J. : Automated Knowledge Acquisition for PROSPECTOR-like Expert Systems. In: (Bergadano, de. Raedt eds. ) Proc. ECML'94, Springer 1994, 339 -342. · Berka, P. - Rauch, J. : Data Mining using GUHA and KEX. In: (Callaos, Yang, Aguilar eds. ) 4 th. Int. Conf. on Information Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238 - 244. · Rauch, J. : Classes of Four Fold Table Quantifiers. In: (Zytkow, Quafafou eds. ) Principles of Data Mining and Knowledge Discovery. Springer 1998, 203 - 211. Mining. Mart prezentation (c) Petr Berka, LISp, 2001 27
References Preprocessing: · · · Bruha, I. - Berka, P. : Discretization and Fuzzification of Numerical Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Kacprzyk (eds. ): Fuzzy Systems in Medicine, Physica Verlag, 2000, 112 -138. Pudil, P. , Novovičová J. : Novel Methods for Subset Selection with Respect to Problem Knowledge, IEEE Transactions on Intelligent Systems - Special Issue on Feature Transformation and Subset Selection 1998, 66 -74 J. Zvarova and M. Studeny: Information theoretical approach to constitution and reduction of medical data. International Journal of Medical Informatics 45 (1997), n. 1 -2, pp. 65 -74. Mining. Mart prezentation (c) Petr Berka, LISp, 2001 28
5cc7777349e8fc64694d394ef60465fe.ppt