1e99aefd35a755075f029eea760e8055.ppt
- Количество слайдов: 27
Arno Knobbe Arne Koopman LIACS Data Mining course an introduction
Course Textbook Data Mining Practical Machine Learning Tools and Techniques second edition, Morgan Kaufmann, ISBN 0 -12 -088407 -0 by Ian Witten and Eibe Frank
Course Information g New Website: http: //www. liacs. nl/~akoopman/Da. Mi/ g Old website discontinued: http: //www. liacs. nl/~joost/DM/College. Data. Mining. htm g g g New lecturers new style Old book. This may change next year Updated slides + some new material Practical exercises New style of exam g g g fewer definitions, more understanding and applying old exams should not be used exam preparation (Dec 8) important
Course Outline 08 -Sep Knobbe 15 -Sep Knobbe 22 -Sep Koopman 29 -Sep Knobbe 06 -Oct Knobbe 13 -Oct Koopman 20 -Oct Koopman 27 -Oct Knobbe 03 -Nov Knobbe 10 -Nov Koopman 17 -Nov 24 -Nov Koopman 01 -Dec Koopman 08 -Dec Koopman, Knobbe 14 -Jan, 10: 00 -13: 00 today practical exercise no lecture! start at 9: 00! exam preparation! exam
Introduction Data Mining an overview and some examples
Data Mining definitions Data Mining: the concept of extracting previously unknown and potentially useful information from large sets of data. secondary statistics: analyzing data that wasn’t originally collected for analysis.
Data Mining, the big idea g Organizations collect large amounts of data Often for administrative purposes Large body of experience Learning from experience g Goals g g g g Prediction Optimization Forecasting Diagnostics …
2 Streams g Mining for insight g g g Understanding a domain Finding regularities between variables Goal of Data Mining is mostly undefined Interpretable models Examples: Medicine, production, maintenance ‘Black-box’ Mining g Don’t care how you do it, just do it well Optimization Examples: Marketing, forecasting (financial, weather)
example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond: g g more response fewer letters test mailing final mailing remainder Customer information Data Mining Customer information customer model response 3% response 30%
example: Bioinformatics g Find genes involved in disease (Parkinson’s, Celiac, Neuroblastoma) Measurements from patients (1) and controls (0) Gene expression: measurements of 20 k genes dataset 20, 001 x 100 g Challenges g g g many variables few examples (patients), testing is expensive interactions between genes
Data Mining paradigms g Classification g g Clustering g g divide dataset into groups of similar cases Regression g g binary class variable predict class of future cases most popular paradigm numeric target variable Association g g find dependencies between variables basket analysis, …
Classification Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). 0. 4 Rent 0. 2 0. 1 Buy 0. 07 Other No 0. 64 Age < 35 Yes 0. 25 Age ≥ 35 No 0. 51 Price < 200 K Yes 0. 01 Price ≥ 200 K No
Building (inducing) a decision tree Age 21 30 40 32 30 55 25 … Gender M F F M F House Rent Buy Buy Price 300 K 260 K 180 K Mortgage? No Yes Age < 35 Rent Age ≥ 35 Price < 200 K Buy Price ≥ 200 K Other
Applying a classifier (decision tree) New customer: (House = Rent, Age = 32, …) prediction = Yes Age < 35 Yes Age ≥ 35 No Rent Price < 200 K Yes Price ≥ 200 K No Buy Other No
Graphical interpretation g g dataset with two variables + 1 class (+/-) graphical interpretation of decision tree y + + + + 0 - + + + - - x x<t y < t’ x t y t’
Graphical interpretation g g dataset with two variables + 1 class (+/-) other classifiers Support Vector Machine y + + + + 0 - + + + - - x Neural Network
Applications of DM g Marketing g g g outgoing incoming Bioinformatics & Medicine Fraud detection Risk management Insurance Enterprise resource planning
Rhinoplastic surgery ‘beinvloedt deze bezorgdheid uw dagelijkse leven’
Infra. Watch: monitoring of infrastructure Continuous monitoring of a large bridge ‘Hollandse Brug’ g 145 sensors g time-dependent, at frequencies up to 100 Hz g multi-modal (sensor, video, differen freq. ) g managing large data quantities, >1 Gb per day
Infra. Watch: monitoring of infrastructure g g g 34 `geo-phones' (vibration sensors) 44 embedded strain-gauges, 47 gauges outside 20 thermometers video camera weather station
Infra. Watch sensors
Real-world application: Maintenance planning at KLM g g g Routine checks of aircrafts Maintenance requires up to 10 k different parts Ordering parts incurs delay (costs)… … but so does stocking In theory 10 k individual predictions Input g g g maintenance history flight history, Sahara/North Pole Only few parts predictable
Cashflow Online g g Online personal finance overview All bank transactions are loaded into the application transactions are classified into different categories Data Mining predicts category
67 Categories Gas Water Licht Onderhoud huis en tuin Telefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi …
Fragmented results: Boodschappen (groceries) Contributie
Decision Tree over all categories false true
Data Mining at LIACS g Applications g g g bioinformatics (LUMC) law enforcement (KLPD, NFI) rhinoplastic surgery (NKI) Hollandse Brug (Strukton, RWS, Reef Infra) … Complex data g g graphical data (molecules) relational data (criminal careers) stream data (sensor-data, click-streams) …
1e99aefd35a755075f029eea760e8055.ppt