Скачать презентацию Practical Data Mining Lecture 1 Introduction to machine Скачать презентацию Practical Data Mining Lecture 1 Introduction to machine

lecture_1.pptx

  • Количество слайдов: 28

Practical Data Mining Lecture 1. Introduction to machine learning Victor Kantor MIPT 2014 Practical Data Mining Lecture 1. Introduction to machine learning Victor Kantor MIPT 2014

Lecturer Victor Kantor, ABBYY – Head of learning group, R&D Courses: • MIPT, Data Lecturer Victor Kantor, ABBYY – Head of learning group, R&D Courses: • MIPT, Data Mining special course: since Feb 2012 • MIPT, DIHT Machine learning course: Feb 2013 – Jun 2014 • Yandex, Text Mining course for MIPT bachelors: since Sep 2014 • MIPT, DIHT Image Analysis course: since Sep 2014 Projects: • “Smart tagger” – automatic text document tagger, with Nikita Pustovoytov & Dmitry Elisov • “ 2 Long 2 Read” (2 l 2 r. ru) – automatic text summarization, with Makar Krasnoperov, Anatoly Prokhorchuk, Vitaly Pavlenko, Julia Olkhovskaya, Dmitry Elisov, Kirill Yanykin, Artyom Sannikov, Diana Khammatova • “Broca” (broca. ru) – of-the-shelf natural language processing tools as a service, with Anatoly Prokhorchuk, Alexander Nikitin, Azat Davletshin

Plan 1. Data Scientist? ! WAT? ? ? 2. Typical machine learning task example Plan 1. Data Scientist? ! WAT? ? ? 2. Typical machine learning task example 3. Classification, clustering, regression tasks and simple algorithms 4. Feature engineering 5. Overfitting

Typical machine learning example German credit data set (UCI repository) Training set Typical machine learning example German credit data set (UCI repository) Training set

Typical machine learning example German credit data set (UCI repository) Attribute 1: Status of Typical machine learning example German credit data set (UCI repository) Attribute 1: Status of existing checking account 1 : . . . < 0 DM 2 : 0 <=. . . < 200 DM 3 : . . . >= 200 DM / salary assignments for at least 1 year 4 : no checking account

Typical machine learning example German credit data set (UCI repository) Attribute 2: Duration in Typical machine learning example German credit data set (UCI repository) Attribute 2: Duration in month

Typical machine learning example German credit data set (UCI repository) Answer: 1 – Good, Typical machine learning example German credit data set (UCI repository) Answer: 1 – Good, 2 - Bad

Typical machine learning example Task (supervised classification): predict classes (1 or 2) ? ? Typical machine learning example Task (supervised classification): predict classes (1 or 2) ? ? ? ? Test set Glogal task: construct algorithm, which can generate classification algorithm (“fitted model”) on given train set

Standard tasks: classification Input (train set): features of N objects with known class Output: Standard tasks: classification Input (train set): features of N objects with known class Output: classifier (class prediction algorithm) for new data.

Standard tasks: classification 11 Standard tasks: classification 11

Simple classifier: k. NN k nearest neighbours k = 1 12 Possibly outliers Simple classifier: k. NN k nearest neighbours k = 1 12 Possibly outliers

Simple classifier: k. NN k nearest neighbours Mistakes? k = 5 13 Possibly outliers Simple classifier: k. NN k nearest neighbours Mistakes? k = 5 13 Possibly outliers

Clustering (unsupervised classification) Input (train set): features of N objects Output: Explored in train Clustering (unsupervised classification) Input (train set): features of N objects Output: Explored in train set classes (clusters), cluster labels for train examples and cluster label prediction for new data Example: market segmentation

Clustering (usupervised classification) From clusters. uk. com: Clustering (usupervised classification) From clusters. uk. com:

Clustering (usupervised classification) Clustering (usupervised classification)

Simple clustering algorithm: k. Means Simple clustering algorithm: k. Means

Simple clustering algorithm: k. Means Simple clustering algorithm: k. Means

Regression Input (train set): features of N objects with known values of predicting real-value Regression Input (train set): features of N objects with known values of predicting real-value parameter Output: prediction algorithm for new data Pictures for one-feature case: x – feature, y – predicted value Quadratic model Linear model

Simple regression algorithm: k. NN Simple regression algorithm: k. NN

Feature engineering • Extraction – getting features from data • Selection – choosing “top” Feature engineering • Extraction – getting features from data • Selection – choosing “top” features • Transformation – constructing new better features based on already existing features

Example: text features • Dataset: 20 news_groups • Email messages – 20 topics (classes) Example: text features • Dataset: 20 news_groups • Email messages – 20 topics (classes) • Let’s try classification for messages of two classes: auto and politics. mideast 22

Text feature extraction • Message example 1: 23 From: carl_f_hoffman@cup. portal. com Newsgroups: rec. Text feature extraction • Message example 1: 23 From: carl_f_hoffman@cup. portal. com Newsgroups: rec. autos Subject: 1993 Infiniti G 20 Message-ID: <78834@cup. portal. com> Date: Mon, 5 Apr 93 07: 36: 47 PDT Organization: The Portal System (TM) Lines: 26 I am thinking about getting an Infiniti G 20. In consumer reports it is ranked high in many catagories including highest in reliability index for compact cars. Mitsubushi Galant was second followed by Honda Accord). A couple of things though: 1) In looking around I have yet to see anyone driving this car. I see lots of Honda's and Toyota's.

Text feature extraction • Message example 2: From: Bob. Waldrop@f 418. n 104. z Text feature extraction • Message example 2: From: Bob. Waldrop@f 418. n 104. z 1. fidonet. org (Bob Waldrop) Subject: Celebrate Liberty! 1993 Message-ID: <1993 Apr 5. 201336. 16132@dsd. es. com> Followup-To: talk. politics. misc 24 Announcing. . . CELEBRATE LIBERTY! 1993 LIBERTARIAN PARTY NATIONAL CONVENTION AND POLITICAL EXPO THE MARRIOTT HOTEL AND THE SALT PALACE SALT LAKE CITY, UTAH INCLUDES INFORMATION ON DELEGATE DEALS! (Back by Popular Demand!) The convention will be held at the Salt Palace Convention Center and the

Text features: bag-of-words model 25 Text features: bag-of-words model 25

Classifying texts by topic Text document Classifier Bag-of-words 26 Classifying texts by topic Text document Classifier Bag-of-words 26

Overfitting Overfitting

Plan 1. Data Scientist? ! WAT? ? ? 2. Typical machine learning task example Plan 1. Data Scientist? ! WAT? ? ? 2. Typical machine learning task example 3. Classification, clustering, regression tasks and simple algorithms 4. Feature engineering 5. Overfitting Thanks for your attention : )