lecture_1.pptx
- Количество слайдов: 28
Practical Data Mining Lecture 1. Introduction to machine learning Victor Kantor MIPT 2014
Lecturer Victor Kantor, ABBYY – Head of learning group, R&D Courses: • MIPT, Data Mining special course: since Feb 2012 • MIPT, DIHT Machine learning course: Feb 2013 – Jun 2014 • Yandex, Text Mining course for MIPT bachelors: since Sep 2014 • MIPT, DIHT Image Analysis course: since Sep 2014 Projects: • “Smart tagger” – automatic text document tagger, with Nikita Pustovoytov & Dmitry Elisov • “ 2 Long 2 Read” (2 l 2 r. ru) – automatic text summarization, with Makar Krasnoperov, Anatoly Prokhorchuk, Vitaly Pavlenko, Julia Olkhovskaya, Dmitry Elisov, Kirill Yanykin, Artyom Sannikov, Diana Khammatova • “Broca” (broca. ru) – of-the-shelf natural language processing tools as a service, with Anatoly Prokhorchuk, Alexander Nikitin, Azat Davletshin
Plan 1. Data Scientist? ! WAT? ? ? 2. Typical machine learning task example 3. Classification, clustering, regression tasks and simple algorithms 4. Feature engineering 5. Overfitting
Typical machine learning example German credit data set (UCI repository) Training set
Typical machine learning example German credit data set (UCI repository) Attribute 1: Status of existing checking account 1 : . . . < 0 DM 2 : 0 <=. . . < 200 DM 3 : . . . >= 200 DM / salary assignments for at least 1 year 4 : no checking account
Typical machine learning example German credit data set (UCI repository) Attribute 2: Duration in month
Typical machine learning example German credit data set (UCI repository) Answer: 1 – Good, 2 - Bad
Typical machine learning example Task (supervised classification): predict classes (1 or 2) ? ? ? ? Test set Glogal task: construct algorithm, which can generate classification algorithm (“fitted model”) on given train set
Standard tasks: classification Input (train set): features of N objects with known class Output: classifier (class prediction algorithm) for new data.
Standard tasks: classification 11
Simple classifier: k. NN k nearest neighbours k = 1 12 Possibly outliers
Simple classifier: k. NN k nearest neighbours Mistakes? k = 5 13 Possibly outliers
Clustering (unsupervised classification) Input (train set): features of N objects Output: Explored in train set classes (clusters), cluster labels for train examples and cluster label prediction for new data Example: market segmentation
Clustering (usupervised classification) From clusters. uk. com:
Clustering (usupervised classification)
Simple clustering algorithm: k. Means
Simple clustering algorithm: k. Means
Regression Input (train set): features of N objects with known values of predicting real-value parameter Output: prediction algorithm for new data Pictures for one-feature case: x – feature, y – predicted value Quadratic model Linear model
Simple regression algorithm: k. NN
Feature engineering • Extraction – getting features from data • Selection – choosing “top” features • Transformation – constructing new better features based on already existing features
Example: text features • Dataset: 20 news_groups • Email messages – 20 topics (classes) • Let’s try classification for messages of two classes: auto and politics. mideast 22
Text feature extraction • Message example 1: 23 From: carl_f_hoffman@cup. portal. com Newsgroups: rec. autos Subject: 1993 Infiniti G 20 Message-ID: <78834@cup. portal. com> Date: Mon, 5 Apr 93 07: 36: 47 PDT Organization: The Portal System (TM) Lines: 26 I am thinking about getting an Infiniti G 20. In consumer reports it is ranked high in many catagories including highest in reliability index for compact cars. Mitsubushi Galant was second followed by Honda Accord). A couple of things though: 1) In looking around I have yet to see anyone driving this car. I see lots of Honda's and Toyota's.
Text feature extraction • Message example 2: From: Bob. Waldrop@f 418. n 104. z 1. fidonet. org (Bob Waldrop) Subject: Celebrate Liberty! 1993 Message-ID: <1993 Apr 5. 201336. 16132@dsd. es. com> Followup-To: talk. politics. misc 24 Announcing. . . CELEBRATE LIBERTY! 1993 LIBERTARIAN PARTY NATIONAL CONVENTION AND POLITICAL EXPO THE MARRIOTT HOTEL AND THE SALT PALACE SALT LAKE CITY, UTAH INCLUDES INFORMATION ON DELEGATE DEALS! (Back by Popular Demand!) The convention will be held at the Salt Palace Convention Center and the
Text features: bag-of-words model 25
Classifying texts by topic Text document Classifier Bag-of-words 26
Overfitting
Plan 1. Data Scientist? ! WAT? ? ? 2. Typical machine learning task example 3. Classification, clustering, regression tasks and simple algorithms 4. Feature engineering 5. Overfitting Thanks for your attention : )


