Machine Learning Part 2 Intermediate and Active Sampling

Скачать презентацию Machine Learning Part 2 Intermediate and Active Sampling

392579da15d593cacf80d510e1869f6a.ppt

Количество слайдов: 105

Machine Learning Part 2: Intermediate and Active Sampling Methods Jaime Carbonell (with contributions from Pinar Donmez and Jingrui He) Carnegie Mellon University jgc@cs. cmu. edu December, 2008 © 2008, Jaime G Carbonell

Beyond “Standard” Learning: o o o Multi-Objective Learning Structuring Unstructured Data n Text Categorization Temporal Prediction n Cycle & trend detection Semi-Supervised Methods n Labeled + Unlabeled Data n Active Learning n Proactive Learning “Unsupervised” Learning n Predictor attributes, but no explicit objective n Clustering methods n Rare category detection December, 2008 © 2008, Jaime G. Carbonell 2

Multi-Objective Supervised Learning o Several objectives to predict, overlapping sets of predictor attributes obj 1 obj 2 p 1 p 2 p 3 p 4 p 5 Predictor att’s o p 6 -- Independent predictions (each solved ignoring others) -- Dependent predictions (results of earlier predictions partially feed next round) obj 3 Dependent case: sequence the predictions. If feedback, cycle until stability (or fixed N) December, 2008 © 2008, Jaime G. Carbonell 3

The Vector Space Model How to Convert Text to “Data” o o Definitions of document and query vectors, where wj = jth word, and c(wj, di) = count the occurrences of wi in document dj For topic-categorization use wn+1 as objective category to predict (e. g. “finance”, “sports”) December, 2008 © 2008, Jaime G. Carbonell 4

Refinements to Word-Based Features o o o Well-known methods n Stop-word removal (e. g. , “it”, “the”, “in”, …) n Phrasing (e. g. , “White House”, “heart attack”, …) n Morphology (e. g. , “countries” => “country”) Feature Expansion n Query expansion (e. g. , “cheap” => “inexpensive”, “discount”, “economic”, …) Feature Transformation & Reduction n Singular-value decomposition (SVD) n Linear discriminant analysis (LDA) December, 2008 © 2008, Jaime G. Carbonell 5

Query-Document Similarity (For Retrieval and for k. NN) Traditional “Cosine Similarity” where: Each element in the query and document vectors are word weights Rare words count more, e. g. : di = log 2(Dall/Dfreq(wordi)) Getting the top-k documents (or web pages) is done by: December, 2008 © 2008, Jaime G. Carbonell 6

Time Series Prediction Process o o Find leading indicators n “predictor” variables from earlier epochs n Code values per distinct time interval o E. g. “sales at t-1, at t-2, t-3 …” o E. g. “advertisement $ at t, t-1, t-2” n Objective is to predict desired variable at current or future epochs o E. g. “sales at t, t+1, t+2” Apply machine learning methods you learned n Regression, d-trees, k. NN, Bayesian, … December, 2008 © 2008, Jaime G. Carbonell 8

Time Series Prediction: caveat 1 2006 Total Sales 2008 Total Sales Q 1: 9. 5 M Q 1: 12 M Q 2: 8. 5 M Q 2: 11 M Q 3: 7. 5 M Q 3: 9. 5 M Q 4: 11 M Q 4: ? ? 2007 Total Sales 1. Determine periodic cycle Q 1: 11 M 2. Find within-cycle trend Q 2: 10 M 3. Find cross-cycle trend Q 3: 8. 5 M 4. Combine both components Q 4: 13 M December, 2008 © 2008, Jaime G. Carbonell 9

Time Series Prediction: caveat 2 2008 Total Airline Sales Q 1: 12 M Q 1: 9. 5 M Q 2: 8. 5 M Q 2: 11 M Q 3: 7. 5 M Q 3: 9. 5 M Q 4: 11 M Q 4: ? ? Watch for exogenous variable! (World-trade Center attack wreaked havoc with airline industry predictions) Less tragic and less obvious one-of-a-kind events too December, 2008 2006 Total Sales 2007 Total Sales Q 1: 11 M Q 2: 10 M Q 3: 8. 5 M Q 4: 13 M © 2008, Jaime G. Carbonell 10

Leveraging Existing Data Collecting Systems 1999 Influenza outbreak Influenza cultures Sentinel physicians Web. MD queries about ‘cough’ etc. School absenteeism Sales of cough and cold meds Sales of cough syrup ER respiratory complaints ER ‘viral’ complaints Influenza-related deaths December, 2008 [Moore, 2002] Week (1999 -2000)) © 2008, Jaime G. Carbonell 11

Adaptive Filtering over a Document Stream Training documents (past) Unlabeled documents On-topic documents Off-topic documents December, 2008 Test documents Current document: On-topic? time Topic 1 Topic 2 Topic 3 … RF © 2008, Jaime G. Carbonell 12

Time Series in a Nutshell o o Time-Series Prediction requires regression, except n Historical data per time period (aka “epoch”) n Predictor attributes come from both current + earlier epochs n Objective attribute from earlier epochs predictor attributes for current epoch Process Difference with Normal Machine Learning n First detect cyclical patterns among epochs n Predict within a cycle n Predict cross-cycle using corresponding epochs only (then combine within-cycle prediction) December, 2008 © 2008, Jaime G. Carbonell 14

Active Learning o o Assume: n Very few “labeled” instances n Very many “unlabeled” instances n An omniscient “oracle” which can assign an label to an unlabeled instance Objective: n Select instances to label such that learning accuracy is maximized with the fewest oracle labeling requests December, 2008 © 2008, Jaime G. Carbonell 15

Why is Active Learning Important? o o Labeled data volumes unlabeled data volumes n 1. 2% of all proteins have known structures n. 01% of all galaxies in the Sloan Sky Survey have consensus type labels n. 0001% of all web pages have topic labels If labeling is costly, or limited, we want to select the points that will have maximal impact December, 2008 © 2008, Jaime G. Carbonell 17

Sampling Strategies o o o Random sampling (preserves distribution) Uncertainty sampling (Tong & Koller, 2000) n proximity to decision boundary n maximal distance to labeled x’s Density sampling (k. NN-inspired Mc. Callum & Nigam, 2004) Representative sampling (Xu et al, 2003) Instability sampling (probability-weighted) n x’s that maximally change decision boundary Ensemble Strategies n Boosting-like ensemble (Baram, 2003) n DUAL (Donmez & Carbonell, 2007) o Dynamically switches strategies from Density-Based to Uncertainty-Based by estimating derivative of expected residual error reduction December, 2008 © 2008, Jaime G. Carbonell 20

Active Learning Issues o o o o Interaction of active sampling with underlying classifier(s). On-line sampling vs. batch sampling. Active sampling for rank learning and for structured learning (e. g. HMMs, s. CRFs). What if Oracle is fallible, or reluctant, or differentially expensive proactive learning. How does noisy data affect active learning? What if we do not have even the first labeled point(s) for one or more classes? new class discovery. How to “optimally” combine A. L. strategies December, 2008 © 2008, Jaime G. Carbonell 26

Strategy Selection: No Universal Optimum • Optimal operating range for AL sampling strategies differs • How to get the best of both worlds? • (Hint: ensemble methods, e. g. DUAL) December, 2008 © 2008, Jaime G. Carbonell 27

Motivation for DUAL o Strength of DWUS: n favors higher density samples close to the decision boundary n fast decrease in error n But! DWUS establishes diminishing returns! Why? • Early iterations -> many points are highly uncertain • Later iterations -> points with high uncertainty no longer in dense regions December, 2008 © 2008, Jaime G. Carbonell 28 • DWUS wastes time picking instances with no direct effect on the error

How does DUAL do better? o Runs DWUS until it estimates a cross-over o Monitor the change in expected error at each iteration to detect when it is stuck in local minima o DUAL uses a mixture model after the cross-over ( saturation ) point o Our goal should be to minimize the expected future error n If we knew the future error of Uncertainty Sampling (US) to be zero, then we’d force n But in practice, we do not know it December, 2008 © 2008, Jaime G. Carbonell 29

More on DUAL o o After cross-over, US does better => uncertainty score should be given more weight should reflect how well US performs n can be calculated by the expected error of US on the unlabeled data* => o * Finally, we have the following selection criterion for DUAL: US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to December, 2008 © 2008, Jaime G. Carbonell 30

Paired Density-Based Sampling (Donmez & Carbonell, 2008) o o Desiderata n Balanced Sampling from both (all) classes n Combine density-based with coverage-based Method n Non-Euclidian distance function n Select maximally separated pairs of points based on maximizing a utility function December, 2008 © 2008, Jaime G. Carbonell 32

Active Learning model in NLP Test Data Evaluation Parsing model Training Data Build Machine Translation System Named Entity Recognition module Active Learner Word Sense Disambiguation model Sample selection Addition Samples Unlabeled Set Active Training Set Un-annotated corpus Annotation Translation

Word-Sense Disambiguation o o o Needed in NLP for parsing, translation, search… Example: n Line ax+by+c, rope, queue, track, … n “Banco” bench, financial inst, sand bank, … Challenge: How to disambiguate from context Approach: Build ML classifier (sense = class) Problem: Insufficient training data Amelioration: Active Learning December, 2008 © 2008, Jaime G. Carbonell 36

Word Sense Disambiguation: Active Learning Methods o Entropy Sampling n Vector q represents the trained model’s predictions n qc prediction probability of class c n o Pick the example whose prediction vector displays the greatest entropy Margin Sampling n If c and c’ are the two most likely categories Picks the example with the smallest margin

Word Sense Disambiguation: Experiment On 5 English verbs that had coarse grained senses. n Double-blind tagging applied to 50 instances of the target word n If the inter-tagger (ITA) agreement < 90%, the sense entry is revised by adding examples and explanations

Word Sense Disambiguation Results

Active vs. Proactive Learning ACTIVE LEARNING o o PROACTIVE LEARNING All x’s cost the same to label n Max number of labels Omniscient oracle n Never errs Indefatigable oracle n Always answers Single oracle n Oracle selection unnecessary December, 2008 o o Labeling cost is f 1(D(x), O) n Max labeling budget Fallible oracles n Errs with p(E(x)) ~ f 2(D(x), O) Reluctant oracles n Answers with p(A(x)) … Multiple oracles n Joint optimization of oracle and instance selection © 2008, Jaime G. Carbonell 40

Scenario 1: Reluctance o o 2 oracles: n reliable oracle: expensive but always answers with a correct label n reluctant oracle: cheap but may not respond to some queries Define a utility score as expected value of information at unit cost December, 2008 © 2008, Jaime G. Carbonell 41

How to estimate o o ? Cluster unlabeled data using k-means Ask the label of each cluster centroid to the reluctant oracle. If n label received: increase of nearby points n no label: decrease of nearby points equals 1 when label received, -1 otherwise o # clusters depend on the clustering budget and oracle fee December, 2008 © 2008, Jaime G. Carbonell 42

Scenario 2: Fallibility o o o Two oracles: n One perfect but expensive oracle n One fallible but cheap oracle, always answers Alg. Similar to Scenario 1 with slight modifications During exploration: n Fallible oracle provides the label with its confidence n n Confidence = of fallible oracle If but we still update then we don’t use the label December, 2008 © 2008, Jaime G. Carbonell 44

Scenario 3: Non-uniform Cost o o o Uniform cost: Fraud detection, face recognition, etc. Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc. 2 oracles: n n Fixed-cost Oracle Variable-cost Oracle December, 2008 © 2008, Jaime G. Carbonell 45

Underlying Sampling Strategy o o Conditional entropy based sampling, weighted by a density measure Captures the information content of a close neighborhood close neighbors of x December, 2008 © 2008, Jaime G. Carbonell 47

Proactive Learning in General o o Multiple Expert (a. k. a. Oracles) n Different areas of expertise n Different costs n Different reliabilities n Different availability What question to ask and whom to query? n Joint optimization of query & oracle selection n Referals among Oracles (with referal fees) n Learn about Oracle capabilities as well as solving the Active Learning problem at hand December, 2008 © 2008, Jaime G. Carbonell 50

Unsupervised Learning in DM o o What does it mean to learn without an objective? n Explore the data for natural groupings n Learn association rules, and later examine whether they can be of any business use Illustrative examples n Market basket analysis later optimize shelf allocation & placements n Cascaded or correlated mechanical faults n Demographic grouping beyond known classes n Plan product bundling offers December, 2008 © 2008, Jaime G. Carbonell 51

Example Similarity Functions o Determine a similarity metric n n Cosine n o Eucledian KL-divergence Determine a clustering algorithm n Incremental, agglomerative, K-means, … December, 2008 © 2008, Jaime G. Carbonell 52

Hierarchical Agglomerative Clustering Methods o Generic Agglomerative Procedure (Salton '89), result in nested clusters via iterations 1. Compute all pairwise document-document similarity coefficients 2. Place each of n documents into a class of its own 3. Merge the two most similar clusters into one; - replace the two clusters by the new cluster - recompute intercluster similarity scores w. r. t. the new cluster - If cluster radius > max-size, block further merging 4. Repeat the above step until there are only k clusters left (note k could = 1). December, 2008 © 2008, Jaime G. Carbonell 53

K-Means Clustering 1. Select k-seeds s. t. d(ki, kj) > dmin 2. Assign points to clusters by min dist. Cluster(pi) = Argmin(d(pi, sj)) sj {s 1, …, sk} 3. Compute new cluster centroids: 4. Reassign points to clusters (as in 2 above) 5. Iterate until no points change clusters December, 2008 © 2008, Jaime G. Carbonell 55

Clustering for Novelty Detection Functionality o Build background model n o n n o o Expected Events (clusters) Find divergences n Individual outliers (but many false positives) New Mini-clusters (unmasked new -event detection) Detect when a novel event is masked by ordinary ones n December, 2008 (Hierarchical) k-means Divergence metrics n n n o Route & Prioritize Formulate hypotheses for Analyst Modeling methods n o Trigger Alerts n Technology Radial density gradients from cluster centroid Temporally-adaptive distance measures Secondary peaks in density function Create analyst profiles n RETE-based SAMs methods (last PI-meeting ARGUS paper) © 2008, Jaime G. Carbonell 61

Results on Medical Data New Mini-Cluster Analysis reveals outbreaks of: • • Tularemia Dengue Fever Myiasis Chagas Disease SARS Outbreak simulation n n Added new records for patients from a small geographical region diagnosed with influenza in 9/2001 Graph shows resulting secondary peak in the pulmonary disease density function December, 2008 © 2008, Jaime G. Carbonell 65

What’s Rare Category Detection o o December, 2008 Start de-novo Very skewed classes n Majority classes n Minority classes Labeling oracle Goal n Discover minority classes with a few label requests © 2008, Jaime G. Carbonell 66

Comparison with Outlier Detection o Rare classes n A group of points n Clustered n Non-separable from the majority classes December, 2008 o Outliers n A single point n Scattered n Separable © 2008, Jaime G. Carbonell 67

The Big Picture Unbalanced Unlabeled Data Set Rare Category Detection Feature Extraction Learning in Unbalanced Settings Classifier Feature Representation Relational Raw Data Temporal December, 2008 © 2008, Jaime G. Carbonell 69

Questions We Want to Address o o o How to detect rare categories in an unbalanced, unlabeled data set with the help of an oracle? How to detect rare categories with different data types, such as graph data, stream data, etc? How to do rare category detection with the least information about the data set? How to select relevant features for the rare categories? How to design effective classification algorithms which fully exploit the property of the minority classes (rare category classification)? December, 2008 © 2008, Jaime G. Carbonell 70

Notation o o o Unlabeled examples: m Classes: m-1 rare classes: One majority class: , , Goal: find at least ONE example from each rare class by requesting a few labels December, 2008 © 2008, Jaime G. Carbonell 71

Why NNDB Works o o Theoretically n Theorem 1 [He & Carbonell 2007]: under certain conditions, with high probability, after a few iteration steps, NNDB queries at least one example whose probability of coming from the minority class is at least 1/3 Intuitively n The score measures the change in local density December, 2008 © 2008, Jaime G. Carbonell 77

Multiple Classes: ALICE o o m-1 rare classes: One majority class: , 1. For each rare class c, Yes 2. We have found examples from class c No 3. Run NNDB with prior December, 2008 © 2008, Jaime G. Carbonell 78

Why ALICE Works o Theoretically n Theorem 2 [He & Carbonell 2008]: under certain conditions, with high probability, in each outer loop of ALICE, after a few iteration steps in NNDB, ALICE queries at least one example whose probability of coming from one minority class is at least 1/3 December, 2008 © 2008, Jaime G. Carbonell 79

Summary of Real Data Sets o Abalone n 4177 examples n 7 -dimensional features n 20 classes n Largest class: 16. 50% n Smallest class: 0. 34% December, 2008 o Shuttle n 4515 examples n 9 -dimensional features n 7 classes n Largest class: 75. 53% n Smallest class: 0. 13% © 2008, Jaime G. Carbonell 82

Specially Designed Exponential Families [Efron & Tibshirani 1996] o o Favorable compromise between parametric and nonparametric density estimation Estimated density Carrier density parameter vector Normalizing parameter December, 2008 © 2008, Jaime G. Carbonell vector of sufficient statistics 85

SEDER Algorithm o Carrier density: kernel density estimator o o To decouple the estimation of different parameters n Decompose n Relax the constraint such that December, 2008 © 2008, Jaime G. Carbonell 86

Summary of Real Data Sets Data Set n d m Largest Class Smallest Class Ecoli 336 7 6 42. 56% 2. 68% Glass 214 Moderately Skewed 9 6 35. 51% 4. 21% Page Blocks 5473 10 5 89. 77% 0. 51% Abalone 4177 7 20 16. 50% 0. 34% Shuttle 4515 9 7 75. 53% 0. 13% December, 2008 Extremely Skewed © 2008, Jaime G. Carbonell 91

Additional Notation pair-wise similarity matrix o diagonal matrix, o o : normalized matrix o : global similarity matrix, where is an identity matrix, and is a positive parameter close to 1 December, 2008 © 2008, Jaime G. Carbonell 94

GRADE: Full Prior Information 1. For each rare class c, 2. Calculate class-specific similarity 3. , Increase t by 1 , 4. Relevance Feedback 5. Query No December, 2008 6. Yes class c? © 2008, Jaime G. Carbonell 7. Output 96

GRADE-LI: Less Prior Information 1. Calculate problem-specific similarity 2. , Increase t by 1 , Relevance Feedback 3. 4. Query No 5. a new class? Yes 6. Output December, 2008 7. Budget exhausted? © 2008, Jaime G. Carbonell No 97

Applying Machine Learning for Data Mining in Business o o o Step 1: Have clear objective to Optimize Step 2: Have sufficient data Step 3: Clean, normalize, clean data some more Step 4: Make sure there isn’t an easy solution (e. g. a small number of rules from expert) Step 5: Do the Data Mining for real Step 6: Cross-validate, improve, go to step 5 December, 2008 © 2008, Jaime G. Carbonell 99

Managing the Data Mining Process o o Ingredients for successful DM n Data (warehouse, stream, DBs, …) n Right problems (objectives, …) n Tools (Machine Learning tool suites, …) n People (analogy to surgical team: next slide) Estimate (size) problem, approach, progress n ROI (max, min, realistic) n Determine if DM is likely best approach n Deploy team n Evaluate intermediate results December, 2008 © 2008, Jaime G. Carbonell 100

The Data Mining Team o o o The Administrator (manager & domain) n Pick problem, resources, ROI calc, monitor, … The Surgeon (ML specialist w/domain knowledge) n Select ML method, predictor atts, objective, … The Anesthesiologist (preparer) n Chief data specialist, sampling, coverage, … The Nurses (assistants) n DB manager, programmers, gophers … The Medical Students n Prepare new surgeons: learn by doing December, 2008 © 2008, Jaime G. Carbonell 101

Need Some Domain Expertise o o Data Preparation n What are good candidate predictor att’s? n How to combine multiple objectives? n How to sample? (e. g. id cyclic periods) Progress monitoring and results interpretation n How accurate must prediction be? n Do we need more or different data? n Are we pursing reasonable objective(s)? Application of DM after accomplished Update of DM when/as environment evolves December, 2008 © 2008, Jaime G. Carbonell 102

Typical Data Mining Pitfalls o o o o Insufficient data to establish predictive patterns Incorrect selection of predictor attributes n Statistics to the rescue (e. g. 2 test) Unrealistic objectives (e. g. fraud recovery) Inappropriate ML method selection Data preparation problems n Failure to normalize across data sets n Systematic bias in original data collection Belief in DM as panacea or black magic Giving up too soon (very common) December, 2008 © 2008, Jaime G. Carbonell 103

Final Words on Data Mining o o Data Mining is: n 1/3 science (math, algorithms, …) n …and 1/3 engineering (data prep, analysis, …) n …and 1/3 “art” (experience really counts) 10 years ago it was mostly art 10 years from now it will be mostly engineering What to expect from the research labs? n Better supervised algorithms n Focus on unsupervised learning + optimization n Move to incorporate semi-structured (text) data December, 2008 © 2008, Jaime G. Carbonell 104