Скачать презентацию Bell Laboratories Data Complexity Analysis Linkage between Context Скачать презентацию Bell Laboratories Data Complexity Analysis Linkage between Context

0a51e7f4309446465b04e4c8dd44f031.ppt

  • Количество слайдов: 40

Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law, Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia

Pattern Recognition: Research vs. Practice Steps to solve a practical pattern recognition problem Data Pattern Recognition: Research vs. Practice Steps to solve a practical pattern recognition problem Data Collection Sensory Data Study of the Problem Context Feature Extraction Feature Vectors Classifier Training Classifier Practical Focus Danger of Disconnection 2 Classification Research Focus Decision Study of the Mathematical Solution All Rights Reserved © Alcatel-Lucent 2008

Reconnecting Context and Solution To understand how changes in the problem set-up and data Reconnecting Context and Solution To understand how changes in the problem set-up and data collection procedures may affect such properties Study of the Problem Context Data Complexity Analysis: Analysis of the properties of feature vectors Feature Vectors To understand how such properties may impact the classification solution Improvements Limitations Study of the Mathematical Solution Expectations 3 All Rights Reserved © Alcatel-Lucent 2008

Focus is on Boundary Complexity • Kolmogorov complexity • Boundary length can be exponential Focus is on Boundary Complexity • Kolmogorov complexity • Boundary length can be exponential in dimensionality • A trivial description is to list all points & class labels • Is there a shorter description? 4 All Rights Reserved © Alcatel-Lucent 2008

Early Discoveries • Problems distribute in a continuum in complexity space • Several key Early Discoveries • Problems distribute in a continuum in complexity space • Several key measures provide independent characterization • There exist identifiable domains of classifier’s dominant competency • Feature selection and transformation induce variability in complexity estimates 5 All Rights Reserved © Alcatel-Lucent 2008

Parameterization of Data Complexity 6 All Rights Reserved © Alcatel-Lucent 2008 Parameterization of Data Complexity 6 All Rights Reserved © Alcatel-Lucent 2008

Complexity Classes vs. Complexity Scales • Study is driven by observed limits in classifier Complexity Classes vs. Complexity Scales • Study is driven by observed limits in classifier accuracy, even with new, sophisticated methods (e. g. , ensembles, SVM, …) • Analysis is needed for each instance of a classification problem, not just the worst case of a family of problems • Linear separability: the earliest attempt to address classification complexity • Observed in real-world problems: different degrees of linear non-separability • Continuous scale is needed 7 All Rights Reserved © Alcatel-Lucent 2008

Some Useful Measures of Geometric Complexity Degree of Linear Separability Find separating hyperplane by Some Useful Measures of Geometric Complexity Degree of Linear Separability Find separating hyperplane by linear programming Error counts and distances to plane measure separability Length of Class Boundary Compute minimum spanning tree Count class-crossing edges 8 Fisher’s Discriminant Ratio Classical measure of class separability Maximize over all features to find the most discriminating Shapes of Class Manifolds Cover same-class pts with maximal balls Ball counts describe shape of class manifold All Rights Reserved © Alcatel-Lucent 2008

Continuous Distributions in Complexity Space Real-World Data Sets: Benchmarking data from UC-Irvine archive 844 Continuous Distributions in Complexity Space Real-World Data Sets: Benchmarking data from UC-Irvine archive 844 two-class problems 452 are linearly separable, 392 non-separable Random labeling Synthetic Data Sets: randomly located points 100 problems in 1 -100 dimensions Metric 2 Random labeling of Linearly separable realworld data 9 Linearly nonseparable realworld data Complexity Metric 1 All Rights Reserved © Alcatel-Lucent 2008

Measures of Geometrical Complexity 10 All Rights Reserved © Alcatel-Lucent 2008 Measures of Geometrical Complexity 10 All Rights Reserved © Alcatel-Lucent 2008

The First 6 Principal Components 11 All Rights Reserved © Alcatel-Lucent 2008 The First 6 Principal Components 11 All Rights Reserved © Alcatel-Lucent 2008

Interpretation of the First 4 PCs PC 1: 50% of variance: Linearity of boundary Interpretation of the First 4 PCs PC 1: 50% of variance: Linearity of boundary and proximity of opposite class neighbor PC 2: 12% of variance: Balance between within-class scatter and between-class distance PC 3: 11% of variance: Concentration & orientation of intrusion into opposite class PC 4: 9% of variance: Within-class scatter 12 All Rights Reserved © Alcatel-Lucent 2008

Problem Distribution in 1 st & 2 nd Principal Components • Continuous distribution Linearly Problem Distribution in 1 st & 2 nd Principal Components • Continuous distribution Linearly separable • Known easy & difficult problems occupy opposite ends • Few outliers • Empty regions Random labels 13 All Rights Reserved © Alcatel-Lucent 2008

Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density Problem may appear Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density Problem may appear deceptively simple or complex with small samples 2 points 100 points 14 10 points 500 points All Rights Reserved © Alcatel-Lucent 2008 1000 points

Observations • Problems distribute in a continuum in complexity space • Several key measures/dimensions Observations • Problems distribute in a continuum in complexity space • Several key measures/dimensions provide independent characterization • Need further analysis on uncertainty in complexity estimates due to small sample size effects 15 All Rights Reserved © Alcatel-Lucent 2008

Relating Classifier Behavior to Data Complexity 16 All Rights Reserved © Alcatel-Lucent 2008 Relating Classifier Behavior to Data Complexity 16 All Rights Reserved © Alcatel-Lucent 2008

Class Boundaries Inferred by Different Classifiers XCS: a genetic algorithm 17 Nearest neighbor classifier Class Boundaries Inferred by Different Classifiers XCS: a genetic algorithm 17 Nearest neighbor classifier All Rights Reserved © Alcatel-Lucent 2008 Linear classifier

Accuracy Depends on the Goodness of Match between Classifiers and Problems Problem A Problem Accuracy Depends on the Goodness of Match between Classifiers and Problems Problem A Problem B Better ! XCS 18 error= 1. 9% NN error= 0. 06% XCS All Rights Reserved © Alcatel-Lucent 2008 error= 0. 6% NN error= 0. 7%

Domains of Competence of Classifiers Given a classification problem, we want determine which classifier Domains of Competence of Classifiers Given a classification problem, we want determine which classifier is the best for it. Can data complexity give us a hint? Metric 2 ? XCS LC Decision Forest NN y sm re i ! He lem b pro Complexity metric 1 19 All Rights Reserved © Alcatel-Lucent 2008

Domain of Competence Experiment Use a set of 9 complexity measures Boundary, Pretop, Intra. Domain of Competence Experiment Use a set of 9 complexity measures Boundary, Pretop, Intra. Inter, Non. Lin. NN, Non. Lin. LP, Fisher, Max. Eff, Volume. Overlap, Npts/Ndim Characterize 392 two-class problems from UCI data, all shown to be linearly non-separable Evaluate 6 classifiers NN LP Odt Pdfc Bdfc XCS 20 (1 -nearest neighbor) (linear classifier by linear programming) (oblique decision tree) (random subspace decision forest) (bagging based decision forest) (a genetic-algorithm based classifier) All Rights Reserved © Alcatel-Lucent 2008 ensemble methods

Identifiable Domains of Competence by NN and LP Best Classifier for Benchmarking Data 21 Identifiable Domains of Competence by NN and LP Best Classifier for Benchmarking Data 21 All Rights Reserved © Alcatel-Lucent 2008

Less Identifiable Domains of Competence Regions in complexity space where the best classifier is Less Identifiable Domains of Competence Regions in complexity space where the best classifier is (nn, lp, or odt) vs. an ensemble technique Boundary-Non. Lin. NN Intra. Inter-Pretop Max. Eff-Volume. Overlap ensemble + nn, lp, odt 22 All Rights Reserved © Alcatel-Lucent 2008

Uncertainty of Estimates at Two Levels Sparse training data in each problem & complex Uncertainty of Estimates at Two Levels Sparse training data in each problem & complex geometry cause ill-posedness of class boundaries (uncertainty in feature space) Sparse sample of problems causes difficulty in identifying regions of dominant competence (uncertainty in complexity space) 23 All Rights Reserved © Alcatel-Lucent 2008

Complexity and Data Dimensionality: Class Separability after Dimensionality Reduction Feature selection/transformation may change the Complexity and Data Dimensionality: Class Separability after Dimensionality Reduction Feature selection/transformation may change the difficulty of a classification problem: • Widening the gap between classes • Compressing the discriminatory information • Removing irrelevant dimensions It is often unclear to what extent these happen We seek quantitative description of such changes Feature selection 24 All Rights Reserved © Alcatel-Lucent 2008 Discrimination

Spread of classification accuracy and geometrical complexity due to forward feature selection 25 All Spread of classification accuracy and geometrical complexity due to forward feature selection 25 All Rights Reserved © Alcatel-Lucent 2008

Designing a Strategy for Classifier Evaluation 26 All Rights Reserved © Alcatel-Lucent 2008 Designing a Strategy for Classifier Evaluation 26 All Rights Reserved © Alcatel-Lucent 2008

A Complete Platform for Evaluating Learning Algorithms To facilitate progress on learning algorithms: • A Complete Platform for Evaluating Learning Algorithms To facilitate progress on learning algorithms: • Need a way to systematically create learning problems • Provide a complete coverage of the complexity space • Be representative of all the known problems i. e. , every classification problem arising in the real-world should have a close neighbor representing it in the complexity space. Is this possible? 27 All Rights Reserved © Alcatel-Lucent 2008

Ways to Synthesize Classification Problems • Synthesizing data with targeted levels of complexity • Ways to Synthesize Classification Problems • Synthesizing data with targeted levels of complexity • e. g. compute MST over a uniform point distribution, then assign class-crossing edges randomly [Macia et al. 2008] • or, create partitions with increasing resolution • can create continuous cover of complexity space • but, are the data similar to those arising from reality? 28 All Rights Reserved © Alcatel-Lucent 2008

Ways to Synthesize Classification Problems • Synthesizing data to simulate natural processes • e. Ways to Synthesize Classification Problems • Synthesizing data to simulate natural processes • e. g. Neyman-Scott process • how many such processes have explicit models? • how many are needed to cover all real-world problems? • Systematically degrade real-world datasets • increase noise, reduce image resolution, … 29 All Rights Reserved © Alcatel-Lucent 2008

Simplification of Class Geometry 30 All Rights Reserved © Alcatel-Lucent 2008 Simplification of Class Geometry 30 All Rights Reserved © Alcatel-Lucent 2008

Manifold Learning and Dimensionality Reduction • Manifold learning techniques that highlight intrinsic dimensions • Manifold Learning and Dimensionality Reduction • Manifold learning techniques that highlight intrinsic dimensions • But the class boundary may not follow the intrinsic dimensions 31 All Rights Reserved © Alcatel-Lucent 2008

Manifold Learning and Dimensionality Reduction • Supervised manifold learning – seek mappings that exaggerate Manifold Learning and Dimensionality Reduction • Supervised manifold learning – seek mappings that exaggerate class separation [de Ridder et al. , 2003] • Best, the mapping should be sought to directly minimize some measures of data complexity 32 All Rights Reserved © Alcatel-Lucent 2008

Seeking Optimizations Upstream Back to the application context: • Use data complexity measures for Seeking Optimizations Upstream Back to the application context: • Use data complexity measures for guidance • Change the setup, definition of the classification problem • Collect more samples, in finer resolution, extract more features … • Alternative representations: • dissimilarity-based? [Pekalska & Duin 2005] Data complexity gives an operational definition of learnability Optimization in the upstream: formalize the intuition of seeking invariance, systematically optimize the problem setup and data acquisition scenario to reduce data complexity 33 All Rights Reserved © Alcatel-Lucent 2008

Recent Examples from the Internet 34 All Rights Reserved © Alcatel-Lucent 2008 Recent Examples from the Internet 34 All Rights Reserved © Alcatel-Lucent 2008

CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart Also known CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart Also known as • Reverse Turing Test • Human Interactive Proofs [von Ahn et al. , CMU 2000] Exploit limitations in accuracy of machine pattern recognition 35 All Rights Reserved © Alcatel-Lucent 2008

The Netflix Challenge • $1 Million Prize for the first team to improve 10% The Netflix Challenge • $1 Million Prize for the first team to improve 10% over the company’s own recommender system • But, is the goal achievable? Do the training data support such possibility? 36 All Rights Reserved © Alcatel-Lucent 2008

Amazon’s Mechanical Turk • “Crowd-sourcing” tedious human intelligence (pattern recognition) tasks • Which ones Amazon’s Mechanical Turk • “Crowd-sourcing” tedious human intelligence (pattern recognition) tasks • Which ones are doable by machines? 37 All Rights Reserved © Alcatel-Lucent 2008

Conclusions 38 All Rights Reserved © Alcatel-Lucent 2008 Conclusions 38 All Rights Reserved © Alcatel-Lucent 2008

Summary Automatic classification is useful, but can be very difficult. We know the key Summary Automatic classification is useful, but can be very difficult. We know the key steps and many promising methods. But we have not fully understood how they work, what else is needed. We found measures for geometric complexity that are useful to characterize difficulties of classification problems and classifier domains of competence. Better understanding of how data and classifiers interact can guide practice, and re-establish the linkage between context and solution. 39 All Rights Reserved © Alcatel-Lucent 2008

For the Future Further progress in statistical and machine learning will need systematic, scientific For the Future Further progress in statistical and machine learning will need systematic, scientific evaluation of the algorithms with problems that are difficult for different reasons. A “problem synthesizer” will be useful to provide a complete evaluation platform, and reveal the “blind spots” of current learning algorithms. Rigorous statistical characterization of complexity estimates from limited training data will help gauge the uncertainty, and determine applicability of data complexity methods. 40 All Rights Reserved © Alcatel-Lucent 2008