Скачать презентацию Pattern Classification All materials in these slides were Скачать презентацию Pattern Classification All materials in these slides were

feaa465084b5c25b2e4cf4a416bf3b40.ppt

  • Количество слайдов: 55

Pattern Classification All materials in these slides were taken from Pattern Classification (2 nd Pattern Classification All materials in these slides were taken from Pattern Classification (2 nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Bayes decision theory febr. 17. Bayes decision theory febr. 17.

2 Classification Supervised learning: Based on training examples (E), learn a modell which works 2 Classification Supervised learning: Based on training examples (E), learn a modell which works fine on previously unseen examples. Classification: a supervised learning task of categorisation of entities into predefined set of classes

3 Pattern Classification, Chapter 2 (Part 1) 3 Pattern Classification, Chapter 2 (Part 1)

4 Posterior, likelihood, evidence P( j | x) = P(x | j). P ( 4 Posterior, likelihood, evidence P( j | x) = P(x | j). P ( j) / P(x) Posterior = (Likelihood. Prior) / Evidence Where in case of two categories Pattern Classification, Chapter 2 (Part 1)

5 Pattern Classification, Chapter 2 (Part 1) 5 Pattern Classification, Chapter 2 (Part 1)

6 Bayesian Decision • Decision given the posterior probabilities X is an observation for 6 Bayesian Decision • Decision given the posterior probabilities X is an observation for which: if P( 1 | x) > P( 2 | x) if P( 1 | x) < P( 2 | x) True state of nature = 1 True state of nature = 2 This rule minimizes the probability of the error. Pattern Classification, Chapter 2 (Part 1)

Bayesian Decision Theory – Generalization 7 • Use of more than one feature • Bayesian Decision Theory – Generalization 7 • Use of more than one feature • Use more than two states of nature • Allowing actions and not only decide on the state of • nature Introduce a loss of function which is more general than the probability of error Pattern Classification, Chapter 2 (Part 1)

8 Let { 1, 2, …, c} be the set of c states of 8 Let { 1, 2, …, c} be the set of c states of nature (or “categories”) Let { 1, 2, …, a} be the set of possible actions Let ( i | j) be the loss incurred for taking action i when the state of nature is j Pattern Classification, Chapter 2 (Part 1)

Bayes decision theory example Automatic trading (on stock exchanges) 1: the prices will increase Bayes decision theory example Automatic trading (on stock exchanges) 1: the prices will increase (in the future!) 2: the prices will be lower 3: the prices won’t change too much We cannot observe (latent)! 1: buy 2: sell x: actual prices (and historical prices) x is observed : how much to lose with an action

10 Overall risk R = Sum of all R( i | x) for i 10 Overall risk R = Sum of all R( i | x) for i = 1, …, a Conditional risk Minimizing R( i | x) for i = 1, …, a Pattern Classification, Chapter 2 (Part 1)

11 Select the action i for which R( i | x) is minimum R 11 Select the action i for which R( i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved! Pattern Classification, Chapter 2 (Part 1)

12 • Two-category classification 1 : deciding 1 2 : deciding 2 ij = 12 • Two-category classification 1 : deciding 1 2 : deciding 2 ij = ( i | j) loss incurred for deciding i when the true state of nature is j Conditional risk: R( 1 | x) = 11 P( 1 | x) + 12 P( 2 | x) R( 2 | x) = 21 P( 1 | x) + 22 P( 2 | x) Pattern Classification, Chapter 2 (Part 1)

13 Our rule is the following: if R( 1 | x) < R( 2 13 Our rule is the following: if R( 1 | x) < R( 2 | x) action 1: “decide 1” is taken This results in the equivalent rule : decide 1 if: ( 21 - 11) P(x | 1) P( 1) > ( 12 - 22) P(x | 2) P( 2) and decide 2 otherwise Pattern Classification, Chapter 2 (Part 1)

14 Likelihood ratio: The preceding rule is equivalent to the following rule: Then take 14 Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action 1 (decide 1) Otherwise take action 2 (decide 2) Pattern Classification, Chapter 2 (Part 1)

15 Exercise Select the optimal decision where: W = { 1 , 2 } 15 Exercise Select the optimal decision where: W = { 1 , 2 } P(x | 1) P(x | 2) N(2, 0. 5) (Normal distribution) N(1. 5, 0. 2) P( 1) = 2/3 P( 2) = 1/3 Pattern Classification, Chapter 2 (Part 1)

Zero-one loss function (Bayes Classifier) 16 Therefore, the conditional risk is: “The risk corresponding Zero-one loss function (Bayes Classifier) 16 Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error” Pattern Classification, Chapter 2 (Part 2)

Classifiers, Discriminant Functions and Decision Surfaces 17 • The multi-category case • Set of Classifiers, Discriminant Functions and Decision Surfaces 17 • The multi-category case • Set of discriminant functions gi(x), i = 1, …, c • The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j i Pattern Classification, Chapter 2 (Part 2)

18 • Let gi(x) = - R( i | x) (max. discriminant corresponds to 18 • Let gi(x) = - R( i | x) (max. discriminant corresponds to min. risk!) • For the minimum error rate, we take gi(x) = P( i | x) (max. discrimination corresponds to max. posterior!) gi(x) P(x | i) P( i) gi(x) = ln P(x | i) + ln P( i) (ln: natural logarithm!) Pattern Classification, Chapter 2 (Part 2)

19 • Feature space divided into c decision regions if gi(x) > gj(x) j 19 • Feature space divided into c decision regions if gi(x) > gj(x) j i then x is in Ri (Ri means assign x to i) • The two-category case • A classifier is a “dichotomizer” that has two discriminant functions g 1 and g 2 Let g(x) g 1(x) – g 2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 Pattern Classification, Chapter 2 (Part 2)

20 • The computation of g(x) Pattern Classification, Chapter 2 (Part 2) 20 • The computation of g(x) Pattern Classification, Chapter 2 (Part 2)

21 Discriminant functions of the Bayes Classifier with Normal Density Pattern Classification, Chapter 2 21 Discriminant functions of the Bayes Classifier with Normal Density Pattern Classification, Chapter 2 (Part 1)

22 • The Normal Density Univariate density • • Density which is analytically tractable 22 • The Normal Density Univariate density • • Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where: = mean (or expected value) of x 2 = expected squared deviation or variance Pattern Classification, Chapter 2 (Part 2)

23 Pattern Classification, Chapter 2 (Part 2) 23 Pattern Classification, Chapter 2 (Part 2)

24 • Multivariate density • Multivariate normal density in d dimensions is: where: x 24 • Multivariate density • Multivariate normal density in d dimensions is: where: x = (x 1, x 2, …, xd)t (t stands for the transpose vector form) = ( 1, 2, …, d)t mean vector = d*d covariance matrix | | and -1 are determinant and inverse respectively Pattern Classification, Chapter 2 (Part 2)

Discriminant Functions for the Normal Density 25 • We saw that the minimum error-rate Discriminant Functions for the Normal Density 25 • We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P( i) • Case of multivariate normal Pattern Classification, Chapter 2 (Part 3)

26 • Case i = 2. I (I stands for the identity matrix) Pattern 26 • Case i = 2. I (I stands for the identity matrix) Pattern Classification, Chapter 2 (Part 3)

27 • A classifier that uses linear discriminant functions is called “a linear machine” 27 • A classifier that uses linear discriminant functions is called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) Pattern Classification, Chapter 2 (Part 3)

28 The hyperplane is always orthogonal to the linking the means! Pattern Classification, Chapter 28 The hyperplane is always orthogonal to the linking the means! Pattern Classification, Chapter 2 (Part 3)

29 • The hyperplane separating Ri and Rj always orthogonal to the linking the 29 • The hyperplane separating Ri and Rj always orthogonal to the linking the means! Pattern Classification, Chapter 2 (Part 3)

30 Pattern Classification, Chapter 2 (Part 3) 30 Pattern Classification, Chapter 2 (Part 3)

31 Pattern Classification, Chapter 2 (Part 3) 31 Pattern Classification, Chapter 2 (Part 3)

32 • Case i = (covariance of all classes are identical but arbitrary!) • 32 • Case i = (covariance of all classes are identical but arbitrary!) • Hyperplane separating Ri and Rj (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) Pattern Classification, Chapter 2 (Part 3)

33 Pattern Classification, Chapter 2 (Part 3) 33 Pattern Classification, Chapter 2 (Part 3)

34 Pattern Classification, Chapter 2 (Part 3) 34 Pattern Classification, Chapter 2 (Part 3)

35 • Case i = arbitrary • The covariance matrices are different for each 35 • Case i = arbitrary • The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperboloids) Pattern Classification, Chapter 2 (Part 3)

36 Pattern Classification, Chapter 2 (Part 3) 36 Pattern Classification, Chapter 2 (Part 3)

37 Pattern Classification, Chapter 2 (Part 3) 37 Pattern Classification, Chapter 2 (Part 3)

38 Exercise Select the optimal decision where: W = { 1 , 2 } 38 Exercise Select the optimal decision where: W = { 1 , 2 } P(x | 1) P(x | 2) N(2, 0. 5) (Normal distribution) N(1. 5, 0. 2) P( 1) = 2/3 P( 2) = 1/3 Pattern Classification, Chapter 2

39 Parameter estimation Pattern Classification, Chapter 3 39 Parameter estimation Pattern Classification, Chapter 3

 • Data availability in a Bayesian framework • We could design an optimal • Data availability in a Bayesian framework • We could design an optimal classifier if we knew: • • P( i) (priors) P(x | i) (class-conditional densities) Unfortunately, we rarely have this complete information! • Design a classifier from a training sample • No problem with prior estimation • Samples are often too small for class-conditional estimation (large dimension of feature space!) 1

 • A priori information about the problem • E. g. assume normality of • A priori information about the problem • E. g. assume normality of P(x | i) ~ N( i, i) Characterized by 2 parameters • Estimation techniques • Maximum-Likelihood (ML) and the Bayesian estimations • Results are nearly identical, but the approaches are different 1

 • Parameters in ML estimation are fixed but unknown! • Best parameters are • Parameters in ML estimation are fixed but unknown! • Best parameters are obtained by maximizing the probability of obtaining the samples observed • Bayesian methods view the parameters as random variables having some known distribution • In either approach, we use P( i | x) for our classification rule! 1

 • Use the information provided by the training samples to estimate = ( • Use the information provided by the training samples to estimate = ( 1, 2, …, c), each i (i = 1, 2, …, c) is associated with each category • Suppose that D contains n samples, x 1, x 2, …, xn • ML estimate of is, by definition the value that maximizes P(D | ) “It is the value of that best agrees with the actually observed training sample” 2

 • Optimal estimation • Let = ( 1, 2, …, p)t and let • Optimal estimation • Let = ( 1, 2, …, p)t and let be the gradient operator • We define l( ) as the log-likelihood function l( ) = ln P(D | ) • New problem statement: determine that maximizes the log-likelihood 2

Bayesian Estimation • In MLE was supposed fix • In BE is a random Bayesian Estimation • In MLE was supposed fix • In BE is a random variable • The computation of posterior probabilities P( i | x) • lies at the heart of Bayesian classification Goal: compute P( i | x, D) Given the sample D, Bayes formula can be written

er 1 er 1

 • Bayesian Parameter Estimation: Gaussian Case Goal: Estimate using the a-posteriori density P( • Bayesian Parameter Estimation: Gaussian Case Goal: Estimate using the a-posteriori density P( | D) • The univariate case: P( | D) is the only unknown parameter ( 0 and 0 are known!) 47

 • Reproducing density Identifying (1) and (2) yields: • Reproducing density Identifying (1) and (2) yields:

 • The univariate case P(x | D) • P( | D) computed • • The univariate case P(x | D) • P( | D) computed • P(x | D) remains to be computed! It provides: (Desired class-conditional density P(x | Dj, j)) Therefore: P(x | Dj, j) together with P( j) and using Bayes formula, we obtain the Bayesian classification rule:

 • Bayesian Parameter Estimation: General Theory • P(x | D) computation can be • Bayesian Parameter Estimation: General Theory • P(x | D) computation can be applied to any situation in which the unknown density can be parametrized: the basic assumptions are: • The form of P(x | ) is assumed known, but the value of • • is not known exactly Our knowledge about is assumed to be contained in a known prior density P( ) The rest of our knowledge is contained in a set D of n random variables x 1, x 2, …, xn that follows P(x) 5

The basic problem is: “Compute the posterior density P( | D)” then “Derive P(x The basic problem is: “Compute the posterior density P( | D)” then “Derive P(x | D)” Using Bayes formula, we have: And by independence assumption:

52 MLE vs. Bayes estimation • If n→∞ they are equal! • MLE • 52 MLE vs. Bayes estimation • If n→∞ they are equal! • MLE • Simple and fast (convex optimisation vs. numerical integration) • Bayes estimation • We can express our uncertainty by P( )

Sumamry • Bayes decision theory General framework for probabilistic decision making • Bayes classifier Sumamry • Bayes decision theory General framework for probabilistic decision making • Bayes classifier Classification is a special decision making ( 1 : choose 1) • Zero-one loss function l can be omitted • Bayes classifier with zero-one loss with Normal Density

Summary • Parameter estimation • General procedures for densities’ parameter estimation based on a Summary • Parameter estimation • General procedures for densities’ parameter estimation based on a sample (it can be applied beyond Bayes classifier) • Bayesian Machine learning: the marrige of Bayesian Decision Theory and Parameter estimation from a training sample