4 Inductive Learning from Examples Decision Tree Learning

Скачать презентацию 4 Inductive Learning from Examples Decision Tree Learning

c675da5706883d8ff825398c8a93fffa.ppt

Количество слайдов: 40

4. Inductive Learning from Examples: Decision Tree Learning Prof. Gheorghe Tecuci Learning Agents Laboratory Computer Science Department George Mason University 2003, G. Tecuci, Learning Agents Laboratory 1

Overview The decision tree learning problem The basic ID 3 learning algorithm Discussion and refinement of the ID 3 method Applicability of the decision tree learning Exercises Recommended reading 2003, G. Tecuci, Learning Agents Laboratory 2

The decision tree learning problem Given • language of instances: feature value vectors • language of generalizations: decision trees • a set of positive examples (E 1, . . . , En) of a concept • a set of negative examples (C 1, . . . , Cm) of the same concept • learning bias: preference for shorter decision trees Determine • a concept description in the form of a decision tree which is a generalization of the positive examples that does not cover any of the negative examples 2003, G. Tecuci, Learning Agents Laboratory 3

Illustration Feature vector representation of examples That is, there is a fixed set of attributes, each attribute taking values from a specified set. Decision tree concept 2003, G. Tecuci, Learning Agents Laboratory Examples height short tall tall short hair blond red dark blond eyes blue brown blue brown class + + + - 4

What is the logical expression represented by the decision tree? Decision tree concept: dark hair blond red eyes + - blue brown + - Which is the concept represented by this decision tree? Disjunction of conjunctions (one conjunct per path to a + node): (hair = red) Ú [(hair = blond) & (eyes = blue)] 2003, G. Tecuci, Learning Agents Laboratory 5

Feature-value representation Is the feature value representation adequate? If the training set (i. e. the set of positive and negative examples from which the tree is learned) contains a positive example and a negative example that have identical values for each attribute, it is impossible to differentiate between the instances with reference only to the given attributes. In such a case the attributes are inadequate for the training set and for the induction task. 2003, G. Tecuci, Learning Agents Laboratory 6

Feature-value representation (cont. ) When could a decision tree be built? If the attributes are adequate, it is always possible to construct a decision tree that correctly classifies each instance in the training set. So what is the difficulty in learning a decision tree? The problem is that there are many such correct decision trees and the task of induction is to construct a decision tree that correctly classifies not only instances from the training set but other (unseen) instances as well. 2003, G. Tecuci, Learning Agents Laboratory 7

The basic ID 3 learning algorithm • Let C be the set of training examples • If all the examples in C are positive then create a node with label + • If all the examples in C are negative then create a node with label • If there is no attribute left then create a node with the same label as the majority of examples in C • Otherwise: - select the best attribute A and create a decision node, where v 1, v 2, . . . , vk are the values of A: v 1 - partition the examples into subsets C 1, C 2, . . . , Ck according to the values of A. A vk v 2. . . - apply the algorithm recursively to each of the sets Ci which is not empty - for each Ci which is empty create a node with the same label as the majority of examples in C the node 2003, G. Tecuci, Learning Agents Laboratory 9

Features selection: information theory Let us consider a set S containing objects from n classes S 1, . . . , Sn, so that the probability of an object to belong to a class Si is pi. According to the information theory, the amount of information needed to identify the class of one particular member of S is: Ii = - log 2 pi Intuitively, Ii represents the number of questions required to identify the class Si of a given element in S. The average amount of information needed to identify the class of an element in S is: - ∑ pi log 2 pi 2003, G. Tecuci, Learning Agents Laboratory 10

Discussion Consider the following letters: A B C D E F G H Think of one of them (call it, the secret letter). How many questions need to be asked in order to find the secret letter? 2003, G. Tecuci, Learning Agents Laboratory 11

Features selection: the best attribute Let us suppose that the decision tree has been built from a training set C consisting of p positive examples and n negative examples. The average amount of information needed to classify an instance from C is If attribute A with values {v 1, v 2, . . . , vk} is used for the root of the decision tree, it will partition C into {C 1, C 2, . . . , Ck}, where each Ci contains pi positive examples and ni negative examples. The expected information required to classify an instance in Ci is I(pi, ni). The expected amount of information required to classify an instance after the value of the attribute A is known is therefore: The information gained by branching on A is: gain(A) = I(p, n) - Ires(A) 2003, G. Tecuci, Learning Agents Laboratory 12

Features selection: the heuristic The information gained by branching on A is: gain(A) = I(p, n) - Ires(A) What would be a good heuristic? Choose the attribute which leads to the greatest information gain. Why is this a heuristic and not a guaranteed method? Hint: What kind of search method for the best attribute does ID 3 uses? 2003, G. Tecuci, Learning Agents Laboratory 13

Features selection: the heuristic Why is this a heuristic and not a guaranteed method? Hint: Think of a situation where a is the best attribute, but the combination of “b and c” would actually be better than any of “a and b”, or “a and c”. That is, knowing b and c you can classify, but knowing only a and b (or only a and c) you cannot. This shows that the attributes may not be independent. How could we deal with this? Hint: Consider also combination of attributes, not only a, b, c, but also ab, bc, ca What is a problem with this approach? 2003, G. Tecuci, Learning Agents Laboratory 14

Illustration of the method Examples height short tall tall short hair blond red dark blond eyes blue brown blue brown class + + + - 1. Find the attribute that maximizes the information gain: gain(A) = I(p, n) - I(3+, 5 -) = -3/8 log 23/8 – 5/8 log 25/8 = 0. 954434003 Height: short (1+, 2 -) tall(2+, 3 -) Gain(height) = 0. 954434003 - 3/8*I(1+, 2 -) - 5/8*I(2+, 3 -) = = 0. 954434003 – 3/8(-1/3 log 21/3 - 2/3 log 22/3) – 5/8(-2/5 log 22/5 - 3/5 log 23/5) = 0. 003228944 Hair: blond(2+, 2 -) red(1+, 0 -) dark(0+, 3 -) Gain(hair) = 0. 954434003 – 4/8(-2/4 log 22/4 – 2/4 log 22/4) – 1/8(-1/1 log 21/1 -0) – -3/8(0 -3/3 log 23/3) = 0. 954434003 – 0. 5 = 0. 454434003 Eyes: blue(3+, 2 -) brown(0+, 3 -) Gain(eyes) = 0. 954434003 – 5/8(-3/5 log 23/5 – 2/5 log 22/5) -5/8(= = 0. 954434003 - 0. 606844122 = 0. 347589881 2003, G. Tecuci, Learning Agents Laboratory “Hair” is the best attribute. 17

Illustration of the method (cont. ) Examples height short tall tall short hair blond red dark blond 2003, G. Tecuci, Learning Agents Laboratory eyes blue brown blue brown class + + + - 2. “Hair” is the best attribute. Build the tree using it. 18

Illustration of the method (cont. ) 3. Select the best attribute for the set of examples: short, blond, blue: + tall, blond, brown: tall, blond, blue: + short, blond, brown: I(2+, 2 -) = -2/4 log 22/4 – 2/4 log 22/4 = -log 21/2=1 Height: short (1+, 1 -) tall(1+, 1 -) Eyes: brown(0+, 2 -) blue (2+, 0 -) Gain(height) = 1 – 2/4*I(1+, 1 -) = 1 - I(1+, 1 -) = 1 -1 = 0 Gain(eyes) = 1 – 2/4*I(2+, 0 -) – 2/4*I(0+, 2 -) = 1 – 0 = 1 “Eyes” is the best attribute. 2003, G. Tecuci, Learning Agents Laboratory 19

Illustration of the method (cont. ) 4. “Eyes” is the best attribute. Expand the tree using it: 2003, G. Tecuci, Learning Agents Laboratory 20

Illustration of the method (cont. ) 5. Build the decision tree: What induction hypothesis is made? 2003, G. Tecuci, Learning Agents Laboratory 21

How could we transform a tree into a set of rules? Answer: IF (hair = red) THEN positive example IF [(hair = blond) & (eyes = blue)] THEN positive example Why should we make such a transformation? Converting to rules improves understandability. 2003, G. Tecuci, Learning Agents Laboratory 23

Learning from noisy data What errors could be found in an example (also called noise in data)? • errors in the values of attributes (due to measurements or subjective judgments); • errors of classifications of the instances (for instance a negative example that was considered a positive example). What are the effects of noise? How to change the ID 3 algorithm to deal with noise? 2003, G. Tecuci, Learning Agents Laboratory 24

How to deal with noise? What are the effects of noise? Noise may cause the attributes to become inadequate. Noise may lead to decision trees of spurious complexity (overfitting). How to change the ID 3 algorithm to deal with noise? The algorithm must be able to work with inadequate attributes, because noise can cause even the most comprehensive set of attributes to appear inadequate. The algorithm must be able to decide that testing further attributes will not improve the predictive accuracy of the decision tree. For instance, it should refrain from increasing the complexity of the decision tree to accommodate a single noise-generated special case. 2003, G. Tecuci, Learning Agents Laboratory 25

How to deal with an inadequate attribute set? (inadequacy due to noise) A collection C of instances may contain representatives of both classes, yet further testing of C may be ruled out, either because the attributes are inadequate and unable to distinguish among the instances in C, or because each attribute has been judged to be irrelevant to the class of instances in C. In this situation it is necessary to produce a leaf labeled with a class information, but the instances in C are not all of the same class. What class to assign a leaf node that contains both + and - examples? 2003, G. Tecuci, Learning Agents Laboratory 26

What class to assign a leaf node that contains both + and - examples? Approaches: 1. The notion of class could be generalized from a binary value (0 for negative examples and 1 for positive examples) to a number in the interval [0; 1]. In such a case, a class of 0. 8 would be interpreted as 'belonging to class P with probability 0. 8'. 2. Opt for the more numerous class, i. e. assign the leaf to class P if p>n, to class N if p

How to avoid overfitting the data? One says that a hypothesis overfits the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances. How to avoid overfitting? • Stop growing the tree before it overfits; • Allow the tree to overfit and then prune it. How to determine the correct size of the tree? Use a testing set of examples to compare the likely errors of various trees. 2003, G. Tecuci, Learning Agents Laboratory 28

Rule post pruning to avoid overfitting the data? Rule post pruning algorithm Infer a decision tree Convert the tree into a set of rules Prune (generalize) the rules by removing antecedents as long as this improves their accuracy Sort the rules by their accuracy and use this order in classification Compare tree pruning with rule post pruning. Rule post pruning is more general. We can remove an attribute from the top of the tree without removing all the attributes that follow. 2003, G. Tecuci, Learning Agents Laboratory 29

How to use continuous attributes? Transform a continuous attribute into a discrete one. Give an example of such a transformation. 2003, G. Tecuci, Learning Agents Laboratory 30

How to deal with missing attribute values? Estimate the value from the values of the other examples. How? Assign the value that is most common for the training examples at that node. Assign a probability to each of the values. How does this affect the algorithm? Consider fractional examples. 2003, G. Tecuci, Learning Agents Laboratory 31

Comparison with the candidate elimination algorithm Generalization language ID 3 – disjunctions of conjunctions CE – conjunctions Bias ID 3 – preference bias (Occam’s razor) CE – representation bias Search strategy ID 3: hill climbing (may not find the concept but only an approximation) CE: exhaustive search Use of examples ID 3 – all in the same time (can deal with noise and missing values) CE – one at a time (can determine the most informative example) 2003, G. Tecuci, Learning Agents Laboratory 32

What problems are appropriate for decision tree learning? Problems for which: Instances can be represented by attribute-value pairs Disjunctive descriptions may be required to represent the learned concept Training data may contain errors Training data may contain missing attribute values 2003, G. Tecuci, Learning Agents Laboratory 34

What practical applications could you envision? Classify: - Patients by their disease; - Equipment malfunctions by their cause; - Loan applicants by their likelihood to default on payments. 2003, G. Tecuci, Learning Agents Laboratory 35

Which are the main features of decision tree learning? May employ a large number of examples. Discovers efficient classification trees that are theoretically justified. Learns disjunctive concepts. Is limited to attribute-value representations. Has a non incremental nature (there are however also incremental versions that are less efficient). The tree representation is not very understandable. The method is limited to learning classification rules. The method was successfully applied to complex real world problems. 2003, G. Tecuci, Learning Agents Laboratory 36

Exercise Build two different decision trees corresponding to the examples and counterexamples from the following table. Indicate the concept represented by each decision tree. food herbivore carnivore omnivorous herbivore omnivorous carnivore omnivorous medium land water amphibious air land type harmless harmful harmless moody class mammal + mammal fish + amphibian bird reptile + reptile mammal + deer (e 1) lion (c 1) goldfish (e 2) frog (c 2) parrot (c 3) cobra (e 3) lizard (c 4) bear (e 4) Apply the ID 3 algorithm to build the decision tree corresponding to the examples and counterexamples from the above table. 2003, G. Tecuci, Learning Agents Laboratory 38

Exercise Consider the following positive and negative examples of a concept shape ball brick cube ball size large small class + + e 1 c 2 e 2 and the following background knowledge a) You will be required to learn this concept by applying two different learning methods, the Induction of Decision Trees method, and the Versions Space (candidate elimination) method. Do you expect to learn the same concept with each method or different concepts? Explain in detail your prediction (You will need to consider various aspects like the instance space, the hypothesis space, and the method of learning). b) Learn the concept represented by the above examples by applying: - the Induction of Decision Trees method; - the Versions Space method. c) Explain the results obtained in b) and compare them with your predictions. d) Which will be the results of learning with the above two methods if only the first three examples are available? 2003, G. Tecuci, Learning Agents Laboratory 39

Exercise Consider the following positive and negative examples of a concept workstation maclc sun hp sgi mac. II software macwrite frame-maker accounting spreadsheet microsoft-word printer laserwriter laserjet laserwriter proprinter class + e 1 + e 2 - c 1 - c 2 + e 3 and the following background knowledge a) Build two decision trees corresponding to the above examples. Indicate the concept represented by each decision tree. In principle, how many different decision trees could you build? b) Learn the concept represented by the above examples by applying the Versions Space method. Which is the learned concept if only the first four examples are available? c) Compare and justify the obtained results. 2003, G. Tecuci, Learning Agents Laboratory 40

Exercise True of false: If decision tree D 2 is an elaboration of D 1 (according to ID 3), then D 1 is more general than D 2. 2003, G. Tecuci, Learning Agents Laboratory 41

Recommended reading Mitchell T. M. , Machine Learning, Chapter 3: Decision tree learning, pp. 52 -80, Mc. Graw Hill, 1997. Quinlan J. R. , Induction of decision trees, in Machine Learning Journal, 1: 81 -106. Also in Shavlik J. and Dietterich T. (eds), Readings in Machine Learning, Morgan Kaufmann, 1990. Barr A. , Cohen P. , and Feigenbaum E. (eds), The Handbook of Artificial Intelligence, vol III, pp. 406 -410, Morgan Kaufmann, 1982. Elwyn Edwards, Information Transmission, Chapter 4: Uncertainty, pp. 2839, Chapman and Hall, 1964. 2003, G. Tecuci, Learning Agents Laboratory 42