Скачать презентацию COMP 578 Artificial Neural Networks for Data Mining

6259284933db443a2d5c34d81850850c.ppt

• Количество слайдов: 82

COMP 578 Artificial Neural Networks for Data Mining Keith C. C. Chan Department of Computing The Hong Kong Polytechnic University 1

Human vs. Computer • Computers – Not good at performing such tasks as visual or audio processing/recognition. – Execute instructions one after another extremely rapidly. – Good at serial activities (e. g. counting, adding). • Human brain – Units respond at 10/s (vs. PV 2. 5 GHz). – Work on many different things at once. – Vision or speech recognition by interaction of many different pieces of information. 2

The brain • Human brain is complicated and poorly understood. • Contains approximately 1010 basic units called neurons. • Each neuron connected to about 10, 000 others. Dendrites Soma (or Cell Body) Axon Synapse 3

The Neuron Dendrites Soma Axon Synapse • Neuron accepts many inputs (through dendrites). • Inputs are all added up in some fashion. • If enough active inputs are received at once, neuron will be activated and “fire” (through axon). 4

The Synapse • Axon produce voltage pulse called action potential (AP). • Need arrival of more than one AP to trigger synapse. • Synapse releases neurotransmitters when AP is raised sufficiently. • Neurotransmitters diffuse across the gap chemically activating dendrites on the other side. • Some synapses pass a large signal across, whilst others allow very little through. 5

Modeling the Single Neuron • n inputs. • Efficiency of synapses modeled by having a multiplicative factor on each of the inputs to the neuron. • Multiplicative factor = associated weights on input lines. • Neuron’s tasks: – Calculates weighted sum of its inputs. – Compares sum to some internal threshold. – Turn on if threshold exceeded. x 1 w 1 x 2 w 2 Σ y wn xn 6

A Mathematical Model of Neurons • Neuron computes weighted sum: • Fire if SUM exceeds a threshold θ. – y=1 if SUM > θ – y=0 if SUM θ. 7

Learning in Simple Neurons • Need to be able to determine connection weights. • Inspiration comes from looking at real neural systems. – Reinforce good behavior and reprimand bad. – E. g. , train a NN to recognize 2 characters H and F – Output 1 when a H is presented and 0 when it sees a F. – If it produces an incorrect output, we want to reduce the chances of that happening again. – This is done by modifying the weights. 8

Learning in Simple Neurons (2) • Neuron given random initial weights. – At starting state, neuron knows nothing. • Present an H. – Neuron computes the weighted sum of inputs. – Compare weighted sum with threshold. – If exceeds threshold, output a 1 otherwise a 0. • If output is 1, neuron is correct. – Do nothing. • Otherwise if neuron produces a 0. – Increase the weights so that next time it will exceed the threshold and produces a 1. 9

A Simple Learning Rule • How much weight to increase? • Can follow simple rule: – Add the input values to the weights when we want the output to be on. – Subtract the input values from the weights when we want the output to be off. • This learning rule is called the Hebb rule: – It is a variant on one proposed by Donald Hebb and is called Hebbian learning. – It is the earliest and simplest learning rule for a neuron. 10

The Hebb Net • Step 0. Initialize all weights: – wi =0 (i = 1 to n). • Step 1. For each input training record (s) it’s target output (t), do steps 2 -4. – Step 2. Set activations for all input units: – Step 3. Set activation for the output unit: – Step 4. Adjust the weights and the bias: • wi (new) = wi (old) + xi y (i = 1 to n) (note: wi = xi y) • θ(new) = θ(old) + y. • The bias (the θ) adjusted like a weight from a unit whose output signal is always 1. 11

A Hebb Net Example 12

The Data Set • Attributes – HS_Index: {Drop, Rise} – Trading_Vol: {Small, Medium, Large} – DJIA: {Drop, Rise} • Class Label – Buy_Sell: {Buy, Sell} 13

The Data Set HS_Index Trading_Vol DJIA Buy_Sell 1 Drop Large Drop Buy 2 Rise Large Rise Sell 3 Rise Medium Drop Buy 4 Drop Small Drop Sell 5 Rise Small Drop Sell 6 Rise Large Drop Buy 7 Rise Small Rise Sell 8 Drop Large Rise Sell 14

Transformation Bias • Input Features – – – – HIS=Drop HS_Index_Drop: {-1, 1} HS_Index_Rise: {-1, 1} HIS=Rise Trading_Vol_Small: {-1, 1} Trading_Vol_Medium: {-1, 1} Trading_Vol_Large: {-1, 1} DJIA_Drop: {-1, 1} DJIA_Rise: {-1, 1} Bias: {1} DJIA=Drop B/S • Output Feature – Buy_Sell: {-1, 1} DJIA=Rise 15

Transformed Data 1 2 3 Input Feature <1, -1, -1, 1, 1, -1, 1> <-1, 1, -1, 1, -1, 1> Output Feature <1> <-1> <1> 4 5 6 7 8 <1, -1, 1, -1, 1> <-1, 1, -1, 1, 1, -1, 1> <-1, 1, 1, -1, -1, 1, 1> <1, -1, -1, 1, 1> <-1> <-1> 16

Record 1 • • • Input Feature: <1, -1, -1, 1, 1, -1, 1> Output Feature: <1> Original Weight: <0, 0, 0> Weight Change: <1, -1, -1, 1, 1, -1, 1> New Weight: <1, -1, -1, 1, 1, -1, 1> 17

Record 2 • • • Input Feature: <-1, 1, 1> Output Feature: <-1> Original Weight: <1, -1, -1, 1, 1, -1, 1> Weight Change: <1, -1, -1> New Weight: <2, -2, 0, 0, 0, 2, -2, 0> 18

Record 3 • • • Input Feature: <-1, 1, -1, 1> Output Feature: <1> Original Weight: <2, -2, 0, 0, 0, 2, -2, 0> Weight Change: <-1, 1, -1, 1> New Weight: <1, -1, 3, -3, 1> 19

Record 4 • • • Input Feature: <1, -1, 1> Output Feature: <-1> Original Weight: <1, -1, 3, -3, 1> Weight Change: <-1, 1, 1, -1, 1, -1> New Weight: <0, 0, -2, 2, 0, 2, -2, 0> 20

Record 5 • • • Input Feature: <-1, 1, 1, -1, 1> Output Feature: <-1> Original Weight: <0, 0, -2, 2, 0, 2, -2, 0> Weight Change: <1, -1, 1, 1, -1> New Weight: <1, -3, 3, 1, 1, -1> 21

Record 6 • • • Input Feature: <-1, 1, 1, -1, 1> Output Feature: <1> Original Weight: <1, -3, 3, 1, 1, -1> Weight Change: <-1, 1, 1, -1, 1> New Weight: <0, 0, -4, 2, 2, 2, -2, 0> 22

Record 7 • • • Input Feature: <-1, 1, 1, -1, -1, 1, 1> Output Feature: <-1> Original Weight: <0, 0, -4, 2, 2, 2, -2, 0> Weight Change: <1, -1, -1> New Weight: <1, -5, 3, 3, 3, -1> 23

Record 8 • • • Input Feature: <1, -1, -1, 1, 1> Output Feature: <-1> Original Weight: <1, -5, 3, 3, 3, -1> Weight Change: <-1, 1, -1> New Weight: <0, 0, -4, 4, 2, 4, -2> 24

A Hebb Net Example 2 Input Target (x 1 X 2 1) (1 1 1) +1 (1 -1 1) -1 (-1 -1 1) -1 25

Input (1 x 2 1 1) 1) Weight Changes Weights w 2 θ) (0 (x 1 Target 0 0) 1) (1 1 1) ( w` 1 (1 1 x 2 θ) (w 1 The separating line becomes x 2 = - x 1 - 1 x 1 26

Input (x 1 Weight Changes ( w 1 w 2 -1 1) Weights b) (w 1 1) -1 (-1 b) 1 1) -1) (0 1 w 2 (1 (1 x 2 Target 2 0) x 2 The separating line becomes x 2 = 0 x 1 27

Input (x 1 Weight Changes ( w 1 w 2 1 1) Weights b) (w 1 1) -1 (1 b) 2 0) -1) (1 -1 w 2 (0 (-1 x 2 Target 1 -1) x 2 The separating line becomes x 2 = - x 1 + 1 x 1 28

Input (x 1 Weight Changes ( w 1 w 2 -1 1) Weights b) (w 1 1) -1 (1 b) 1 -1) (2 1 w 2 (1 (-1 x 2 Target 2 -2) x 2 Even though the weights have changed, the separating line is still x 2 = - x 1 + 1 The graph of the decision regions (the positive response and the negative response) remains as shown. x 1 29

A Hebb Net Example 3 Input Target (x 1 (1 x 2 1 1) 1) 1 (1 0 1) 0 (0 1 1) 0 (0 0 1) 0 30

Input (x 1 Weight Changes ( w 1 w 2 1 1) Weights b) (w 1 1) 1 (1 1 w 2 b) (0 (1 x 2 Target 0 0) 1) (1 1 1) x 2 0 The separating line becomes x 2 = - x 1 - 1 0 0 x 1 31

Input Target Weight Changes ( w 1 w 2 Weights b) (w 1 w 2 b) 0 0) (1 1 1) (0 0 0) (1 1 1) (x 1 x 2 1) (1 0 1) 0 (0 (0 1 1) 0 (0 0 1) 0 Since the target value is 0, no learning occurs. Using binary target values prevents the net from learning any pattern for which the target is “off”. 32

Characteristics of the Hebb Net • Choice of training records determines which problems can be solved. • Training records corresponding to the AND function can be solved if inputs and targets in bipolar form. • Bipolar representation allows modification of a weight when input and target are both “on” and when they are both “off” at the same time. 33

The Perceptron Learning Rule • More powerful than the Hebb rule. • The Perceptron learning rule convergence theorem states that: – If weights exist to allow neuron to respond correctly to all training patterns, then the rule will find such weights. – The neuron will find these weights in a finite number of training steps. • Let SUM be the weighted sum, the output of the Perceptron, y = f(SUM), can be 1, 0, -1. • The activation function is: 34

Perceptron Learning • For each training record, the net would calculate the response of the output unit. • The net would determine whether an error occurred for this pattern (comparing the calculated with target value). • If an error occurred, weights would be changed according to: wi (new) = wi (old) + txi where t is +1 or – 1 and is the learning rate. • If an error did not occur, the weights would not be changed. • Training continue until no error occurred. 35

Perceptron for classification • Step 0. Initialize all weights and bias: (For simplicity, set weights and bias to zero. ) Set learning rate (0 < < 1). (For simplicity, can be set to 1. ) • Step 1. While stopping condition is false, do steps 2 -6. • Step 2. For each training pair, do Steps 3 -5: • Step 3. Set activation for input unit, xi. • Step 4. Compute response of output unit: SUM = θ + i xi wi. • Step 5. Update weights and bias if error occurred for this vector. If y’ y, wi (new) = wi (old) + txi θ(new) = θ (old) + t else wi (new) = wi (old) θ (new) = θ (old) • Step 6. If no weights changed in 2, stop else continue. 36

Perceptron for classification (2) • Only weights connecting active input units (xi 0) are updated. • Weights are updated only for patterns that do not produce the correct value of y. • Less learning as more training patterns produce the correct response. • The threshold on the activation function for the response unit is a fixed, non-negative value . • The form of the activation function for the output unit constitutes an undecided band of fixed width determined by separating the region of positive 37 response from that of negative response.

Perceptron for classification (3) • Instead of one separating line, we have a line separating the region of positive response from the region of zero response (line bounding inequality): – w 1 x 1 + w 2 x 2 + b > • and a line separating the region of zero response from the region of negative response (line bounding the inequality): w 1 x 1 + w 2 x 2 + b < w 1 x 1 + w 2 x 2 + b > w 1 x 1 + w 2 x 2 + b < 38

Perceptron 39

The Data Set (1) • Attributes – HS_Index: {Drop, Rise} – Trading_Vol: {Small, Medium, Large} – DJIA: {Drop, Rise} • Class Label – Buy_Sell: {Buy, Sell} 40

The Data Set (2) HS_Index Trading_Vol DJIA Buy_Sell 1 Drop Large Drop Buy 2 Rise Large Rise Sell 3 Rise Medium Drop Buy 4 Drop Small Drop Sell 5 Rise Small Drop Sell 6 Rise Large Drop Buy 7 Rise Small Rise Sell 8 Drop Large Rise Sell 41

Transformation • Input Features – – – – HS_Index_Drop: {0, 1} HS_Index_Rise: {0, 1} Trading_Vol_Small: {0, 1} Trading_Vol_Medium: {0, 1} Trading_Vol_Large: {0, 1} DJIA_Drop: {0, 1} DJIA_Rise: {0, 1} Bias: {0} • Output Feature – Buy 1 – Sell -1 42

Transformed Data 1 2 3 Input Feature <1, 0, 0, 0, 1, 1, 0, 1> <0, 1, 0, 1, 0, 1> Output Feature <1> <-1> <1> 4 5 6 7 8 <1, 0, 0, 1, 0, 1> <0, 1, 1, 0, 0, 1> <0, 1, 0, 0, 1, 1, 0, 1> <0, 1, 1, 0, 0, 0, 1, 1> <1, 0, 0, 0, 1, 1> <-1> <-1> 43

Record 1 • • • Input Feature: <1, 0, 0, 0, 1, 1, 0, 1> Output Feature: <1> Original Weight: <0, 0, 0> Output: f(0) = 0 Weight Change: <1, 0, 0, 0, 1, 1, 0, 1> New Weight: <1, 0, 0, 0, 1, 1, 0, 1> 44

Record 2 • • • Input Feature: <0, 1, 1> Output Feature: <-1> Original Weight: <1, 0, 0, 0, 1, 1, 0, 1> Output: f(2) = 1 Weight Change: <0, -1, -1> New Weight: <1, -1, 0, 0, 0, 1, -1, 0> 45

Record 3 • • • Input Feature: <0, 1, 0, 1> Output Feature: <1> Original Weight: <1, -1, 0, 0, 0, 1, -1, 0> Output: f(1) = 0 Weight Change: <0, 1, 0, 1> New Weight: <1, 0, 0, 1, 0, 2, -1, 1> 46

Record 4 • • • Input Feature: <1, 0, 0, 1, 0, 1> Output Feature: <-1> Original Weight: <1, 0, 0, 1, 0, 2, -1, 1> Output: f(4) = 1 Weight Change: <-1, 0, 0, -1, 0, -1> New Weight: <0, 0, -1, 1, 0, 1, -1, 0> 47

Record 5 • • • Input Feature: <0, 1, 1, 0, 0, 1> Output Feature: <-1> Original Weight: <0, 0, -1, 1, 0, 1, -1, 0> Output: f(0) = 0 Weight Change: <0, -1, 0, 0, -1> New Weight: <0, -1, -2, 1, 0, 0, -1> 48

Record 6 • • • Input Feature: <0, 1, 0, 0, 1, 1, 0, 1> Output Feature: <1> Original Weight: <0, -1, -2, 1, 0, 0, -1> Output: f(-2) = -1 Weight Change: <0, 1, 0, 0, 1, 1, 0, 1> New Weight: <0, 0, -2, 1, 1, 1, -1, 0> 49

Record 7 • • • Input Feature: <0, 1, 1, 0, 0, 0, 1, 1> Output Feature: <-1> Original Weight: <0, 0, -2, 1, 1, 1, -1, 0> Output: f(-3) = -1 Weight Change: <0, 0, 0, 0> New Weight: <0, 0, -2, 1, 1, 1, -1, 0> 50

Record 8 • • • Input Feature: <1, 0, 0, 0, 1, 1> Output Feature: <-1> Original Weight: <0, 0, -2, 1, 1, 1, -1, 0> Output: f(0) = 0 Weight Change: <-1, 0, 0, 0, -1, -1> New Weight: <-1, -3, 1, 0, 1, -3, -2> 51

A Perceptron Example (x 1 x 2 1) (1 1 1) 1 (1 0 1) -1 (0 1 1) -1 (0 0 1) -1 52

Input Net Out Target Weight Changes (x 1 x 2 1) Weights 1 1) 0 0 The separating lines become x 1 + x 2 + 1 =. 2 and x 1 + x 2 + 1 = -. 2 1 (1 1 x 2 1) b) (0 (1 (w 1 w 2 0 0) (1 1 1) x 1 53

Input Net Out Target Weight Changes (x 1 x 2 1) Weights 0 1) 2 1 -1 (-1 0 -1) b) (1 (1 (w 1 w 2 1 1) (0 1 0) x 2 The separating lines become x 2 =. 2 and x 2 = -. 2 x 1 54

Input Net Out Target Weight Changes (x 1 x 2 1) Weights 1 0 1) 1) 1 -1 -1 -1 (0 (0 -1) 0) b) (0 (0 (0 (w 1 w 2 1 0) (0 (0 0 0 -1) 55

Input Net Out Target Weight Changes (x 1 x 2 1) Weights (w 1 w 2 b) (0 (1 1 1) -1 -1 1 (1 1 1) 0 -1) (1 1 0) x 2 The separating line become x 1 + x 2 =. 2 and x 1 + x 2 = -. 2 x 1 56

Input Net Out Target Weight Changes (x 1 x 2 1) Weights 0 1) 1 1 -1 (-1 0 x 2 -1) b) (1 (1 (w 1 w 2 1 0) (0 1 -1) Te separating line become x 1 + x 2 =. 2 and x 1 + x 2 = -. 2 x 1 57

Input Net Out Target Weight Changes Weights (x 1 x 2 1) (w 1 w 2 b) (0 (0 (0 1 0 1) 1) 0 -2 0 -1 -1 -1 (0 (0 -1) 0) 1 -1) (0 (0 0 0 -2) The results for the third epoch are: Input (x 1 x 2 Net Out Target Weight Changes 1) Weights (w 1 w 2 b) (0 0 -2) (1 1 1) -2 -1 1 (1 1 1) (1 1 -1) (1 0 1) 0 0 -1 (-1 0 -1) (0 1 1) -1 -1 -1 (0 0 0) (0 1 -2) (0 0 1) -2 -1 -1 (0 0 0) (0 1 -2) 58

The results for the fourth epoch are: (1 1 1) -1 -1 1 (1 1 1) (1 2 -1) (1 0 1) 0 0 -1 (-1 0 -1) (0 2 -2) (0 1 1) 0 0 -1 (0 -1 -1) (0 1 -3) (0 0 1) -3 -1 -1 (0 0 0) (0 1 -3) For the fifth epoch, we have (1 1 1) -2 -1 1 (1 1 1) (1 2 -2) (1 0 1) -1 -1 -1 (0 0 0) (1 2 -2) (0 1 1) 0 0 -1 (0 -1 -1) (1 1 -3) (0 0 1) -3 -1 -1 (0 0 0) (1 1 -3) And for the sixth epoch, (1 1 1) -1 -1 1 (1 1 1) (2 2 -2) (1 0 1) 0 0 -1 (-1 0 -1) (1 2 -3) (0 1 1) -1 -1 -1 (0 0 0) (1 2 -3) (0 0 1) -3 -1 -1 (0 0 0) (1 2 -3) 59

The results for the seventh epoch are: (1 1 1) 0 0 1 (1 1 1) (2 3 -2) (1 0 1) 0 0 -1 (-1 0 -1) (1 3 -3) (0 1 1) 0 0 -1 (0 -1 -1) (1 2 -4) (0 0 1) -4 -1 -1 (0 0 0) (1 2 -4) The eight epoch yields (1 1 1) -1 -1 1 (1 1 1) (2 3 -3) (1 0 1) -1 -1 -1 (0 0 0) (2 3 -3) (0 1 1) 0 0 -1 (0 -1 -1) (2 2 -4) (0 0 1) -4 -1 -1 (0 0 0) (2 2 -4) And the ninth (1 1 1) 0 0 1 (1 1 1) (3 3 -3) (1 0 1) 0 0 -1 (-1 0 -1) (2 3 -4) (0 1 1) -1 -1 -1 (0 0 0) (2 3 -4) (0 0 1) -4 -1 -1 (0 0 0) (2 3 -4) 60

Finally, the results for the tenth epoch are (1 1 1) 1 1 1 (0 0 0) (2 3 -4) (1 0 1) -2 -1 -1 (0 0 0) (2 3 -4) (0 1 1) -1 -1 -1 (0 0 0) (2 3 -4) (0 0 1) -4 -1 -1 (0 0 0) (2 3 -4) x 2 • The positive response is given by: – 2 x 1 + 3 x 2 – 4 >. 2 • with boundary line – x 2 = -2 / 3 x 1 + 7 / 5 • The negative response is given by: x 1 – 2 x 1 + 3 x 2 – 4 < -. 2 • with boundary line – x 2 = -2 / 3 x 1 + 19 / 15 61

The 2 nd Perceptron Algorithm Input (x 1 Net Out Target Weight Changes x 2 1) Weights (w 1 w 2 b) (0 0 0) (1 1 1) 0 0 1 (1 1 1) (1 -1 1) 1 1 -1 (-1 1 -1) (0 2 0) (-1 1 1) 2 1 -1 (1 -1 -1) (1 1 -1) (-1 -1 1) -3 -1 -1 (0 0 0) (1 1 -1) 62

In the second epoch of training, we have: (1 1 1) 1 1 1 (0 0 0) (1 1 -1) (1 -1 1) -1 -1 -1 (0 0 0) (1 1 -1) (-1 -1 1) -3 -1 -1 (0 0 0) (1 1 -1) Since all the w’s are 0 in epoch 2, the system was fully trained after the first epoch. 63

Limitations of Perceptrons • • Perceptron finds a straight line that separates classes. It cannot learn for exclusive-or (XOR) problems. Such patterns are not linearly separable. Not much work after Minsky and Papert published their book in 1969. • Rumelhart and Mc. Clelland produced an improvement in 1986. – Proposed some modern adaptations to Perceptron, called multilayer Perceptron. 64

The Multilayer Perceptron • Overcome linearly inseparability: – Use more perceptrons. – Each set up to identify small, linearly separable sections of the inputs. – Combine their outputs into another perceptron. • Each neuron still takes weighted sum of inputs, thresholds it, outputs 1 or 0. • But how can we learn? 65

The Multilayer Perceptron (2) • Perceptrons in the 2 nd layer do not know which of the real inputs were on or not. • Only 2 -state, on or off, gives no indication of how much to adjust the weights. – Some weighted input definitely turn on a neuron. – Some weighted inputs only just turn a neuron on and should not be altered to the same extent. – What changes to produce a better solution next time? – Which of the input weights should be increased and which should not? – But we have no way of finding out (the credit assignment problem). 66

The Solution • Need a non-binary thresholding function. • Use a slightly different nonlinearity so that it more or less turns on or off. • A possible new thresholding function is the sigmoid function. • Sigmoid thresholding function does not mask inputs from the outputs. 67

The Multi-layer Preceptron • An input layer, an output layer, and a hidden layer. • Each unit in hidden and output layer is like a perceptron unit. • But the thresholding function is sigmoid. • Units in input layer serve to distribute values they receive to next layer • Input units do not perform a weighted sum or threshold. 68

The Backpropagation Rule • Single-layer perceptron model changed. – Thresholding function from a step to a sigmoid function. – A hidden layer added. – Learning rule needs to be altered. • New learning rule for multilayer perceptron is called the “generalized delta rule”, or the “backpropagation rule”. – Show NN a pattern and calculate its response. – Compare with desired response. – Alter weights so that NN can produce a more accurate output next time. – The learning rule provides the method for adjusting the weights so as to decrease the error next time. 69

Backpropagation Details • Define an error function to represent difference between NN's current output and the correct output. • The backpropagation rule aims to reduce the error by: – Calculating the value of the error for a particular input. – Then back-propagates the error from one layer to the previous one. – Each unit in the net has its weights adjusted so that it reduces the value of the error function – For units on the output. • Their output and desired output is known and adjusting the weights is relatively simple. – For units in the middle: • Those that are connected to outputs with a large error should have their weights adjusted a lot. • Those that feed almost correct outputs should not be altered much. 70

The Detailed Algorithm • Step 0. Initialize weights (Set to small random values). • Step 1. While stopping condition is false, do Steps 2 -9. – Step 2. For each training pair, do Steps 3 -8. Feedbackward. • Step 3. Each input unit (xi , i = 1, …, n) receives input signal xi and broadcasts this signal to all units in the layer above (the hidden units). • Step 4. Each hidden unit (Zj , j = 1, …, p) sums its weighted input signals, – applies its activation function to compute its output signal, zj = f(z_inj), – and sends this signal to all units in the layer above (output units). • Step 5. Each output unit (Yk , k=1, …, m) sums its weighted input signals, – And applies its activation function to compute its output signal, yk = f(z_inj), 71

The Detailed Algorithm (2) Feedbackward. • Step 6. Each output unit (yk , k = 1, …, m) receives a target pattern corresponding to the input training pattern, computes its error information term, – Calculates its weight correction term (used to update wjk later), wjk= kzj, – Calculates its bias correction term (used to upate w 0 k later), w 0 k= k, – And sends k to units in the layer below. • Step 7. Each hidden unit (Zj, j=1, …, p) sums its delta inputs (from units in the layer above), – Multiplies by the derivative of its activation function to calculate its error information term, j= _inj f’(z_inj), – Calculates its weight correction term (used to update vij later), vij= jxi, – And calculates its bias correction term (used to update v 0 j later), v 0 j= j, 72

The Detailed Algorithm (3) Update weights and biases: • Step 8. Each output unit (Yk , k = 1, …, m) updates its bias and weights (j=0, …, p): wjk(new)= wjk (old)+ wjk , – Each hidden unit (Zj, j=1, …, p) updates its bias and weights (I=0, …, n): vjk(new)= vjk (old)+ vjk , – Step 9. Test stopping condition. 73

An example: Multilayer Perceptron Network with Backpropagation Training HSI=Rise Vol=High DJIA=Drop 74

Initial Weights and Bias Values • wij = Weight between nodes i and j. • i = Bias value of node i. • For node 4, – w 14 = 0. 2, w 24 = 0. 4, w 34 = – 0. 5, 4 = – 0. 4 • For node 5, – w 15 = – 0. 3, w 25 = 0. 1, w 35 = 0. 2, 5 = 0. 2 • For node 6, – w 16 = 0. 6, w 26 = 0. 7, w 36 = – 0. 1, 6 = 0. 1 • For node 7, – w 47 = – 0. 3, w 57 = – 0. 2, w 67 = 0. 1, 7 = 0. 6 • For node 8, – w 48 = – 0. 5, w 58 = 0. 1, w 68 = – 0. 3, 8 = 0. 3 75

Training (1) • • Learning Rate = 0. 9 Input: <1, 0, 1> Output: <1, 0> For node 4, – Input: 0. 2 + 0 – 0. 5 – 0. 4 = – 0. 7 – Output: 1 / (1 + e 0. 7) = 0. 332 • For node 5, – Input: – 0. 3 + 0. 2 = 0. 1 – Output: 1 / (1 + e – 0. 1) = 0. 525 • For node 6, – Input: 0. 6 + 0 – 0. 1 + 0. 1 = 0. 6 – Output: 1 / (1 + e – 0. 6) = 0. 646 • For node 7, – Input: 0. 332 * (– 0. 3) + 0. 525 * (– 0. 2) + 0. 646 * 0. 1 + 0. 6 = 0. 460 – Output: 1 / (1 + e 0. 460) = 0. 613 • For node 8, – Input: 0. 322 * (– 0. 5) + 0. 525 * 0. 1 + 0. 646 * (– 0. 3) + 0. 3 = – 0. 007 – Output: 1 / (1 + e – 0. 007) = 0. 498 76

Training (2) • For node 7, – Error: 0. 613 (1 – 0. 613) = 0. 092 • For node 8, – Error: 0. 498 (1 – 0. 498) (0 – 0. 498) = – 0. 125 • For node 4, – Error: 0. 332 (1 – 0. 332) (0. 092 * (– 0. 3) + 0. 125 * (– 0. 5)) = 0. 008 • For node 5, – Error: 0. 525 (1 – 0. 525) (0. 092 * (– 0. 2) + 0. 125 * 0. 1) = 0. 009 • For node 6, – Error: 0. 646 (1 – 0. 646) (0. 092 * 0. 1 + 0. 125 * (– 0. 3)) = 0. 008 77

Training (3) • For each weight, – w 14 = 0. 2 + 0. 9 (0. 008) (0. 332) = 0. 202 – w 15 = – 0. 3 + 0. 9 (0. 009) (0. 525) = – 0. 296 –… • For each bias, – 4 = – 0. 4 + 0. 9 (0. 008) = – 0. 393 – 5 = 0. 2 + 0. 9 (0. 009) = 0. 208 –… 78

Using ANN for Data Mining • Constructing a network – input data representation – selection of number of layers, number of nodes in each layer • Training the network using training data • Pruning the network • Interpret the results 79

Step 1: Constructing the Network Multi-layer perceptron (MLP): feed forward back propagation x 1 # of Terms w 1 o 1 Persist x 2 GPA x 3 Demographics o 2 Not-persist x 4 Courses x 5 Fin Aid… w 5…n xj…n 80

Constructing the Network (2) • The number of input nodes: corresponds to the dimensionality of the input tuples – Thermometer coding: • age 20 -80: 6 intervals • [20, 30) 000001, [30, 40) 000011, …. , [70, 80) 111111 • Number of hidden nodes: adjusted during training • Number of output nodes: number of classes 81

ANN vs. Others for Data Mining • Advantages – prediction accuracy is generally high – robust, works when training examples contain errors – output may be discrete, real-valued, or a vector of several discrete or real-valued attributes – fast evaluation of the learned target function. • Criticism – long training time – difficult to understand the learned function (weights). – not easy to incorporate domain knowledge 88