209951dd128360a3e0e259d5aaaa3ca2.ppt

- Количество слайдов: 47

Decisions under uncertainty Reading: Ch. 16, AIMA 2 nd Ed. Rutgers CS 440, Fall 2003

Outline • Decisions, preferences, utility functions • Influence diagrams • Value of information Rutgers CS 440, Fall 2003

Decision making • Decisions – an irrevocable allocation of domain resources • Decisions should be made so as to maximize expected utility • Questions: – Why make decisions based on average or expected utility? – Why can one assume that utility functions exist? – Can an agent act rationally by expressing preferences between states without giving them numeric values? – Can every preference structure be captured by assigning a single number to every state? Rutgers CS 440, Fall 2003

Simple decision problem • Party decision problem: inside or outside? Dry Wet Perfect ! Wet Disaster state Action OUT Relief Dry IN Regret state Rutgers CS 440, Fall 2003

Value function • Numerical score over all possible states of the world Action Weather Value OUT Dry $100 IN Wet $60 IN Dry $50 OUT Wet $0 Rutgers CS 440, Fall 2003

Preferences • Agent chooses among prizes (A, B, …) and lotteries (situations with uncertain prizes) L 1 = (. 2, $40000; . 8, $0 ) . 2 $40, 000 L 2 = (. 25, $30000; . 75, $0 ) . 25 $30, 000 . 8 . 75 $0 ~ A B A~B A is preferred to B B is preferred to A indifference between A & B Rutgers CS 440, Fall 2003 $0

Desired properties for preferences over lotteries • Prefer $100 over $0 AND p < q, then L 1 = ( p, $100; 1 -p, $0 ) p L 2 = (. q, $100; 1 -q, $0 ) $100 q $100 1 -p 1 -q $0 Rutgers CS 440, Fall 2003 $0

Properties of (rational) preference Lead to rational agent behavior 1. Orderability (A B)V(A~B) 2. Transitivity (A B)^(B C) (A C) 3. Continuity A B C p, ( p, A; (1 -p) C ) ~ B 4. Substitutability A~B p, ( p, A; (1 -p), C ) ~ ( p, B; (1 -p), C ) 5. Monotonicity A B ( p > q ( p, A; (1 -p)B ) ( q, A; (1 -q), B ) ) Rutgers CS 440, Fall 2003

Preference & expected utility • Properties of preference lead to existence (Ramsey 1931, von Neumann & Morgenstern 1944) of utility function U such that L 1 = ( p, $100; 1 -p, $0 ) p L 2 = (. q, $100; 1 -q, $0 ) $100 q $100 1 -p 1 -q $0 $0 IFF q U($100) + (1 -q) U($0) < p U($100) + (1 -p) U($0) EXPECTED UTILITY of L 1, EU(L 1) EXPECTED UTILITY of L 2, EU(L 2) Rutgers CS 440, Fall 2003

Properties of utility • Utility is a function that maps states to real numbers • Standard approach to assessing utilities of states: 1. Compare state A to a standard lottery L = ( p, Ubest, 1 -p, Uworst) Ubest – best possible event Uworst – worst possible event 2. Adjust p until A ~ L 0. 999999 $30 Continue as before ~ 0. 000001 Rutgers CS 440, Fall 2003 Instant death

Utility scales • Normalized utilities: Ubest = 1. 0, Uworst = 0. 0 • Micromorts: one-millionth chance of death – useful for Russian roulette, paying to reduce product risks, etc. • QALYs: quality-adjusted life years – useful for medical decisions involving substantial risk • Note: behavior is invariant w. r. t. positive linear transformation U’(s) = A U(s) + B, A>0 Rutgers CS 440, Fall 2003

Utility vs Money • Utility is NOT monetary payoff . 8 . 2 $40, 000 1 $0 EMV(L 1) = $32, 000 > Rutgers CS 440, Fall 2003 0 $30, 000 $0 EMV(L 2) = $30, 000

Attitudes toward risk U( $reward ) L 0. 5 U( $500 ) 0. 5 $1000 $0 U( L ) Insurance risk premium $400 $500 $1000 $ reward certain monetary equivalent U convex – risk averse U linear – risk neutral U concave – risk seeking Rutgers CS 440, Fall 2003

Human judgment under uncertainty • • Is decision theory compatible with human judgment under uncertainty? Are people “experts” in reasoning under uncertainty? How well do they perform? What kind of heuristics do they use? . 25 $40, 000 $30, 000 . 8 . 75 $0 $0 . 2 U($40 k) >. 25 U($30 k). 8 U($40 k) > U($30 k). 8 1 $40, 000 $30, 000 . 2 $0 0 . 8 U($40 k) < U($30 k) Rutgers CS 440, Fall 2003 $0

Student group utility • For each $ amount, adjust p until half the class votes for lottery ($10000) Rutgers CS 440, Fall 2003

Technology forecasting • “I think there is a world market for about five computers. ” - Thomas J. Watson, Sr. Chairman of the Board of IBM, 1943 • “There doesn't seem to be any real limit to the growth of the computer industry. ” - Thomas J. Watson, Sr. Chairman of the Board of IBM, 1968 Rutgers CS 440, Fall 2003

Maximizing expected utility Value 0. 7 IN Dry $50 U($50) = 0. 632 Wet $60 U($60) = 0. 699 Dry $100 U($100) = 0. 865 Wet $0 U($0) state 0. 3 0. 6521 Action 0. 6055 OUT Utility 0. 7 state 0. 3 EU(IN) = 0. 7 * 0. 632 + 0. 3 * 0. 699 EU(OUT) = 0. 7 * 0. 865 + 0. 3 * 0 = 0. 6521 = 0. 6055 Action* = arg MEU(IN, OUT) = arg max{ EU(IN), EU(OUT) } = IN Rutgers CS 440, Fall 2003 =0

Multi-attribute utilities • Many aspects of an outcome combine to determine our preferences: – vacation planning: cost, flying time, beach quality, food quality, etc. • Medical decision making: risk of death (micromort), quality of life (QALY), cost of treatment, etc. • For rational decision making, must combine all relevant factors into single utility function. U(a, b, c, …)= f[ f 1(a), f 2(b), … ] where f is a simple function such as addition • f=+, In case of mutual preference independence which occurs when it is always preferable to increase the value of an attribute given all other attributes are fixed Rutgers CS 440, Fall 2003

Decision graphs / Influence diagrams earthquake burglary alarm newscast call Action node goods recovered go home? Utility miss meeting Rutgers CS 440, Fall 2003 Utility node

Optimal policy earthquake burglary Choose action given evidence MEU( go home | call ) alarm newscast call goods recovered go home? Utility miss meeting Call? EU( Go home ) EU( Stay ) Yes ? ? No ? ? Rutgers CS 440, Fall 2003

Optimal policy earthquake burglary Choose action given evidence MEU( go home | call ) alarm newscast call goods recovered go home? Utility miss meeting Rutgers CS 440, Fall 2003

Optimal policy earthquake burglary Choose action given evidence MEU( go home | call ) alarm newscast call goods recovered go home? Utility miss meeting Call? c Call? EU( Go home ) EU( Stay ) MEU(Call ) Yes 37 13 37 A*(Call=Yes) = Go Home No 53 83 83 A*(Call=No) = Stay Rutgers CS 440, Fall 2003

Value of information • What is it worth to get another piece of information? • What is the increase in (maximized) expected utility if I make a decision with an additional piece of information? • Additional information (if free) cannot make you worse off. • There is no value-of-information if you will not change your decision. Rutgers CS 440, Fall 2003

Optimal policy with additional evidence earthquake burglary alarm newscast call goods recovered How much better can we do if we have evidence about newscast? go home? Utility miss meeting ( Should we ask for evidence about newscast? ) Rutgers CS 440, Fall 2003

Optimal policy with additional evidence earthquake burglary alarm newscast call goods recovered Call Newscast Go home Yes Quake NO 44 / 45 Yes No YES 35 / 6 No Quake NO 51 / 80 No No go home? NO 52 / 84 Utility miss meeting Rutgers CS 440, Fall 2003

Value of perfect information • The general case: We assume that exact evidence can be obtained about the value of some random variable Ej. • The agent's current knowledge is E. • The value of the current best action a is defined by: • With the new evidence Ej the value of new best action a. Ej will be Rutgers CS 440, Fall 2003

VPI (cont’d) • However, we do not have this new evidence in hand. Hence, we can only say what we expect the expected utility of Ej to be: • The value of perfect information Ej is then Rutgers CS 440, Fall 2003

Properties of VPI 1. Positive: E, E 1 VPIE(E 1) 0 2. Non-additive ( in general ): VPIE(E 1, E 2) VPIE(E 1) + VPIE(E 2) 3. Order-invariant: VPIE(E 1, E 2) = VPIE( E 1) + VPIE, E 1(E 2) = VPIE( E 2) + VPIE, E 2(E 1) Rutgers CS 440, Fall 2003

Example • What is the value of information Newscast? Rutgers CS 440, Fall 2003

Example (cont’d) Call? MEU(Call ) Call? Newscast MEU(Call, Newscast) Yes 36. 74 Yes Quake 45. 20 No 83. 23 Yes No. Quake 35. 16 No Quake 80. 89 No No. Quake 83. 39 Call? P(Newscast=Quake | Call ) P(Newscast=No. Quake | Call) Yes . 1794 . 8206 No . 0453 . 9547 Rutgers CS 440, Fall 2003

Sequential Decisions • So far, decisions in static situations. But most situations are dynamic! – If I don’t attend CS 440 today, will I be kicked out of the class? – If I don’t attend CS 440 today, will I be better off in the future? A: Attend / Do not attend Action P(S | A) S: Professor hates me / Professor does not care E: Professor looks upset / not upset State U( S ) U: Probability of being expelled from class Eviden. Utility Rutgers CS 440, Fall 2003

Sequential decisions • Extend static structure over time – just like an HMM, with decisions and utilities. • One small caveat: a different representation slightly better… A 0 A 1 A 2 S 0 S 1 S 2 E 0 E 1 E 2 U 0 U 1 Rutgers CS 440, Fall 2003 … U 2

Partially-observed Markov decision processes (POMDP) • Actions at time t should impact state at t+1 • Use Rewards (R) instead of utilities (U) • Actions directly determine rewards A 0 A 1 A 2 P(St | At-1) S 0 S 1 P(St | St-1) P(Et | St) E 0 E 1 R 0 S 2 R(St) R 1 Rutgers CS 440, Fall 2003 … E 2 R 2

POMDP Problems • Objective: Find a sequence of actions that takes one from an initial state to a final state while maximizing some notion of total/future “reward”. E. g. : Find a sequence of actions that takes a car from point A to point B while minimizing time and consumed fuel. Rutgers CS 440, Fall 2003

Example (POMDPs) • Optimal dialog modeling: e. g. , automated airline reservation system – Actions: • System prompts: “How may I help you? ”, “Please specify your favorite airline”, “Where are you leaving from? ”, Do you mind leaving at a different time? ”, … – States: • (Origin, Destination, Airline, Flight#, Departure, Arrival, …) “A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies” Levin, Pieraccini, and Eckert, IEEE TSAP, 2000 Rutgers CS 440, Fall 2003

Example #2 (POMDP) • Optimal control – States/actions are continuous – Objective: design optimal control laws for guiding objects from start position to goal position (Lunar lander) – Actions: • Engine thrust, robot arm torques, … – States: • Positions, velocities of objects/robotic arm joints, … – Reward: • Usually specified in terms of cost (reward-1): cost of fuel, battery charge loss, energy loss, … Rutgers CS 440, Fall 2003

Markov decision processes (MDPs) • Let’s make life a bit simpler – assume we exactly know the state of the world A 0 A 1 A 2 S 1 S 2 P(St | At-1) S 0 P(St | St-1) R(St) R 0 R 1 Rutgers CS 440, Fall 2003 R 2 …

Examples • Blackjack game – Objective: Have your card sum be greater than the dealers without exceeding 21. – States(200 of them): • Current sum (12 -21) • Dealer’s showing card (ace -10) • Do I have a useable ace? – Reward: – Actions: +1 for winning, 0 for a draw, -1 for losing stick (stop receiving cards), hit (receive another card) Rutgers CS 440, Fall 2003

MDP Fundamentals • We mentioned (in POMDPs) that the goal is to select actions that maximize “reward” • What “reward”? – Immediate? at* = arg max E[ R(st+1) ] – Cumulative? at* = arg max E[ R(st+1) + R(st+2) + R(st+3) + … ] – Discounted? at* = arg max E[ R(st+1) + R(st+2) + 2 R(st+3) + … ] Rutgers CS 440, Fall 2003

Utility & utility maximization in MDPs • Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on. A at at+1 at+2 S st st+1 st+2 … U Rt Rt+1 Rutgers CS 440, Fall 2003 Rt+2

Utility & utility maximization in MDPs • Assume we are in state st and want to find the best sequence of future actions that will maximize discounted reward from st on. • Convert it into a “simplified” model by compounding states, actions, rewards A Maximum expected utility st S U Rutgers CS 440, Fall 2003

Bellman equations & value iteration algorithm • Do we need to search over |a|T, T , actions A? • No! Best immediate action Bellman update Rutgers CS 440, Fall 2003

Proof of Bellman update equation Rutgers CS 440, Fall 2003

Example • “If I don’t attend CS 440 today, will I be better off in the future? ” • • Actions: Attend / Don’t attend States: Learned topic / Did not learn topic Reward: +1 Learned, -1 Did not learn Discount factor: =0. 9 Transition probabilities: • • • Attended (A) Do not attend (NA) Learned (L) Did not learn (NL) Learned (L) 0. 9 0. 5 0. 6 0. 2 Did not learn (L) 0. 1 0. 5 0. 4 0. 8 U(L) = 1 + 0. 9 max{ 0. 9 U(L) + 0. 1 U(NL), 0. 6 U(L) + 0. 4 U(NL) } U(NL) = -1 + 0. 9 max{ 0. 5 U(L) + 0. 5 U(NL), 0. 2 U(L) + 0. 8 U(NL) } Rutgers CS 440, Fall 2003

Computing MDP state utilities Value iteration • How can one solve for U(L) and U(NL) in the previous example? • Answer: Value-iteration algorithm Start with some initial utility U(L), U(NL), then iterate 8 7 6 5 4 3 U(L) U(NL) 2 1 0 -1 5 10 15 20 25 30 35 40 45 Rutgers CS 440, Fall 2003 50

Optimal policy • Given utilities from VPI, find optimal policy A*(L) = = arg max{ 0. 9 U(L) + 0. 1 U(NL), 0. 6 U(L) + 0. 4 U(NL) } arg max{ 0. 9*7. 1574 + 0. 1*4. 0324, 0. 6*7. 1574 + 0. 4*4. 0324 } arg max{ 6. 8449, 5. 9074 } Attend A*(NL) = arg max{ 0. 5 U(L) + 0. 5 U(NL), 0. 2 U(L) + 0. 8 U(NL) } = arg max{ 5. 5949, 4. 6574 } = Attend Rutgers CS 440, Fall 2003

Policy iteration • Instead of iterating in the space of utility values, iterate over policies 1. Assume optimal policy, e. g. , A*(L) & A*(NL) 2. Compute utility values, e. g. , U(L) & U(NL) for A* 3. Compute new optimal policy from utilities, e. g. , U(L) & U(NL) Rutgers CS 440, Fall 2003