SEQUENTIAL DECISIONS - Making decisions in a stoachstic environment. - SEQUENTIAL DECISION PROBLEMS: Agent’s utility depends on a sequence of decisions. - The utility function for the agent depends on a sequence of states (ENVIRONMENT HISTORY) because the devisions problem is sequential. - MARKOV DECISION PROCESS: Sequential decision problem for a fully oberservable, stochastic environment with a Markovian transition model and additive rewards. - Additive Rewards: Sum the rewards of the chain of states the agent has been. - Consists of - A set of states with an initial state s₀ - A set ACTIONS(s) of actions in each state - A transition model P(s′| s, a) - A reward function R(s) - What does a solution look like? Sequence of actions does not guarantee a state that the agent will get to. Therefore a solution must speicy what the agent must do for any state that the agent might reach (POLICY). - POLICY: π. π(s) is the action recommended by policy π for state s. With a complete policy, the agent will always know what to do next. - Quality of policies ies measured by the expected utility of the possible environment histories generated by that policy. - An OPTIMAL POLICY is a policy that yields the highest expected utility, π*. - Given π*, the agent decides what to do by consulting its current percept (which knows current state s) and then executing the action π*(s). Utilies over time p648 - Horizons - FINITE HORIZON: There is a fixed time N after which nothing matters; this implies that the optimal action in a given state could change over time (NONSTATIONARY policy). - INFINITE HORIZON: Optimal action depends only on the current state (STATIONARY policy). - γ: DISCOUNT FACTOR between 0 and 1. Describes the preference of an agent for current rewards over future rewards. - A discount factor of γ is equivalent to an interest rate of (1/γ)−1 - Reward Calculation for state sequences - Additive Rewards Uh([s₀, s₁, s₂, ...]) = R(s₀)+R(s₁)+R(s₂)+... - Discounted Rewards Uh(([s₀, s₁, s₂, ...]) = R(s₀)+γ R(s₁)+γ² R(s₂)+... Optimal Policies and utilies of states (comparison of policies) p650 - The probability distribution over state sequences S₁, S₂, ..., is determined by the initial state s. - Expected utility obtained by executing π starting in s is $U^\pi(s) = E[\sum^\infty\gamma^t\,R(S_t)]$ Where the expectation is with respect to the probability distribution over state sequences determined by s and π. - The optimal policy is independent of the starting state.