GraphPlan

Mutex Conditions

Inconsistent Effects: The effects of one negates the effects of the other

Complements: (Literal) A and !A

Interference: One deletes the pre-conditions of another

Inconsistent Support: (Literal) Actions are mutex

Competing Needs: Mutually exclusive preconditions

Partial Order Planning

Setup

Steps

Add Start (with postconditions) and Finish (with preconditions)

Causal Links

Ordering Constraints

Querying Information

\(Pr(x_1,...x_n)=\prod\limits_{i=1}^{n} Pr(x_i|\,parents(X_i))\)

For probabilities not specified, we need to take the sum of all possibilities inside the products.

Conditional Independence

A node is conditionally independent of its other predecessors, given its parents.

Each variable is conditionally independent of its non-descendents, given its parents

**Markov blanket**: A node is conditionally independent of all other nodes in the blanket, given its parents, children, and children’s parents.Three Conditions for all paths

Common parent known

Middle of chain known

Common child unknown

Bayes rule: flip

Information Gain

Entropy of a boolean variable that is true with probability q

\(B(q) = -(q \text{ log}_2q+(1-q)\text{ log}_2(1-q))\)

Entropy of the goal attribute on the whole set is \(B(\frac{p}{p+n})\)

\(\text{Gain}(A) = B(\frac{p}{p+n})-\text{Remainder}(A)\)

\(\text{Remainder}(A) = \sum\limits_{k=1}^{d}\frac{p_k+n_k}{p+n}\,B(\frac{p_k}{p_k+n_k})\) where each subset \(E_k\) has \(p_k\) positive examples and \(n_k\) negative examples.

Decision Trees

Bellman Equation

\(U(s) = R(s) + \gamma\, max_{a \in A(s)} \sum\limits_{s'} P(s'|\,s,a)U(s')\)

Value Iteration

\(U_{i+1}(s) \leftarrow R(s) + \gamma \,max_{a\in A(s)} \sum\limits_{s'} P(s'|\,s,a)U_i(s')\)

Convergence of Value Iteration

For the updates on all states; \(\delta = \) Maximum difference between old and new utility.

Quit when \(\delta < \epsilon(1-\gamma)/\gamma\) where \(\gamma\) is the discount factor.

https://www.dropbox.com/s/s7rl78omhzjj4sq/Screenshot%202014-04-25%2015.47.28.png

Dont’ know transition model \((P'|s,a)\) or reward function \(R(s)\) for each state.

Markov Decision Process: State at time t depends on only previous state and action

\(\text{Pr}(s_{t+1}=s'|\,s_t=s,\,a_t=a) = T(s,a,s')\)

Direct Estimation

Given trials, reward functions

Utility of a state is the sum_of_total_rewards/number_of_trials_it_is_in

Adaptive Dynamic Programming (ADP)

Active ADP

Temporal Difference Learning (TD)

Look forward one step to figure out utilities.

Q-Learning

\(\text{Pr}(H_{bad} \text{ consistent with random example}) \le 1 - \epsilon.\)