Incorporating Markov Decision Processes within General Linear Bilevel Programs



Let \(x_{i},\ i=1,\ldots,n,\) be \(n\) decision variables governed by the leader and define \(c_{i},\ i=1,\ldots,n,\) as the associated cost with the \(i^{\textrm{th}}\) decision. We assume that the leader’s decision must satisfy some set of a linear constraints, which are defined by matrix \(A\) and right-hand-side (RHS) vector \(b\). Also, we assume these decisions belong to the compact set \(\mathcal{X}\).

For the follower MDP problem, we assume the set of states and actions are defined as \(S=\{1,\ldots,|S|\}\) and \(A_{s},\ s\in S\), respectively. Let \(p_{s,s^{\prime}}^{a},\ s,s^{\prime}\in S,\ a\in A\) be the probability of moving to state \(s^{\prime}\) if the action \(a\) is chosen at the state \(s\). For given states \(s\in S\) and \(s^{\prime}\in S\) and action \(a\in A_{s}\), let \(r_{s,s^{\prime}}^{a}\) be the reward of the follower. In order to model the effect of the leader’s decision in the follower’s problem, we define the vector \(d_{s,s^{\prime}}^{a}\in\mathcal{R}^{n}\). Let the leader’s decision be \(x\) and suppose the follower’s decision at the state \(s\in S\) has been \(a\in A_{s}\). If the follower’s state in the next period becomes \(s^{\prime}\), then the reward \(r_{s,s^{\prime}}^{a}\) will be reduced by the value of \({d_{s,s^{\prime}}^{a}}^{T}x\).

Let \(v_{s}\) be the value of state \(s\in S\). Then, the leader will solve the following problem.

\begin{aligned} \nonumber \\ \displaystyle\min{\color[rgb]{1,1,1}sss} & \label{m1:0} c^{T}x+f(v) \\ \textrm{s.t.}{\color[rgb]{1,1,1}sss} & \label{m1:1} Ax=b \\ & \label{m1:2} x\in X \\ & \label{m1:3} v{\color[rgb]{1,1,1}s}\textrm{solves}{\color[rgb]{1,1,1}s}\Phi(x),\\ \end{aligned}

where \(\Phi(x)\) is defined as follows.

\begin{aligned} \nonumber \\ \displaystyle\Phi(x)=\max{\color[rgb]{1,1,1}sss} & \label{m2:0}\mu^{T}v \\ \textrm{s.t.}{\color[rgb]{1,1,1}sss} & \label{m2:1} v_{s}\leq\sum_{s^{\prime}}p_{s,s^{\prime}}^{a}\left(r_{s,s^{\prime}}^{a}-{d_{s,s^{\prime}}^{a}}^{T}x+\gamma v_{s^{\prime}}\right)\ \ \ \ \ \ \forall s\in S,\ a\in A.\\ \end{aligned}

The follower’s LP problem allows us to represent the leader problem into a single level optimization problem. Note that the follower’s problem may have alternative optimal solutions (since multiple optimal policies may exist for an MDP in general). Hence, we should distinguish the pessimistic case from the optimistic case when solving this problem. Let \(\lambda_{s}^{a}\) be the dual variable associated with the follower’s constraint for \(s\in S\) and \(a\in A\).

\begin{aligned} \nonumber \\ \displaystyle\min{\color[rgb]{1,1,1}sss} & \label{m3:0} c^{T}x+f(v) \\ \textrm{s.t.}{\color[rgb]{1,1,1}sss} & \label{m3:1} Ax=b \\ & \label{m3:2} x\in X \\ & \label{m3:3}\mu^{T}v-\sum_{s\in S}\sum_{a\in A}\lambda_{s}^{a}\sum_{s^{\prime}\in S}p_{s,s^{\prime}}^{a}\left(p_{s,s^{\prime}}^{a}-{d_{s,s^{\prime}}^{a}}^{T}x\right)=0 \\ & \label{m3:4} v_{s}-\gamma\sum_{s^{\prime}\in S}p_{s,s^{\prime}}^{a}v_{s^{\prime}}-\sum_{s^{\prime}}p_{s,s^{\prime}}^{a}\left(r_{s,s^{\prime}}^{a}-{d_{s,s^{\prime}}^{a}}^{T}x\right)\leq 0\ \ \ \ \ \ \ \ \ \ \ \forall s\in S,\ a\in A \\ & \label{m3:5}\lambda_{s}^{a}-\gamma\sum_{s^{\prime}}\sum_{a\in A}p_{s,s^{\prime}}^{a}\lambda_{s^{\prime}}^{a}=0\\ \end{aligned}

We note that the