TRPO Details
The origin paper: Schulman, John, et al. “Trust region policy optimization.” International conference on machine learning. PMLR, 2015.
Overview
The obejective is
\[J(\pi) := \mathbb{E}_{\pi} \left[ \sum\limits_{t=0}^{\infty} \gamma^t\cdot r(s_{t}, a_{t}) \right].\]The Problem is
\[\max\limits_{\pi} J(\pi).\]The performance difference lemma (see below) shows that
\[\begin{aligned} J(\textcolor{blue}{\pi'}) =& J(\pi) + \mathbb{E}_{\pi'} \left[ \sum\limits_{t=0}^\infty \gamma^t\cdot A_{\pi} (s_{t},a_{t}) \right] \\ =& J(\pi) + \sum\limits_{s} d_{\textcolor{blue}{\pi'}}(s) \sum\limits_{a} \textcolor{blue}{\pi'}(a\mid s) \cdot A_{\pi} (s,a), \end{aligned}\]where $d_{\pi}(s)$ is the discounted state visitation frequencies and $A_{\pi}$ is the advantage function under policy $\pi$.
TRPO makes a local approximation, whereby $d_{\textcolor{blue}{\pi’}}$ is replaced by $d_{\pi}(s)$
One can define
\[L_{\pi}(\textcolor{blue}{\pi'}) := J(\pi)+\sum_{s} d_{\pi}(s) \sum_{a} \textcolor{blue}{\pi'}(a \mid s) \cdot A_{\pi}(s, a)\]and derive the lower bound $J(\textcolor{blue}{\pi’}) \geq L_{\pi}(\textcolor{blue}{\pi’})-c \cdot D_{\mathrm{KL}}^{\max }(\pi, \textcolor{blue}{\pi’})$, where $D_{\mathrm{KL}}^{\max }$ is the KL divergence maximized over states and $c$ depends on $\pi$. The KL divergence penalty can be replaced by a constraint, so the problem becomes
\[\begin{aligned} & \max _{\textcolor{blue}{\theta'}} \sum_{s} d_{\theta}(s) \sum_{a} \textcolor{blue}{\pi'}_{\textcolor{blue}{\theta'}}(a \mid s) \cdot A_{\theta}(s, a) \\ & \text { s.t. } \bar{D}_{\mathrm{KL}}^\theta(\theta, \textcolor{blue}{\theta'}) \leq \delta, \end{aligned}\]where $\bar{D}_{\mathrm{KL}}^\theta$ is the KL divergence averaged over states $s \sim d_{\theta}$. Using importance sampling, the summation over actions $\sum_{a}(\cdot)$ is replaced by $\mathbb{E}_{a \sim q}\left[\frac{1}{q(a \mid s)}(\cdot)\right]$. It is convenient to choose $q=\pi_{\theta}$, which results in:
\[\begin{aligned} & \max _{\textcolor{blue}{\theta'}} \mathbb{E}_{s \sim d_{\theta}, a \sim \pi_{\theta}}\left[\frac{\textcolor{blue}{\pi'}_{\textcolor{blue}{\theta'}}(a \mid s)}{\pi_{\theta}(a \mid s)} A_{\theta}(s, a)\right] \\ & \text { s.t. } \mathbb{E}_{s \sim d_{\theta}}\left[D_{\mathrm{KL}}\left(\pi_{\theta}(\cdot \mid s), \textcolor{blue}{\pi'}_{\textcolor{blue}{\theta'}}(\cdot \mid s)\right)\right] \leq \delta . \end{aligned}\]During online learning, the $\textcolor{blue}{\theta’}$ that is optimized and the old $\theta$ are the same at each iteration, so the gradient estimate is
\[\mathbb{E}_{\pi_{\theta}}\left[\frac{\nabla_{\theta} \pi_{\theta}(a \mid s)}{\pi_{\theta}(a \mid s)} A_{\theta}(s, a)\right] .\]Performance Difference Lemma
For all policies $\pi, \textcolor{blue}{\pi’}$ and states $s_{0}$,
\[\begin{aligned} V^{\textcolor{blue}{\pi'}}(s_{0}) - V^{\pi}(s_{0}) =& \mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s) } \left[\sum_{t=0}^\infty \gamma^t A^{\pi}(s_{t},a_{t})\right] \\ =& \frac{1}{1-\gamma}\mathbb{E}_{s\sim d_{s_{0}}^{\textcolor{blue}{\pi'}} }\mathbb{E}_{a\sim \textcolor{blue}{\pi'}(\cdot|s) } \left[ A^{\pi}(s,a)\right]. \end{aligned}\]Kakade, Sham, and John Langford. “Approximately optimal approximate reinforcement learning.” Proceedings of the Nineteenth International Conference on Machine Learning. 2002.
Proof
The proof is provided in the appendix of “On the theory of policy gradient methods: Optimality, approximation, and distribution shift” and I just transcribed it here with additional details.
Let $\Pr^{\textcolor{blue}{\pi’}}(\tau \mid s_{0} = s)$ denote the probability of observing a trajectory $\tau$ when starting in state $s$ and following the policy $\textcolor{blue}{\pi’}$. Using a telescoping argument, we have:
\[\begin{aligned} &V^{\textcolor{blue}{\pi'}}(s) - V^{\pi}(s) \\ =& \mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s) } \left[\sum_{t=0}^\infty \gamma^t r(s_{t},a_{t})\right] - V^{\pi}(s) \\ =& \mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s) } \left[\sum_{t=0}^\infty \gamma^t \left(r(s_{t},a_{t})+V^{\pi}(s_{t})-V^{\pi}(s_{t}) \right)\right]-V^{\pi}(s)\\ \stackrel{(a)}{=}& \mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s) } \left[\sum_{t=0}^\infty \gamma^t \left(r(s_{t},a_{t})+\gamma V^{\pi}(s_{t+1})-V^{\pi}(s_{t})\right)\right]\\ \stackrel{(b)}{=}&\mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s) } \left[\sum_{t=0}^\infty \gamma^t \left(r(s_{t},a_{t})+\gamma \mathbb{E}[V^{\pi}(s_{t+1})|s_{t},a_{t}]-V^{\pi}(s_{t})\right)\right]\\ \stackrel{(c)}{=}& \mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s) } \left[\sum_{t=0}^\infty \gamma^t A^{\pi}(s_{t},a_{t})\right] \\ =& \frac{1}{1-\gamma}\mathbb{E}_{s'\sim d^{\textcolor{blue}{\pi'}}_{s} }\,\mathbb{E}_{a\sim \textcolor{blue}{\pi'}(\cdot | s')} \left[ A^{\pi}(s',a) \right], \end{aligned}\]where $(a)$ rearranges terms in the summation and cancels the $V^{\pi}(s_{0})$ term with the $-V^{\pi}(s)$ outside the summation, and $(b)$ uses the tower property of conditional expectations and the final equality follows from the definition of $d^{\textcolor{blue}{\pi’}}_{s}$. $\blacksquare$
Details
$(a)$:
\[- a_{0} +\sum\limits_{k=0}^{\infty} \left(a_k - b_k \right) = \sum\limits_{k=0}^{\infty} \left(a_{k+1} - b_k\right).\]$(b)$: The tower property of conditional expectations (or law of total probability): If $\mathcal{H} \subseteq \mathcal{G}$, then
\[\mathbb{E}\left[\mathbb{E}\left[X\mid \mathcal{G} \right] \mid \mathcal{H} \right] = \mathbb{E}\left[X\mid \mathcal{H} \right].\]Correspondingly,
- $\mathcal{G} = \tau \sim {\Pr}^{\textcolor{blue}{\pi’}}(\tau \mid s_{0}=s)$,
- $\mathcal{H} = (s_{t},a_{t})$.
$(c)$: Step $(b)$ is necessary. Note that
\[Q^{\pi}(s, a) \ne r(s, a) + \gamma \cdot V^{\pi}(s').\]But
\[Q^{\pi}(s, a) = r(s, a) + \gamma \cdot \sum\limits_{s'} P(s' \mid s,a) \cdot V^{\pi}(s').\]The final equality: the (normalized) discounted state visitation distribution is
\[d^{\textcolor{blue}{\pi'}}_{s}(s') := (1-\gamma)\sum\limits_{t=0}^\infty \gamma^t \cdot {\Pr}^{\textcolor{blue}{\pi'}}(s_{t} = s' \mid s_{0} = s),\]so, pushing the trajectory expectation onto the per-step state marginals,
\[\begin{aligned} \mathbb{E}_{\tau \sim {\Pr}^{\textcolor{blue}{\pi'}}(\tau|s_{0}=s)}\left[\sum_{t=0}^\infty \gamma^t A^{\pi}(s_{t},a_{t})\right] =& \sum_{t=0}^\infty \gamma^t \sum_{s'} {\Pr}^{\textcolor{blue}{\pi'}}(s_{t}=s'\mid s_{0}=s) \sum_{a} \textcolor{blue}{\pi'}(a\mid s') A^{\pi}(s',a) \\ =& \sum_{s'} \underbrace{\left(\sum_{t=0}^\infty \gamma^t {\Pr}^{\textcolor{blue}{\pi'}}(s_{t}=s'\mid s_{0}=s)\right)}_{=\,\frac{1}{1-\gamma}d^{\textcolor{blue}{\pi'}}_{s}(s')} \sum_{a} \textcolor{blue}{\pi'}(a\mid s') A^{\pi}(s',a) \\ =& \frac{1}{1-\gamma}\mathbb{E}_{s'\sim d^{\textcolor{blue}{\pi'}}_{s}}\,\mathbb{E}_{a\sim\textcolor{blue}{\pi'}(\cdot\mid s')}\left[A^{\pi}(s',a)\right]. \end{aligned}\]The Surrogate Objective and Its Guarantees
Abbreviate the expected current-policy advantage taken under the candidate $\textcolor{blue}{\pi’}$, and the un-normalized discounted visitation, as
\[\bar{A}_{\pi}(s) := \mathbb{E}_{a\sim \textcolor{blue}{\pi'}(\cdot\mid s)}\left[ A_{\pi}(s,a) \right], \qquad d_{\pi}(s) := \sum_{t=0}^\infty \gamma^t \cdot {\Pr}^\pi(s_{t} = s).\]The performance difference lemma gives the exact improvement
\[J(\textcolor{blue}{\pi'}) = J(\pi) + \sum_{s} d_{\textcolor{blue}{\pi'}}(s) \cdot \bar{A}_{\pi}(s).\]This is hard to optimize because the visitation $d_{\textcolor{blue}{\pi’}}$ depends on the unknown candidate. Replacing it by the current policy’s visitation $d_{\pi}$ yields the surrogate
\[L_{\pi}(\textcolor{blue}{\pi'}) = J(\pi) + \sum_{s} d_{\pi}(s) \cdot \bar{A}_{\pi}(s) = J(\pi) + \sum_{s} d_{\pi}(s) \sum_{a} \textcolor{blue}{\pi'}(a\mid s) \cdot A_{\pi}(s,a).\]First-order matching
$L_{\pi}$ matches $J$ to first order at $\theta = \theta_{0}$ (i.e. at $\textcolor{blue}{\pi’} = \pi$):
\[L_{\pi_{\theta_{0}}}(\pi_{\theta_{0}}) = J(\pi_{\theta_{0}}), \qquad \nabla_{\theta} L_{\pi_{\theta_{0}}}(\pi_{\theta})\big|_{\theta=\theta_{0}} = \nabla_{\theta} J(\pi_{\theta})\big|_{\theta=\theta_{0}}.\]Proof. Value matching. When $\textcolor{blue}{\pi’} = \pi$,
\[\bar{A}_{\pi}(s) = \mathbb{E}_{a\sim\pi}\left[ A_{\pi}(s,a) \right] = \mathbb{E}_{a\sim\pi}\left[ Q_{\pi}(s,a) - V_{\pi}(s) \right] = V_{\pi}(s) - V_{\pi}(s) = 0,\]so every term vanishes and $L_{\pi}(\pi) = J(\pi)$.
Gradient matching. Both objectives are functions of $\theta$ with the baseline frozen at $\theta_{0}$. By the performance difference lemma,
\[\begin{aligned} J(\pi_{\theta}) =& J(\pi_{\theta_{0}}) + \sum_{s} d_{\pi_{\theta}}(s) \sum_{a} \pi_{\theta}(a\mid s) \cdot A_{\pi_{\theta_{0}}}(s,a), \\ L_{\pi_{\theta_{0}}}(\pi_{\theta}) =& J(\pi_{\theta_{0}}) + \sum_{s} d_{\pi_{\theta_{0}}}(s) \sum_{a} \pi_{\theta}(a\mid s) \cdot A_{\pi_{\theta_{0}}}(s,a). \end{aligned}\]The only difference is $d_{\pi_{\theta}}$ versus the frozen $d_{\pi_{\theta_{0}}}$. Differentiating $J$ at $\theta_{0}$ with the product rule,
\[\nabla_{\theta} J(\pi_{\theta})\big|_{\theta_{0}} = \underbrace{\sum_{s} \left[\nabla_{\theta} d_{\pi_{\theta}}(s)\right]_{\theta_{0}} \underbrace{\sum_{a} \pi_{\theta_{0}}(a\mid s) A_{\pi_{\theta_{0}}}(s,a)}_{=\,0}}_{=\,0} + \sum_{s} d_{\pi_{\theta_{0}}}(s) \sum_{a} \left[\nabla_{\theta} \pi_{\theta}(a\mid s)\right]_{\theta_{0}} A_{\pi_{\theta_{0}}}(s,a).\]The first group dies because $\sum_{a} \pi_{\theta_{0}}(a\mid s) A_{\pi_{\theta_{0}}}(s,a) = 0$. The surviving term is exactly $\nabla_{\theta} L_{\pi_{\theta_{0}}}(\pi_{\theta})$ evaluated at $\theta_{0}$, since $d_{\pi_{\theta_{0}}}$ does not depend on $\theta$. $\blacksquare$
So a small step that improves $L_{\pi}$ also improves $J$ — as long as the step is small enough that the second-order gap does not dominate. The next theorem bounds that gap.
The surrogate lower bound
Write the maximum advantage magnitude and the worst-case total-variation radius as
\[\epsilon = \max_{s,a}\lvert A_{\pi}(s,a)\rvert, \qquad \alpha = D_{\mathrm{TV}}^{\max}(\pi,\textcolor{blue}{\pi'}) = \max_{s} D_{\mathrm{TV}}\left(\pi(\cdot\mid s),\textcolor{blue}{\pi'}(\cdot\mid s)\right).\]Then
\[J(\textcolor{blue}{\pi'}) \geq L_{\pi}(\textcolor{blue}{\pi'}) - \frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2.\]Proof. Subtracting the two expressions and expanding $d(s) = \sum_{t} \gamma^t {\Pr}(s_{t} = s)$,
\[J(\textcolor{blue}{\pi'}) - L_{\pi}(\textcolor{blue}{\pi'}) = \sum_{s} \left( d_{\textcolor{blue}{\pi'}}(s) - d_{\pi}(s) \right) \bar{A}_{\pi}(s) = \sum_{t=0}^\infty \gamma^t \sum_{s} \left( {\Pr}^{\textcolor{blue}{\pi'}}(s_{t} = s) - {\Pr}^{\pi}(s_{t} = s) \right) \bar{A}_{\pi}(s).\]Step 1 — bound $\bar{A}_{\pi}$. Since $\mathbb{E}_{a\sim\pi}[A_{\pi}(s,a)] = 0$,
\[\bar{A}_{\pi}(s) = \sum_{a} \left( \textcolor{blue}{\pi'}(a\mid s) - \pi(a\mid s) \right) A_{\pi}(s,a),\] \[\lvert \bar{A}_{\pi}(s) \rvert \leq \epsilon \sum_{a} \lvert \textcolor{blue}{\pi'}(a\mid s) - \pi(a\mid s) \rvert = 2\epsilon \cdot D_{\mathrm{TV}}\!\left(\pi(\cdot\mid s),\textcolor{blue}{\pi'}(\cdot\mid s)\right) \leq 2\epsilon\alpha.\]Step 2 — couple the two trajectories. Build a coupling of $\pi$ and $\textcolor{blue}{\pi’}$ that picks the same action whenever possible; at any state they can be made to disagree with probability at most $\alpha$. Call time $t$ clean if the coupled trajectories took identical actions at all steps $0,\dots,t-1$ (hence visited identical states up to $t$). Then
\[\Pr(\text{$t$ is clean}) \geq (1-\alpha)^t.\]On the clean event $s_{t}$ has the same distribution under both policies, so those contributions cancel in the difference; only the non-clean event survives:
\[\left\lvert \sum_{s} \left( {\Pr}^{\textcolor{blue}{\pi'}}(s_{t} = s) - {\Pr}^{\pi}(s_{t} = s) \right) \bar{A}_{\pi}(s) \right\rvert \leq \Pr(\text{not clean}) \cdot 2\max_{s}\lvert \bar{A}_{\pi}(s) \rvert \leq \left(1-(1-\alpha)^t\right) \cdot 4\epsilon\alpha.\]Step 3 — sum the geometric series.
\[\begin{aligned} \left\lvert J(\textcolor{blue}{\pi'}) - L_{\pi}(\textcolor{blue}{\pi'}) \right\rvert \leq& 4\epsilon\alpha \sum_{t=0}^\infty \gamma^t \left(1-(1-\alpha)^t\right) \\ =& 4\epsilon\alpha \left( \frac{1}{1-\gamma} - \frac{1}{1-\gamma(1-\alpha)} \right) = \frac{4\epsilon\alpha \cdot \gamma\alpha}{(1-\gamma)\left(1-\gamma(1-\alpha)\right)}. \end{aligned}\]Since $1-\gamma(1-\alpha) \geq 1-\gamma$,
\[\left\lvert J(\textcolor{blue}{\pi'}) - L_{\pi}(\textcolor{blue}{\pi'}) \right\rvert \leq \frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2,\]which gives the claimed lower bound. $\blacksquare$
From total variation to KL
Using the relationship $D_{\mathrm{TV}}(p,q)^2 \leq D_{\mathrm{KL}}(p,q)$ (Pollard, 2000), we have $\alpha^2 = \left(D_{\mathrm{TV}}^{\max}\right)^2 \leq D_{\mathrm{KL}}^{\max}(\pi,\textcolor{blue}{\pi’})$, hence
\[J(\textcolor{blue}{\pi'}) \geq L_{\pi}(\textcolor{blue}{\pi'}) - \frac{4\epsilon\gamma}{(1-\gamma)^2} \cdot D_{\mathrm{KL}}^{\max}(\pi,\textcolor{blue}{\pi'}).\]Monotonic improvement: the MM view
Write $C = \frac{4\epsilon\gamma}{(1-\gamma)^2}$ and define the surrogate
\[M_{i}(\pi) := L_{\pi_{i}}(\pi) - C \cdot D_{\mathrm{KL}}^{\max}(\pi_{i}, \pi).\]The bound says $J(\pi) \geq M_{i}(\pi)$ for every $\pi$, with equality at $\pi = \pi_{i}$ (both the KL term and the surrogate gap vanish there). Thus $M_{i}$ minorizes $J$ and touches it at $\pi_{i}$. Choosing
\[\pi_{i+1} = \arg\max_{\pi} M_{i}(\pi)\](minorization–maximization) yields a monotonically non-decreasing sequence of true returns:
\[J(\pi_{i+1}) \geq M_{i}(\pi_{i+1}) \geq M_{i}(\pi_{i}) = J(\pi_{i}),\]where the first inequality is the bound, the second is the $\arg\max$, and the equality is the tightness at $\pi_{i}$. $\blacksquare$
From the penalty to the trust region
Maximizing $M_{i}$ directly uses the worst-case coefficient $C = \frac{4\epsilon\gamma}{(1-\gamma)^2}$, which is large and blows up as $\gamma\to 1$ — so the penalty forces tiny steps. Two relaxations turn the penalized problem into TRPO.
(1) Penalty $\to$ hard constraint with a tunable radius $\delta$:
\[\max_{\theta} L_{\pi_{\theta_{0}}}(\pi_{\theta}) \quad \text{s.t.} \quad D_{\mathrm{KL}}^{\max}(\pi_{\theta_{0}}, \pi_{\theta}) \leq \delta.\](2) Worst-case KL $\to$ average KL over visited states, since the per-state max imposes too many constraints:
\[\max_{\theta} \sum_{s} d_{\pi_{\theta_{0}}}(s) \sum_{a} \pi_{\theta}(a\mid s) \cdot A_{\pi_{\theta_{0}}}(s,a) \quad \text{s.t.} \quad \mathbb{E}_{s\sim d_{\pi_{\theta_{0}}}}\left[ D_{\mathrm{KL}}\!\left(\pi_{\theta_{0}}(\cdot\mid s), \pi_{\theta}(\cdot\mid s)\right) \right] \leq \delta,\]where the constant $J(\pi_{\theta_{0}})$ has been dropped from the objective.
(3) Importance sampling. Rewrite the inner action sum as an expectation over $a\sim q$, then choose $q = \pi_{\theta_{0}}$:
\[\sum_{a} \pi_{\theta}(a\mid s) \cdot A_{\pi_{\theta_{0}}}(s,a) = \mathbb{E}_{a\sim q}\left[ \frac{\pi_{\theta}(a\mid s)}{q(a\mid s)} A_{\pi_{\theta_{0}}}(s,a) \right].\]This delivers the trust region policy optimization problem
\[\begin{aligned} & \max_{\theta} \mathbb{E}_{s\sim d_{\pi_{\theta_{0}}},\, a\sim\pi_{\theta_{0}}}\left[ \frac{\pi_{\theta}(a\mid s)}{\pi_{\theta_{0}}(a\mid s)} A_{\pi_{\theta_{0}}}(s,a) \right] \\ & \text{s.t.} \quad \mathbb{E}_{s\sim d_{\pi_{\theta_{0}}}}\left[ D_{\mathrm{KL}}\!\left(\pi_{\theta_{0}}(\cdot\mid s), \pi_{\theta}(\cdot\mid s)\right) \right] \leq \delta, \end{aligned}\]which is exactly the objective stated in the Overview, closing the loop.