TRPO Details

Posted Apr 24, 2024 Updated Dec 31, 2024

By Yue Lin 3 min read

The origin paper: Schulman, John, et al. “Trust region policy optimization.” International conference on machine learning. PMLR, 2015.

Overview

This derivation comes from the Appendix A.1 of this paper: Yang, Jiachen, et al. “Adaptive incentive design with multi-agent meta-gradient reinforcement learning.” arXiv preprint arXiv:2112.10859 (2021).

The obejective is

\[J(\pi) := \mathbb{E}_{\pi} \left[ \sum\limits_{t=0}^{\infty} \gamma^t\cdot r(s_t, a_t) \right].\]

The Problem is

\[\max\limits_{\pi} J(\pi).\]

The performance difference lemma (see below) shows that

\[\begin{aligned} J(\textcolor{blue}{\pi'}) =& J(\pi) + \mathbb{E}_{\pi'} \left[ \sum\limits_{t=0}^\infty \gamma^t\cdot A_{\pi} (s_t,a_t) \right] \\ =& J(\pi) + \sum\limits_{s} d_{\textcolor{blue}{\pi'}}(s) \sum\limits_{a} \textcolor{blue}{\pi'}(a\mid s) \cdot A_{\pi} (s,a), \end{aligned}\]

where $d_{\pi}(s)$ is the discounted state visitation frequencies and $A_{\pi}$ is the advantage function under policy $\pi$.

TRPO makes a local approximation, whereby $d_{\textcolor{blue}{\pi’}}$ is replaced by $d_{\pi}(s)$

One can define

\[L_\pi(\textcolor{blue}{\pi'}) := J(\pi)+\sum_s d_\pi(s) \sum_a \textcolor{blue}{\pi'}(a \mid s) \cdot A_\pi(s, a)\]

and derive the lower bound $J(\textcolor{blue}{\pi’}) \geq L_\pi(\textcolor{blue}{\pi’})-c \cdot D_{\mathrm{KL}}^{\max }(\pi, \textcolor{blue}{\pi’})$, where $D_{\mathrm{KL}}^{\max }$ is the KL divergence maximized over states and $c$ depends on $\pi$. The KL divergence penalty can be replaced by a constraint, so the problem becomes

\[\begin{aligned} & \max _{\textcolor{blue}{\theta'}} \sum_s d_\theta(s) \sum_a \textcolor{blue}{\pi'}_{\textcolor{blue}{\theta'}}(a \mid s) \cdot A_\theta(s, a) \\ & \text { s.t. } \bar{D}_{\mathrm{KL}}^\theta(\theta, \textcolor{blue}{\theta'}) \leq \delta, \end{aligned}\]

where $\bar{D}_{\mathrm{KL}}^\theta$ is the KL divergence averaged over states $s \sim d_\theta$. Using importance sampling, the summation over actions $\sum_a(\cdot)$ is replaced by $\mathbb{E}_{a \sim q}\left[\frac{1}{q(a \mid s)}(\cdot)\right]$. It is convenient to choose $q=\pi_\theta$, which results in:

\[\begin{aligned} & \max _{\textcolor{blue}{\theta'}} \mathbb{E}_{s \sim d_\theta, a \sim \pi_\theta}\left[\frac{\textcolor{blue}{\pi'}_{\textcolor{blue}{\theta'}}(a \mid s)}{\pi_\theta(a \mid s)} A_\theta(s, a)\right] \\ & \text { s.t. } \mathbb{E}_{s \sim d_\theta}\left[D_{\mathrm{KL}}\left(\pi_\theta(\cdot \mid s), \textcolor{blue}{\pi'}_{\textcolor{blue}{\theta'}}(\cdot \mid s)\right)\right] \leq \delta . \end{aligned}\]

TRPO Details

Overview

Performance Difference Lemma

Proof

Details

Other proofs

Trending Tags