TRPO Details
The origin paper: Schulman, John, et al. “Trust region policy optimization.” International conference on machine learning. PMLR, 2015.
Overview
This derivation comes from the Appendix A.1 of this paper: Yang, Jiachen, et al. “Adaptive incentive design with multi-agent meta-gradient reinforcement learning.” arXiv preprint arXiv:2112.10859 (2021).
The obejective is
The Problem is
The performance difference lemma (see below) shows that
where
TRPO makes a local approximation, whereby
One can define
and derive the lower bound
where
During online learning, the
Performance Difference Lemma
In this section, the
For all policies
Kakade, Sham, and John Langford. “Approximately optimal approximate reinforcement learning.” Proceedings of the Nineteenth International Conference on Machine Learning. 2002.
Proof
The proof is provided in the appendix of “On the theory of policy gradient methods: Optimality, approximation, and distribution shift” and I just transcribed it here with additional details.
Let
where
Details
Correspondingly,
, .
But
Other proofs
剩下的没写完…发现暂时用不到了,之后要用到再看