Post

TRPO Details

The origin paper: Schulman, John, et al. “Trust region policy optimization.” International conference on machine learning. PMLR, 2015.

Overview

This derivation comes from the Appendix A.1 of this paper: Yang, Jiachen, et al. “Adaptive incentive design with multi-agent meta-gradient reinforcement learning.” arXiv preprint arXiv:2112.10859 (2021).

The obejective is

J(π):=Eπ[t=0γtr(st,at)].

The Problem is

maxπJ(π).

The performance difference lemma (see below) shows that

J(π)=J(π)+Eπ[t=0γtAπ(st,at)]=J(π)+sdπ(s)aπ(as)Aπ(s,a),

where dπ(s) is the discounted state visitation frequencies and Aπ is the advantage function under policy π.

TRPO makes a local approximation, whereby dπ is replaced by dπ(s)

One can define

Lπ(π):=J(π)+sdπ(s)aπ(as)Aπ(s,a)

and derive the lower bound J(π)Lπ(π)cDKLmax(π,π), where DKLmax is the KL divergence maximized over states and c depends on π. The KL divergence penalty can be replaced by a constraint, so the problem becomes

maxθsdθ(s)aπθ(as)Aθ(s,a) s.t. D¯KLθ(θ,θ)δ,

where D¯KLθ is the KL divergence averaged over states sdθ. Using importance sampling, the summation over actions a() is replaced by Eaq[1q(as)()]. It is convenient to choose q=πθ, which results in:

maxθEsdθ,aπθ[πθ(as)πθ(as)Aθ(s,a)] s.t. Esdθ[DKL(πθ(s),πθ(s))]δ.

During online learning, the θ that is optimized and the old θ are the same at each iteration, so the gradient estimate is

Eπθ[θπθ(as)πθ(as)Aθ(s,a)].

Performance Difference Lemma

In this section, the π and π from the previous text have been swapped.

For all policies π,π and states s0,

Vπ(s0)Vπ(s0)=EτPrπ(τ|s0=s)[t=0γtAπ(st,at)]=11γEsds0πEaπ(|s)[Aπ(s,a)].

Kakade, Sham, and John Langford. “Approximately optimal approximate reinforcement learning.” Proceedings of the Nineteenth International Conference on Machine Learning. 2002.

Proof

The proof is provided in the appendix of “On the theory of policy gradient methods: Optimality, approximation, and distribution shift” and I just transcribed it here with additional details.

Let Prπ(τs0=s) denote the probability of observing a trajectory τ when starting in state s and following the policy π. Using a telescoping argument, we have:

Vπ(s)Vπ(s)=EτPrπ(τ|s0=s)[t=0γtr(st,at)]Vπ(s)=EτPrπ(τ|s0=s)[t=0γt(r(st,at)+Vπ(st)Vπ(st))]Vπ(s)=(a)EτPrπ(τ|s0=s)[t=0γt(r(st,at)+γVπ(st+1)Vπ(st))]=(b)EτPrπ(τ|s0=s)[t=0γt(r(st,at)+γE[Vπ(st+1)|st,at]Vπ(st))]=(c)EτPrπ(τ|s0=s)[t=0γtAπ(st,at)]=11γEsdsπEaπ(|s)[Aπ(s,a)],

where (a) rearranges terms in the summation and cancels the Vπ(s0) term with the Vπ(s) outside the summation, and (b) uses the tower property of conditional expectations and the final equality follows from the definition of dsπ.

Details

(a):

a0+k=0(akbk)=k=0(ak+1bk).

(b): The tower property of conditional expectations (or law of total probability): If HG, then

E[E[XG]H]=E[XH].

Correspondingly,

  • G=τPrπ(τs0=s),
  • H=(st,at).

(c): Step (b) is necessary. Note that

Qπ(s,a)r(s,a)+γVπ(s).

But

Qπ(s,a)=r(s,a)+γsP(ss,a)Vπ(s).

Other proofs

Check other proofs here and here.

剩下的没写完…发现暂时用不到了,之后要用到再看

This post is licensed under CC BY 4.0 by the author.