RL Toolbox
This note will be consistently updated.
PPO Tricks
There are a total of 37 tricks, among which 13 are relatively core.
Adam Optimizer Epsilon Parameter
1
2
self.actor_optim = torch.optim.Adam(self.actor.parameters(), lr=config.lr_actor, eps=1e-5)
self.critic_optim = torch.optim.Adam(self.critic.parameters(), lr=config.lr_critic, eps=1e-5)
Gradient Clip
1
2
3
4
5
6
7
8
9
self.critic_optim.zero_grad()
critic_loss.backward()
torch.nn.utils.clip_grad_norm_(self.critic.parameters(), config.clip_range) # here
self.critic_optim.step()
self.actor_optim.zero_grad()
loss_actor.mean().backward()
torch.nn.utils.clip_grad_norm_(self.actor.parameters(), config.clip_range) # here
self.actor_optim.step()
Tanh Activation Function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# A continuous actor
class Actor(torch.nn.Module):
def __init__(self):
super(Actor, self).__init__()
self.mlp = torch.nn.Sequential(
torch.nn.Linear(config.state_size, config.mlp_dim), torch.nn.Tanh(),
torch.nn.LayerNorm(config.mlp_dim),
torch.nn.Linear(config.mlp_dim, config.action_num),
torch.nn.Tanh()
)
self.log_std = torch.nn.Parameter(torch.zeros(1, config.action_num)) # Gaussian std, learnable
def forward(self, state):
mean_raw = self.mlp(state) # [-1, 1]
mean = mean_raw * config.action_space_range # [-max_a, max_a]
std = torch.exp(self.log_std) # std=exp(log_std)>0
distribution = torch.distributions.Normal(mean, std)
return distribution
Policy Entropy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def choose_action(self, state):
state_tensor = torch.tensor(state).to(torch.float32).squeeze()
distribution = self.actor(state_tensor)
dist_entropy = distribution.entropy()
action = distribution.sample().squeeze(dim=0)
log_prob = distribution.log_prob(action)
return action.detach().numpy(), log_prob, dist_entropy
negative_loss_actor = log_prob * TD_error.detach() + dist_entropy * config.entropy_coe
loss_actor = - negative_loss_actor
self.actor_optim.zero_grad()
loss_actor.mean().backward()
self.actor_optim.step()
Reward Scaling
Incremental mean
I have a dataset with $n$ samples ${x_1, x_2, \ldots, x_n}$. The expectation of $X$ is calculated as
\[\mu_n = \frac{1}{n}\sum\limits_{i=1}^n x_i\]Then I get a new sample $x_{n+1}$, then the expectation of $X$ should be updated. And it can be represented by the current expectation:
\[\mu_{n+1} = \mu_n + \frac{1}{n+1}\left(x_{n+1} - \mu_n \right)\]Derivation:
\[\begin{aligned} \mu_{n+1} =& \frac{1}{n+1}\sum\limits_{i=1}^{n+1} x_i = \frac{1}{n+1}\left(x_{n+1} + \sum\limits_{i=1}^n x_i \right) \\ =& \frac{1}{n+1}x_{n+1} + \frac{n}{n+1} \sum\limits_{i=1}^n x_i \\ =& \frac{1}{n+1}x_{n+1} + \left(1 - \frac{1}{n+1}\right) \mu_n \\ =& \mu_n + \frac{1}{n+1}\left(x_{n+1} - \mu_n \right) \end{aligned}\]To reduce the impact of previous samples, the coefficient is fixed as a constant $\alpha$:
\[\begin{aligned} \mu_{n+1} =& \mu_n + \alpha\left(x_{n+1} - \mu_n \right) \\ =& \alpha\cdot x_{n+1} - \left(1-\alpha\right)\cdot\mu_n \end{aligned}\]Incremental Variance
\[s_n^2 = s_{n-1}^2 + \frac{(x_n - \mu_{n-1})(x_n - \mu_n)}{n}\]\[s_n^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \mu_n)^2\] \[\mu_n = \mu_{n-1} + \frac{1}{n}(x_n - \mu_{n-1})\] \[\begin{aligned} s_n^2 =& \frac{1}{n} \sum_{i=1}^n \left(x_i - \mu_{n-1} - \frac{1}{n}(x_n - \mu_{n-1})\right)^2 \\ =& \frac{1}{n} \sum_{i=1}^{n-1}(x_i - \mu_{n-1})^2 + \frac{1}{n}(x_n - \mu_{n-1})^2 - 2\frac{1}{n}(x_n - \mu_{n-1})\sum_{i=1}^{n-1}(x_i - \mu_{n-1}) + \frac{1}{n^2}(x_n - \mu_{n-1})^2\sum_{i=1}^{n-1}1 \end{aligned}\] \[\sum_{i=1}^{n-1}(x_i - \mu_{n-1}) = 0\] \[\begin{aligned} s_n^2 =& \frac{1}{n} \sum_{i=1}^{n-1}(x_i - \mu_{n-1})^2 + \frac{1}{n}(x_n - \mu_{n-1})^2 - \frac{n-1}{n^2}(x_n - \mu_{n-1})^2 \\ =& \frac{n-1}{n}s_{n-1}^2 + \frac{1}{n}(x_n - \mu_{n-1})^2 - \frac{n-1}{n^2}(x_n - \mu_{n-1})^2 \\ =& s_{n-1}^2 + \frac{1}{n}(x_n - \mu_{n-1})(x_n - \mu_n) \end{aligned}\]
Embedding for the Q-value Critic
Check the implementation of DIAL.
In my understanding, after going through the embedding, inputs with different ranges can be considered as linearly independent quantities in the same space, so they can be added directly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# From the CoLab: https://colab.research.google.com/gist/MJ10/2c0d1972f3dd1edcc3cd17c636aac8d2/dial.ipynb#scrollTo=G5e0IeqmIJJj
class CNet(nn.Module):
def __init__(self, opts):
"""
Initializes the CNet model
"""
super(CNet, self).__init__()
self.opts = opts
self.comm_size = opts['game_comm_bits']
self.init_param_range = (-0.08, 0.08)
## Lookup tables for the state, action and previous action.
self.action_lookup = nn.Embedding(opts['game_nagents'], opts['rnn_size'])
self.state_lookup = nn.Embedding(2, opts['rnn_size'])
self.prev_action_lookup = nn.Embedding(opts['game_action_space_total'], opts['rnn_size'])
# Single layer MLP(with batch normalization for improved performance) for producing embeddings for messages.
self.message = nn.Sequential(
nn.BatchNorm1d(self.comm_size),
nn.Linear(self.comm_size, opts['rnn_size']),
nn.ReLU(inplace=True)
)
# RNN to approximate the agent’s action-observation history.
self.rnn = nn.GRU(input_size=opts['rnn_size'], hidden_size=opts['rnn_size'], num_layers=2, batch_first=True)
# 2 layer MLP with batch normalization, for producing output from RNN top layer.
self.output = nn.Sequential(
nn.Linear(opts['rnn_size'], opts['rnn_size']),
nn.BatchNorm1d(opts['rnn_size']),
nn.ReLU(),
nn.Linear(opts['rnn_size'], opts['game_action_space_total'])
)
def forward(self, state, messages, hidden, prev_action, agent):
"""
Returns the q-values and hidden state for the given step parameters
"""
state = Variable(torch.LongTensor(state))
hidden = Variable(torch.FloatTensor(hidden))
prev_action = Variable(torch.LongTensor(prev_action))
agent = Variable(torch.LongTensor(agent))
# Produce embeddings for rnn from input parameters
z_a = self.action_lookup(agent)
z_o = self.state_lookup(state)
z_u = self.prev_action_lookup(prev_action)
z_m = self.message(messages.view(-1, self.comm_size))
# Add the input embeddings to calculate final RNN input.
z = z_a + z_o + z_u + z_m
z = z.unsqueeze(1)
rnn_out, h = self.rnn(z, hidden)
# Produce final CNet output q-values from GRU output.
out = self.output(rnn_out[:, -1, :].squeeze())
return h, out
Gumbel-Softmax
- Reparameterization.
- Maintain gradients from the sampled variables.
- Commonly used in communication methods.
What is gumbel-softmax for?
If $a_t\sim \pi_\theta(\cdot \mid s_t)$, then how to calculate $\nabla_\theta a_t$?
What is reparameterization?
This trick decouples the deterministic part and the random part of a variable.
This concept can be best illustrated with the example of the Gaussian distribution.
If $z\sim \mathcal{N}(\mu,\sigma^2)$, then $z = \mu + \sigma \cdot \epsilon$, where $\epsilon\sim \mathcal{N}(0,1)$. In this way, $\frac{\partial z}{\partial \mu} = 1$ and $\frac{\partial z}{\partial \sigma} = \epsilon$. Usually $\mu$ and $\sigma$ are estimated by a neural network, and the following gradient can be automatically calculated by deep frameworks.
What does Gumbel-Softmax do?
We often use neural networks to generate a probability simplex, i.e., a profile of probability where $0\le p_i$ and $\sum\limits_{i} p_i = 1$. Then we will sample an $x$ based on this distribution.
An example scenario is in RL, where an agent needs to choose an action $a_t$. We output a distribution $\pi(\cdot \mid s_t)$ and then sample an action $a_t\sim \pi(\cdot \mid s_t)$ based on this distribution to execute.
Gumbel-Softmax is used to reparameterization this kind of categorical distribution. This technique allows samples to be drawn according to the original distribution and enables gradient computation.
\[z\sim \arg\max\limits_i (\log(p_i) + g_i),\]where $g_i = -\log(-\log (u_i)), u_i\sim U(0,1)$.
The argmax is non-differentiable, it can be replaced with softmax. $i = \arg\max\limits_{j} (x_j)$.
\[\mathrm{softmax}_T (x) = \frac{e^{x_j/T}}{\sum_k e^{x_k/T}}.\]If temperature $T$ is small enough, then the output of the softmax can be seen as a one-hot vector which indicates $i$.
$x\ne \log(\mathrm{softmax}(x))$
\[\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}\] \[\begin{aligned} \log(\mathrm{softmax}(x_i)) =& \log\left(\frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}\right) \\ =& x_i - \log\left(\sum_{j=1}^{n} e^{x_j}\right) \end{aligned}\]1
2
3
4
5
6
7
8
9
10
11
12
import torch
x = torch.rand(5)
x1 = torch.nn.Softmax(dim=0)(x)
x2 = torch.nn.functional.softmax(x, dim=0)
x3 = torch.nn.functional.log_softmax(x, dim=0)
print(x1)
print(x2)
print(torch.log(x1))
print(x3)
1
2
3
4
tensor([0.1385, 0.1978, 0.2231, 0.2861, 0.1543])
tensor([0.1385, 0.1978, 0.2231, 0.2861, 0.1543])
tensor([-1.9766, -1.6204, -1.4999, -1.2512, -1.8686])
tensor([-1.9766, -1.6204, -1.4999, -1.2512, -1.8686])
Example code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
if __name__ == '__main__':
batch_size = int(1e7)
logits_distribution = [2, 3]
logits_batch = torch.tensor(logits_distribution, dtype=torch.float64) \
.unsqueeze(dim=0).expand(batch_size, len(logits_distribution))
softmax = torch.nn.Softmax(dim=-1)
pi = softmax(logits_batch)
# -----
# The standard way.
temperature = 1
actions_sampled = torch.nn.functional.gumbel_softmax(logits_batch, tau=temperature, hard=True)
a0_num = torch.sum(actions_sampled[:, 0])
a1_num = torch.sum(actions_sampled[:, 1])
print(pi[0], a0_num, a1_num, sep="\n")
# -----
# In RL, the common epsilon-greedy is a operation on the policy space.
# To sample it, we need to edit the policy first, and then put log(pi) into the gumbel-softmax.
# See https://stackoverflow.com/questions/64980330/input-for-torch-nn-functional-gumbel-softmax
print('===============')
temperature = 1
actions_sampled = torch.nn.functional.gumbel_softmax(torch.log(pi), tau=temperature, hard=True)
a0_num = torch.sum(actions_sampled[:, 0])
a1_num = torch.sum(actions_sampled[:, 1])
print(pi[0], a0_num, a1_num, sep="\n")
1
2
3
4
5
6
7
tensor([0.2689, 0.7311], dtype=torch.float64)
tensor(2687766., dtype=torch.float64)
tensor(7312234., dtype=torch.float64)
===============
tensor([0.2689, 0.7311], dtype=torch.float64)
tensor(2690092., dtype=torch.float64)
tensor(7309908., dtype=torch.float64)
Applying Gumbel-Softmax may cause
NaNduring training. Changing the data type of the variable tofloat64seems to have avoided this issue.
Computation graph
Check my note on computation graph.
Advantage
This section is generated by ChatGPT-4.
The reparameterization trick reduces the variance of gradient estimates primarily by altering the way stochastic variables are sampled, introducing randomness only through differentiable transformations rather than direct sampling from the policy distribution. This approach offers several key benefits that help understand why it can reduce the variance of gradient estimates:
Direct backpropagation through the stochastic component: Before reparameterization, the policy’s gradient could not be directly backpropagated through the random action sampling process, as this process was non-differentiable. This meant that gradient updates had to rely on external estimates of sampled actions, such as using Monte Carlo methods, typically leading to higher variance. Reparameterization allows gradients to be directly backpropagated through the action generation process, now involving a differentiable transformation (i.e., the output of the policy network plus noise), directly reducing the variance caused by sampling.
Reducing the impact of sampling noise: By confining randomness to noise from a known distribution and transforming it into actions through the policy network, reparameterization reduces the direct impact of sampling noise on the policy output. The effect of noise is modulated through a differentiable, network-parameterized function, allowing the algorithm to adjust these effects more efficiently through gradient descent rather than relying solely on the outcomes of random sampling.
Smoothing the optimization process: Since gradients can be directly calculated through the action generation process, each gradient update reflects direct improvements for the current policy parameters, rather than indirect estimates from sampling. This makes the optimization process smoother, reducing the fluctuations in optimization due to high sampling variance.
Improving sample efficiency: Reducing the variance of gradient estimates means that for the same number of samples, the algorithm can obtain more accurate gradient estimates. This improves sample efficiency because the information provided by each sample is utilized more effectively, accelerating learning speed and enhancing policy performance.
In summary, reparameterization reduces uncertainty and variance caused by sampling by changing the way stochastic variables are sampled and processed, making gradient estimates more stable and accurate. This contributes to improved performance and efficiency of reinforcement learning algorithms.
Social Influence
- A MARL method.
- An intrinsic reward.
- Agent $i$ chooses the action that has the most impact on others.
In the principal-agent communication:
\[r^i = D_{KL}\left[ \pi^j(a^j\mid\sigma^i) \Big\Vert \sum\limits_{\sigma'}\varphi^i(\sigma^{i\prime}\mid s)\cdot \pi^j(a^j\mid\sigma^{i\prime})\right]\]Basics
TD(0)
Resampling techniques are a class of statistical methods that involve creating new samples by repeatedly drawing observations from the original data sample.
Bootstrapping is a method where new “bootstrap samples” are created by drawing observations with replacement from the original sample.
In RL, a common example is the Temporal Difference (TD) learning. This method bootstraps from the current estimate of the value function. The value function is defined as
\[V(s) = \mathbb{E}\left[\sum\limits_{t=0}^\infty \gamma^t \cdot r_t | s_0 = s\right]\]But if the trajectory will never end, then we cannot get all the $r_t$ that we need to calculate the expectation.
According to the Bellman equation, the value function can be calculated as
\[V(s) = \mathbb{E}\left[r_{t+1} + \gamma V(s_{t+1}) | s_t = s\right]\]Now I get a new sample of $R_{t+1}$, I can use it to update $V(s_t)$, using the incremental mean trick.
\[V(s_t) \gets V(s_t) + \alpha\left(x_{n+1} - V(s_t) \right),\]where $x_{n+1} = r_{t+1} + \gamma V(s_{t+1}).$ The $V(s_{t+1})$ is not the ground true value, but we can used it. (Proving convergence is another thing to do.)
So we can say that the value function is updated based on itself. And this method uses $V(s_{t+1})$ instead of $\sum\limits_{k=t}^\infty \gamma^{k-t}\cdot r_{k+2}.$ And that’s what bootstrapping means.
LLM Text Embedding for State
ToM-agent: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection https://arxiv.org/html/2501.15355v1
让大模型描述环境state,然后用openai提供的text embedding,把对state的自然语言描述转化成embedding,传给RL去学
可以引入先验知识
Fourier Features
如果网络的输入是低维,输出是高维,那么可以用这个trick来捕捉
原论文: Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains https://arxiv.org/abs/2006.10739
应用例子: Solving Infinite-Player Games with Player-to-Strategy Networks https://arxiv.org/pdf/2501.09330
核心思想一句话:不再为每个玩家单独存一份策略,而是训练一个函数 $s_\theta$(神经网络,参数 $\theta$),它输入一个玩家、输出该玩家的策略。要问玩家 $i$ 怎么打,就前向计算 $s_\theta(i)$。
网络输入由三部分组成(具体用哪些取决于博弈):
- 标识玩家的特征:例如空间博弈里玩家就是平面上的一个点(二维坐标);”连续交易者”博弈里玩家就是单位区间 $[0,1]$ 上的一个标量;若玩家间有相似性结构,可用嵌入向量(相似玩家嵌入相近)。
- 该玩家收到的观测(不完全信息博弈中)。
- 随机噪声:用来让输出”随机化”,从而表示混合策略(见 §4.3)。
共享参数:所有玩家共用同一套网络参数 $\theta$。靠泛化能力,一组参数”一次性”覆盖无穷多玩家——这正是能处理”无限”的关键。
为什么要 Fourier 特征(含背景):
先讲清”玩家特征空间(player feature space)”是什么。 在 P2SN 里,网络 $s_\theta$ 的输入就是”用来标识一个玩家的那几个坐标/特征”。这些特征张成的空间,就是网络的输入域,也就是策略函数 $s_\theta$ 所”生活”的那个空间。它通常非常低维:
- 空间博弈里,一个玩家就是平面上的一个点,输入是二维坐标 $(x_1, x_2) \in [0,1]^2$ → 玩家特征空间是 2 维的单位正方形;
- “连续交易者”博弈里,一个玩家就是单位区间上的一个标量 $x \in [0,1]$ → 玩家特征空间是 1 维的线段。
也就是说,$s_\theta$ 要表示的,其实是一个定义在这个低维空间上的函数:“玩家在哪个位置 → 该位置的玩家该怎么打”。
难点在哪。 均衡策略随玩家位置的变化往往很剧烈、很”高频”。比如反协调博弈里相邻玩家要选不同资源,于是”策略随坐标”会形成密集的交替条纹,这在信号意义上就是高空间频率。而一个已知事实是:标准前馈网络(MLP)存在谱偏置(spectral bias)——它天生偏向学”低频、平滑”的函数,在低维输入上很难刻画这种细节丰富、高频变化的模式(要么根本学不出来,要么收敛极慢)。直觉上,一个普通 MLP 在 2 维输入上更像在画”平滑曲面”,画不出尖锐的交替花纹。
解决办法(借自 Tancik et al. 2020 的 random Fourier features)。 在把输入喂给 MLP 之前,先过一层正弦/余弦映射,把低维输入”升频”到一个高维特征空间:
\[f(x) = \big(\sin(Bx+b),\ \cos(Bx+b)\big),\]其中频率矩阵 $B$ 用标准差 $\sigma=100$ 的正态分布初始化($\sigma$ 越大、采到的频率越高,越能表达高频细节),相位 $b$ 在 $[0, 2\pi)$ 上均匀初始化。直觉:单凭一个 MLP 只会画平滑曲面,但先铺一组各种频率的正弦/余弦波作为”基”,MLP 就能像傅立叶级数那样,用这些基把”尖锐、交替”的策略模式拼出来。作者发现这一步显著提升了 P2SN 的表示能力。