Theory of Mind and Markov Models
We do not see things as they are, we see them as we are. — Anaïs Nin.
What is Theory of Mind?
In psychology, theory of mind refers to the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind). This includes the knowledge that others’ beliefs, desires, intentions, emotions, and thoughts may be different from one’s own.
Possessing a functional theory of mind is considered crucial for success in everyday human social interactions. People use such a theory when analyzing, judging, and inferring others’ behaviors. The discovery and development of theory of mind primarily came from studies done with animals and infants.
Empathy—the recognition and understanding of the states of mind of others, including their beliefs, desires, and particularly emotions—is a related concept. Empathy is often characterized as the ability to “put oneself into another’s shoes”. Recent neuro-ethological studies of animal behaviour suggest that even rodents may exhibit empathetic abilities. While empathy is known as emotional perspective-taking, theory of mind is defined as cognitive perspective-taking.1
In my understanding, theory of mind refers to the ability of an individual modeling others’ decision making processes based on others’ partial observations.
Basic Markov Models
Dec-POMDP
A decentralized partially observable Markov decision process (Dec-POMDP) is a tuple $(I, S,\set{A^i}_{i\in I}, \mathcal{P}, R, \set{O^i}_{i\in I}, \mathcal{Q}, \gamma)$, where
- $I$ is a set of agents ($\vert I\vert=n$ and they are cooperative),
- $S$ is a set of global states of the environment (and agents cannot see the sampled state at any time, but they know the state set),
- $A^i$ is a set of actions for agent $i$, with $\boldsymbol{A}=\times_{i\in I} A^i$ is the set of joint actions,
- $\mathcal{P}:S\times \boldsymbol{A}\to\Delta(S)$ is the state transition probability where $\Delta(S)$ is a set of distributions over $S$,
- $R:S\times \boldsymbol{A}\to \mathbb{R}$ is a reward function (not $\mathbb{R}^n$ since the agents are cooperative),
- $O^i$ is a set of observations for agent $i$, with $\boldsymbol{O} = \times_{i\in I} O^i$ is the set of joint observations,
- $\mathcal{Q}:S\times \boldsymbol{A}\to\Delta(\boldsymbol{O})$ is an observation emission function (sometimes $\mathcal{Q}:S\to\Delta(\boldsymbol{O})$), and
- $\gamma\in[0,1]$ is the discount factor.
One step of the process is: $(s_{t}, \boldsymbol{o}_{t})\to \boldsymbol{a}_{t}\to (s_{t+1} \to (\boldsymbol{o}_{t+1}, r_{t}))$. At each timestep $t$,
- each agent takes an action $a_t^i\in A^i$ based on its belief of the current state, given its observable $o_t^i$ and previous belief (the term “belief” will be introduced later),
- $s_{t+1}\sim \mathcal{P}(\cdot \mid s_t, \boldsymbol{a}_t)$,
- $\boldsymbol{o}_{t+1} \sim \mathcal{Q}(\cdot \mid s_{t+1}, \boldsymbol{a}_{t})$ and a reward is generated for the whole team based on the reward function $R(s,\boldsymbol{a})$.
These timesteps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor $\gamma$ maintains a finite sum in the infinite-horizon case ($\gamma \in [0,1)$). The goal is to maximize expected cumulative reward over a finite or infinite number of steps.1
What is different from what I previously thought is that the observations are sampled after agents make decisions at each timestep.
This definition is adapted from that of Wikipedia2. When discussing Dec-POMDP, these papers34 are often referenced.
MDP & POMDP
Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) are degenerated cases of Dec-POMDPs:
- Dec-POMDP: $(I, S,\set{A^i}_{i\in I}, \mathcal{P}, R, \set{O^i}_{i\in I}, \mathcal{Q}, \gamma)$.
- POMDP: $(S, A, \mathcal{P}, R, O, \mathcal{Q}, \gamma)$, a single-agent version of Dec-POMDP.
- MDP: $(S, A, \mathcal{P}, R, \gamma)$, a fully observable version of POMDP.
One may check this slides5 for understanding the comparison between MDP, POMDP, and Dec-POMDP.
Belief
So how does the agent know which state it is in, in POMDPs?
Since it is the state that affects payoffs and the state transitions (thus the future payoffs) rather than the observation, the agent needs to estimate the current state $s_t$ by its previous observations before choosing $a_t$.
The state is Markovian by assumption, meaning that maintaining a belief over the current states $s_t$ solely requires knowledge of
- the previous belief state $b(\cdot\mid a_{t-1}, o_{t})$,
- the taken action $a_{t-1}$,
- and the current observation $o_{t}$.
- and the environment’s model
- the sets $S$ and $O$,
- the observation emission function $\mathcal{Q}$,
- and the state transition function $\mathcal{P}$.
The belief $b$ is a posterior distribution over states. The conditional probability of a current state $s_t$ can be recursively calculated by Bayes’ rule as follows:
If the agent has access to the environment’s model , given $a_{t-1}, o_{t}$, and $b(\cdot\mid o_{t-1}, a_{t-2})$ (or $b_{t-1}$ in short), then
\[\begin{aligned} b\Big(s_{t}\mid o_{t}, a_{t-1}, b_{t-1} \Big) &= \frac{ P(o_{t}\mid s_t, a_{t-1}, b_{t-1})\cdot P(s_t\mid a_{t-1}, b_{t-1}) } { P(o_{t}\mid a_{t-1}, b_{t-1}) }\\ &= \frac{ P(o_{t}\mid s_t, a_{t-1})\cdot \sum\limits_{s_{t-1}} P(s_{t}\mid s_{t-1}, a_{t-1}) \cdot b_{t-1}(s_{t-1}) } { \sum\limits_{s_t} P(o_{t}\mid s_t, a_{t-1})\cdot \sum\limits_{s_{t-1}} P(s_{t}\mid s_{t-1}, a_{t-1}) \cdot b_{t-1}(s_{t-1}) }\\ &= \frac{ \mathcal{Q}(o_{t}\mid s_t, a_{t-1}) \sum\limits_{s_{t-1}} \mathcal{P}(s_{t}\mid s_{t-1}, a_{t-1}) \cdot b_{t-1}(s_{t-1}) } { \sum\limits_{s_{t}} \mathcal{Q}(o_{t}\mid s_t, a_{t-1}) \sum\limits_{s_{t-1}} \mathcal{P}(s_{t}\mid s_{t-1}, a_{t-1}) \cdot b_{t-1}(s_{t-1}) }. \end{aligned}\]Note that
\[P(o_{t}\mid a_{t-1}, b_{t-1}) = \sum\limits_{s_{t}} \mathcal{Q}(o_{t}\mid s_t, a_{t-1}) \sum\limits_{s_{t-1}} \mathcal{P}(s_{t}\mid s_{t-1}, a_{t-1}) \cdot b_{t-1}(s_{t-1}),\]and we will catch it later.
This definition is adapted from that of Wikipedia6. And I found its original definition is a bit of confusing, for agent observing $o$ after reaching $s’$. I suppose that $o$ is ought to be $o_{t+1}$ or $o’$.
Theory of Mind in Dec-POMDPs
Fuchs et al. proposed nested beliefs for deep RL in Hanabi.7 And in this section, I will focus on the settings of their paper rather than their method, since the method is specifically designed to tackle the Hanabi problem.
Consider a two-player game. Each agent makes decisions based on its belief of the other’s policy. So the two agents’ polices are recursively dependent:
- $i$ makes decisions based on $j$’s policy.
- And $j$ acts the same way.
- If $i$ become aware of the second step, then $i$ will speculate how $j$ is guessing it and make decisions based on that.
- And so forth.
Formally, a belief at depth $k$ is given by $b_k^{ij\ldots}=\Delta(S_k)$, such that $\vert ij\ldots \vert = k+1$, and the subscript of $S$ indicates the depth. The beliefs at even depth are self-reflective, and the ones at odd depth model the other’s modeling. E.g.,
- $b_0^{i}$ is $i$’s prior knowledge of states.
- $b_1^{ij}$ is $i$’s belief about $j$’s prior knowledge that $i$ models. It is still a distribution over states.
- $b_2^{iji}$ is $i$’s belief about $b_1^{ji}$. It is still a distribution over states.
My questions
- I am not sure why the agent’s high-level beliefs are still distributions over states rather than distributions over the former level beliefs.
- After we have the tool of belief, what can we do? How should agents make decisions based on their beliefs?
An example
Simplified Action Decoder (SAD). See my other note.
Belief MDP
Given a POMDP $(S, A, \mathcal{P}, R, O, \mathcal{Q}, \gamma)$, the corresponding belief MDP is $(B, A, \tau, R, \gamma)$, where
- $B$ is the belief states set, and each element in it is a distribution over the states of the POMDP,
- $A$ is the same as the one of the POMDP,
- $\tau: B\times A\to\Delta(B)$ is the belief state transition function,
- $\mathcal{R}: B\times A \to \mathbb{R}$ is the reward function on belief states,
- $\gamma$ is the same as the one of the POMDP.
More specifically,
\[\tau(b_{t+1} \mid b_t, a_t) = \sum\limits_{o_{t+1}} P(b_{t+1}\mid b_t, a_t, o_{t+1}) \cdot P(o_{t+1}\mid b_t, a_t),\]where $P(o_{t+1}\mid b_t, a_t)$ is defined in the previous section and
\[P(b_{t+1}\mid b_t, a_t, o_{t+1}) = \begin{cases} 1 & \text{if the belief update with } b_t, a_t, o_{t+1} \text{ returns } b_{t+1}, \\ 0 & \text{otherwise}. \end{cases}\]And $\mathcal{R}(b, a) = \sum\limits_{s} b(s) \cdot R(s,a)$.
Compared to the original POMDP, the corresponding belief MDP is not partially observable anymore, and the agent makes decisions at each timestep based on the current belief state. Its policy is denoted as $\pi(b)$, and its goal is $\max\limits_{\pi} V_{\pi}(b_0)$, where $V_{\pi}(b_0) = \mathbb{E}_\pi\left[ \sum\limits_{t=0}^{\infty}\gamma^t\cdot \mathcal{R}(b_t,a_t)\mid b_0 \right].$6
References
Wikipedia: Theory of Mind. ↩ ↩2
Daniel S Bernstein, Robert Givan, Neil Immerman, Shlomo Zilberstein. “The complexity of decentralized control of markov decision processes.” Mathematics of operations research (2002). ↩
Frans A Oliehoek, Christopher Amato. “A concise introduction to decentralized POMDPs.” Springer (2016). ↩
Alina Vereshchaka’s slides about MDPs. ↩
Andrew Fuchs, Michael Walton, Theresa Chadwick, Doug Lange. “Theory of mind for deep reinforcement learning in hanabi.” NeurIPS Workshop (2019). ↩