Theory of Mind and Markov Models
We do not see things as they are, we see them as we are. — Anaïs Nin.
What is Theory of Mind?
In psychology, theory of mind refers to the capacity to understand other people by ascribing mental states to them (that is, surmising what is happening in their mind). This includes the knowledge that others’ beliefs, desires, intentions, emotions, and thoughts may be different from one’s own.
Possessing a functional theory of mind is considered crucial for success in everyday human social interactions. People use such a theory when analyzing, judging, and inferring others’ behaviors. The discovery and development of theory of mind primarily came from studies done with animals and infants.
Empathy—the recognition and understanding of the states of mind of others, including their beliefs, desires, and particularly emotions—is a related concept. Empathy is often characterized as the ability to “put oneself into another’s shoes”. Recent neuro-ethological studies of animal behaviour suggest that even rodents may exhibit empathetic abilities. While empathy is known as emotional perspective-taking, theory of mind is defined as cognitive perspective-taking.1
In my understanding, theory of mind refers to the ability of an individual modeling others’ decision making processes based on others’ partial observations.
Basic Markov Models
Dec-POMDP
A decentralized partially observable Markov decision process (Dec-POMDP) is a tuple
is a set of agents ( and they are cooperative), is a set of global states of the environment (and agents cannot see the sampled state at any time, but they know the state set), is a set of actions for agent , with is the set of joint actions, is the state transition probability where is a set of distributions over , is a reward function (not since the agents are cooperative), is a set of observations for agent , with is the set of joint observations, is an observation emission function (sometimes ), and is the discount factor.
One step of the process is:
- each agent takes an action
based on its belief of the current state, given its observable and previous belief (the term “belief” will be introduced later), , and a reward is generated for the whole team based on the reward function .
These timesteps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor
What is different from what I previously thought is that the observations are sampled after agents make decisions at each timestep.
This definition is adapted from that of Wikipedia2. When discussing Dec-POMDP, these papers34 are often referenced.
MDP & POMDP
Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) are degenerated cases of Dec-POMDPs:
- Dec-POMDP:
. - POMDP:
, a single-agent version of Dec-POMDP. - MDP:
, a fully observable version of POMDP.
One may check this slides5 for understanding the comparison between MDP, POMDP, and Dec-POMDP.
Belief
So how does the agent know which state it is in, in POMDPs?
Since it is the state that affects payoffs and the state transitions (thus the future payoffs) rather than the observation, the agent needs to estimate the current state
The state is Markovian by assumption, meaning that maintaining a belief over the current states
- the previous belief state
,- the taken action
, - and the current observation
.
- the taken action
- and the environment’s model
- the sets
and , - the observation emission function
, - and the state transition function
.
- the sets
The belief
If the agent has access to the environment’s model , given
Note that
and we will catch it later.
This definition is adapted from that of Wikipedia6. And I found its original definition is a bit of confusing, for agent observing
after reaching . I suppose that is ought to be or .
Theory of Mind in Dec-POMDPs
Fuchs et al. proposed nested beliefs for deep RL in Hanabi.7 And in this section, I will focus on the settings of their paper rather than their method, since the method is specifically designed to tackle the Hanabi problem.
Consider a two-player game. Each agent makes decisions based on its belief of the other’s policy. So the two agents’ polices are recursively dependent:
makes decisions based on ’s policy.- And
acts the same way. - If
become aware of the second step, then will speculate how is guessing it and make decisions based on that. - And so forth.
Formally, a belief at depth
is ’s prior knowledge of states. is ’s belief about ’s prior knowledge that models. It is still a distribution over states. is ’s belief about . It is still a distribution over states.
My questions
- I am not sure why the agent’s high-level beliefs are still distributions over states rather than distributions over the former level beliefs.
- After we have the tool of belief, what can we do? How should agents make decisions based on their beliefs?
An example
Simplified Action Decoder (SAD). See my other note.
Belief MDP
Given a POMDP
is the belief states set, and each element in it is a distribution over the states of the POMDP, is the same as the one of the POMDP, is the belief state transition function, is the reward function on belief states, is the same as the one of the POMDP.
More specifically,
where
And
Compared to the original POMDP, the corresponding belief MDP is not partially observable anymore, and the agent makes decisions at each timestep based on the current belief state. Its policy is denoted as
References
Wikipedia: Theory of Mind. ↩ ↩2
Daniel S Bernstein, Robert Givan, Neil Immerman, Shlomo Zilberstein. “The complexity of decentralized control of markov decision processes.” Mathematics of operations research (2002). ↩
Frans A Oliehoek, Christopher Amato. “A concise introduction to decentralized POMDPs.” Springer (2016). ↩
Alina Vereshchaka’s slides about MDPs. ↩
Andrew Fuchs, Michael Walton, Theresa Chadwick, Doug Lange. “Theory of mind for deep reinforcement learning in hanabi.” NeurIPS Workshop (2019). ↩