MARL Basics

Posted Jun 29, 2023 Updated Aug 26, 2024

By Yue Lin 6 min read

This note has not been finished yet.

Markov Models

MDP
- Markov decision process
- $(S, A, \mathcal{P}, R, \gamma)$
- Single-agent, fully observable, and dynamic.
POMDP
- Partially observable Markov decision process
- $(S, A, \mathcal{P}, R, O, \mathcal{Q}, \gamma)$
- Single-agent, partially observable, and dynamic.
Dec-POMDP
- Decentralized partially observable Markov decision process
- $(I, S,\set{A^i}_{i\in I}, \mathcal{P}, R, \set{O^i}_{i\in I}, \mathcal{Q}, \gamma)$
- Multi-agent, fully observable, cooperative, and dynamic.

Stochastic games (or Markov games)

$(I, S,\set{A^i}_{i\in I}, \mathcal{P}, \set{R^i}_{i\in I}, \gamma)$

Multi-agent, fully observable, C/A/M (meaning that it can be cooperative, adversarial, or mixed-motive), and dynamic.

% Markov Game
@incollection{MarkovGame1994,
title={Markov games as a framework for multi-agent reinforcement learning},
author={Littman, Michael L},
booktitle={Machine learning proceedings 1994},
pages={157--163},
year={1994},
publisher={Elsevier}
}
% Stochastic Game (= Markov Game)
@book{StochasticGame2013,
title={Game theory},
author={Owen, Guillermo},
year={2013},
publisher={Emerald Group Publishing}
}

where

$I$ is a set of agents (and $\vert I\vert = n$, where $n\ge 2$),
$S$ is the state space,
$A$ is the action space,
$P: S\times A \to \Delta(S)$ is the state transition function,
$R: S\times A$
(TODO)

Normal-form game (or repeated games)
- $(I,\set{A^i}_{i\in I}, \set{R^i}_{i\in I})$
- Multi-agent, C/A/M, and static.
Extensive-form game
- $(I, X, Z, f,\set{A^i}_{i\in I}, \set{R^i}_{i\in I})$
- Multi-agent, C/A/M, and dynamic.
Bayesian game

Check this note, A Memo on Game Theory.

(TODO) Some potentially confusing concepts:

Repeated game
Matrix game
Static game
Dynamic game
Normal-form game
Extensive-form game
Bayesian game

Learning Goals

The learning goals of MARL are mainly stability and adaptation.

Stability

Stability is an inherent characteristic of the learning process within the algorithm we are examining. It signifies the achievement of convergence towards stationary policies or other measurable criteria.

Here are some common criteria (in repeated games, where $t$ means the number of times the agents have played) listed, with their strength gradually increasing (Bowling 2004):

Average reward $\sum\limits_{t\le T} (r_t\cdot \pi_t)/T$;
Empirical distribution of actions $\sum\limits_{t\le T} (\pi_t)/T$;
Expected reward $r_t\cdot \pi_t$;
Policies $\pi_t$.

The analysis of equilibrium should take dynamic environment into account.

I have a question: Do the resulting policies form an equilibrium when convergence occurs? What if the it converges to a suboptimal case or failed due to insufficient exploration?

Adaptation

This concept includes several aspects:

The performance of agents should be maintained or improved.
Agents can be adaptive to the update of the others’ policies since the others are learning too.
Agents can avoid exploitation by the others.
When agents have several optimal actions in coordination cases (which are common), they can achieve coordination to some extent.

To check whether the designed algorithm for agents are adaptive or not, there are two criteria:

Rationality: Whether each agent converges to a best response when the other agents remain stationary.
- But even in this kind of equilibrium, agents can be exploited by the others. Rationality cannot prevent the learner from ‘being exploited’ by the other agents.
No-regret: An algorithm has no-regret $\Leftrightarrow$ the average regret is less than or equal to $0$ against all others’ policies.
- No-regret can prevent the learner from ‘being exploited’ by the other agents.
- Regret $\mathscr{R}$

Scenarios (or Tasks)

There are three types of scenarios based on their properties: fully cooperative, fully competitive, and mixed scenarios. These scenarios are classified according to the reward functions assigned to the agents: a scenario is

fully cooperative if all the reward functions are the same function (i.e. $r^i=r^j\ \forall i, j\in I$),
- (Without a centralized controller, a coordination problem will arise. Discussed below.)
fully competitive if $r^i = -r^j$ (two-agent scenarios are mostly discussed in this case),
- (Games in this case are called zero-sum.)
and mixed otherwise.

Some additional points:

If there is no constraint on the reward functions, the game is called general-sum game, and pure cooperative games and zero-sum games are special cases.
The second and the third case are called noncooperative games.
In coordination games, rewards are always positively related.

Algorithms

MARL algorithms are designed for variant tasks. Some of them are built relying on the assumptions of the tasks, making them applicable only to those particular tasks.

For fully cooperative tasks

In this case, if a centralized controller is available, the task reduces to a MDP. Otherwise the agents take actions independently, and a coordination problem arises. When there are multiple equilibria, the agents should coordinate to break ties in a same way.

The algorithms designed for the fully cooperative scenarios without centralized controller (i.e. the coordination problems) can be classified into three categories based on the dimension of coordination:

Coordination-free methods: teammate-independent
Direct coordination methods:
1. teammate-independent if it rely on common-knowledge,
2. and teammate-aware if they use negotiation.
Indirect coordination methods: team-aware

(TODO)

Team-Q: It is the Q-learning algorithm, assuming that the optimal joint action are unique (which will rarely be the case).
Distributed-Q: It is deterministic. Every one can see the
OAL
JAL
FMQ

For fully cooperative tasks

minimax-Q

For mixed-motive tasks

Static

Fictitious play
MetaStrategy
IGA
WoLF-IGA
GIGA
GIGA-WoLF
AWESOME
Hyper-Q

Dynamic

Single-agent RL
Nash-Q
CE-Q
Asymmetric-Q
NSCP
EIRA
WoLF-PHC
PD-WoLF
EXORL

Reading List

Convergence and No-Regret in Multiagent Learning.
Michael Bowling.
Advances in Neural Information Processing Systems (NIPS) 2004.
- GIGA-WoLF (Generalized Infinitesimal Gradient Ascent, Win or Learn Fast)
Multi-Agent Reinforcement Learning: A survey.
L. Busoniu, R. Babuska, B. De Schutter.
International Conference on Control, Automation, Robotics and Vision (ICARCV) 2006.
- This note is basically built on the framework of this paper.
[/] Playing is Believing: The Role of Beliefs in Multi-Agent Learning.
Yu-Han Chang, Leslie Pack Kaelbling.
Advances in Neural Information Processing Systems (NIPS) 2001.
[/] Nash Q-Learning for General-Sum Stochastic Games. Junling Hu, Michael P. Wellman.
Journal of machine learning research (JMLR) 2003.
[/] Convergence of Q-learning: A Simple Proof.
Francisco S. Melo.
[/] Online Convex Programming and Generalized Infinitesimal Gradient Ascent.
Martin Zinkevich.
International Conference on Machine Learning (ICML) 2003.
[/] Value-Function Reinforcement Learning in Markov Games.
Michael L. Littman.
Cognitive Systems Research 2001.
[/] Multi-Agent Reinforcement Learning: A Critical Survey.
Yoav Shoham, Rob Powers, Trond Grenager.
Technical report, Stanford University 2003.
[/] An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective.
Yaodong Yang, Jun Wang.
arXiv 2020.
[/] A Survey and Critique of Multiagent Deep Reinforcement Learning.
Pablo Hernandez-Leal, Bilal Kartal, Matthew E. Taylor.
Autonomous Agents and Multi-Agent Systems (AAMAS) 2019.
[/] Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms.
Kaiqing Zhang, Zhuoran Yang, Tamer Basar.
Handbook of reinforcement learning and control 2021.

Disclaimer: The above content partially uses materials from the cited papers. Corresponding links or references have been provided.

Artificial Intelligence, Multi-Agent Reinforcement Learning

Tech AI Multi_Agents RL Markov_Models Classic

This post is licensed under CC BY 4.0 by the author.

Markov Models

Learning Goals

Stability

Adaptation

Scenarios (or Tasks)

Algorithms

For fully cooperative tasks

For fully cooperative tasks

For mixed-motive tasks

Static

Dynamic

Reading List

Trending Tags