Speaker
Description
Reinforcement learning (RL) is gaining more and more importance in the field of machine learning (ML). One subfield of RL is Multi-Agent RL (MARL). Here, several agents learn to solve a problem simultaneously rather than a single agent. For this reason, this approach is suitable for many real-world problems.
Since learning in a multiple agent scenario is highly complex, further conflicts can arise in addition to the difficulties in single-agent RL. These are, for example, scalability problems, non-stationarity or ambiguous learning goals.
To explore the difficulties of MARL, we have implemented the environment for a wireless network communication problem. There we look at the assignment of frequency resources to guarantee a reliable communication. Thus, different devices in a given area should learn on their own, which frequency they can use without disturbing the communication of other devices. As an overlap of frequencies can lead to a lack of communication, it is important that all devices select their own frequencies.
To accomplish this task, all devices that need to communicate become agents. The given area in which their communication takes place, the so-called communication cell, is the environment of the MARL problem. At each time step, every device chooses a communication channel (a frequency band) that it wants to use for its communication.
To ensure that the problem can be solved reliably, each agent receives the following information in its state: the communication channel used in the previous step, the own Quality of Service~(QoS) achieved by the last action, a vector of all neighbouring devices and the communication channels the neighbouring devices used in their last action.
After all agents have selected a communication channel, they receive their next state and a reward. We choose the reward to be the sum of the achieved QoS of all agents, since a shared reward avoids adversarial behavior and leads to cooperation between the agents.
We train this MARL task using a Q-Learning algorithm. We train our agents as well with a NashQ algorithm, which is adapted for Multi-Agent learning from Game Theory. The results show that the agents learn to communicate in a reliable way. However, the number of agents influences the training, since the number of possible state combinations increases exponentially with the number of agents. By comparison of the two different algorithms, the NashQ algorithm needs slightly less episodes to converge to an optimal policy than the Q-Learning algorithm.