## 1 Introduction

Imitation learning avoids excessive interactions with an environment, by learning from a set of expert demonstrations during training. The common problem with imitation learning is distributional shifts (Daumé et al., 2009; Ross and Bagnell, 2010). When an agent makes a mistake, it may lead to different observations than those seen under expert demonstrations. Therefore, imitating an expert can become misaligned with the true objective of imitation learning.

(de Haan et al., 2019)

studied the relation between causal misidentification and distributional shift. It discussed that for an imitation learning algorithm to be maximally robust to distributional shift, a policy must rely solely on the true causes of expert actions. This highlights the necessity of disentangled representation of expert trajectories to account for factors of variance. These variations can be

external such as when one is learning from a mixture of expert demonstrations. When learning in new environments not demonstrated in expert trajectories, one has to take into account internalfactors of variations as well. For example, observations in the game environments for different levels can vary. Similar to reinforcement learning (RL), the agent must be able to transfer its experience to new environments. These two types of variations, exacerbate the distributional shift problem.

Two successful frameworks for tackling the mentioned problems of imitation learning are inspired by game-theoretic concepts such as no-regret learning and minimax games. We shed some light on the landscape of existing solution concepts

in game theory, specifically applicable to imitation learning. We then provide a hybrid as well as an extension of no-regret learning and minimax training for imitation learning. To generalize imitation learning to this new game-theoretic concepts, one has to equip the algorithms with queues of discriminators and agents, in contrast with the classical approach, where there are single discriminator and single agent.

As the main contribution of this paper, we then discuss that a specific family of correlated equilibrium, namely maximum entropic correlated equilibrium is more suitable for imitation learning. This type of equilibrium is stricter than the no-regret counterparts, where players cannot predict the actual outcomes. The achievement of such correlated equilibrium is due to a mediating neural architecture, which augments the observations that are being seen by queues of discriminators and agents using auxiliary codes. At every step of the training, the ‘mediator’ network computes feedback using the rewards of discriminators and agents to augment the observations accordingly. By interacting in the game in this way, it steers the training dynamic towards more suitable regions.

We discuss that this type of framework for imitation learning, makes adaptability and transferability of the learned model to new environments straightforward and it leads to more efficient learnings. Second, it is suitable for learning from a mixture of expert trajectories.. Third, it needs minimal hyper-parameters adjustment. Unlike previous GAN like implementations, it is free of the classic difficulties of non-convex optimization faced by the discriminator. We conduct several experiments, to confirm these statements.

Related adversarial training

Among the recent development of imitation learning algorithms is the application of generative adversarial networks (GAN) (Goodfellow et al., 2014; Schmidhuber, 1992) to imitation learning (GAIL) (Jonathann, 2016). By building on top of minimax frameworks, the discriminator network of GAIL learns the reward function implicitly by distinguishing between the demonstrated state-action trajectories and those generated by the agent.

Disentangled representations of the trajectories are also important when learning from a mixture of experts. Built on top of infoGAN (Chen et al., 2016), (Li et al., 2017) maximizes the lower bound of mutual information between the latent code and the corresponding demonstrations. Instead of assigning latent codes to trajectories, (Lee et al., 2018) handles multimodality of trajectories by modeling agent policy as a sparse mixture density network.

Related no-regret frameworks No-regret leaning is not new to imitation learning. It is, in fact, one of the main important frameworks for addressing distributional shift problems. (Ross et al., 2010) Another key feature of regret minimization frameworks is the ability of the learner to adapt to the evolving environments.

All of the GAIL extensions and most of the GAN works are concerned with the notion of Pure Nash Equilibrium (PNE). GAN was recently studied in the regret minimization frameworks (Schuurmans and Zinkevich, 2016; Grnarova et al., 2017; Hazan et al., 2017; Kodali et al., 2018). However, one major difficulty is that standard regret minimization fails in non-convex settings of GAN unless discriminator is a shallow network. One recent development is non-convex FTPL (Gonen and Hazan, 2018)

which invokes an offline oracle to introduce an exponential random noise to the loss function.

Despite the progress made to the development of no-regret GAN, GAIL type architectures have not yet been studied in the no-regret frameworks. Although we provide an implementation of no-regret GAIL, we note that the main contribution of our work is the modification of such a no-regret learning framework with a mediator network, to arrive at the correlated equilibrium (Aumann, 1987).

## 2 Problem Formulation

### 2.1 Preliminaries

Let us use

to describe an infinite-horizon, discounted Markov decision process (MDP) with state space

, action space, transition probability distribution

, reward function , distribution of the initial state , the discount factor .### 2.2 Adversarial Imitation

In the case of imitation learning, we are given access to a set of expert trajectories that are achieved using expert policy

. We are interested at estimating a stochastic policy

.To estimate , GAIL optimizes the following:

(1) |

, with the expected terms defined as , where , , , and is the -discounted causal entropy.

In game-theoretic terms, it is introducing an auxiliary adversary player , who tries to distinguish state-action pairs generated during roll-out using from the demonstrated trajectories generated by during a minimax standoff with an agent with policy .

### 2.3 Relation between balances of players, causal misidentification and distributional shift

GAIL inherits the issues of GAN training including non-convergence, mode collapse, diminished gradients as well as criticality of the balance between the players. The latter issue is even more important for imitation learning. Unlike GANs, considering the environment as a black box, the optimization objective is not differentiable end-to-end, which asks for proper Monte-Carlo estimation of policy gradients. This favors the discriminator and results in GAIL’s discriminator to overfit by exploiting task-irrelevant/non-causal features. As recently stated in (de Haan et al., 2019), for an imitation learning algorithm to be maximally robust to distributional shift, a policy must rely solely on the true causes of expert actions. Therefore, a fruitful path for improving the efficiency of GAIL type architectures is by adhering to two main principles: 1) Removing the causal misidentification. 2) creating a balance between players. To set up a proper formulation for addressing these, we first need to introduce some extra game-theoretic formulations.

### 2.4 Mixed-GAIL

The standard formulation of GAN is concerned with Pure Nash Equilibrium (PNE). If one relaxes the assumptions on the equilibrium, one can arrive at various other game-theoretic concepts, as it is shown in Fig. 1

. PNE is the most strict and has higher computational complexities. Considering the selection of neural network parameters as a

deterministic strategy, PNE seems to be the first suitable choice. It is not clear to what extent, the strictness of the PNE equilibrium is essential for efficient learning and expressivity of the adversarial training setup.Consider discriminator and agent policy to be parameterized by and respectively. To formalize the game-theoretic notations for the 2-player minimax game, let the action set to be with and

to be the pure action for discriminator and agent respectively. In a deep learning game, each pure action is a parameter of a given neural architecture. Let

denote the probability simplex. If we assume that the set of parameters are finite, we can then define individual mixed strategies and as the probability distributions over the set of parameters. For example, amounts to the probability that an agent selects a policy with neural network parameters . For defining equilibrium notions, we also need to define joint mixed strategy .The minimax loss term will have an extra expectation term over the set of mixed strategies:

(2) |

If a Nash equilibrium gets achieved over the probability distributions of strategies, it is a mixed Nash equilibrium (MNE). In the degenerate case, this would be the same as the standard GAIL which amounts to achieving a PNE.

## 3 No-regret GAIL

The least strict of equilibrium notions is no-regret also known as coarse correlated equilibria (CCE) or Hannan set. It corresponds to the empirical distributions that arise from the repeated joint play by no-regret learners. In no-regret games, each player begins with no model of their opponent. After the play has progressed, each player can look into the past and ask whether they could have done better, using a notion of regret. Regret is the difference in payoff between an alternate strategy and the pursued strategy. Regret minimization algorithms aim to ensure that long term regret is sublinear in the number of time steps. More formally, let and be the history of weight updates when the game is played repeatedly up to time . In external regret minimization framework, each player compares his average payoff to the payoff that would have been received against the best constant action. When translated to the context of GAIL, it implies that the objective of a no-regret GAIL is:

(3) |

where is the same objective of standard GAIL. There are several classes of the algorithms that can be used to ensure Eq. 3 solved, meaning that it yields sub-linear regret. One important class of no-regret learners is known as Follow The Regularized Leader (FTRL) with -regularization:

(4) |

This was successfully applied to GAN (Grnarova et al., 2017) but convergence guarantees hold true only when discriminator is a shallow network. To address this issue, a non-convex Follow -The-Perturbed-Leader (FTPL) (Gonen and Hazan, 2018) recently proposed which is quite simple to implement:

(5) |

where at each step. We use the same type of regularization for the implementation of no-regret GAIL.

### 3.1 No regret and causal misidentification

A discriminator can focus on non-relevant features. For example, in the case of training a model to learn how to drive, a model can focus on a slight background difference such as a brake indicator signal instead of the ongoing seen in the environment. When a model is learning using a selection of past parameters/strategies, as is done in the no-regret learning GAIL, there is a good chance that some of the past models have not been trained using the brake indicator information. This reduces the likelihood of a model to exploit task-irrelevant/non-causal features.

## 4 Correlated GAIL

), utility vector for discriminators

and agentsThe no-regret framework, however, does not remove the causal misidentification by itself. This is because the no-regret equilibrium is determined by the history of play. The history of the game can reveal information about the next player’s moves. In the case when one player is more adversarial than the other, as is in the case of GAIL, discriminator optimization might benefit from ignoring the rest of the policy/strategies. In the extreme degenerate case, it is as if it plays against only one policy and we are back with the original problem of causal misidentification no-regret learning was to trying to solve.

Therefore, we would like a case where players are not able to predict each others’ moves. We use the concept of correlation device in game theory to turn the problem into a maximum entropy correlated equilibrium (MaxEnt CE) (Ortiz et al., 2006). To turn the described no-regret GAIL into a correlated one, we introduce a third network. Such a mediating network learns to guide the equilibrium to the desired regions by augmenting the observations. A desirable region is an equilibrium where a) neither players can predict the strategy of the other player based on their play. This addresses the no-regret learning issue. b) no user would have benefited from ignoring the provided code. Otherwise, they would put zero weight on the mediating codes.

To satisfy the mentioned requirements, we set a reward function for the mediator network that leads to MaxEnt CE. Since it reaches an equilibrium, players incur more loss if they ignore the mediator signals. The Maximum Entropy part guarantees the unpredictability part.

Let us define and discriminator’s and agent’s payoff gain for selecting a parameter index instead of .

Definition: correlated equilibrium (CE) (Aumann, 1987) It is a joint mixed strategy such , and ,

(6) |

Computing CE amounts to solving a linear program. As a result, it is computationally less expensive than computing NE, which amounts to computing a fixed point.

If is set to be a product distribution , CE reduces to NE. In other words, standard GAIL can be considered as a special case of correlated GAIL where agents and discriminators play independently.

Definiton: MaxEnt CE It is a CE with joint mixed strategy , where is the entropy.

###### Theorem 1

MaxEnt CE Representation (Ortiz et al., 2006) MaxEnt CE has the following parameterization:

(7) |

where and are the Lagrange multipliers with the natural interpretation that iff , the player is indifferent and when , the user has strict preference to choose over .

When players’ utility are known, computing MaxEnt is a convex optimization problem with available efficient algorithms. However, in our case, they are simultaneously being learned. Instead, we augment the utility by absorbing the Lagrange multipliers. This is straightforward considering the interpretation of Lagrange multipliers as a way of quantifying the indifference between strategies. We can affect this indifference level via augmenting the observation of the players. We augment the utility of the users to by appending codes to their observations through a RL-based mediator. The reward of the mediator with action is set to:

(8) |

Maximizing mediator reward is then equivalent to approximating the MaxEnt CE. To avoide the extra notation, we still refer to the utilities u, but they are augmented utilities of the original game (without appended code).

### 4.1 Correlated GAIL implementation

Correlated GAIIL (corGAIL) is described in details in Alg.3. There are several points worth mentioning:

Both discriminator and agent, utilize the follow the leader (FTL) algorithm for updating the strategies. The role of a mediator then can be also thought of as a means to turn the external regret to correlated equilibrium (Blum and Mansour, 2005).

To avoid the high variance of the mediator from leaking too much into the imitation training dynamics, we parameterize the policy of the mediator using the reparameterization trick (Blum et al., 2015).

The concept of adding codes to the policy network is similar to infoGAIL (Li et al., 2017). InfoGAIL uses fixed code to guide an entire trajectory. Moreover, it uses other regularization terms in the policy gradient optimization objective, to make sure that the codes would not be ignored. The correlated codes have different properties. First, they are generated per state-action (they also are fed into the discriminator) and therefore it addresses the multimodality and other types of variations within the trajectories as well. The motivation of players to include the mediator codes is implicit in the definition of correlated equilibrium. Ignoring the codes would lead to a loss in the optimization steps and therefore there is no need to include any extra regularization terms.

Treating the space of strategies with a small finite capacity queue might sound a gross approximation. In practice, however, they lead to very efficient implementation. We used the pruning strategy in Alg.2 (Grnarova et al., 2017) to keep the queue size fixed. Other efficient techniques can be used to prune the set of parameters, for example (Hazan and Comandur, 2009). CorGAIL is sensitive to the pruning parameter . One reason for this sensitivity is that not giving enough time before pruning could introduce a bad policy into the strategies and make the training unstable. A similar type of instability was a practical challenge for (Daumé et al., 2009; Ross and Bagnell, 2010; Kakade and Langford, 2002), where a finite mixture of policies are collected by adding a new policy at each step of training. We did, however, find that this instability can be avoided when is set to large enough value.

We used the utility definition of wasserstein GAN (Arjovsky et al., 2017) for our final implementation. However, unlike previous minimax approaches, choices of entropy regularization, grad penalty, or even the parameter clipping for the discriminator is unnecessary. This removes the difficulties of past training methodologies.

## 5 Expriements

### 5.1 Adaptability and transferability

One concern with introducing an RL-based mediator to imitation learning is the possible difficulty of training in environments with high dimensional observations. Moreover, we wanted an environment that we could run controlled experiments for testing the generalization capability of corGAIL for new and adaptive environments. We, therefore, conduct the main body of our experiments using the CoinRun environment (Cobbe et al., 2018). It is a procedurally generated environment that has a configurable number of levels and difficulty. It can provide insight into an agent’s ability to generalize to new and unseen environments. It was specifically designed to study the generalization in RL. However, it can also be used to study the adaptivity and transferability of imitation learning algorithms. At each time step, the agent receives a RGB observation and controls a character that spawns on the far left, and the coin spawns on the far right. Depending on the level, several obstacles, both stationary and non-stationary, lie between the agent and the coin. Levels vary widely in difficulty, so the distribution of levels forms a curriculum for the agent. A collision with an obstacle results in the agent’s death. The only reward in the environment is obtained by collecting the coin, and this reward is a fixed positive constant. The level terminates when the agent dies, the coin is collected, or after 1000 time steps.

In all of the experiments, we use queue size and code size of . We used a larger value of for stable and efficient training in such environments, whereas smaller values are suitable for classical RL training environments, which we set to be between 10 and 100. We first trained a PPO (Schulman et al., 2017) agent using the 3-layer convolutional architecture proposed in (Mnih et al., 2015). We stopped the training when it achieved a reward of 6.3 with 500 levels. We then generated expert trajectories with the same number of levels. However, in imitation learning training, three different number of levels are selected: 1, 500 and unbounded set of levels. A higher number of levels decrease the chance that a given environment gets encountered more than once. For the unbounded number of levels, this probability is almost 0. The selection of these different numbers of levels provides an insight into the adaptability and transferability of the corGAIL to new environments. In Fig. 2, we visualized the performance of corGAIL and GAIL, averaged over three different seeds. When there is only one level, the gap between GAIL and corGAIL is not very wide. In fact, in the beginning, GAIL shows better sample efficiency. However, with the increase in the number of levels, the corGAIL outperforms GAIL substantially.

CoinRun also has a difficulty setting. Several choices during the game generation of CoinRun process are conditioned on this difficulty setting, including the number of sections in the level, the length and height of each section, and the frequency of obstacles. We measure the log-likelihood of mediator’s policy during the course of several games. Fig. 3 shows the result in a quiver plot. It can be seen that for difficult environments, mediator actions have a sharper change in log-likelihoods. One plausible explanation is that high difficulty setting increases the probability of the agent making a mistake. This increases the distribution shifts. To address this issue, mediator has to steer the dynamic harder toward efficient training regions.

We also did a few baselines for MountainCarContinuous-v0 and pendulum-v0. We failed to make the no-regret GAIL stable with different choices of regularizations. We therefore only report the GAIL and corGAIL. Similar to 1 number of levels for CoinRun, the performances are the same.

### 5.2 Imitating mixture of state-action trajectories

To validate the assertion that corGAIL is suitable for imitating a mixture of experts, we use the Synthetic 2D Circle World experiment of (Li et al., 2017). The goal is is to select direction strategy at time using the observations of to such that a path would mimic those demonstrated in expert trajectories . These expert trajectories are stochastic policies that produce circle-like trajectories. They contain three different modes as are shown in Fig. 5. A proper imitation learning should have the ability to distinguish the mixture of experts from each other. The results in Fig. 5 demonstrate the path of learned trajectories during the last 40K of the overall 200K steps of training. It can be seen that CorGAIL can distinguish the expert trajectories and imitate the demonstrations more efficiently than no-regret GAIL and GAIL.

## 6 Conclusion and future works

We proposed a novel game-theoretic framework for imitation learning that replaces the problem of computing Nash with computing MaxEnt correlated equilibrium. Computing MaxEnt CE is a convex optimization in its original format. But when it is applied to the imitation learning context, its derivation is not straightforward. This is because the utility of the game is simultaneously being learned during imitation learning training. Therefore, an RL-based mediator network is proposed to approximate the MaxEnt equilibrium by augmenting the observation/states with codes.

We highlighted several benefits of such a framework. We argued that it helps with the causal misidentification and distribution shift. This is because no-regret loss minimization can attack the sequential nature of causal misidentification. We also showed how turning a no-regret imitation learning framework into a MaxEnt Correlated framework can provide further benefits. For example, it avoids the non-convex optimization challenges of no-regret frameworks. It also prevents the degeneracy cases that can arise from the predictability of users’ plays. Thanks to the mediator codes, CorGAIL is suitable for learning from a mixture of experts and is capable of generalizing to unseen environments.

Although we discussed the role of mediator codes for efficient learning, we leave the use of correlated codes for interpretability to future works. The probabilistic and computationally efficient nature of CE is also a potential area of future imitation learning research. It could be interesting to explore the applicability of such a game-theoretic framework to meta-imitation learning. We are also investigating the use of the mediator codes for efficient recurrent policies.

## References

- Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML-17, pp. 214–223. Cited by: §4.1.
- Correlated equilibrium as an expression of bayesian rationality. Cited by: §1, §4.
- Variational dropout and the local reparameterization trick. In NIPS, Cited by: §4.1.
- From external to internal regret. J. Mach. Learn. Res. 8, pp. 1307–1324. Cited by: §4.1.
- InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In NIPS, Cited by: §1.
- Quantifying generalization in reinforcement learning. ArXiv abs/1812.02341. Cited by: §5.1.
- Search-based structured prediction. Machine Learning 75, pp. 297–325. Cited by: §1, §4.1.
- Causal confusion in imitation learning. In NeurIPS, Cited by: §1, §2.3.
- Learning in non-convex games with an optimization oracle. In COLT, Cited by: §1, §3.
- Generative adversarial nets. In NIPS, Cited by: §1.
- An online learning approach to generative adversarial networks. ArXiv abs/1706.03269. Cited by: §1, §3, §4.1, Algorithm 2.
- Efficient learning algorithms for changing environments. In ICML ’09, Cited by: §4.1.
- Efficient regret minimization in non-convex games. In ICML, Cited by: §1.
- Generative adversarial imitation learning. In NIPS, Cited by: §1.
- Approximately optimal approximate reinforcement learning. In ICML, Cited by: §4.1.
- On convergence and stability of gans. Cited by: §1.
- PyTorch implementations of reinforcement learning algorithms. GitHub. Note: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail Cited by: §5.
- Maximum causal tsallis entropy imitation learning. In NeurIPS, Cited by: §1.
- InfoGAIL: interpretable imitation learning from visual demonstrations. In NIPS, Cited by: §1, §4.1, §5.2.
- Human-level control through deep reinforcement learning. Nature 518, pp. 529–533. Cited by: §5.1.
- Maximum entropy correlated equilibria. In AISTATS, Cited by: §4, Theorem 1.
- Efficient reductions for imitation learning. In AISTATS, Cited by: §1, §4.1.
- A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, Cited by: §1.
- Learning factorial codes by predictability minimization. Cited by: §1.
- Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §5.1, 11.
- Deep learning games. In NIPS, Cited by: §1.

Comments

There are no comments yet.