by Mengyuan Yan, Yao Lu

Jun 19, 2021

AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale

AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale

Learning methods for robotic control are conventionally divided into two groups: methods based on autonomous trial-and-error (reinforcement learning), and methods based on imitating user-provided demonstrations (imitation learning). These two approaches have complementary strengths and weaknesses. Reinforcement Learning (RL) enables robots to improve autonomously, but introduces significant challenges with exploration safety, and data efficiency. Imitation learning (IL) methods learn from expert demonstrations, but it is challenging to generalize and adapt learned control policies to an increasingly broad range of situations without constantly requiring more demonstrations.

In “AW-Opt: Learning Robotic Skills with Imitation and Reinforcement at Scale”, we collaborated with robotics research at Google to systematically investigate design decisions made in several existing algorithms that sought to combine the best of both worlds from IL and RL (IL+RL) [1] [2] [3] [4]. By leveraging the diverse set of robot platforms, tasks, and control modalities that we experiment with, we developed, through extensive experimentation, a complete and scalable system for integrating IL and RL that enables learning robotic control policies from demonstration data, suboptimal experience, and online autonomous trial-and-error.  We hope that the analysis, ablation experiments, and detailed evaluation of individual design decisions that we present in this work would help to guide further developments in the community.

We begin our investigation with two existing methods: AWAC [5], which combines IL and RL, and QT-Opt [6], a scalable RL algorithm we have been using on our robots. Our testbed consists of 6 tasks, including a navigation task with dense reward and 5 manipulation tasks with sparse reward. The manipulation tasks are on two different robot platforms using different control modalities (KUKA and ours).

Our tasks cover varying levels of difficulty

Our tasks cover varying levels of difficulty from indiscriminate grasping (figure a and c) to semantic grasping (figure d, grasping compostable objects) to instance grasping (figure b, grasping the green bowl).

Both algorithms are provided with demonstrations, either from human or from previous successful RL rollouts, for offline pretraining, and then switch to on-policy data collection and training.

An example of real-world demonstration of data collection

We found that basic QT-Opt fails to learn from only successful rollouts, and even fails to make progress during on-policy training for tasks with a 7 DoF action space. On the other hand, AWAC [2] does attain a non-zero success rate from the demonstrations, but performance is still poor, and performance collapses during online fine-tuning for all our sparse-reward manipulation tasks.


We introduced a series of modifications to AWAC that bring it closer to QT-Opt, retaining the ability to utilize demonstrations while improving overall learning performance, culminating in our full AW-Opt algorithm.

Positive Sample Filtering: One possible explanation for the poor performance of AWAC is that, with the relatively low success rate after pretraining, large amounts of failed episodes during online exploration drowns the initial successful demonstrations and the actor unlearns the promising policy. To address this issue, we used positive filtering for the actor, applying the AWAC actor update only on successful samples. As a result, the algorithm no longer collapses during on-policy training.


Hybrid Actor-Critic Exploration: QT-Opt uses the cross-entropy method (CEM) to optimize the action using its critic, which can be viewed as an implicit policy (CEM policy). The CEM process has intrinsic noise due to sampling and can act as the exploration policy. AWAC on the other hand, explores by sampling from the actor network, although we could also obtain a CEM policy from its critic. We found that using both the actor and the CEM policies for exploration, by switching randomly between the two on a per-episode basis, performs better than using either one alone.

Algorithm 3

Action Selection in the Bellman Update: QT-Opt uses CEM to find the optimal action for the Bellman backup target. AWAC on the other hand, samples from the actor network for the Bellman backup target. We compare both methods as well as two new ones combining both actor and critic networks: (a) using the actor predicted action as the initial mean for CEM; (b) using the actor-predicted action as an additional candidate in each round of CEM. We found that the last choice gave us the best performance.

Algorithm 4

In summary, our results suggest that AW-Opt can be a powerful IL+RL method for scaling up robotic skills learning. With AW-Opt we have shown that depending on task difficulty, with a few hours or a few days of human demonstration and additional simulated on-policy training, we can get high-performing manipulation or navigation policies without task-specific engineering.

Read the paper.

A compilation of AW-Opt evaluation videos on several tasks


[1] S. Schaal et al. Learning from demonstration. Advances in Neural Information Processing Systems, pages 1040–1046, 1997.

[2] J. Kober, B. Mohler, and J. Peters. Imitation and reinforcement learning for motor primitives with perceptual coupling. From Motor Learning to Interaction Learning in Robots, pages 209–225. Springer, 2010. 

[3] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel. Overcoming exploration in reinforcement learning with demonstrations. 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6292–6299. IEEE, 2018.

[4] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothorl, T. Lampe, and M. Riedmiller. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817, 2017.

[5] A. Nair, M. Dalal, A. Gupta, and S. Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020. 

[6] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision based robotic manipulation. 2018 Conference on Robot Learning, pages 651-673, 2018.