Action-Image: Teaching Grasping in Sim

by Mohi Khansari

May 13, 2020

Action-Image Representation: Learning Scalable Deep Grasping Policies with Zero Real World Data

We introduced Action-Image, a new grasp proposal representation that allows learning an end-to-end deep-grasping policy with no real world data. Our model achieves 84% grasp success on 172 real world objects while being trained only in simulation on 48 objects with just naive domain randomization. This representation works on a variety of inputs, including color images (RGB), depth images (D), and combined color-depth (RGB-D).

We achieve this level of performance by helping the network to better extract the salient features of the grasping tasks by 1) leveraging the knowledge of having access to robot kinematics, hence letting the model focus on understanding the underlying task, 2) exploiting the object-centric (aka locality) property of the grasping task, hence avoiding passing unnecessary details to the model, and 3) imposing invariancy in translation in the image space similarly to object detection models.