by Chelsea Finn, Eric Jang
Feb 08, 2022
Contact rich manipulation problems are ubiquitous in the physical world. In millions of years of evolution, humans have developed the remarkable capability to understand environment physics, so as to achieve general contact rich manipulation skills. Combining visual and tactile perception with end-effectors like fingers and palms, humans effortlessly manipulate objects with various shapes and dynamics properties in complex environments.
Robots, on the other hand, lack this capability – due to the difficulty of understanding high dimensional perception and complicated contact physics. Recent development in deep reinforcement learning (RL) has shown great potential towards solving manipulation problems by leveraging two key advantages. First, the representative capability of a deep neural network structure can capture complicated dynamics models. Second, control policy optimization explores vast contact interactions. However, contact-rich manipulation tasks are generally dynamics dependent; since the RL policies are trained in a specific dynamics setting, they specialize within the training scenario and are vulnerable to variations of dynamics. Learning a policy that is robust to dynamics variations is pivotal for deployment to scenarios with diverse object dynamics properties. In this work, we design a deep RL method that takes multi-modal perception input and uses deep representative structure to capture contact-rich dynamics properties.
The proposed method, Contact-aware Online COntext Inference (COCOI), uses prior camera frames and force readings in a contact-aware way to encode dynamics information into a latent context representation. This allows the RL policy to plan with dynamics-awareness and improves robustness against domain variations. Unlike the the feedforward baseline (Fig. 1), which only has access to a single sensory input and cannot infer the dynamics properties of the object (necessary for our non-planar pushing task), the online COntext Inference (COI) is a module that takes history observation samples and encodes them into a dynamics context representation. This equips the control policy with the ability to infer dynamics of the object. COI consists of a set of additional streams in the policy network that encode history sensor observations into a dynamic context representation. Each stream of COI takes a pair of consecutive sensory inputs separated in time by 0.5s (the sensor update interval in our robot system). The encoded sensory input for each stream is then averaged to obtain the final dynamics context representation of COI, which is concatenated with the state-action representation to estimate the Q value. To ensure that each history sample contains useful information, we also propose a contact-aware sampling strategy which actively checks the force torque sensor mounted at the robot gripper and only collects a sample when the contact force magnitude is considerably large. This strategy guarantees the samples to be representative, in that the gripper and the object are in contact. We call this sampling strategy COntact-aware-COI, or COCOI.
Fig. 1: The baseline feed forward neural network Q function (up) and the proposed COntext Inference (COI) module (down).
We apply COCOI to a novel pushing task where dynamics property reasoning plays a vital role: the robot needs to push an object to a target location while avoiding knocking it over. Prior work in pushing mostly focuses on objects inherently stable when pushed on a flat surface. This essentially reduces the task to a 2D planar problem. As a result, they cannot handle our proposed class of “nonplanar pushing tasks” where real-world 3D objects can move with the full six degrees of freedom during pushing.
Despite being commonly seen in everyday life, these tasks have the following challenges:
Visual perception: Unlike in planar pushing, where concrete features can be retrieved from a top down view, in non-planar pushing, key information can not be easily extracted from the third angle perspective image.
Contact-rich dynamics: The task dynamics properties are not directly observable from raw sensor information. Furthermore, in our non-planar pushing task, dynamics property reasoning is vital to avoid knocking the object over.
Generalizable across domain variations: The policy needs to be effective for objects with different appearances, shapes, masses, and friction properties.
Fig. 2 Our method, COCOI, achieves dynamic-aware, nonplanar pushing of an upright 3D object. The method is robust against domain variations, including various objects and environments, in both simulation and the real world. The first and second columns show the table simulation setting in the robot’s perspective and the third party perspective, respectively. The third column shows the simulated and real world trash bin settings in the robot perspective.
We tested our method on two settings, the tabletop setting, and a setting inside the sorting bin. And we leveraged our simulation, and worked on 75 different objects, such as cups, bottles, cans, mugs, etc. We divide the objects into a training set and an unseen testing set. And across different domains, COCOI consistently outperforms other baseline methods. Specifically, as shown in the tables, COCOI shows an average relative improvement of 50% and 20% success rate compared to the baseline and VCOI, respectively.
Fig. 3 The scenes and objects used to test our approach.
We also visualized the inferred representations for different settings with different dynamics parameters to inspect the dynamics context learned by COCOI. We ran sample episodes and visualized these representations using a combination of principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). The visualization shows clear separation between settings, which indicates COCOI learns to infer the dynamics properties. And we designed a push-in-bin task in both the simulator and in the real world, in order to test real world deployment. We trained a RetinaGAN model to adapt the simulation images to synthetic images with realistic appearance. We achieved 90% success, demonstrating the capability of our 3D pushing policy to overcome both the visual and dynamic domain gap.