by Vincent Vanhoucke, Distinguished Scientist and Head of Robotics at Google Research

Dec 12, 2022

This piece was first published on The Keyword blog.

We’ve often relied on technology to supplement — and even superpower — our human capabilities. We developed the printing press to help share information, the abacus (and then the calculator) to help us do math, the airplane to help us get from one point to another. In recent years, and specifically in the field of machine learning, we’ve developed novel ways to process information to power helpful technologies like Search, Assistant, Maps and more.

Underlying many of these technologies is a foundational breakthrough from 2017: the Transformer. Before the Transformer, machine learning systems struggled with determining which part of their input was relevant to arriving at the correct answer. Transformers introduced the notion of attention: by attending to the important part of its input, the model can dynamically choose what information matters and what doesn't, like applying a highlighter to the most relevant sentences in a book. Transformers proved so powerful that it’s become the mother of modern language models — powering much of the artificial intelligence we see today across the industry, even in AI that generates images, like Imagen and Parti.

Over the years, Transformers have been trained on large quantities of text data from the web. They help identify trends and patterns in language to provide translation services, model human conversation and power high quality search results. Lately, Transformers are being more widely adopted to help make sense of other types of information beyond language including images, video and speech. In fact, Transformers excel at language and vision tasks, so we’ve been able to make use of this technology to make sense of what robots see and how they act.

Applying Transformers to robots

Earlier this year, we worked with Everyday Robots to demonstrate that integrating a powerful language model such as PaLM into a robot learning model could not only enable people to communicate with a robot — but also improve the robot’s overall performance. This language model made it possible for helper robots to understand several types of requests — like “I’m hungry, bring me a snack” or “help me clean up this spill” — and execute them.

Now, we’re using the same architectural foundation as PaLM’s – the Transformer – to help robots learn more generally from what they’ve already seen. So rather than merely understanding the language underpinning a request like “I’m hungry, bring me a snack,” it can learn — just like we do — from all of its collective experiences doing things like looking at and fetching snacks.

We did this by training a Transformer model on data collected from 130,000 demonstrations – when a person operates the robot to execute a task – of over 700 types of tasks, completed by 13 helper robots from Everyday Robots. The tasks include skills like picking and placing items, opening and closing drawers, getting items in and out drawers, placing elongated items up-right, knocking objects over, pulling napkins and opening jars. The result is a state-of-the-art Robotics Transformer model, or RT-1, that can perform over 700 tasks at a 97% success rate, and even generalize its learnings to new tasks, objects and environments.

RT-1 is able to generalize learnings across tasks, objects and environments.

Just like a Transformer-based language model predicts the next word based on trends and patterns it sees in text, RT-1 has been trained on robotic perception data – images – and corresponding actions so it can identify the next most likely behavior a robot should take. For example, the model is trained on many examples of how a robot should pick up a banana, the robot would learn how to identify and pick up a banana even when the banana is seen in a new kitchen surrounded by objects it's never seen before. This approach enables the robot to generalize what it’s learned to new tasks, handling new objects and environments based on experiences in its training data — a rare feat for robots, which are typically strictly coded for narrow tasks.

Learning from each other

As human beings, we learn from our personal experiences and from each other. We often share what we’ve learned and rework systems based on failures we’ve encountered. While our robots don’t communicate with each other, this research shows that we can successfully combine datasets from different types of robots and transfer behaviors across them. In fact, our research shows that by combining data from different robots we’re able to nearly double the model’s ability to generalize to a new scene. That means that as we continue to experiment with different robots and new tasks, we may be able to augment the training data for RT-1 to improve robot behavior, making it a flexible and scalable approach to robot learning.

Towards more helpful robotics

Just like we open-sourced our Transformer research when we first developed it, we’ll be open-sourcing RT-1 to promote further research in the robotics space. This is an early step towards robot learning systems that may be able to handle the near-infinite variability of human-centered environments. We hope that together we can continue to advance robot learning in a way that will supercharge the more helpful robots of the future.