AI Atlas: How RT-2 Adds a New Layer to Robotics

AI breakthroughs, concepts, and techniques that are tangibly valuable, specific, and actionable. Written by Glasswing Founder and Managing Partner, Rudina Seseri

🗺️ What is RT-2?

Created by researchers at Google DeepMind, Robotics Transformer 2 (RT-2) is an AI model that enables robots to perform tasks that they have not been trained to perform, by incorporating information learned from the web. The model leverages a transformer architecture as a base, which is extremely effective at analyzing language and inferring context. For example, the transformer behind RT-2 enables the model to understand that an apple falls under the word “fruit,” rather than having to pre-train the system with every possible fruit it might come across. RT-2 also leverages visual data, enabling it to recognize objects in reality, and directly outputs robotic actions in the form of code-based instructions. In other words, the system is able to generalize across unique tasks and take actions in reality based on natural language commands.

The image above provides an illustration of RT-2 in action. With only natural language and a camera feed as inputs, the system is able to identify objects it has never seen before, break down complicated instructions, and react appropriately.

🤔 What is the significance of RT-2 and what are its limitations?

The innovation behind RT-2 is that the addition of an AI model as a “brain” enables the system to not just generalize objects and tasks, but also to break down the steps required to proceed forward and take actions in the real world. Historically, the largest barrier to general-purpose robotics has been that training is extremely expensive and time-consuming. For comparison, imagine manually guiding a robot around your home trying to teach it to recognize every single object (e.g., to differentiate between towels, rugs, and sweaters) and to map out the exact steps necessary to complete tasks under any circumstances. This type of training requires literally billions of data points in order to understand a task as simple as cleaning the floor.

At an enterprise level, this logic is the same: it has traditionally been too prohibitive to train robots to be generalized, so it has always been easier to specially design robots as tools tailored for simple tasks such as moving products from A to B and installing components with rote assembly instructions. Compared to such traditional robotic systems, RT-2 is:

Applicable across contexts: Instead of memorizing every detail of a single situation, RT-2 has the ability to adapt to novel environments and translate context into action.
Scalable: Leveraging a transformer model, which is very good at recognizing context and is capable of translating both human language and visual data to code, RT-2 can learn from massive amounts of data and apply its learnings in operation at large scales.
Able to perform complex tasks: Utilizing chain-of-thought prompting, a way to dissect multi-step problems, RT-2 is able to respond appropriately to complicated tasks and even complete tasks it has never encountered before.

However, as DeepMind works on expanding the model’s potential in the field of robotics, there are standing limitations worth addressing or mitigating for RT-2:

Expanded use cases: In the demonstrated proof of concept, RT-2 was applied to a small robotic arm in a controlled setting. It remains to be seen how the model performs compared to benchmarks in production settings or settings with larger environments.
Training for new robotic systems: The model outputs instructions in code familiar to a robotic system. As different systems use different sets of instructions, RT-2 will need to be trained for each one individually.
Susceptibility to hallucinations: RT-2 is a transformer-based language and vision model, which means that its error rate is not zero. In an industry use case with real impact, the risk of incorrectly categorizing objects and workers may be too great to justify flexibility.

🛠️ Applications of RT-2

The pairing of deep learning with robotics opens up substantial opportunities across industries, as the vision of general-purpose robotics moves closer to reality. RT-2’s ability to leverage web data to make decisions based on real objects has valuable potential in several key industries, including:

Supply chain and distribution: In a warehouse, RT-2 can be leveraged to give robotic systems greater context behind their actions, such as being able to take more specific instructions when moving objects.
Medicine and biotech: With a more nuanced ability to differentiate and appropriately interact with objects, robotics could find greater applicability as assistants for doctors and researchers.
Manufacturing: On assembly lines with multiple products, RT-2 could enable robots to provide even greater value by expanding their use beyond individual tasks.

Stay up-to-date on the latest AI news by subscribing to Rudina’s AI Atlas.

Subscribe Now