Octo] General-purpose Robot Trained On A Large Robotics Dataset

Large Language Models 03/12/2024

3 main points
✔️ Octo's pre-training with 800,000 robot trajectories allows it to solve a wide variety of tasks in zero shots
✔️ Octo's configuration design makes it easy to fine-tune to new inputs and action spaces, making it applicable to a wide range of robot control problems &. nbsp;
✔️ Demonstrates high performance, but suggests room for further improvement through data diversification, such as processing wrist camera information and improving performance of verbal instructions

Octo: An Open-Source Generalist Robot Policy
written by Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, Sergey Levine
(Submitted on 20 May 2024 (v1), last revised 26 May 2024 (this version, v2))
Comments: Project website: this https URL
Subjects: Robotics (cs.RO); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

In robotics, policies are usually learned using data sets collected for a specific robot or task. However, this approach requires a large amount of data collection for each task, and the resulting policy can only yield limited generalization performance. Leveraging experience from other robots and tasks can be expected to provide broader generalization performance and better performance in downstream tasks, but thisrequires consideration ofotherrobot morphologies, sensor settings, task specifications, and environments, sobuilding a"general-purpose robot model" thatcan achieve thisis considered a very difficult task.

Against this backdrop, several studies have proposed "robot infrastructure models" that map directly from robot observations to actions. These models can generalize on a zero-shot or fourshot basis to new domains and robots, enabling visuomotor control across different tasks, environments, and robotic systems. For example, the GNM model generalizes to different robot navigation scenarios, the RoboCat model handles goal-oriented tasks in different robot forms, and the RT-X model performs language-oriented operations.

These models represent an important step toward a "generic robot model," but there is still work to be done. In this paper,we design a system to pre-learn generic robot policies for a variety of interfaces in downstream robotic applications.

The core of the system is a transformer architecture that maps arbitrary input tokens (created from observations or tasks) to output tokens (translated into actions). This allows various camera configurations and robots to be controlled without additional learning and guided by verbal commands or target images. In addition, new robot configurations can be accommodated by adding appropriate adapters and fine tuning with small data sets and computational cost.

We are developing Octo, a transformer-based policy that is pre-trained using 800,000 robot demonstrations from the Open X-Embodiment dataset Octo is the first to effectively fine-tune to new observations and action spaces GRP and is a complete open-source robot control policy that includes a learning pipeline, model checkpoints, and data.And while Octo's individual components (transformer backbone, diffusion head, etc.) have been discussed in previous work, the integration of these into a powerful general-purpose robot policy is unique and new technology.

Through extensive experiments on nine robots, the system proposed in this paper provides state-of-the-art multi-robot control performance in single- and dual-arm manipulation tasks and serves as an effective initialization for unseen settings with new observations and action spaces. We also carefully study the impact of design decisions during GRP pre-training to evaluate how the choice of data distribution, model architecture, and policy formulation affects the quality of pre-trained GRPs. Evaluation results demonstrate the usefulness of scale and flexibility.

Note that this paper exposes all the resources needed to train, use, reproduce, and fine tune the Octo model, providing pre-trained Octo model checkpoints for 27M and 93M parameters, multiple RGB camera inputs, and task specification for language and target image support. The system also provides a set of models that can be used to create a new domain. Also provided are scripts for fine tuning these models in new domains, a pre-training pipeline, an optimized data loader, a transformer implementation for multimodal inputs, and tools to monitor training progress.

Octo Model Design

The paper develops the Octo model, an open-source general-purpose robot policy that can be fine-tuned and adapted to new robots and tasks. The paper also presents key design decisions, learning goals, training datasets, and infrastructure.

The Octo model is designed with an emphasis on flexibility and scalability. It is a general-purpose, scalable model that supports a variety of robots, sensor configurations, and actions and can be trained on large amounts of data. In particular, it employs natural language instructions, target images, observation history, and diffusion decoding to support multimodal action prediction; Octo can efficiently adapt to new action spaces and robots with different combinations of cameras and sensors. This design makes Octo a flexible and versatile robot policy that can be used in a wide variety of robotics applications and research projects.

The core of Octo lies in the transformer-based policy, which consists of three main parts:first, the input tokenizer, which converts language instructions, goals, and observation sequences into tokens (below, left);second, the transformer backbone, which processes the tokens and generates the embedding ( below, top) ; and third , the readout head , which generates the desired output (action) .The third isthe readout head, which generates the desired output (action).

Task definitions (e.g., linguistic instructions or target images) and observations (e.g., wrist or third-party camera streams) are transformed into a common token format using modality-specific tokenizers.Language input is tokenized and a sequence of language-embedded tokens is generated through pre-trained transformers.Image observations and targets pass through a shallow convolution stack and are split into a sequence of flattened patches.

The unified token sequence is processed by the transformer. This is similar to previous work learning transformer-based policies based on observation and action sequences: the Octo transformer's attention pattern is masked on a block-by-block basis, so that observation tokens are only attached to tokens at the same time step or earlier and Attends only to task tokens. This modular design allows for the addition or deletion of observations and tasks during fine tuning.

The readout token is similar to BERT's [CLS] token and serves as a compact vector embedding of the previous observation sequence. A lightweight "action head" that implements the diffusion process is applied to the embedding of the readout token. This action head predicts several successive "chunks" of action.

This design allows flexibility to add new tasks, observation inputs, or action output heads to the model during downstream fine tuning. Adding new tasks, observations, or loss functions can be accommodated by retaining the pre-trained weights of the transformer and adding only new position embedding, new lightweight encoders, or new head parameters.

This flexibility is an important factor in making Octo a general-purpose model. The ability to adapt Octo's inputs and outputs during fine tuning is a versatile tool for the robotics community, as it is not possible to cover all robot sensor and action configurations during pre-training. While past model designs that fused standard transformer backbones or visual encoders with MLP output heads would have fixed the type and order of inputs expected by the model, Octo's observation and task switching does not require reinitialization of most of the model.

As such, the Octo model is a flexible and versatile robot policy that can demonstrate its capabilities in a wide variety of robotics applications.

The Octo model is trained using 25 different datasets from the Open XEmbodiment Dataset. This dataset is a collection of diverse robot training data. It contains demonstration data for a wide variety of tasks from different robot structures and scenes. These datasets differ not only in robot type, but also in sensor type and label (e.g., with or without a wrist camera or with or without verbal instructions).

When creating training data sets, we first exclude data sets that do not contain image streams or use delta end-effector control.Wealso exclude data sets that are too repetitive, have low resolution, or are overly biased toward a particular task.The remaining datasets are classified into "more diverse" and "less diverse" datasets, andtheweight of the "more diverse" datasets is doubled during training. We also lower the weights of the datasets with more repetitions so that the mixture is not biased. Finally, we zero-padded the missing camera channels to unify the action space of the grippers. Note that although the training data set obtained with this method was very effective, the authors believe that a more detailed analysis of the quality is needed in the future.

In this paper, we study two variants, Octo-Small with a ViT-S sized transformer backbone and Octo-Base with a ViT-B sized transformer backbone.

Experiment

Here we experimentally analyze the performance of the Octo model and evaluate its ability to serve as a basic model for a general-purpose robot. The key points of the evaluation are as follows

Can Octo control multiple robot forms and solve language and target tasks as is?
Are Octo weights suitable for data-efficient fine tuning to new tasks and robots? And are they better than learning from scratch or commonly used pre-trained models?
Which design decisions are most important in building Octo as a general-purpose robotic policy?

Octo's capabilities are evaluated through nine robot learning setups at four institutions, as shown in the figure below. Robot setups that match the pre-training data are used to test zero-shot control capability for language and target image tasks. In these setups, all robots are controlled with delta end-effector control actions and the observation space is RGB images.

In addition, Octo will also be evaluated for its data-efficient fine-tuning capabilities for new environments and tasks. This includes new observations (e.g., force-torque input in "Berkeley Insertion"), new action spaces (e.g., joint position control in "Berkeley Pick-Up"), and new robot forms (e.g., "Berkeley Coke" and "Berkeley Bimanual"). Each fine-tuning setup used approximately 100 Indian Maine demonstrations and took less than 5 hours of fine-tuning on an NVIDIA A5000 GPU. Evaluation tasks test Octo's ability to interact with diverse objects (e.g., "WidowX BridgeV2"), solve long-term tasks (e.g., "Stanford Coffee"), and perform precision operations (e.g., "Berkeley Insertion").

Octo's ability to control multiple robots in zero-shot is compared to RT-1-X, the best publicly available general-purpose robotics policy. RT-1-X, like Octo, is pre-trained on the Open X-Embodiment robotics dataset and, like Octo, is designed to control multiple robots in zero-shot. shot control of multiple robots.In addition, we compare the zero-shot capability of Octo to RT-2-X, a 5.5 billion-parameter visual language model, which is fine-tuned on the Open X-Embodiment dataset to generate robot actions. -X models are trained on a more restricted subset of 350,000 episodes versus Octo's 800,000 episodes.In addition, we compareOcto's performance as a policy initialization for data-efficient fine tuningto two popular approaches:learning from scratch in a target domain demonstration andusingpre-trainedvisual representations.

In fine tuning, we found thatlearning alarge transformer architecture from scratchleads to early overfitting on small data sets. Instead, we have achieved better resultswhen learning fromscratch by using a standard policy architecture that has been employed in many prior studies.This architecturecombinesa ResNet visual encoder with FiLM language conditioning anda small transformer-action decodertrained ondiffusion targets, with 28M parameters, similar to RT-1. This architecture is used as a baseline ("ResNet+Transformer Scratch") from scratch.We also compared it with apre-trainedvisual representationfollowing the procedure of Majumdar et al.We initialized the ViT-B visual encoder to VC-1 weights, usedegocentric video anda state-of-the-art visual representationpre-trainedon ImageNet, and combined it with an MLP action decoder. The entire model is trained to predict expert actions using MSE loss.

The figure below compares the zero-shot manipulation capabilities of Octo, RT-1-X, and RT-2-X. The evaluation was performed on several tasks chosen from a pre-trained dataset, such as pick-and-place, wiping a table with a cloth, and opening and closing drawers. For each robot, two verbal tasks were chosen from the corresponding OXE dataset and each task was tried 10 times under different initial conditions.The chosen tasks are "indistributions" of the pre-trained data, but require generalization to new object locations, lighting conditions, backgrounds, and distracting objects.

While all methods perform tasks reasonably well in a pre-training environment, Octo averages 29% higher success rates than RT-1-X (35M parameters); evaluations of the WidowX and RT-1 robots also compared Octo to RT-2-X (5.5B parameters) and found Octo to have comparable performance.Furthermore, RT-1-X and RT-2-X only support verbal instructions, whereas Octo also supports target images. The evaluation of the WidowX task using target images achieved a 25% higher success rate than the evaluation using verbal instructions. This may be due to the fact that target images provide more information about task accomplishment.The BridgeV2 domain provides a detailed analysis of zero-shot ability for setups and new environments, scenes, and skills in the data set. The results show that the Octo model has a high success rate for new objects, but slightly lower performance for new scenes and new behaviors (e.g., flips and precise insertions).

The results for data-efficient fine tuning in the new domain are shown in the table below, which shows that fine tuning Octo yields better policies than either learning from scratch or using pre-trained VC-1 weights. 6 Ratings Averaged across setups, Octo outperforms the next best baseline by 52%, indicating that Octo is superior as a default setting.

Octo has also been shown to support new observations (e.g., force-torque input in "Berkeley Insertion"), new action spaces (e.g., joint position control in "Berkeley Pick-Up"), and new robot forms (e.g., "Berkeley Coke" and "Berkeley Bimanual"). This confirms that Octo has the flexibility to go beyond a single camera input and end-effector position control and is applicable to a wide range of single and dual arm robotic manipulation problems.

Summary

Octo is a large-scale transformer-based policy that has been pre-trained on the largest dataset of robot operations to date (800,000 robot trajectories). In this paper, we demonstrate that Octo can solve a wide variety of tasks in a zero-shot fashion, and that Octo's constructive design allows for fine tuning to new inputs and action spaces, making it a flexible initial setup for a wide range of robot control problems. In addition, training and fine tuning code and tools are available to assist in training on large robot datasets.

While Octo has shown high performance in zero-shot and fine-tuning evaluations, the current model has also been found to have several shortcomings. These shortcomings are primarily due to the characteristics of the training data. First, the current Octo model struggles to process wrist camera information. Often, fine tuning results were better when using only a third-party camera than when combined with a wrist camera. In addition, we found significant differences in the performance of policies based on linguistic instructions and policies based on goals. In these cases, this may be due to the lack of each modality in the training data. Only 27% of the data contained wrist camera information and 56% contained linguistic annotations.

Expanding the data used to train Octo is a natural direction for improvement; the Open X-Embodiment dataset consists of optimal robot demonstrations, so the current model is learning by imitation. Future research should consider learning from sub-optimal or online interaction data and setting different goals. Furthermore, while Octo is currently trained and evaluated only on single and dual arm manipulators, its extension to more robot sets with navigation and mobile manipulation has great potential.

While Octo represents a step toward building a general-purpose robotics policy that works as-is with a wide variety of robot setups, much work remains to be done to improve the model. For example, improved language conditioning, better support for wrist cameras, and incorporation of non-optimal demonstration data, Octo offers researchers and practitioners access to a large robotics dataset, leverages pre-trained robotics models to efficiently learn new tasks, and provides a simple way to build a general-purpose robot policy that can be used in a wide variety of robot setups, It is hoped that this will be a simple starting point for broad versatility in the future.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

Octo] General-purpose Robot Trained On A Large Robotics Dataset

Summary

Octo Model Design

Experiment

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...