In-Context Imitation Learning via Next-Token Prediction
Abstract
We explore how to enhance next-token prediction models to perform in-context imitation learning on a real robot, where the robot executes new tasks by interpreting contextual information provided during the input phase, without updating its underlying policy parameters. We propose In-Context Robot Transformer (ICRT), a causal transformer that performs autoregressive prediction on sensorimotor trajectories without relying on any linguistic data or reward function. This formulation enables flexible and training-free execution of new tasks at test time, achieved by prompting the model with sensorimotor trajectories of the new task composing of image observations, actions and states tuples, collected through human teleoperation. Experiments with a Franka Emika robot demonstrate that the ICRT can adapt to new tasks specified by prompts, even in environment configurations that differ from both the prompt and the training data. In a multitask environment setup, ICRT significantly outperforms current state-of-the-art next-token prediction models in robotics on generalizing to unseen tasks. Code, checkpoints and data are available on https://icrt.dev/
Community
TL;DR: We approach in-context, multi-task imitation learning on a physical robot as a next-token prediction problem. We train a causal transformer on concatenated robot trajectories. During testing, the model can execute a new task in a different environment configuration without fine-tuning by being prompted with raw robot trajectories collected via human teleoperation that perform the new task.
Website: https://icrt.dev/
Code, checkpoints, dataset: https://github.com/Max-Fu/icrt
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Actra: Optimized Transformer Architecture for Vision-Language-Action Models in Robot Learning (2024)
- Robotic Control via Embodied Chain-of-Thought Reasoning (2024)
- Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts (2024)
- GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy (2024)
- Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper