---
library_name: custom
tags:
- robotics
- diffusion
- mixture-of-experts
- multi-modal
license: mit
datasets:
- CALVIN
languages:
- en
pipeline_tag: robotics
---
# MoDE (Mixture of Denoising Experts) Diffusion Policy

## Model Description

<div style="text-align: center">
    <img src="MoDE_Figure_1.png" width="800px"/>
</div>

- [Github Link](https://github.com/intuitive-robots/MoDE_Diffusion_Policy) 
- [Project Page](https://mbreuss.github.io/MoDE_Diffusion_Policy/) 

This model implements a Mixture of Diffusion Experts architecture for robotic manipulation, combining transformer-based backbone with noise-only expert routing. For faster inference, we can precache the chosen expert for each timestep to reduce computation time.

The model has been pretrained on a subset of OXE for 300k steps and finetuned for downstream tasks on the CALVIN/LIBERO dataset.

## Model Details

### Architecture
- **Base Architecture**: MoDE with custom Mixture of Experts Transformer
- **Vision Encoder**: ResNet-50 with FiLM conditioning finetuned from ImageNet
- **EMA**: Enabled
- **Action Window Size**: 10
- **Sampling Steps**: 5 (optimal for performance)
- **Sampler Type**: DDIM

### Input/Output Specifications

#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings

#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions

## Usage

```python
obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
```

## Training Details

### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 0.0001
- **Weight Decay**: 0.05

## License
This model is released under the MIT license.