SmolVLA — SO-101 Space Decluttering

SmolVLA policy fine-tuned on the SO-101 Space Decluttering Dataset v1 for language-conditioned pick-and-place decluttering tasks on a 6-DoF SO-101 robotic arm. Trained using LeRobot.

Training Details

Policy: SmolVLA (Vision-Language-Action)
Steps: 20,000
Robot: SO-101 6-DoF leader-follower
Cameras: Dual-view — fixed top-view + wrist-mounted egocentric
Framework: LeRobot
Language conditioning: Task descriptions passed as natural language instructions

Dataset

Trained on ShubhamK32/so101_declutter_v1 — a multi-view teleoperation dataset with spatial distractors injected to prevent visual shortcut learning.

Usage

from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained("ShubhamK32/smolvla_so101_declutter")

Camera Views

observation.images.topview — Fixed overhead. Better for unoccluded pick-place tasks.
observation.images.wristview — Egocentric wrist-mounted. Better for overlapping and cluttered scenes.