OpenTrackVLA π€ π
Visual Navigation & Following for Everyone.
OpenTrackVLA is a fully open-source Vision-Language-Action (VLA) stack that turns monocular video and natural-language instructions into actionable, short-horizon waypoints.
While we explore massive backbones (8B/30B) internally, this repository is dedicated to democratizing embodied AI. We have intentionally released our highly efficient 0.6B checkpoint along with the full training pipeline.
π Why OpenTrackVLA?
- Fully Open Source: We release the model weights, inference code, and the training stackβnot just the inference wrapper.
- Accessible: Designed to reproduce, fine-tune, and deploy with affordable compute .
- Multimodal Control: Combines learned priors with visual input to guide real or simulated robots via simple text prompts.
Acknowledgment: OpenTrackVLA builds on the ideas introduced by the original TrackVLA project. Their partially-open release inspired this community-driven effort to keep the ecosystem open so researchers and developers can continue improving the stack together.
Demo In Action
The system processes video history and text instructions to predict future waypoints. Below are examples of the tracker in action:
This directory contains the HuggingFace-friendly export of the OpenTrackVLA planner.
Full project (code, datasets, training pipeline): https://github.com/om-ai-lab/OpenTrackVLA
Downloading from HuggingFace
Python
from transformers import AutoModel
model = AutoModel.from_pretrained("omlab/opentrackvla-qwen06b").eval()
Habitat evaluation using this export
OpenTrackVLA GitHub Repository
Full Project Documentation
trained_agent.py prefers HuggingFace weights when either env var is set:
HF_MODEL_DIR=/abs/path/to/open_trackvla_hf(already downloaded)HF_MODEL_ID=omlab/opentrackvla-qwen06b(auto-download viahuggingface_hub)
Example:
HF_MODEL_ID=omlab/opentrackvla-qwen06b bash eval.sh
- Downloads last month
- 5