File size: 4,856 Bytes

---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: transformers
pipeline_tag: image-to-text
tags:
- multimodal
- video-understanding
- spatial-reasoning
- vision-language
datasets:
- nyu-visionx/VSI-590K
model-index:
- name: Cambrian-S-7B
  results:
  - task:
      type: visual-question-answering
      name: VSI-Bench
    dataset:
      type: vsi-bench
      name: VSI-Bench
    metrics:
      - type: accuracy
        name: accuracy
        value: 67.5
  - task:
      type: visual-question-answering
      name: Tomato
    dataset:
      type: Tomato
      name: Tomato
    metrics:
      - type: accuracy
        name: accuracy
        value: 27.0
  - task:
      type: visual-question-answering
      name: HourVideo
    dataset:
      type: hourvideo
      name: HourVideo
    metrics:
      - type: accuracy
        name: accuracy
        value: 36.5
  - task:
      type: visual-question-answering
      name: EgoSchema
    dataset:
      type: egoschema
      name: EgoSchema
    metrics:
      - type: accuracy
        name: accuracy
        value: 76.8
  - task:
      type: visual-question-answering
      name: Perception Test
    dataset:
      type: perception-test
      name: Perception Test
    metrics:
      - type: accuracy
        name: accuracy
        value: 69.9
  - task:
      type: visual-question-answering
      name: VideoMME
    dataset:
      type: videomme
      name: VideoMME
    metrics:
      - type: accuracy
        name: accuracy
        value: 63.4
  - task:
      type: visual-question-answering
      name: MVBench
    dataset:
      type: mvbench
      name: MVBench
    metrics:
      - type: accuracy
        name: accuracy
        value: 64.5
  - task:
      type: visual-question-answering
      name: LongVideoBench
    dataset:
      type: longvideobench
      name: LongVideoBench
    metrics:
      - type: accuracy
        name: accuracy
        value: 59.4
  - task:
      type: visual-question-answering
      name: VideoMMMU
    dataset:
      type: videommmu
      name: VideoMMMU
    metrics:
      - type: accuracy
        name: accuracy
        value: 38.6
  - task:
      type: visual-question-answering
      name: MMVP
    dataset:
      type: mmvp
      name: MMVP
    metrics:
      - type: accuracy
        name: accuracy
        value: 60.0
  - task:
      type: visual-question-answering
      name: 3DSR
    dataset:
      type: 3dsr
      name: 3DSR
    metrics:
      - type: accuracy
        name: accuracy
        value: 54.8
  - task:
      type: visual-question-answering
      name: CV-Bench
    dataset:
      type: cv-bench
      name: CV-Bench
    metrics:
      - type: accuracy
        name: accuracy
        value: 76.9
language:
- en
---


# Cambrian-S-7B

**[Website](https://cambrian-mllm.github.io/cambrian-s/)** | **[Paper](https://arxiv.org/abs/2511.04670)** | **[GitHub](https://github.com/cambrian-mllm/cambrian-s)** | **[Cambrian-S Family](https://huggingface.co/collections/nyu-visionx/cambrian-s-models)**

**Authors**: [Shusheng Yang*](https://github.com/vealocia), [Jihan Yang*](https://jihanyang.github.io/), [Pinzhi Huang†](https://pinzhihuang.github.io/), [Ellis Brown†](https://ellisbrown.github.io/), et al.

Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.

## Model Details

- **Architecture**: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
- **Parameters**: 7B
- **Vision Encoder**: SigLIP-384 (SiGLIP)
- **Training**: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
- **Training Data**: Trained on [VSI-590K](https://huggingface.co/datasets/nyu-visionx/VSI-590K) (spatial reasoning) + general video instruction data


## Usage

```python
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates

model_path = "nyu-visionx/Cambrian-S-7B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda")

# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
```

## Citation

```bibtex
@article{yang2025cambrian,
  title={Cambrian-S: Towards Spatial Supersensing in Video},
  author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others},
  journal={arXiv preprint arXiv:2025},
  year={2025}
}
```