File size: 4,856 Bytes
d7c6b05
 
 
 
 
 
4c040ee
 
 
 
d7c6b05
4c040ee
 
 
 
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
f1a2c40
4c040ee
f1a2c40
 
4c040ee
 
 
f1a2c40
 
 
 
 
 
 
 
 
 
 
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
58a7702
 
 
 
 
 
 
 
 
4c040ee
 
 
d7c6b05
 
58a7702
d7c6b05
 
556e72f
58a7702
 
d7c6b05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58a7702
d7c6b05
 
 
 
 
4c040ee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: transformers
pipeline_tag: image-to-text
tags:
- multimodal
- video-understanding
- spatial-reasoning
- vision-language
datasets:
- nyu-visionx/VSI-590K
model-index:
- name: Cambrian-S-7B
  results:
  - task:
      type: visual-question-answering
      name: VSI-Bench
    dataset:
      type: vsi-bench
      name: VSI-Bench
    metrics:
      - type: accuracy
        name: accuracy
        value: 67.5
  - task:
      type: visual-question-answering
      name: Tomato
    dataset:
      type: Tomato
      name: Tomato
    metrics:
      - type: accuracy
        name: accuracy
        value: 27.0
  - task:
      type: visual-question-answering
      name: HourVideo
    dataset:
      type: hourvideo
      name: HourVideo
    metrics:
      - type: accuracy
        name: accuracy
        value: 36.5
  - task:
      type: visual-question-answering
      name: EgoSchema
    dataset:
      type: egoschema
      name: EgoSchema
    metrics:
      - type: accuracy
        name: accuracy
        value: 76.8
  - task:
      type: visual-question-answering
      name: Perception Test
    dataset:
      type: perception-test
      name: Perception Test
    metrics:
      - type: accuracy
        name: accuracy
        value: 69.9
  - task:
      type: visual-question-answering
      name: VideoMME
    dataset:
      type: videomme
      name: VideoMME
    metrics:
      - type: accuracy
        name: accuracy
        value: 63.4
  - task:
      type: visual-question-answering
      name: MVBench
    dataset:
      type: mvbench
      name: MVBench
    metrics:
      - type: accuracy
        name: accuracy
        value: 64.5
  - task:
      type: visual-question-answering
      name: LongVideoBench
    dataset:
      type: longvideobench
      name: LongVideoBench
    metrics:
      - type: accuracy
        name: accuracy
        value: 59.4
  - task:
      type: visual-question-answering
      name: VideoMMMU
    dataset:
      type: videommmu
      name: VideoMMMU
    metrics:
      - type: accuracy
        name: accuracy
        value: 38.6
  - task:
      type: visual-question-answering
      name: MMVP
    dataset:
      type: mmvp
      name: MMVP
    metrics:
      - type: accuracy
        name: accuracy
        value: 60.0
  - task:
      type: visual-question-answering
      name: 3DSR
    dataset:
      type: 3dsr
      name: 3DSR
    metrics:
      - type: accuracy
        name: accuracy
        value: 54.8
  - task:
      type: visual-question-answering
      name: CV-Bench
    dataset:
      type: cv-bench
      name: CV-Bench
    metrics:
      - type: accuracy
        name: accuracy
        value: 76.9
language:
- en
---


# Cambrian-S-7B

**[Website](https://cambrian-mllm.github.io/cambrian-s/)** | **[Paper](https://arxiv.org/abs/2511.04670)** | **[GitHub](https://github.com/cambrian-mllm/cambrian-s)** | **[Cambrian-S Family](https://huggingface.co/collections/nyu-visionx/cambrian-s-models)**

**Authors**: [Shusheng Yang*](https://github.com/vealocia), [Jihan Yang*](https://jihanyang.github.io/), [Pinzhi Huang†](https://pinzhihuang.github.io/), [Ellis Brown†](https://ellisbrown.github.io/), et al.

Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.

## Model Details

- **Architecture**: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
- **Parameters**: 7B
- **Vision Encoder**: SigLIP-384 (SiGLIP)
- **Training**: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
- **Training Data**: Trained on [VSI-590K](https://huggingface.co/datasets/nyu-visionx/VSI-590K) (spatial reasoning) + general video instruction data


## Usage

```python
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates

model_path = "nyu-visionx/Cambrian-S-7B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda")

# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
```

## Citation

```bibtex
@article{yang2025cambrian,
  title={Cambrian-S: Towards Spatial Supersensing in Video},
  author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others},
  journal={arXiv preprint arXiv:2025},
  year={2025}
}
```