File size: 4,856 Bytes
d7c6b05 4c040ee d7c6b05 4c040ee 58a7702 4c040ee 58a7702 f1a2c40 4c040ee f1a2c40 4c040ee f1a2c40 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee 58a7702 4c040ee d7c6b05 58a7702 d7c6b05 556e72f 58a7702 d7c6b05 58a7702 d7c6b05 4c040ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: transformers
pipeline_tag: image-to-text
tags:
- multimodal
- video-understanding
- spatial-reasoning
- vision-language
datasets:
- nyu-visionx/VSI-590K
model-index:
- name: Cambrian-S-7B
results:
- task:
type: visual-question-answering
name: VSI-Bench
dataset:
type: vsi-bench
name: VSI-Bench
metrics:
- type: accuracy
name: accuracy
value: 67.5
- task:
type: visual-question-answering
name: Tomato
dataset:
type: Tomato
name: Tomato
metrics:
- type: accuracy
name: accuracy
value: 27.0
- task:
type: visual-question-answering
name: HourVideo
dataset:
type: hourvideo
name: HourVideo
metrics:
- type: accuracy
name: accuracy
value: 36.5
- task:
type: visual-question-answering
name: EgoSchema
dataset:
type: egoschema
name: EgoSchema
metrics:
- type: accuracy
name: accuracy
value: 76.8
- task:
type: visual-question-answering
name: Perception Test
dataset:
type: perception-test
name: Perception Test
metrics:
- type: accuracy
name: accuracy
value: 69.9
- task:
type: visual-question-answering
name: VideoMME
dataset:
type: videomme
name: VideoMME
metrics:
- type: accuracy
name: accuracy
value: 63.4
- task:
type: visual-question-answering
name: MVBench
dataset:
type: mvbench
name: MVBench
metrics:
- type: accuracy
name: accuracy
value: 64.5
- task:
type: visual-question-answering
name: LongVideoBench
dataset:
type: longvideobench
name: LongVideoBench
metrics:
- type: accuracy
name: accuracy
value: 59.4
- task:
type: visual-question-answering
name: VideoMMMU
dataset:
type: videommmu
name: VideoMMMU
metrics:
- type: accuracy
name: accuracy
value: 38.6
- task:
type: visual-question-answering
name: MMVP
dataset:
type: mmvp
name: MMVP
metrics:
- type: accuracy
name: accuracy
value: 60.0
- task:
type: visual-question-answering
name: 3DSR
dataset:
type: 3dsr
name: 3DSR
metrics:
- type: accuracy
name: accuracy
value: 54.8
- task:
type: visual-question-answering
name: CV-Bench
dataset:
type: cv-bench
name: CV-Bench
metrics:
- type: accuracy
name: accuracy
value: 76.9
language:
- en
---
# Cambrian-S-7B
**[Website](https://cambrian-mllm.github.io/cambrian-s/)** | **[Paper](https://arxiv.org/abs/2511.04670)** | **[GitHub](https://github.com/cambrian-mllm/cambrian-s)** | **[Cambrian-S Family](https://huggingface.co/collections/nyu-visionx/cambrian-s-models)**
**Authors**: [Shusheng Yang*](https://github.com/vealocia), [Jihan Yang*](https://jihanyang.github.io/), [Pinzhi Huang†](https://pinzhihuang.github.io/), [Ellis Brown†](https://ellisbrown.github.io/), et al.
Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.
## Model Details
- **Architecture**: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
- **Parameters**: 7B
- **Vision Encoder**: SigLIP-384 (SiGLIP)
- **Training**: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
- **Training Data**: Trained on [VSI-590K](https://huggingface.co/datasets/nyu-visionx/VSI-590K) (spatial reasoning) + general video instruction data
## Usage
```python
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates
model_path = "nyu-visionx/Cambrian-S-7B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda")
# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
```
## Citation
```bibtex
@article{yang2025cambrian,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others},
journal={arXiv preprint arXiv:2025},
year={2025}
}
``` |