--- license: apache-2.0 base_model: Qwen/Qwen2.5-7B-Instruct library_name: transformers pipeline_tag: image-to-text tags: - multimodal - video-understanding - spatial-reasoning - vision-language datasets: - nyu-visionx/VSI-590K model-index: - name: Cambrian-S-7B results: - task: type: visual-question-answering name: VSI-Bench dataset: type: vsi-bench name: VSI-Bench metrics: - type: accuracy name: accuracy value: 67.5 - task: type: visual-question-answering name: Tomato dataset: type: Tomato name: Tomato metrics: - type: accuracy name: accuracy value: 27.0 - task: type: visual-question-answering name: HourVideo dataset: type: hourvideo name: HourVideo metrics: - type: accuracy name: accuracy value: 36.5 - task: type: visual-question-answering name: EgoSchema dataset: type: egoschema name: EgoSchema metrics: - type: accuracy name: accuracy value: 76.8 - task: type: visual-question-answering name: Perception Test dataset: type: perception-test name: Perception Test metrics: - type: accuracy name: accuracy value: 69.9 - task: type: visual-question-answering name: VideoMME dataset: type: videomme name: VideoMME metrics: - type: accuracy name: accuracy value: 63.4 - task: type: visual-question-answering name: MVBench dataset: type: mvbench name: MVBench metrics: - type: accuracy name: accuracy value: 64.5 - task: type: visual-question-answering name: LongVideoBench dataset: type: longvideobench name: LongVideoBench metrics: - type: accuracy name: accuracy value: 59.4 - task: type: visual-question-answering name: VideoMMMU dataset: type: videommmu name: VideoMMMU metrics: - type: accuracy name: accuracy value: 38.6 - task: type: visual-question-answering name: MMVP dataset: type: mmvp name: MMVP metrics: - type: accuracy name: accuracy value: 60.0 - task: type: visual-question-answering name: 3DSR dataset: type: 3dsr name: 3DSR metrics: - type: accuracy name: accuracy value: 54.8 - task: type: visual-question-answering name: CV-Bench dataset: type: cv-bench name: CV-Bench metrics: - type: accuracy name: accuracy value: 76.9 language: - en --- # Cambrian-S-7B **[Website](https://cambrian-mllm.github.io/cambrian-s/)** | **[Paper](https://arxiv.org/abs/2511.04670)** | **[GitHub](https://github.com/cambrian-mllm/cambrian-s)** | **[Cambrian-S Family](https://huggingface.co/collections/nyu-visionx/cambrian-s-models)** **Authors**: [Shusheng Yang*](https://github.com/vealocia), [Jihan Yang*](https://jihanyang.github.io/), [Pinzhi Huang†](https://pinzhihuang.github.io/), [Ellis Brown†](https://ellisbrown.github.io/), et al. Cambrian-S-7B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks. ## Model Details - **Architecture**: Qwen2.5-7B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter - **Parameters**: 7B - **Vision Encoder**: SigLIP-384 (SiGLIP) - **Training**: 4-stage pipeline (image alignment → image IT → video IT → spatial IT) - **Training Data**: Trained on [VSI-590K](https://huggingface.co/datasets/nyu-visionx/VSI-590K) (spatial reasoning) + general video instruction data ## Usage ```python from cambrian.model.builder import load_pretrained_model from cambrian.mm_utils import process_images, tokenizer_image_token from cambrian.conversation import conv_templates model_path = "nyu-visionx/Cambrian-S-7B" tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-7b", device_map="cuda") # Process image/video conv = conv_templates["qwen_2"].copy() conv.append_message(conv.roles[0], "\nWhat objects are in this scene?") conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() # Generate output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes) ``` ## Citation ```bibtex @article{yang2025cambrian, title={Cambrian-S: Towards Spatial Supersensing in Video}, author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and others}, journal={arXiv preprint arXiv:2025}, year={2025} } ```