---
task_categories:
- text-generation
size_categories:
- 157K
license: apache-2.0
---

# Model Card for AtomThink-EMOVA-8B

The model is post-trained based on EMOVA-8B and the AtomThink framework, and can be used to solve complex multimodal mathematical problems.

# Comparison of accuracy with state-of-the-art methods on MathVista and MathVerse:
| **Model**             | **Inference** | **General** | **Math** | **Total** | **TL**   | **TD**   | **VI**   | **VD**   | **VO**   | **Total** |
|-----------------------|---------------|-------------|----------|-----------|----------|----------|----------|----------|----------|-----------|
| Random Choice         | -             | -           | -        | 17.9      | 12.4     | 12.4     | 12.4     | 12.4     | 12.4     | 12.4      |
| Human                 | -             | -           | -        | -         | 70.9     | 71.2     | 61.4     | 68.3     | 66.7     | 66.7      |
| OpenAI o1             | Slow Think*   | -           | -        | 73.9      | -        | -        | -        | -        | -        | -         |
| GPT-4o                | CoT           | -           | -        | 63.8      | -        | -        | -        | -        | -        | -         |
| GPT-4V                | CoT           | -           | -        | 49.9      | 56.6     | 63.1     | 51.4     | 50.8     | 50.3     | 54.4      |
| LLaVA-NeXT-34B        | Direct        | -           | -        | 46.5      | 25.5     | 33.8     | 23.5     | 20.3     | 15.7     | 23.8      |
| InternLM-XComposer2   | Direct        | -           | -        | 57.6      | 17.0     | 22.3     | 15.7     | 16.4     | 11.0     | 16.5      |
| Qwen-VL-Plus          | Direct        | -           | -        | 43.3      | 11.1     | 15.7     | 9.0      | 13.0     | 10.0     | 11.8      |
| LLaVA-1.5-13B         | Direct        | -           | -        | 27.6      | 15.2     | 19.4     | 16.8     | 15.2     | 11.3     | 15.6      |
| G-LLaVA-7B            | Direct        | -           | -        | 53.4      | 20.7     | 20.9     | 17.2     | 14.6     | 9.4      | 16.6      |
| MAVIS-7B              | Direct        | -           | -        | -         | 29.1     | 41.4     | 27.4     | 24.9     | 14.6     | 27.5      |
| LLaVA-Llama3-8B       | Direct        | 34.1        | 25.6     | 29.5      | 16.0     | 19.3     | 16.4     | 13.1     | 15.0     | 15.9      |
| EMOVA-8B-200k         | Direct        | 52.4        | 51.1     | 51.7      | 34.4     | 39.0     | 33.4     | 30.1     | 23.5     | 32.1      |
| EMOVA w/. Formatted   | CoT           | 30.9        | 31.3     | 31.1      | 26.5     | 36.5     | 25.3     | 20.4     | 19.8     | 25.7      |
| AtomThink-EMOVA       | Direct        | 53.9        | 52.4     | 53.1      | 33.6     | 39.0     | 33.8     | 28.0     | 24.4     | 31.8      |
| AtomThink-EMOVA       | Quick Think   | 48.7        | **54.4** | **51.8**  | **36.5** | **42.4** | **34.1** | **32.9** | **29.7** | **35.1**  |
| AtomThink-EMOVA       | Slow Think    | 48.9        | **57.0** | **53.3**  | **42.1** | **51.5** | **39.0** | **36.7** | **33.1** | **40.5**  |


# Citation
If you use this dataset in your research, please cite:
```text
@article{xiang2024atomthink,
  title={AtomThink: A Slow Thinking Framework for Multimodal Mathematical Reasoning},
  author={Xiang, Kun and Liu, Zhili and Jiang, Zihao and Nie, Yunshuang and Huang, Runhui and Fan, Haoxiang and Li, Hanhui and Huang, Weiran and Zeng, Yihan and Han, Jianhua and others},
  journal={arXiv preprint arXiv:2411.11930},
  year={2024}
}
@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}
```

# License
The checkpoint is released under the Apache 2.0 license. Please ensure proper attribution when using this dataset.