wchai
/

AuroraCap-7B-VID-xtuner

Video-Text-to-Text

Model card Files Files and versions Community

AuroraCap-7B-VID-xtuner / README.md

wchai's picture

Update README.md

262910b verified about 1 month ago

|

history blame contribute delete

3.78 kB

	---
	license: apache-2.0
	datasets:
	- wchai/AuroraCap-trainset
	base_model:
	- lmsys/vicuna-7b-v1.5-16k
	tags:
	- caption
	model-index:
	- name: AuroraCap-7B
	results:
	- task:
	type: video detailed caption
	dataset:
	type: VDC
	name: VDC
	metrics:
	- type: Acc
	value: 38.21
	name: VDCScore
	- type: Acc
	value: 48.33
	name: VDD
	- type: cider
	value: 9.51
	- type: bleu
	value: 30.9
	name: bleu@1
	- type: bleu
	value: 4.06
	name: bleu@4
	- type: meteor
	value: 19.09
	- type: rouge
	value: 21.58
	name: rouge-l
	- task:
	type: video caption
	dataset:
	type: MSR-VTT
	name: NSR-VTT
	metrics:
	- type: cider
	value: 33.1
	- type: bleu
	value: 58.6
	name: bleu@1
	- type: bleu
	value: 21
	name: bleu@4
	- type: meteor
	value: 23.9
	- type: rouge
	value: 49.5
	name: rouge-l
	- task:
	type: video caption
	dataset:
	type: VATEX
	name: VATEX
	metrics:
	- type: cider
	value: 33.8
	- type: bleu
	value: 57.1
	name: bleu@1
	- type: bleu
	value: 18.4
	name: bleu@4
	- type: meteor
	value: 19
	- type: rouge
	value: 40.8
	name: rouge-l
	- task:
	type: video question anwering
	dataset:
	type: ActivityNet
	name: ActivityNet
	metrics:
	- type: Acc
	value: 61.8
	- task:
	type: video question anwering
	dataset:
	type: MSVD
	name: MSVD
	metrics:
	- type: Acc
	value: 62.6
	- task:
	type: video question anwering
	dataset:
	type: MSR-VTT
	name: MSR-VTT
	metrics:
	- type: Acc
	value: 43.5
	- task:
	type: video question anwering
	dataset:
	type: iVQA
	name: iVQA
	metrics:
	- type: Acc
	value: 55.2
	pipeline_tag: video-text-to-text
	---

	<img src="assets/teaser.png" align="center">

	## Resources

	- [Website](https://rese1f.github.io/aurora-web/)
	- [arXiv: Paper](https://arxiv.org/abs/2410.03051)
	- [GitHub: Code](https://github.com/rese1f/aurora)
	- [Huggingface: AuroraCap Model](https://huggingface.co/collections/Reself/auroracap-66d117ffe13bedda96702013)
	- [Huggingface: VDC Benchmark](https://huggingface.co/datasets/Reself/Video-Detailed-Caption)
	- [Huggingface: Trainset](https://huggingface.co/datasets/Reself/AuroraCap-trainset)

	## Features

	<img src="assets/assets_vdc_baseline.png" align="center">

	AuroraCap is a multimodal large language model for image and video captioning.

	## Quick Start
	See [Docs](https://github.com/rese1f/aurora/blob/main/docs/auroracap/README.md).

	## FAQ

	Q: Can I only use token merging during inference?

	A: No, our experiments show that token merging is also a way to accelerate training while maintaining similar performance. Additionally, besides auroracap, you can also use token merging on other llava-like models.

	Q: Why do we provide both official LLaVA-format and Xtuner format weights for AuroraCap?

	A: While Xtuner supports saving checkpoints in multiple formats, it currently only allows continued training with the Xtuner format. Therefore, we currently provide the model in the Xtuner format for both continued training and inference. In the future, we will provide the model in the official LLaVA format for both training and inference, enabling quicker SGLang deployment and integration with the transformers.

	## Citation

	```
	@article{chai2024auroracap,
	title={AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark },
	author={Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning},
	journal={arXiv preprint arXiv:2410.03051},
	year={2024}
	}
	```