metadata
license: apache-2.0
language:
- en
base_model:
- mistralai/Mistral-7B-Instruct-v0.2
tags:
- video temporal grounding
- dense video caption
- video highlight detection
Overview
In this work
- We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
- We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.
Model Zoo
Checkpoints | Description | URL |
---|---|---|
Initialization | Weights initialized from VideoLLaMA2 | trace-init |
Stage-1 | Model checkpoints trained after stage-1 | trace-stage1 |
Stage-2 | Model checkpoints trained after stage-2 | trace |
Results
Youcook2 (Zero-Shot) | CIDER | METEOR | SODA_c | F1 |
---|---|---|---|---|
TRACE | 8.1 | 2.8 | 2.2 | 22.4 |
Charades-STA (Zero-Shot) | 0.3 | 0.5 | 0.7 | mIOU |
---|---|---|---|---|
TRACE | 58.6 | 40.3 | 19.4 | 38.7 |
QVHighlights (Zero-Shot) | mAP | Hit@1 |
---|---|---|
TRACE | 26.8 | 42.7 |
ActivityNet-DVC | CIDER | METEOR | SODA_c | F1 |
---|---|---|---|---|
TRACE | 25.9 | 6.0 | 6.4 | 39.3 |
ActivityNet-MR | 0.3 | 0.5 | 0.7 | mIOU |
---|---|---|---|---|
TRACE | 53.0 | 37.7 | 24.0 | 39.0 |