README.md · Yongxin-Guo/trace at f276f2d1c09fa5455c08949f9ef64c4f9d9ce352

metadata

license: apache-2.0
language:
  - en
base_model:
  - mistralai/Mistral-7B-Instruct-v0.2
tags:
  - video temporal grounding
  - dense video caption
  - video highlight detection

Overview

In this work

We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

Model Zoo

Checkpoints	Description	URL
Initialization	Weights initialized from VideoLLaMA2	trace-init
Stage-1	Model checkpoints trained after stage-1	trace-stage1
Stage-2	Model checkpoints trained after stage-2	trace

Results

Youcook2 (Zero-Shot)	CIDER	METEOR	SODA_c	F1
TRACE	8.1	2.8	2.2	22.4

Charades-STA (Zero-Shot)	0.3	0.5	0.7	mIOU
TRACE	58.6	40.3	19.4	38.7

QVHighlights (Zero-Shot)	mAP	Hit@1
TRACE	26.8	42.7

ActivityNet-DVC	CIDER	METEOR	SODA_c	F1
TRACE	25.9	6.0	6.4	39.3

ActivityNet-MR	0.3	0.5	0.7	mIOU
TRACE	53.0	37.7	24.0	39.0