VTSUM-BLIP Model Card
Model details
Model type: VTSUM-BLIP is an end-to-end cross-modal video summarization model.
Model description:
- VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
- VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
- VT-CLIP for VT-CLIPScore metric: vt_clip.pth
- BLIP w/ ViT-B and CapFilt-L (Download): model_base_capfilt_large.pth
The file structure of Model zoo looks like:
outputs
βββ blip
β βββ model_base_capfilt_large.pth
βββ vt_clipscore
β βββ vt_clip.pth
βββ vtsum_tt
β βββ vtsum_tt.pth
βββ vtsum_tt_ca
βββ vtsum_tt_ca.pth
Paper or resources for more information: https://videoxum.github.io/
Training dataset
- VideoXum training set: 8K long videos long videos with 80K pairs of aligned video and text summaries.
Evaluation dataset
- VideoXum val set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
- VideoXum test set: 4K long videos long videos with 80K pairs of aligned video and text summaries.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.