VTSUM-BLIP Model Card

Model details

Model type: VTSUM-BLIP is an end-to-end cross-modal video summarization model.

Model description:

  • VTSUM-BLIP + Temporal Transformer (TT): vtsum_tt.pth
  • VTSUM-BLIP + Temporal Transformer (TT) + Context Aggregation (CA): vtsum_tt_ca.pth
  • VT-CLIP for VT-CLIPScore metric: vt_clip.pth
  • BLIP w/ ViT-B and CapFilt-L (Download): model_base_capfilt_large.pth

The file structure of Model zoo looks like:

outputs
β”œβ”€β”€ blip
β”‚   └── model_base_capfilt_large.pth
β”œβ”€β”€ vt_clipscore
β”‚   └── vt_clip.pth
β”œβ”€β”€ vtsum_tt
β”‚   └── vtsum_tt.pth
└── vtsum_tt_ca
    └── vtsum_tt_ca.pth

Paper or resources for more information: https://videoxum.github.io/

Training dataset

  • VideoXum training set: 8K long videos long videos with 80K pairs of aligned video and text summaries.

Evaluation dataset

  • VideoXum val set: 2K long videos long videos with 80K pairs of aligned video and text summaries.
  • VideoXum test set: 4K long videos long videos with 80K pairs of aligned video and text summaries.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.