Liangrj5
commited on
Commit
·
a638e43
1
Parent(s):
f2d2d1a
init
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +2 -0
- .gitignore +1 -0
- README.md +47 -3
- config/config.py +227 -0
- config/model_config.json +3 -0
- config/tvr_ranking_data_config_top01.json +3 -0
- config/tvr_ranking_data_config_top20.json +3 -0
- config/tvr_ranking_data_config_top40.json +3 -0
- data_loader/second_stage_start_end_dataset.py +349 -0
- inference.py +570 -0
- model/__init__.py +0 -0
- model/backbone/__init__.py +0 -0
- model/backbone/encoder.py +235 -0
- model/conquer.py +205 -0
- model/head/__init__.py +0 -0
- model/head/ml_head.py +61 -0
- model/head/vs_head.py +42 -0
- model/layers.py +196 -0
- model/modeling_utils.py +135 -0
- model/qal/__init__.py +0 -0
- model/qal/query_aware_learning_module.py +92 -0
- model/transformer/__init__.py +0 -0
- model/transformer/bert.py +275 -0
- model/transformer/bert_embed.py +64 -0
- ndcg_iou_topk.py +66 -0
- optim/adamw.py +106 -0
- results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01.log +3 -0
- results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01_back.log +3 -0
- results/tvr-top01-2024_07_08_17_18_30/best_test_predictions.json +3 -0
- results/tvr-top01-2024_07_08_17_18_30/best_val_predictions.json +3 -0
- results/tvr-top01-2024_07_08_17_18_30/code.zip +3 -0
- results/tvr-top01-2024_07_08_17_18_30/model.ckpt +3 -0
- results/tvr-top01-2024_07_08_17_18_30/opt.json +3 -0
- results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20.log +3 -0
- results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20_back.log +3 -0
- results/tvr-top20-2024_07_08_21_19_47/best_test_predictions.json +3 -0
- results/tvr-top20-2024_07_08_21_19_47/best_val_predictions.json +3 -0
- results/tvr-top20-2024_07_08_21_19_47/code.zip +3 -0
- results/tvr-top20-2024_07_08_21_19_47/model.ckpt +3 -0
- results/tvr-top20-2024_07_08_21_19_47/opt.json +3 -0
- results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40.log +3 -0
- results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40_back.log +3 -0
- results/tvr-top40-2024_07_11_10_58_46/best_test_predictions.json +3 -0
- results/tvr-top40-2024_07_11_10_58_46/best_val_predictions.json +3 -0
- results/tvr-top40-2024_07_11_10_58_46/code.zip +3 -0
- results/tvr-top40-2024_07_11_10_58_46/model.ckpt +3 -0
- results/tvr-top40-2024_07_11_10_58_46/opt.json +3 -0
- run_disjoint_top01.sh +19 -0
- run_disjoint_top20.sh +19 -0
- run_disjoint_top40.sh +19 -0
.gitattributes
CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.json filter=lfs diff=lfs merge=lfs -text
|
37 |
+
*.log filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
*__pycache__
|
README.md
CHANGED
@@ -1,3 +1,47 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- axgroup/Ranking_TVR
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
---
|
8 |
+
# CONQUER_RVMR
|
9 |
+
|
10 |
+
This repository contains the XML model for the baseline of the Ranked Video Moment Retrieval (RVMR) task. The associated paper is titled "Video Moment Retrieval in Practical Setting: A Dataset of Ranked Moments for Imprecise Queries."
|
11 |
+
|
12 |
+
The main repository of the paper is [TVR-Ranking](https://huggingface.co/axgroup/TVR-Ranking), and this model is adapted from [CONQUER](https://github.com/houzhijian/CONQUER.git). The environment setup is the same as for RelocNet_RVMR, as detailed in the [TVR-Ranking](https://huggingface.co/axgroup/TVR-Ranking) repository.
|
13 |
+
|
14 |
+
|
15 |
+
CONQUER leverages video retrieval results from [HERO](https://github.com/linjieli222/HERO.git). We continue to use these
|
16 |
+
results when training on our TVR-Ranking dataset. Note that, because the HERO results are obtained from the TVR dataset, there could be a data leak issue in our task setting. However, this issue is negligible for two reasons: (i) the queries used in our setting is imprecise query with query re-written, and (ii) a query has multiple ground truth moments in our task setting, which was not annotated in the original TVR dataset.
|
17 |
+
|
18 |
+
|
19 |
+
## Performance
|
20 |
+
|
21 |
+
|
22 |
+
| **Model** | **Train Set Top N** | **IoU=0.3** | | **IoU=0.5** | | **IoU=0.7** | |
|
23 |
+
|------------|---------------------|-------------|----------|-------------|----------|-------------|----------|
|
24 |
+
| | | **Val** | **Test** | **Val** | **Test** | **Val** | **Test** |
|
25 |
+
| **NDCG@10**| | | | | | | |
|
26 |
+
| CONQUER | 1 | 0.0999 | 0.0859 | 0.0844 | 0.0709 | 0.0530 | 0.0512 |
|
27 |
+
| CONQUER | 20 | 0.2406 | 0.2249 | 0.2222 | 0.2104 | 0.1672 | 0.1517 |
|
28 |
+
| CONQUER | 40 | 0.2450 | 0.2219 | 0.2262 | 0.2085 | 0.1670 | 0.1515 |
|
29 |
+
| **NDCG@20**| | | | | | | |
|
30 |
+
| CONQUER | 1 | 0.0952 | 0.0835 | 0.0808 | 0.0687 | 0.0526 | 0.0484 |
|
31 |
+
| CONQUER | 20 | 0.2130 | 0.1995 | 0.1976 | 0.1867 | 0.1527 | 0.1368 |
|
32 |
+
| CONQUER | 40 | 0.2183 | 0.1968 | 0.2022 | 0.1851 | 0.1524 | 0.1365 |
|
33 |
+
| **NDCG@40**| | | | | | | |
|
34 |
+
| CONQUER | 1 | 0.0974 | 0.0866 | 0.0832 | 0.0718 | 0.0557 | 0.0510 |
|
35 |
+
| CONQUER | 20 | 0.2029 | 0.1906 | 0.1891 | 0.1788 | 0.1476 | 0.1326 |
|
36 |
+
| CONQUER | 40 | 0.2080 | 0.1885 | 0.1934 | 0.1775 | 0.1473 | 0.1323 |
|
37 |
+
|
38 |
+
|
39 |
+
## Quick Start
|
40 |
+
|
41 |
+
Modify the path in `run_disjoint_top20.sh` and then execute the script:
|
42 |
+
|
43 |
+
```sh
|
44 |
+
sh run_disjoint_top20.sh
|
45 |
+
```
|
46 |
+
|
47 |
+
Feel free to contribute or raise issues for any problems encountered.
|
config/config.py
ADDED
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import time
|
3 |
+
import torch
|
4 |
+
import argparse
|
5 |
+
import sys
|
6 |
+
import pprint
|
7 |
+
|
8 |
+
import json
|
9 |
+
from utils.basic_utils import mkdirp, load_json, save_json, make_zipfile
|
10 |
+
|
11 |
+
|
12 |
+
def parse_with_config(parser):
|
13 |
+
args = parser.parse_args()
|
14 |
+
if args.config is not None:
|
15 |
+
config_args = json.load(open(args.config))
|
16 |
+
override_keys = {arg[2:].split('=')[0] for arg in sys.argv[1:]
|
17 |
+
if arg.startswith('--')}
|
18 |
+
for k, v in config_args.items():
|
19 |
+
if k not in override_keys:
|
20 |
+
setattr(args, k, v)
|
21 |
+
del args.config
|
22 |
+
return args
|
23 |
+
|
24 |
+
|
25 |
+
class BaseOptions(object):
|
26 |
+
saved_option_filename = "opt.json"
|
27 |
+
ckpt_filename = "model.ckpt"
|
28 |
+
tensorboard_log_dir = "tensorboard_log"
|
29 |
+
train_log_filename = "train.log.txt"
|
30 |
+
eval_log_filename = "eval.log.txt"
|
31 |
+
|
32 |
+
def __init__(self):
|
33 |
+
self.parser = argparse.ArgumentParser()
|
34 |
+
self.initialized = False
|
35 |
+
self.opt = None
|
36 |
+
|
37 |
+
def initialize(self):
|
38 |
+
self.initialized = True
|
39 |
+
self.parser.add_argument("--dset_name", type=str, default="tvr", choices=["tvr", "didemo"])
|
40 |
+
self.parser.add_argument("--eval_split_name", type=str, default="val",
|
41 |
+
help="should match keys in video_duration_idx_path, must set for VCMR")
|
42 |
+
self.parser.add_argument("--data_ratio", type=float, default=1.0,
|
43 |
+
help="how many training and eval data to use. 1.0: use all, 0.1: use 10%."
|
44 |
+
"Use small portion for debug purposes. Note this is different from --debug, "
|
45 |
+
"which works by breaking the loops, typically they are not used together.")
|
46 |
+
self.parser.add_argument("--debug", action="store_true",
|
47 |
+
help="debug (fast) mode, break all loops, do not load all data into memory.")
|
48 |
+
self.parser.add_argument("--disable_eval", action="store_true",
|
49 |
+
help="disable eval")
|
50 |
+
self.parser.add_argument("--results_root", type=str, default="results")
|
51 |
+
self.parser.add_argument("--exp_id", type=str, default=None, help="id of this run, required at training")
|
52 |
+
self.parser.add_argument("--seed", type=int, default=2018, help="random seed")
|
53 |
+
self.parser.add_argument("--device", type=int, default=0, help="0 cuda, -1 cpu")
|
54 |
+
self.parser.add_argument("--device_ids", type=int, nargs="+", default=[0], help="GPU ids to run the job")
|
55 |
+
self.parser.add_argument("--num_workers", type=int, default=8,
|
56 |
+
help="num subprocesses used to load the data, 0: use main process")
|
57 |
+
|
58 |
+
# training config
|
59 |
+
self.parser.add_argument("--lr", type=float, default=1e-4, help="learning rate")
|
60 |
+
self.parser.add_argument("--lr_warmup_proportion", type=float, default=0.01,
|
61 |
+
help="Proportion of training to perform linear learning rate warmup for. "
|
62 |
+
"E.g., 0.1 = 10% of training.")
|
63 |
+
self.parser.add_argument("--wd", type=float, default=0.01, help="weight decay")
|
64 |
+
self.parser.add_argument("--n_epoch", type=int, default=50, help="number of epochs to run")
|
65 |
+
self.parser.add_argument("--max_es_cnt", type=int, default=3,
|
66 |
+
help="number of epochs to early stop, use -1 to disable early stop")
|
67 |
+
self.parser.add_argument("--eval_tasks_at_training", type=str, nargs="+",
|
68 |
+
default=["VCMR", "SVMR", "VR"], choices=["VCMR", "SVMR", "VR"],
|
69 |
+
help="evaluate and report numbers for tasks specified here.")
|
70 |
+
self.parser.add_argument("--bsz", type=int, default=128, help="mini-batch size")
|
71 |
+
self.parser.add_argument("--eval_query_bsz", type=int, default=8,
|
72 |
+
help="mini-batch size at inference, for query")
|
73 |
+
self.parser.add_argument("--no_eval_untrained", action="store_true", help="Evaluate on un-trained model")
|
74 |
+
self.parser.add_argument("--grad_clip", type=float, default=-1, help="perform gradient clip, -1: disable")
|
75 |
+
self.parser.add_argument("--eval_epoch_num", type=int, default=1, help="eval_epoch_num")
|
76 |
+
|
77 |
+
# Data config
|
78 |
+
self.parser.add_argument("--max_ctx_len", type=int, default=100,
|
79 |
+
help="max number of snippets, 100 for tvr clip_length=1.5, only 109/21825 > 100")
|
80 |
+
self.parser.add_argument("--max_desc_len", type=int, default=30, help="max number of query token")
|
81 |
+
self.parser.add_argument("--clip_length", type=float, default=1.5,
|
82 |
+
help="each video will be uniformly segmented into small clips")
|
83 |
+
self.parser.add_argument("--ctx_mode", type=str, default="visual_sub",
|
84 |
+
help="adopted modality list for each clip")
|
85 |
+
self.parser.add_argument("--dataset_config", type=str,help="data config")
|
86 |
+
|
87 |
+
|
88 |
+
# Model config
|
89 |
+
|
90 |
+
self.parser.add_argument("--visual_dim", type=int,default=4352,help="visual modality feature dimension")
|
91 |
+
self.parser.add_argument("--text_dim", type=int, default=768, help="textual modality feature dimension")
|
92 |
+
self.parser.add_argument("--query_dim", type=int, default=768, help="query feature dimension")
|
93 |
+
self.parser.add_argument("--hidden_dim", type=int, default=768, help="joint dimension")
|
94 |
+
self.parser.add_argument("--no_output_moe_weight",action="store_true",
|
95 |
+
help="whether NOT to use query dependent fusion")
|
96 |
+
self.parser.add_argument("--model_config", type=str, help="model config")
|
97 |
+
|
98 |
+
|
99 |
+
## Train config
|
100 |
+
self.parser.add_argument("--lw_st_ed", type=float, default=0.01, help="weight for moment cross-entropy loss")
|
101 |
+
self.parser.add_argument("--lw_video_ce", type=float, default=0.05, help="weight for video cross-entropy loss")
|
102 |
+
self.parser.add_argument("--lr_mul", type=float, default=1, help="Learning rate multiplier for backbone module")
|
103 |
+
self.parser.add_argument("--use_extend_pool", type=int, default=1000,
|
104 |
+
help="use_extend_pool")
|
105 |
+
self.parser.add_argument("--neg_video_num",type=int,default=3,
|
106 |
+
help="sample the number of negative video, "
|
107 |
+
"if neg_video_num=0, then disable shared normalization training objective")
|
108 |
+
self.parser.add_argument("--encoder_pretrain_ckpt_filepath", type=str,
|
109 |
+
default="None",
|
110 |
+
help="first_stage_pretrain checkpoint")
|
111 |
+
self.parser.add_argument("--use_interal_vr_scores", action="store_true",
|
112 |
+
help="whether to interal_vr_scores, true only for general similarity measure function")
|
113 |
+
|
114 |
+
## Eval config
|
115 |
+
self.parser.add_argument("--similarity_measure",
|
116 |
+
type=str, choices=["general", "exclusive","disjoint"],
|
117 |
+
default="general",help="similarity_measure_function")
|
118 |
+
# post processing
|
119 |
+
self.parser.add_argument("--min_pred_l", type=int, default=0,
|
120 |
+
help="constrain the [st, ed] with ed - st >= 1"
|
121 |
+
"(1 clips with length 1.5 each, 1.5 secs in total"
|
122 |
+
"this is the min length for proposal-based method)")
|
123 |
+
self.parser.add_argument("--max_pred_l", type=int, default=24,
|
124 |
+
help="constrain the [st, ed] pairs with ed - st <= 24, 36 secs in total"
|
125 |
+
"(24 clips with length 1.5 each, "
|
126 |
+
"this is the max length for proposal-based method)")
|
127 |
+
self.parser.add_argument("--max_before_nms", type=int, default=200)
|
128 |
+
self.parser.add_argument("--max_vcmr_video", type=int, default=10,
|
129 |
+
help="ranking in top-max_vcmr_video")
|
130 |
+
self.parser.add_argument("--nms_thd", type=float, default=-1,
|
131 |
+
help="additionally use non-maximum suppression "
|
132 |
+
"(or non-minimum suppression for distance)"
|
133 |
+
"to post-processing the predictions. "
|
134 |
+
"-1: do not use nms. 0.7 for tvr")
|
135 |
+
self.parser.add_argument("--eval_num_per_epoch", type=float)
|
136 |
+
|
137 |
+
# can use config files
|
138 |
+
self.parser.add_argument('--config', help='JSON config files')
|
139 |
+
self.parser.add_argument('--model_name', type=str)
|
140 |
+
|
141 |
+
|
142 |
+
def display_save(self, opt):
|
143 |
+
args = vars(opt)
|
144 |
+
# Display settings
|
145 |
+
# print("------------ Options -------------\n{}\n-------------------"
|
146 |
+
# .format({str(k): str(v) for k, v in sorted(args.items())}))
|
147 |
+
print("------------ Options -------------\n{}\n-------------------"
|
148 |
+
.format(pprint.pformat({str(k): str(v) for k, v in sorted(args.items())}, indent=4)))
|
149 |
+
|
150 |
+
|
151 |
+
# Save settings
|
152 |
+
if not isinstance(self, TestOptions):
|
153 |
+
option_file_path = os.path.join(opt.results_dir, self.saved_option_filename) # not yaml file indeed
|
154 |
+
save_json(args, option_file_path, save_pretty=True)
|
155 |
+
|
156 |
+
|
157 |
+
def parse(self):
|
158 |
+
if not self.initialized:
|
159 |
+
self.initialize()
|
160 |
+
opt = parse_with_config(self.parser)
|
161 |
+
|
162 |
+
if opt.debug:
|
163 |
+
opt.results_root = os.path.sep.join(opt.results_root.split(os.path.sep)[:-1] + ["debug_results", ])
|
164 |
+
#opt.disable_eval = True
|
165 |
+
|
166 |
+
if isinstance(self, TestOptions):
|
167 |
+
|
168 |
+
# modify model_dir to absolute path
|
169 |
+
opt.model_dir = os.path.join("results", opt.model_dir)
|
170 |
+
|
171 |
+
saved_options = load_json(os.path.join(opt.model_dir, self.saved_option_filename))
|
172 |
+
for arg in saved_options: # use saved options to overwrite all BaseOptions args.
|
173 |
+
if arg not in ["results_root", "nms_thd", "debug", "dataset_config", "model_config","device",
|
174 |
+
"eval_split_name", "bsz", "eval_context_bsz", "device_ids",
|
175 |
+
"max_vcmr_video","max_pred_l", "min_pred_l", "external_inference_vr_res_path"]:
|
176 |
+
setattr(opt, arg, saved_options[arg])
|
177 |
+
else:
|
178 |
+
if opt.exp_id is None:
|
179 |
+
raise ValueError("--exp_id is required for at a training option!")
|
180 |
+
|
181 |
+
opt.results_dir = os.path.join(opt.results_root,
|
182 |
+
"-".join([opt.dset_name, opt.exp_id,
|
183 |
+
time.strftime("%Y_%m_%d_%H_%M_%S")]))
|
184 |
+
mkdirp(opt.results_dir)
|
185 |
+
# save a copy of current code
|
186 |
+
code_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
|
187 |
+
code_zip_filename = os.path.join(opt.results_dir, "code.zip")
|
188 |
+
make_zipfile(code_dir, code_zip_filename,
|
189 |
+
enclosing_dir="code",
|
190 |
+
exclude_dirs_substring="results",
|
191 |
+
exclude_dirs=["condor","data","results", "debug_results", "__pycache__"],
|
192 |
+
exclude_extensions=[".pyc", ".ipynb", ".swap"],)
|
193 |
+
|
194 |
+
self.display_save(opt)
|
195 |
+
|
196 |
+
|
197 |
+
# assert opt.stop_task in opt.eval_tasks_at_training
|
198 |
+
opt.ckpt_filepath = os.path.join(opt.results_dir, self.ckpt_filename)
|
199 |
+
opt.train_log_filepath = os.path.join(opt.results_dir, self.train_log_filename)
|
200 |
+
opt.eval_log_filepath = os.path.join(opt.results_dir, self.eval_log_filename)
|
201 |
+
opt.tensorboard_log_dir = os.path.join(opt.results_dir, self.tensorboard_log_dir)
|
202 |
+
opt.device = torch.device("cuda:%d" % opt.device_ids[0] if opt.device >= 0 else "cpu")
|
203 |
+
|
204 |
+
self.opt = opt
|
205 |
+
return opt
|
206 |
+
|
207 |
+
|
208 |
+
class TestOptions(BaseOptions):
|
209 |
+
"""add additional options for evaluating"""
|
210 |
+
def initialize(self):
|
211 |
+
BaseOptions.initialize(self)
|
212 |
+
# also need to specify --eval_split_name
|
213 |
+
self.parser.add_argument("--eval_id", type=str, help="evaluation id")
|
214 |
+
self.parser.add_argument("--model_dir", type=str,
|
215 |
+
help="dir contains the model file, will be converted to absolute path afterwards")
|
216 |
+
self.parser.add_argument("--tasks", type=str, nargs="+",
|
217 |
+
choices=["VCMR", "SVMR", "VR"], default=["VCMR", "SVMR", "VR"],
|
218 |
+
help="Which tasks to run."
|
219 |
+
"VCMR: Video Corpus Moment Retrieval;"
|
220 |
+
"SVMR: Single Video Moment Retrieval;"
|
221 |
+
"VR: regular Video Retrieval. (will be performed automatically with VCMR)")
|
222 |
+
|
223 |
+
if __name__ == '__main__':
|
224 |
+
print(__file__)
|
225 |
+
print(os.path.realpath(__file__))
|
226 |
+
code_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
|
227 |
+
print(code_dir)
|
config/model_config.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:1458b56e285bd34b5db29a8e6babc61f9bf02d377a7ce594579baa833190f582
|
3 |
+
size 1637
|
config/tvr_ranking_data_config_top01.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:03ed22c7ab836800651a9ab882496e71d93266bb6dff35c13d308243d1a5c98e
|
3 |
+
size 926
|
config/tvr_ranking_data_config_top20.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:509c13907d08921dd59c41b040166b4e0fd6e49260fa79adca9d23f46a804f70
|
3 |
+
size 926
|
config/tvr_ranking_data_config_top40.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:75a6540a46a85534dcf79b5049cc47053cd48232f6983268a584565b4a55d48b
|
3 |
+
size 926
|
data_loader/second_stage_start_end_dataset.py
ADDED
@@ -0,0 +1,349 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
from torch.utils.data import Dataset
|
3 |
+
import math
|
4 |
+
import os
|
5 |
+
import random
|
6 |
+
import numpy as np
|
7 |
+
from utils.basic_utils import load_json, l2_normalize_np_array, load_json
|
8 |
+
import h5py
|
9 |
+
|
10 |
+
|
11 |
+
class StartEndDataset(Dataset):
|
12 |
+
"""
|
13 |
+
Args:
|
14 |
+
dset_name, str, ["tvr"]
|
15 |
+
Return:
|
16 |
+
a dict: {
|
17 |
+
"model_inputs": {
|
18 |
+
"query"
|
19 |
+
"feat": torch.tensor, (max_desc_len, D_q)
|
20 |
+
"feat_mask": torch.tensor, (max_desc_len)
|
21 |
+
"feat_pos_id": torch.tensor, (max_desc_len)
|
22 |
+
"feat_token_id": torch.tensor, (max_desc_len)
|
23 |
+
"visual"
|
24 |
+
"feat": torch.tensor, (max_ctx_len, D_video)
|
25 |
+
"feat_mask": torch.tensor, (max_ctx_len)
|
26 |
+
"feat_pos_id": torch.tensor, (max_ctx_len)
|
27 |
+
"feat_token_id": torch.tensor, (max_ctx_len)
|
28 |
+
"sub" (optional)
|
29 |
+
"st_ed_indices": torch.LongTensor, (2, )
|
30 |
+
}
|
31 |
+
}
|
32 |
+
"""
|
33 |
+
def __init__(self, config, data_path, vr_rank_path, max_ctx_len=100, max_desc_len=30, clip_length=1.5,ctx_mode="visual_sub",
|
34 |
+
is_eval = False, mode = "train",
|
35 |
+
neg_video_num=3, data_ratio=1,
|
36 |
+
use_extend_pool=500, inference_top_k=10):
|
37 |
+
|
38 |
+
|
39 |
+
self.dset_name = config.dset_name
|
40 |
+
self.root_path = config.root_path
|
41 |
+
|
42 |
+
self.desc_bert_path = os.path.join(self.root_path,config.desc_bert_path)
|
43 |
+
self.vid_feat_path = os.path.join(self.root_path,config.vid_feat_path)
|
44 |
+
|
45 |
+
self.ctx_mode = ctx_mode
|
46 |
+
self.use_sub = "sub" in self.ctx_mode
|
47 |
+
|
48 |
+
if self.use_sub:
|
49 |
+
self.sub_bert_path = os.path.join(self.root_path, config.sub_bert_path)
|
50 |
+
|
51 |
+
self.max_ctx_len = max_ctx_len
|
52 |
+
self.max_desc_len = max_desc_len
|
53 |
+
self.clip_length = clip_length
|
54 |
+
|
55 |
+
self.neg_video_num = neg_video_num
|
56 |
+
self.is_eval = is_eval
|
57 |
+
|
58 |
+
self.mode = mode
|
59 |
+
if mode in ["val", "test"]:
|
60 |
+
# = load_json(data_path)
|
61 |
+
self.annotations = load_json(data_path)
|
62 |
+
self.ground_truth = self.get_relevant_moment_gt()
|
63 |
+
self.annotations = self.expand_annotations( self.annotations)
|
64 |
+
if mode == "train":
|
65 |
+
self.annotations = self.expand_annotations(load_json(data_path))
|
66 |
+
|
67 |
+
self.first_VR_ranklist_pool_txn = h5py.File(vr_rank_path, "r")
|
68 |
+
self.query_bert_h5 = h5py.File(self.desc_bert_path, "r")
|
69 |
+
self.vid_feat_txn = h5py.File(self.vid_feat_path, "r")
|
70 |
+
if self.use_sub:
|
71 |
+
self.sub_bert_txn = h5py.File(self.sub_bert_path, "r")
|
72 |
+
|
73 |
+
|
74 |
+
self.inference_top_k = inference_top_k
|
75 |
+
video_data = load_json(os.path.join(self.root_path,config.video_duration_idx_path))
|
76 |
+
|
77 |
+
self.video_data = [{"vid_name": k, "duration": v[0]} for k, v in video_data.items()]
|
78 |
+
self.video2idx = {k: v[1] for k, v in video_data.items()}
|
79 |
+
self.idx2video = {v[1]:k for k, v in video_data.items()}
|
80 |
+
self.use_extend_pool = use_extend_pool
|
81 |
+
|
82 |
+
self.normalize_vfeat = True
|
83 |
+
self.normalize_tfeat = False
|
84 |
+
|
85 |
+
self.visual_token_id = 0
|
86 |
+
self.text_token_id = 1
|
87 |
+
|
88 |
+
def __len__(self):
|
89 |
+
return len(self.annotations)
|
90 |
+
|
91 |
+
def expand_annotations(self, annotations):
|
92 |
+
new_annotations = []
|
93 |
+
for i in annotations:
|
94 |
+
query = i["query"]
|
95 |
+
query_id = i["query_id"]
|
96 |
+
for moment in i["relevant_moment"]:
|
97 |
+
moment.update({'query': query, 'query_id': query_id})
|
98 |
+
new_annotations.append(moment)
|
99 |
+
return new_annotations
|
100 |
+
|
101 |
+
def get_relevant_moment_gt(self):
|
102 |
+
gt_all = {}
|
103 |
+
for data in self.annotations:
|
104 |
+
gt_all[data["query_id"]] = data["relevant_moment"]
|
105 |
+
return gt_all
|
106 |
+
|
107 |
+
|
108 |
+
def pad_feature(self, feature, max_ctx_len):
|
109 |
+
"""
|
110 |
+
Args:
|
111 |
+
feature: original feature without padding
|
112 |
+
max_ctx_len: the maximum length of video clips (or query token)
|
113 |
+
|
114 |
+
Returns:
|
115 |
+
feat_pad : padded feature
|
116 |
+
feat_mask : feature mask
|
117 |
+
"""
|
118 |
+
N_clip, feat_dim = feature.shape
|
119 |
+
|
120 |
+
feat_pad = torch.zeros((max_ctx_len, feat_dim))
|
121 |
+
feat_mask = torch.zeros(max_ctx_len, dtype=torch.long)
|
122 |
+
feat_pad[:N_clip, :] = torch.from_numpy(feature)
|
123 |
+
feat_mask[:N_clip] = 1
|
124 |
+
|
125 |
+
return feat_pad , feat_mask
|
126 |
+
|
127 |
+
def get_query_feat_by_query_id(self, query_id, token_id=1):
|
128 |
+
"""
|
129 |
+
Args:
|
130 |
+
query_id: unique query description id
|
131 |
+
token_id: specify modality embedding
|
132 |
+
Returns:
|
133 |
+
a dict for query: {
|
134 |
+
"feat": torch.tensor, (max_desc_len, D_q)
|
135 |
+
"feat_mask": torch.tensor, (max_desc_len)
|
136 |
+
"feat_pos_id": torch.tensor, (max_desc_len)
|
137 |
+
"feat_token_id": torch.tensor, (max_desc_len)
|
138 |
+
}
|
139 |
+
"""
|
140 |
+
|
141 |
+
query_feat = self.query_bert_h5[str(query_id)][:self.max_desc_len]
|
142 |
+
|
143 |
+
if self.normalize_tfeat:
|
144 |
+
query_feat = l2_normalize_np_array(query_feat)
|
145 |
+
|
146 |
+
feat_pad, feat_mask = \
|
147 |
+
self.pad_feature(query_feat, self.max_desc_len)
|
148 |
+
|
149 |
+
temp_model_inputs = dict()
|
150 |
+
temp_model_inputs["feat"] = feat_pad
|
151 |
+
temp_model_inputs["feat_mask"] = feat_mask
|
152 |
+
temp_model_inputs["feat_pos_id"] = torch.arange(self.max_desc_len, dtype=torch.long)
|
153 |
+
temp_model_inputs["feat_token_id"] = torch.full((self.max_desc_len,), token_id, dtype=torch.long)
|
154 |
+
|
155 |
+
return temp_model_inputs
|
156 |
+
|
157 |
+
def get_visual_feat_from_storage(self,vid_name):
|
158 |
+
"""
|
159 |
+
Args:
|
160 |
+
vid_name: unique video description id
|
161 |
+
Returns:
|
162 |
+
visual_feat: torch.tensor, (max_ctx_len, D_v)
|
163 |
+
Use ResNet + SlowFast , D_v = 2048 + 2304 = 4352
|
164 |
+
"""
|
165 |
+
|
166 |
+
visual_feat = self.vid_feat_txn[vid_name][:][:self.max_ctx_len]
|
167 |
+
|
168 |
+
if self.normalize_vfeat:
|
169 |
+
visual_feat = l2_normalize_np_array(visual_feat)
|
170 |
+
|
171 |
+
return visual_feat
|
172 |
+
|
173 |
+
def get_sub_feat_from_storage(self,vid_name):
|
174 |
+
"""
|
175 |
+
Args:
|
176 |
+
vid_name: unique video description id
|
177 |
+
Returns:
|
178 |
+
visual_feat: torch.tensor, (max_ctx_len, D_s)
|
179 |
+
Use RoBERTa, D_s =768
|
180 |
+
"""
|
181 |
+
|
182 |
+
sub_feat = self.sub_bert_txn[vid_name][:][:self.max_ctx_len]
|
183 |
+
|
184 |
+
if self.normalize_tfeat:
|
185 |
+
sub_feat = l2_normalize_np_array(sub_feat)
|
186 |
+
|
187 |
+
return sub_feat
|
188 |
+
|
189 |
+
def __getitem__(self, index):
|
190 |
+
|
191 |
+
raw_data = self.annotations[index]
|
192 |
+
# if "video_name" not in raw_data.keys():
|
193 |
+
# initialize with basic data
|
194 |
+
meta = dict(
|
195 |
+
query_id=raw_data["query_id"],
|
196 |
+
desc=raw_data["query"],
|
197 |
+
vid_name=raw_data["video_name"],
|
198 |
+
ts=raw_data["timestamp"],
|
199 |
+
)
|
200 |
+
|
201 |
+
# If mode is test_public, no ground-truth video_id is provided. So use a fixed dummy ground-truth video_id
|
202 |
+
if self.mode =="test_public":
|
203 |
+
meta["vid_name"] = "placeholder"
|
204 |
+
|
205 |
+
|
206 |
+
model_inputs = dict()
|
207 |
+
## query information
|
208 |
+
model_inputs["query"] = self.get_query_feat_by_query_id(meta["query_id"],
|
209 |
+
token_id=self.text_token_id)
|
210 |
+
|
211 |
+
query_id = meta["query_id"]
|
212 |
+
if query_id == 7806:
|
213 |
+
query_id += 1
|
214 |
+
|
215 |
+
_external_inference_vr_res = self.first_VR_ranklist_pool_txn[str(query_id)][:]
|
216 |
+
if not self.is_eval:
|
217 |
+
##get the rank location of the ground-truth video for the first VR search engine
|
218 |
+
location = 100
|
219 |
+
for idx, item in enumerate(_external_inference_vr_res):
|
220 |
+
if meta["vid_name"] == self.idx2video[item[0]]:
|
221 |
+
location = idx
|
222 |
+
break
|
223 |
+
|
224 |
+
##check all the location is below 100 when mode is train
|
225 |
+
# if self.mode =="train":
|
226 |
+
# assert 0<=location<100, meta["query_id"]
|
227 |
+
|
228 |
+
##get the ranklist without the ground-truth video
|
229 |
+
negative_video_pool_list = [self.idx2video[item[0]] for item in _external_inference_vr_res if meta["vid_name"] != self.idx2video[item[0]] ]
|
230 |
+
|
231 |
+
##sample neg_video_num negative videos for shared normalization
|
232 |
+
sampled_negative_video_pool = random.sample(negative_video_pool_list[:location+self.use_extend_pool],
|
233 |
+
k=self.neg_video_num)
|
234 |
+
##the complete sampled video list , [pos, neg1, neg2, ...]
|
235 |
+
total_vid_name_list = [meta["vid_name"],] + sampled_negative_video_pool
|
236 |
+
|
237 |
+
self.shared_video_num = 1 + self.neg_video_num
|
238 |
+
|
239 |
+
else:
|
240 |
+
##during eval, use top-k videos recommended by the first VR search engine
|
241 |
+
inference_video_list = [ self.idx2video[item[0]] for item in _external_inference_vr_res[:self.inference_top_k]]
|
242 |
+
inference_video_scores = [ item[1] for item in _external_inference_vr_res[:self.inference_top_k]]
|
243 |
+
model_inputs["inference_vr_scores"] = torch.FloatTensor(inference_video_scores)
|
244 |
+
total_vid_name_list = [meta["vid_name"],] + inference_video_list
|
245 |
+
self.shared_video_num = 1 + self.inference_top_k
|
246 |
+
|
247 |
+
# sampled neg_video_num negative videos or top-k videos
|
248 |
+
meta["sample_vid_name_list"] = total_vid_name_list[1:]
|
249 |
+
|
250 |
+
"""
|
251 |
+
a dict for visual modality: {
|
252 |
+
"feat": torch.tensor, (shared_video_num, max_ctx_len, D_v)
|
253 |
+
"feat_mask": torch.tensor, (shared_video_num, max_ctx_len)
|
254 |
+
"feat_pos_id": torch.tensor, (shared_video_num, max_ctx_len)
|
255 |
+
"feat_token_id": torch.tensor, (shared_video_num, max_ctx_len)
|
256 |
+
}
|
257 |
+
"""
|
258 |
+
groundtruth_visual_feat = self.get_visual_feat_from_storage(meta["vid_name"])
|
259 |
+
ctx_l, feat_dim = groundtruth_visual_feat.shape
|
260 |
+
|
261 |
+
visual_feat_pad = torch.zeros((self.shared_video_num, self.max_ctx_len, feat_dim))
|
262 |
+
visual_feat_mask = torch.zeros((self.shared_video_num, self.max_ctx_len), dtype=torch.long)
|
263 |
+
visual_feat_pos_id = \
|
264 |
+
torch.repeat_interleave(torch.arange(self.max_ctx_len, dtype=torch.long).unsqueeze(0),
|
265 |
+
self.shared_video_num, dim=0)
|
266 |
+
visual_feat_token_id = torch.full((self.shared_video_num, self.max_ctx_len), self.visual_token_id,
|
267 |
+
dtype=torch.long)
|
268 |
+
|
269 |
+
for index, video_name in enumerate(total_vid_name_list,start=0):
|
270 |
+
visual_feat = self.get_visual_feat_from_storage(video_name)
|
271 |
+
|
272 |
+
feat_pad, feat_mask = \
|
273 |
+
self.pad_feature(visual_feat, self.max_ctx_len)
|
274 |
+
|
275 |
+
visual_feat_pad[index] = feat_pad
|
276 |
+
visual_feat_mask[index] = feat_mask
|
277 |
+
|
278 |
+
temp_model_inputs = dict()
|
279 |
+
temp_model_inputs["feat"] = visual_feat_pad
|
280 |
+
temp_model_inputs["feat_mask"] = visual_feat_mask
|
281 |
+
temp_model_inputs["feat_pos_id"] = visual_feat_pos_id
|
282 |
+
temp_model_inputs["feat_token_id"] = visual_feat_token_id
|
283 |
+
|
284 |
+
model_inputs["visual"] = temp_model_inputs
|
285 |
+
|
286 |
+
"""
|
287 |
+
a dict for sub modality: {
|
288 |
+
"feat": torch.tensor, (shared_video_num, max_ctx_len, D_t)
|
289 |
+
"feat_mask": torch.tensor, (shared_video_num, max_ctx_len)
|
290 |
+
"feat_pos_id": torch.tensor, (shared_video_num, max_ctx_len)
|
291 |
+
"feat_token_id": torch.tensor, (shared_video_num, max_ctx_len)
|
292 |
+
}
|
293 |
+
"""
|
294 |
+
if self.use_sub:
|
295 |
+
groundtruth_sub_feat = self.get_sub_feat_from_storage(meta["vid_name"])
|
296 |
+
|
297 |
+
_ , feat_dim = groundtruth_sub_feat.shape
|
298 |
+
|
299 |
+
sub_feat_pad = torch.zeros((self.shared_video_num, self.max_ctx_len, feat_dim))
|
300 |
+
sub_feat_mask = torch.zeros((self.shared_video_num, self.max_ctx_len), dtype=torch.long)
|
301 |
+
sub_feat_pos_id = \
|
302 |
+
torch.repeat_interleave(torch.arange(self.max_ctx_len, dtype=torch.long).unsqueeze(0),
|
303 |
+
self.shared_video_num, dim=0)
|
304 |
+
sub_feat_token_id = torch.full((self.shared_video_num, self.max_ctx_len), self.text_token_id, dtype=torch.long)
|
305 |
+
|
306 |
+
for index, video_name in enumerate(total_vid_name_list, start=0):
|
307 |
+
sub_feat = self.get_sub_feat_from_storage(video_name)
|
308 |
+
|
309 |
+
feat_pad, feat_mask = \
|
310 |
+
self.pad_feature(sub_feat, self.max_ctx_len)
|
311 |
+
|
312 |
+
sub_feat_pad[index] = feat_pad
|
313 |
+
sub_feat_mask[index] = feat_mask
|
314 |
+
|
315 |
+
temp_model_inputs = dict()
|
316 |
+
temp_model_inputs["feat"] = sub_feat_pad
|
317 |
+
temp_model_inputs["feat_mask"] = sub_feat_mask
|
318 |
+
temp_model_inputs["feat_pos_id"] = sub_feat_pos_id
|
319 |
+
temp_model_inputs["feat_token_id"] = sub_feat_token_id
|
320 |
+
|
321 |
+
model_inputs["sub"] = temp_model_inputs
|
322 |
+
|
323 |
+
if not self.is_eval:
|
324 |
+
model_inputs["st_ed_indices"] = self.get_st_ed_label(meta["ts"],
|
325 |
+
max_idx=ctx_l - 1)
|
326 |
+
|
327 |
+
return dict(meta=meta, model_inputs=model_inputs)
|
328 |
+
|
329 |
+
def get_st_ed_label(self, ts, max_idx):
|
330 |
+
"""
|
331 |
+
Args:
|
332 |
+
ts: [st (float), ed (float)] in seconds, ed > st
|
333 |
+
max_idx: length of the video
|
334 |
+
|
335 |
+
Returns:
|
336 |
+
[st_idx, ed_idx]: int,
|
337 |
+
ed_idx >= st_idx
|
338 |
+
st_idx, ed_idx both belong to [0, max_idx-1]
|
339 |
+
|
340 |
+
Given ts = [3.2, 7.6], st_idx = 2, ed_idx = 6,
|
341 |
+
clips should be indexed as [2: 6), the translated back ts should be [3:9].
|
342 |
+
# TODO which one is better, [2: 5] or [2: 6)
|
343 |
+
"""
|
344 |
+
st_idx = min(math.floor(ts[0] / self.clip_length), max_idx)
|
345 |
+
ed_idx = min(math.ceil(ts[1] / self.clip_length) - 1, max_idx) # st_idx could be the same as ed_idx
|
346 |
+
assert 0 <= st_idx <= ed_idx <= max_idx, (ts, st_idx, ed_idx, max_idx)
|
347 |
+
return torch.LongTensor([st_idx, ed_idx])
|
348 |
+
|
349 |
+
|
inference.py
ADDED
@@ -0,0 +1,570 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import pprint
|
3 |
+
from tqdm import tqdm
|
4 |
+
import numpy as np
|
5 |
+
|
6 |
+
import torch
|
7 |
+
import torch.nn.functional as F
|
8 |
+
import torch.backends.cudnn as cudnn
|
9 |
+
from torch.utils.data import DataLoader
|
10 |
+
|
11 |
+
from config.config import TestOptions
|
12 |
+
from model.conquer import CONQUER
|
13 |
+
from data_loader.second_stage_start_end_dataset import StartEndDataset as StartEndEvalDataset
|
14 |
+
from utils.inference_utils import \
|
15 |
+
get_submission_top_n, post_processing_vcmr_nms
|
16 |
+
from utils.basic_utils import save_json , load_config
|
17 |
+
from utils.tensor_utils import find_max_triples_from_upper_triangle_product
|
18 |
+
from standalone_eval.eval import eval_retrieval
|
19 |
+
from utils.model_utils import move_cuda , start_end_collate
|
20 |
+
from utils.model_utils import VERY_NEGATIVE_NUMBER
|
21 |
+
import logging
|
22 |
+
from time import time
|
23 |
+
from ndcg_iou_topk import calculate_ndcg_iou
|
24 |
+
|
25 |
+
logger = logging.getLogger(__name__)
|
26 |
+
logging.basicConfig(format="%(asctime)s.%(msecs)03d:%(levelname)s:%(name)s - %(message)s",
|
27 |
+
datefmt="%Y-%m-%d %H:%M:%S",
|
28 |
+
level=logging.INFO)
|
29 |
+
|
30 |
+
def generate_min_max_length_mask(array_shape, min_l, max_l):
|
31 |
+
""" The last two dimension denotes matrix of upper-triangle with upper-right corner masked,
|
32 |
+
below is the case for 4x4.
|
33 |
+
[[0, 1, 1, 0],
|
34 |
+
[0, 0, 1, 1],
|
35 |
+
[0, 0, 0, 1],
|
36 |
+
[0, 0, 0, 0]]
|
37 |
+
|
38 |
+
Args:
|
39 |
+
array_shape: np.shape??? The last two dimensions should be the same
|
40 |
+
min_l: int, minimum length of predicted span
|
41 |
+
max_l: int, maximum length of predicted span
|
42 |
+
|
43 |
+
Returns:
|
44 |
+
|
45 |
+
"""
|
46 |
+
single_dims = (1, ) * (len(array_shape) - 2)
|
47 |
+
mask_shape = single_dims + array_shape[-2:]
|
48 |
+
extra_length_mask_array = np.ones(mask_shape, dtype=np.float32) # (1, ..., 1, L, L)
|
49 |
+
mask_triu = np.triu(extra_length_mask_array, k=min_l)
|
50 |
+
mask_triu_reversed = 1 - np.triu(extra_length_mask_array, k=max_l)
|
51 |
+
final_prob_mask = mask_triu * mask_triu_reversed
|
52 |
+
return final_prob_mask # with valid bit to be 1
|
53 |
+
|
54 |
+
|
55 |
+
def get_svmr_res_from_st_ed_probs_disjoint(svmr_gt_st_probs, svmr_gt_ed_probs, query_metas, video2idx,
|
56 |
+
clip_length, min_pred_l, max_pred_l, max_before_nms):
|
57 |
+
"""
|
58 |
+
Args:
|
59 |
+
svmr_gt_st_probs: np.ndarray (N_queries, L, L), value range [0, 1]
|
60 |
+
svmr_gt_ed_probs:
|
61 |
+
query_metas:
|
62 |
+
video2idx:
|
63 |
+
clip_length: float, how long each clip is in seconds
|
64 |
+
min_pred_l: int, minimum number of clips
|
65 |
+
max_pred_l: int, maximum number of clips
|
66 |
+
max_before_nms: get top-max_before_nms predictions for each query
|
67 |
+
|
68 |
+
Returns:
|
69 |
+
|
70 |
+
"""
|
71 |
+
svmr_res = []
|
72 |
+
query_vid_names = [e["vid_name"] for e in query_metas]
|
73 |
+
|
74 |
+
# masking very long ones! Since most are relatively short.
|
75 |
+
# disjoint : b_i + e_i
|
76 |
+
_st_ed_scores = np.expand_dims(svmr_gt_st_probs,axis=2) + np.expand_dims(svmr_gt_ed_probs,axis=1)
|
77 |
+
|
78 |
+
_N_q = _st_ed_scores.shape[0]
|
79 |
+
|
80 |
+
_valid_prob_mask = np.logical_not(generate_min_max_length_mask(
|
81 |
+
_st_ed_scores.shape, min_l=min_pred_l, max_l=max_pred_l).astype(bool))
|
82 |
+
|
83 |
+
valid_prob_mask = np.tile(_valid_prob_mask,(_N_q, 1, 1))
|
84 |
+
|
85 |
+
# invalid location will become VERY_NEGATIVE_NUMBER!
|
86 |
+
_st_ed_scores[valid_prob_mask] = VERY_NEGATIVE_NUMBER
|
87 |
+
|
88 |
+
batched_sorted_triples = find_max_triples_from_upper_triangle_product(
|
89 |
+
_st_ed_scores, top_n=max_before_nms, prob_thd=None)
|
90 |
+
for i, q_vid_name in tqdm(enumerate(query_vid_names),
|
91 |
+
desc="[SVMR] Loop over queries to generate predictions",
|
92 |
+
total=len(query_vid_names)): # i is query_id
|
93 |
+
q_m = query_metas[i]
|
94 |
+
video_idx = video2idx[q_vid_name]
|
95 |
+
_sorted_triples = batched_sorted_triples[i]
|
96 |
+
_sorted_triples[:, 1] += 1 # as we redefined ed_idx, which is inside the moment.
|
97 |
+
_sorted_triples[:, :2] = _sorted_triples[:, :2] * clip_length
|
98 |
+
# [video_idx(int), st(float), ed(float), score(float)]
|
99 |
+
cur_ranked_predictions = [[video_idx, ] + row for row in _sorted_triples.tolist()]
|
100 |
+
cur_query_pred = dict(
|
101 |
+
query_id=q_m["query_id"],
|
102 |
+
desc=q_m["desc"],
|
103 |
+
predictions=cur_ranked_predictions
|
104 |
+
)
|
105 |
+
svmr_res.append(cur_query_pred)
|
106 |
+
return svmr_res
|
107 |
+
|
108 |
+
|
109 |
+
def get_svmr_res_from_st_ed_probs(svmr_gt_st_probs, svmr_gt_ed_probs, query_metas, video2idx,
|
110 |
+
clip_length, min_pred_l, max_pred_l, max_before_nms):
|
111 |
+
"""
|
112 |
+
Args:
|
113 |
+
svmr_gt_st_probs: np.ndarray (N_queries, L, L), value range [0, 1]
|
114 |
+
svmr_gt_ed_probs:
|
115 |
+
query_metas:
|
116 |
+
video2idx:
|
117 |
+
clip_length: float, how long each clip is in seconds
|
118 |
+
min_pred_l: int, minimum number of clips
|
119 |
+
max_pred_l: int, maximum number of clips
|
120 |
+
max_before_nms: get top-max_before_nms predictions for each query
|
121 |
+
|
122 |
+
Returns:
|
123 |
+
|
124 |
+
"""
|
125 |
+
svmr_res = []
|
126 |
+
query_vid_names = [e["vid_name"] for e in query_metas]
|
127 |
+
|
128 |
+
# masking very long ones! Since most are relatively short.
|
129 |
+
# general/exclusive : \hat{b_i} * \hat{e_i}
|
130 |
+
st_ed_prob_product = np.einsum("bm,bn->bmn", svmr_gt_st_probs, svmr_gt_ed_probs) # (N, L, L)
|
131 |
+
|
132 |
+
valid_prob_mask = generate_min_max_length_mask(st_ed_prob_product.shape, min_l=min_pred_l, max_l=max_pred_l)
|
133 |
+
st_ed_prob_product *= valid_prob_mask # invalid location will become zero!
|
134 |
+
|
135 |
+
batched_sorted_triples = find_max_triples_from_upper_triangle_product(
|
136 |
+
st_ed_prob_product, top_n=max_before_nms, prob_thd=None)
|
137 |
+
for i, q_vid_name in tqdm(enumerate(query_vid_names),
|
138 |
+
desc="[SVMR] Loop over queries to generate predictions",
|
139 |
+
total=len(query_vid_names)): # i is query_id
|
140 |
+
q_m = query_metas[i]
|
141 |
+
video_idx = video2idx[q_vid_name]
|
142 |
+
_sorted_triples = batched_sorted_triples[i]
|
143 |
+
_sorted_triples[:, 1] += 1 # as we redefined ed_idx, which is inside the moment.
|
144 |
+
_sorted_triples[:, :2] = _sorted_triples[:, :2] * clip_length
|
145 |
+
# [video_idx(int), st(float), ed(float), score(float)]
|
146 |
+
cur_ranked_predictions = [[video_idx, ] + row for row in _sorted_triples.tolist()]
|
147 |
+
cur_query_pred = dict(
|
148 |
+
query_id=q_m["query_id"],
|
149 |
+
desc=q_m["desc"],
|
150 |
+
predictions=cur_ranked_predictions
|
151 |
+
)
|
152 |
+
svmr_res.append(cur_query_pred)
|
153 |
+
return svmr_res
|
154 |
+
|
155 |
+
|
156 |
+
|
157 |
+
def compute_query2ctx_info(model, eval_dataset, opt,
|
158 |
+
max_before_nms=200, max_n_videos=100, tasks=("SVMR",)):
|
159 |
+
"""
|
160 |
+
Use val set to do evaluation, remember to run with torch.no_grad().
|
161 |
+
model : CONQUER
|
162 |
+
eval_dataset :
|
163 |
+
opt :
|
164 |
+
max_before_nms : max moment number before non-maximum suppression
|
165 |
+
tasks: evaluation tasks
|
166 |
+
|
167 |
+
general/exclusive function : r * \hat{b_i} + \hat{e_i}
|
168 |
+
"""
|
169 |
+
is_vr = "VR" in tasks
|
170 |
+
is_vcmr = "VCMR" in tasks
|
171 |
+
is_svmr = "SVMR" in tasks
|
172 |
+
|
173 |
+
video2idx = eval_dataset.video2idx
|
174 |
+
|
175 |
+
model.eval()
|
176 |
+
query_eval_loader = DataLoader(eval_dataset,
|
177 |
+
collate_fn= start_end_collate,
|
178 |
+
batch_size=opt.eval_query_bsz,
|
179 |
+
num_workers=opt.num_workers,
|
180 |
+
shuffle=False,
|
181 |
+
pin_memory=True)
|
182 |
+
|
183 |
+
n_total_query = len(eval_dataset)
|
184 |
+
bsz = opt.eval_query_bsz
|
185 |
+
|
186 |
+
if is_vcmr:
|
187 |
+
flat_st_ed_scores_sorted_indices = np.empty((n_total_query, max_before_nms), dtype=int)
|
188 |
+
flat_st_ed_sorted_scores = np.zeros((n_total_query, max_before_nms), dtype=np.float32)
|
189 |
+
|
190 |
+
if is_vr :
|
191 |
+
if opt.use_interal_vr_scores:
|
192 |
+
sorted_q2c_indices = np.tile(np.arange(max_n_videos, dtype=int),n_total_query).reshape(n_total_query,max_n_videos)
|
193 |
+
sorted_q2c_scores = np.empty((n_total_query, max_n_videos), dtype=np.float32)
|
194 |
+
else:
|
195 |
+
sorted_q2c_indices = np.empty((n_total_query, max_n_videos), dtype=int)
|
196 |
+
sorted_q2c_scores = np.empty((n_total_query, max_n_videos), dtype=np.float32)
|
197 |
+
|
198 |
+
if is_svmr:
|
199 |
+
svmr_gt_st_probs = np.zeros((n_total_query, opt.max_ctx_len), dtype=np.float32)
|
200 |
+
svmr_gt_ed_probs = np.zeros((n_total_query, opt.max_ctx_len), dtype=np.float32)
|
201 |
+
|
202 |
+
query_metas = []
|
203 |
+
for idx, batch in tqdm(
|
204 |
+
enumerate(query_eval_loader), desc="Computing q embedding", total=len(query_eval_loader)):
|
205 |
+
|
206 |
+
_query_metas = batch["meta"]
|
207 |
+
query_metas.extend(batch["meta"])
|
208 |
+
|
209 |
+
if opt.device.type == "cuda":
|
210 |
+
model_inputs = move_cuda(batch["model_inputs"], opt.device)
|
211 |
+
else:
|
212 |
+
model_inputs = batch["model_inputs"]
|
213 |
+
|
214 |
+
|
215 |
+
video_similarity_score, begin_score_distribution, end_score_distribution = \
|
216 |
+
model.get_pred_from_raw_query(model_inputs)
|
217 |
+
|
218 |
+
if is_svmr:
|
219 |
+
_svmr_st_probs = begin_score_distribution[:, 0]
|
220 |
+
_svmr_ed_probs = end_score_distribution[:, 0]
|
221 |
+
|
222 |
+
# normalize to get true probabilities!!!
|
223 |
+
# the probabilities here are already (pad) masked, so only need to do softmax
|
224 |
+
_svmr_st_probs = F.softmax(_svmr_st_probs, dim=-1) # (_N_q, L)
|
225 |
+
_svmr_ed_probs = F.softmax(_svmr_ed_probs, dim=-1)
|
226 |
+
if opt.debug:
|
227 |
+
print("svmr_st_probs: ", _svmr_st_probs)
|
228 |
+
|
229 |
+
svmr_gt_st_probs[idx * bsz:(idx + 1) * bsz] = \
|
230 |
+
_svmr_st_probs.cpu().numpy()
|
231 |
+
|
232 |
+
svmr_gt_ed_probs[idx * bsz:(idx + 1) * bsz] = \
|
233 |
+
_svmr_ed_probs.cpu().numpy()
|
234 |
+
|
235 |
+
_vcmr_st_prob = begin_score_distribution[:, 1:]
|
236 |
+
_vcmr_ed_prob = end_score_distribution[:, 1:]
|
237 |
+
|
238 |
+
if not (is_vr or is_vcmr):
|
239 |
+
continue
|
240 |
+
|
241 |
+
if opt.use_interal_vr_scores:
|
242 |
+
bs = begin_score_distribution.size()[0]
|
243 |
+
_sorted_q2c_indices = torch.arange(max_n_videos).to(begin_score_distribution.device).repeat(bs,1)
|
244 |
+
_sorted_q2c_scores = model_inputs["inference_vr_scores"]
|
245 |
+
if is_vr:
|
246 |
+
sorted_q2c_scores[idx * bsz:(idx + 1) * bsz] = model_inputs["inference_vr_scores"].cpu().numpy()
|
247 |
+
else:
|
248 |
+
video_similarity_score = video_similarity_score[:, 1:]
|
249 |
+
_query_context_scores = torch.softmax(video_similarity_score,dim=1)
|
250 |
+
|
251 |
+
# Get top-max_n_videos videos for each query
|
252 |
+
_sorted_q2c_scores, _sorted_q2c_indices = \
|
253 |
+
torch.topk(_query_context_scores, max_n_videos, dim=1, largest=True)
|
254 |
+
if is_vr:
|
255 |
+
sorted_q2c_indices[idx * bsz:(idx + 1) * bsz] = _sorted_q2c_indices.cpu().numpy()
|
256 |
+
sorted_q2c_scores[idx * bsz:(idx + 1) * bsz] = _sorted_q2c_scores.cpu().numpy()
|
257 |
+
|
258 |
+
|
259 |
+
if not is_vcmr:
|
260 |
+
continue
|
261 |
+
|
262 |
+
|
263 |
+
# normalize to get true probabilities!!!
|
264 |
+
# the probabilities here are already (pad) masked, so only need to do softmax
|
265 |
+
_st_probs = F.softmax(_vcmr_st_prob, dim=-1) # (_N_q, N_videos, L)
|
266 |
+
_ed_probs = F.softmax(_vcmr_ed_prob, dim=-1)
|
267 |
+
|
268 |
+
|
269 |
+
# Get VCMR results
|
270 |
+
# compute combined scores
|
271 |
+
row_indices = torch.arange(0, len(_st_probs), device=opt.device).unsqueeze(1)
|
272 |
+
_st_probs = _st_probs[row_indices, _sorted_q2c_indices] # (_N_q, max_n_videos, L)
|
273 |
+
_ed_probs = _ed_probs[row_indices, _sorted_q2c_indices]
|
274 |
+
|
275 |
+
# (_N_q, max_n_videos, L, L)
|
276 |
+
# general/exclusive : r * \hat{b_i} * \hat{e_i}
|
277 |
+
_st_ed_scores = torch.einsum("qvm,qv,qvn->qvmn", _st_probs, _sorted_q2c_scores, _ed_probs)
|
278 |
+
|
279 |
+
valid_prob_mask = generate_min_max_length_mask(
|
280 |
+
_st_ed_scores.shape, min_l=opt.min_pred_l, max_l=opt.max_pred_l)
|
281 |
+
|
282 |
+
_st_ed_scores *= torch.from_numpy(
|
283 |
+
valid_prob_mask).to(_st_ed_scores.device) # invalid location will become zero!
|
284 |
+
|
285 |
+
_n_q = _st_ed_scores.shape[0]
|
286 |
+
|
287 |
+
# sort across the total_n_videos videos (by flatten from the 2nd dim)
|
288 |
+
# the indices here are local indices, not global indices
|
289 |
+
|
290 |
+
_flat_st_ed_scores = _st_ed_scores.reshape(_n_q, -1) # (N_q, total_n_videos*L*L)
|
291 |
+
_flat_st_ed_sorted_scores, _flat_st_ed_scores_sorted_indices = \
|
292 |
+
torch.sort(_flat_st_ed_scores, dim=1, descending=True)
|
293 |
+
|
294 |
+
# collect data
|
295 |
+
flat_st_ed_sorted_scores[idx * bsz:(idx + 1) * bsz] = \
|
296 |
+
_flat_st_ed_sorted_scores[:, :max_before_nms].detach().cpu().numpy()
|
297 |
+
flat_st_ed_scores_sorted_indices[idx * bsz:(idx + 1) * bsz] = \
|
298 |
+
_flat_st_ed_scores_sorted_indices[:, :max_before_nms].detach().cpu().numpy()
|
299 |
+
|
300 |
+
if opt.debug:
|
301 |
+
break
|
302 |
+
|
303 |
+
# Numpy starts here!!!
|
304 |
+
vr_res = []
|
305 |
+
if is_vr:
|
306 |
+
for i, (_sorted_q2c_scores_row, _sorted_q2c_indices_row) in tqdm(
|
307 |
+
enumerate(zip(sorted_q2c_scores, sorted_q2c_indices)),
|
308 |
+
desc="[VR] Loop over queries to generate predictions", total=n_total_query):
|
309 |
+
cur_vr_redictions = []
|
310 |
+
query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
|
311 |
+
for j, (v_score, v_meta_idx) in enumerate(zip(_sorted_q2c_scores_row, _sorted_q2c_indices_row)):
|
312 |
+
video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
|
313 |
+
cur_vr_redictions.append([video_idx, 0, 0, float(v_score)])
|
314 |
+
cur_query_pred = dict(
|
315 |
+
query_id=query_metas[i]["query_id"],
|
316 |
+
desc=query_metas[i]["desc"],
|
317 |
+
predictions=cur_vr_redictions
|
318 |
+
)
|
319 |
+
vr_res.append(cur_query_pred)
|
320 |
+
|
321 |
+
svmr_res = []
|
322 |
+
if is_svmr:
|
323 |
+
svmr_res = get_svmr_res_from_st_ed_probs(svmr_gt_st_probs, svmr_gt_ed_probs,
|
324 |
+
query_metas, video2idx,
|
325 |
+
clip_length=opt.clip_length,
|
326 |
+
min_pred_l=opt.min_pred_l,
|
327 |
+
max_pred_l=opt.max_pred_l,
|
328 |
+
max_before_nms=max_before_nms)
|
329 |
+
|
330 |
+
|
331 |
+
vcmr_res = []
|
332 |
+
if is_vcmr:
|
333 |
+
for i, (_flat_st_ed_scores_sorted_indices, _flat_st_ed_sorted_scores) in tqdm(
|
334 |
+
enumerate(zip(flat_st_ed_scores_sorted_indices, flat_st_ed_sorted_scores)),
|
335 |
+
desc="[VCMR] Loop over queries to generate predictions", total=n_total_query): # i is query_idx
|
336 |
+
# list([video_idx(int), st(float), ed(float), score(float)])
|
337 |
+
video_meta_indices_local, pred_st_indices, pred_ed_indices = \
|
338 |
+
np.unravel_index(_flat_st_ed_scores_sorted_indices,
|
339 |
+
shape=(max_n_videos, opt.max_ctx_len, opt.max_ctx_len))
|
340 |
+
# video_meta_indices refers to the indices among the total_n_videos
|
341 |
+
# video_meta_indices_local refers to the indices among the top-max_n_videos
|
342 |
+
# video_meta_indices refers to the indices in all the videos, which is the True indices
|
343 |
+
video_meta_indices = sorted_q2c_indices[i, video_meta_indices_local]
|
344 |
+
|
345 |
+
pred_st_in_seconds = pred_st_indices.astype(np.float32) * opt.clip_length
|
346 |
+
pred_ed_in_seconds = pred_ed_indices.astype(np.float32) * opt.clip_length + opt.clip_length
|
347 |
+
cur_vcmr_redictions = []
|
348 |
+
query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
|
349 |
+
for j, (v_meta_idx, v_score) in enumerate(zip(video_meta_indices, _flat_st_ed_sorted_scores)): # videos
|
350 |
+
video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
|
351 |
+
cur_vcmr_redictions.append(
|
352 |
+
[video_idx, float(pred_st_in_seconds[j]), float(pred_ed_in_seconds[j]), float(v_score)])
|
353 |
+
|
354 |
+
cur_query_pred = dict(
|
355 |
+
query_id=query_metas[i]["query_id"],
|
356 |
+
desc=query_metas[i]["desc"],
|
357 |
+
predictions=cur_vcmr_redictions)
|
358 |
+
vcmr_res.append(cur_query_pred)
|
359 |
+
|
360 |
+
res = dict(VCMR=vcmr_res, SVMR=svmr_res, VR=vr_res)
|
361 |
+
return {k: v for k, v in res.items() if len(v) != 0}
|
362 |
+
|
363 |
+
|
364 |
+
def compute_query2ctx_info_disjoint(model, eval_dataset, opt,
|
365 |
+
max_before_nms=200, max_n_videos=100, maxtopk = 40):
|
366 |
+
"""Use val set to do evaluation, remember to run with torch.no_grad().
|
367 |
+
model : CONQUER
|
368 |
+
eval_dataset :
|
369 |
+
opt :
|
370 |
+
max_before_nms : max moment number before non-maximum suppression
|
371 |
+
tasks: evaluation tasks
|
372 |
+
|
373 |
+
disjoint function : b_i + e_i
|
374 |
+
|
375 |
+
"""
|
376 |
+
video2idx = eval_dataset.video2idx
|
377 |
+
|
378 |
+
model.eval()
|
379 |
+
query_eval_loader = DataLoader(eval_dataset, collate_fn= start_end_collate, batch_size=opt.eval_query_bsz,
|
380 |
+
num_workers=opt.num_workers, shuffle=False, pin_memory=True)
|
381 |
+
|
382 |
+
n_total_query = len(eval_dataset)
|
383 |
+
bsz = opt.eval_query_bsz
|
384 |
+
|
385 |
+
flat_st_ed_scores_sorted_indices = np.empty((n_total_query, max_before_nms), dtype=int)
|
386 |
+
flat_st_ed_sorted_scores = np.zeros((n_total_query, max_before_nms), dtype=np.float32)
|
387 |
+
|
388 |
+
|
389 |
+
query_metas = []
|
390 |
+
for idx, batch in tqdm(
|
391 |
+
enumerate(query_eval_loader), desc="Computing q embedding", total=len(query_eval_loader)):
|
392 |
+
|
393 |
+
query_metas.extend(batch["meta"])
|
394 |
+
if opt.device.type == "cuda":
|
395 |
+
model_inputs = move_cuda(batch["model_inputs"], opt.device)
|
396 |
+
|
397 |
+
else:
|
398 |
+
model_inputs = batch["model_inputs"]
|
399 |
+
|
400 |
+
_ , begin_score_distribution, end_score_distribution = model.get_pred_from_raw_query(model_inputs)
|
401 |
+
|
402 |
+
begin_score_distribution = begin_score_distribution[:,1:]
|
403 |
+
end_score_distribution= end_score_distribution[:,1:]
|
404 |
+
|
405 |
+
# Get VCMR results
|
406 |
+
# (_N_q, total_n_videos, L, L)
|
407 |
+
# b_i + e_i
|
408 |
+
_st_ed_scores = torch.unsqueeze(begin_score_distribution, 3) + torch.unsqueeze(end_score_distribution, 2)
|
409 |
+
|
410 |
+
_n_q, total_n_videos = _st_ed_scores.size()[:2]
|
411 |
+
|
412 |
+
|
413 |
+
## mask the invalid location out of moment length constrain
|
414 |
+
_valid_prob_mask = np.logical_not(generate_min_max_length_mask(
|
415 |
+
_st_ed_scores.shape, min_l=opt.min_pred_l, max_l=opt.max_pred_l).astype(bool))
|
416 |
+
|
417 |
+
_valid_prob_mask = torch.from_numpy(_valid_prob_mask).to(_st_ed_scores.device)
|
418 |
+
|
419 |
+
valid_prob_mask = _valid_prob_mask.repeat(_n_q,total_n_videos,1,1)
|
420 |
+
|
421 |
+
# invalid location will become VERY_NEGATIVE_NUMBER!
|
422 |
+
_st_ed_scores[valid_prob_mask] = VERY_NEGATIVE_NUMBER
|
423 |
+
|
424 |
+
# sort across the total_n_videos videos (by flatten from the 2nd dim)
|
425 |
+
# the indices here are local indices, not global indices
|
426 |
+
_flat_st_ed_scores = _st_ed_scores.reshape(_n_q, -1) # (N_q, total_n_videos*L*L)
|
427 |
+
_flat_st_ed_sorted_scores, _flat_st_ed_scores_sorted_indices = \
|
428 |
+
torch.sort(_flat_st_ed_scores, dim=1, descending=True)
|
429 |
+
|
430 |
+
# collect data
|
431 |
+
flat_st_ed_sorted_scores[idx * bsz:(idx + 1) * bsz] = \
|
432 |
+
_flat_st_ed_sorted_scores[:, :max_before_nms].detach().cpu().numpy()
|
433 |
+
flat_st_ed_scores_sorted_indices[idx * bsz:(idx + 1) * bsz] = \
|
434 |
+
_flat_st_ed_scores_sorted_indices[:, :max_before_nms].detach().cpu().numpy()
|
435 |
+
|
436 |
+
|
437 |
+
|
438 |
+
vcmr_res = {}
|
439 |
+
for i, (_flat_st_ed_scores_sorted_indices, _flat_st_ed_sorted_scores) in tqdm(
|
440 |
+
enumerate(zip(flat_st_ed_scores_sorted_indices, flat_st_ed_sorted_scores)),
|
441 |
+
desc="[VCMR] Loop over queries to generate predictions", total=n_total_query): # i is query_idx
|
442 |
+
# list([video_idx(int), st(float), ed(float), score(float)])
|
443 |
+
video_meta_indices_local, pred_st_indices, pred_ed_indices = \
|
444 |
+
np.unravel_index(_flat_st_ed_scores_sorted_indices,
|
445 |
+
shape=(total_n_videos, opt.max_ctx_len, opt.max_ctx_len))
|
446 |
+
|
447 |
+
pred_st_in_seconds = pred_st_indices.astype(np.float32) * opt.clip_length
|
448 |
+
pred_ed_in_seconds = pred_ed_indices.astype(np.float32) * opt.clip_length + opt.clip_length
|
449 |
+
cur_vcmr_redictions = []
|
450 |
+
query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
|
451 |
+
for j, (v_meta_idx, v_score) in enumerate(zip(video_meta_indices_local, _flat_st_ed_sorted_scores)): # videos
|
452 |
+
# video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
|
453 |
+
cur_vcmr_redictions.append(
|
454 |
+
{
|
455 |
+
"video_name": query_specific_video_metas[v_meta_idx],
|
456 |
+
"timestamp": [float(pred_st_in_seconds[j]), float(pred_ed_in_seconds[j])],
|
457 |
+
"model_scores": float(v_score)
|
458 |
+
}
|
459 |
+
)
|
460 |
+
query_id=query_metas[i]["query_id"]
|
461 |
+
vcmr_res[query_id] = cur_vcmr_redictions[:maxtopk]
|
462 |
+
return vcmr_res
|
463 |
+
|
464 |
+
def get_eval_res(model, eval_dataset, opt):
|
465 |
+
"""compute and save query and video proposal embeddings"""
|
466 |
+
|
467 |
+
if opt.similarity_measure == "disjoint": #disjoint b_i+ e_i
|
468 |
+
eval_res = compute_query2ctx_info_disjoint(model, eval_dataset, opt,
|
469 |
+
max_before_nms=opt.max_before_nms,
|
470 |
+
max_n_videos=opt.max_vcmr_video)
|
471 |
+
elif opt.similarity_measure in ["general" , "exclusive" ] : # r * \hat{b_i} * \hat{e_i}
|
472 |
+
eval_res = compute_query2ctx_info(model, eval_dataset, opt,
|
473 |
+
max_before_nms=opt.max_before_nms,
|
474 |
+
max_n_videos=opt.max_vcmr_video,
|
475 |
+
tasks=tasks)
|
476 |
+
|
477 |
+
|
478 |
+
return eval_res
|
479 |
+
|
480 |
+
|
481 |
+
POST_PROCESSING_MMS_FUNC = {
|
482 |
+
"SVMR": post_processing_vcmr_nms,
|
483 |
+
"VCMR": post_processing_vcmr_nms
|
484 |
+
}
|
485 |
+
|
486 |
+
def get_prediction_top_n(list_dict_predictions, top_n):
|
487 |
+
top_n_res = []
|
488 |
+
for e in list_dict_predictions:
|
489 |
+
e["predictions"] = e["predictions"][:top_n]
|
490 |
+
top_n_res.append(e)
|
491 |
+
return top_n_res
|
492 |
+
|
493 |
+
|
494 |
+
def eval_epoch(model, eval_dataset, opt, max_after_nms, iou_thds, topks):
|
495 |
+
|
496 |
+
pred_data = get_eval_res(model, eval_dataset, opt)
|
497 |
+
# video2idx = eval_dataset.video2idx
|
498 |
+
# pred_data = get_prediction_top_n(eval_res, top_n=max_after_nms)
|
499 |
+
# pred_data = get_prediction_top_n(eval_res, top_n=max_after_nms)
|
500 |
+
gt_data = eval_dataset.ground_truth
|
501 |
+
average_ndcg = calculate_ndcg_iou(gt_data, pred_data, iou_thds, topks)
|
502 |
+
return average_ndcg, pred_data
|
503 |
+
|
504 |
+
|
505 |
+
|
506 |
+
def setup_model(opt):
|
507 |
+
"""Load model from checkpoint and move to specified device"""
|
508 |
+
checkpoint = torch.load(opt.ckpt_filepath)
|
509 |
+
loaded_model_cfg = checkpoint["model_cfg"]
|
510 |
+
|
511 |
+
model = CONQUER(loaded_model_cfg,
|
512 |
+
visual_dim=opt.visual_dim,
|
513 |
+
text_dim=opt.text_dim,
|
514 |
+
query_dim=opt.query_dim,
|
515 |
+
hidden_dim=opt.hidden_dim,
|
516 |
+
video_len=opt.max_ctx_len,
|
517 |
+
ctx_mode=opt.ctx_mode,
|
518 |
+
no_output_moe_weight=opt.no_output_moe_weight,
|
519 |
+
similarity_measure=opt.similarity_measure,
|
520 |
+
use_debug = opt.debug)
|
521 |
+
model.load_state_dict(checkpoint["model"])
|
522 |
+
|
523 |
+
logger.info("Loaded model saved at epoch {} from checkpoint: {}"
|
524 |
+
.format(checkpoint["epoch"], opt.ckpt_filepath))
|
525 |
+
|
526 |
+
if opt.device.type == "cuda":
|
527 |
+
logger.info("CUDA enabled.")
|
528 |
+
model.to(opt.device)
|
529 |
+
assert len(opt.device_ids) == 1
|
530 |
+
# if len(opt.device_ids) > 1:
|
531 |
+
# logger.info("Use multi GPU", opt.device_ids)
|
532 |
+
# model = torch.nn.DataParallel(model, device_ids=opt.device_ids) # use multi GPU
|
533 |
+
return model
|
534 |
+
|
535 |
+
|
536 |
+
def start_inference():
|
537 |
+
logger.info("Setup config, data and model...")
|
538 |
+
opt = TestOptions().parse()
|
539 |
+
cudnn.benchmark = False
|
540 |
+
cudnn.deterministic = True
|
541 |
+
|
542 |
+
data_config = load_config(opt.dataset_config)
|
543 |
+
|
544 |
+
eval_dataset = StartEndEvalDataset(
|
545 |
+
config = data_config,
|
546 |
+
max_ctx_len=opt.max_ctx_len,
|
547 |
+
max_desc_len= opt.max_desc_len,
|
548 |
+
clip_length = opt.clip_length,
|
549 |
+
ctx_mode = opt.ctx_mode,
|
550 |
+
mode = opt.eval_split_name,
|
551 |
+
data_ratio = opt.data_ratio,
|
552 |
+
is_eval = True,
|
553 |
+
inference_top_k = opt.max_vcmr_video)
|
554 |
+
|
555 |
+
postfix = "_hero"
|
556 |
+
model = setup_model(opt)
|
557 |
+
save_submission_filename = "inference_{}_{}_{}_predictions_{}{}.json".format(
|
558 |
+
opt.dset_name, opt.eval_split_name, opt.eval_id, "_".join(opt.tasks),postfix)
|
559 |
+
print(save_submission_filename)
|
560 |
+
logger.info("Starting inference...")
|
561 |
+
with torch.no_grad():
|
562 |
+
metrics_no_nms, metrics_nms, latest_file_paths = \
|
563 |
+
eval_epoch(model, eval_dataset, opt, save_submission_filename,
|
564 |
+
tasks=opt.tasks, max_after_nms=100)
|
565 |
+
logger.info("metrics_no_nms \n{}".format(pprint.pformat(metrics_no_nms, indent=4)))
|
566 |
+
logger.info("metrics_nms \n{}".format(pprint.pformat(metrics_nms, indent=4)))
|
567 |
+
|
568 |
+
|
569 |
+
if __name__ == '__main__':
|
570 |
+
start_inference()
|
model/__init__.py
ADDED
File without changes
|
model/backbone/__init__.py
ADDED
File without changes
|
model/backbone/encoder.py
ADDED
@@ -0,0 +1,235 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Pytorch modules
|
3 |
+
some classes are modified from HuggingFace
|
4 |
+
(https://github.com/huggingface/transformers)
|
5 |
+
"""
|
6 |
+
|
7 |
+
import torch
|
8 |
+
import logging
|
9 |
+
from torch import nn
|
10 |
+
logger = logging.getLogger(__name__)
|
11 |
+
|
12 |
+
try:
|
13 |
+
import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
|
14 |
+
except (ImportError, AttributeError) as e:
|
15 |
+
BertLayerNorm = torch.nn.LayerNorm
|
16 |
+
|
17 |
+
from model.transformer.bert import BertEncoder
|
18 |
+
from model.layers import (NetVLAD, LinearLayer)
|
19 |
+
from model.transformer.bert_embed import (BertEmbeddings)
|
20 |
+
from utils.model_utils import mask_logits
|
21 |
+
import torch.nn.functional as F
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
class TransformerBaseModel(nn.Module):
|
26 |
+
"""
|
27 |
+
Base Transformer model
|
28 |
+
"""
|
29 |
+
def __init__(self, config):
|
30 |
+
super(TransformerBaseModel, self).__init__()
|
31 |
+
self.embeddings = BertEmbeddings(config)
|
32 |
+
self.encoder = BertEncoder(config)
|
33 |
+
|
34 |
+
|
35 |
+
def forward(self,features,position_ids,token_type_ids,attention_mask):
|
36 |
+
# embedding layer
|
37 |
+
embedding_output = self.embeddings(token_type_ids=token_type_ids,
|
38 |
+
inputs_embeds=features,
|
39 |
+
position_ids=position_ids)
|
40 |
+
|
41 |
+
encoder_outputs = self.encoder(embedding_output, attention_mask)
|
42 |
+
|
43 |
+
sequence_output = encoder_outputs[0]
|
44 |
+
|
45 |
+
return sequence_output
|
46 |
+
|
47 |
+
class TwoModalEncoder(nn.Module):
|
48 |
+
"""
|
49 |
+
Two modality Transformer Encoder model
|
50 |
+
"""
|
51 |
+
|
52 |
+
def __init__(self, config,img_dim,text_dim,hidden_dim,split_num,output_split=True):
|
53 |
+
super(TwoModalEncoder, self).__init__()
|
54 |
+
self.img_linear = LinearLayer(
|
55 |
+
in_hsz=img_dim, out_hsz=hidden_dim)
|
56 |
+
self.text_linear = LinearLayer(
|
57 |
+
in_hsz=text_dim, out_hsz=hidden_dim)
|
58 |
+
|
59 |
+
self.transformer = TransformerBaseModel(config)
|
60 |
+
self.output_split = output_split
|
61 |
+
if self.output_split:
|
62 |
+
self.split_num = split_num
|
63 |
+
|
64 |
+
|
65 |
+
def forward(self, visual_features, visual_position_ids, visual_token_type_ids, visual_attention_mask,
|
66 |
+
text_features,text_position_ids,text_token_type_ids,text_attention_mask):
|
67 |
+
|
68 |
+
transformed_im = self.img_linear(visual_features)
|
69 |
+
transformed_text = self.text_linear(text_features)
|
70 |
+
|
71 |
+
transformer_input_feat = torch.cat((transformed_im,transformed_text),dim=1)
|
72 |
+
transformer_input_feat_pos_id = torch.cat((visual_position_ids,text_position_ids),dim=1)
|
73 |
+
transformer_input_feat_token_id = torch.cat((visual_token_type_ids,text_token_type_ids),dim=1)
|
74 |
+
transformer_input_feat_mask = torch.cat((visual_attention_mask,text_attention_mask),dim=1)
|
75 |
+
|
76 |
+
output = self.transformer(features=transformer_input_feat,
|
77 |
+
position_ids=transformer_input_feat_pos_id,
|
78 |
+
token_type_ids=transformer_input_feat_token_id,
|
79 |
+
attention_mask=transformer_input_feat_mask)
|
80 |
+
|
81 |
+
if self.output_split:
|
82 |
+
return torch.split(output,self.split_num,dim=1)
|
83 |
+
else:
|
84 |
+
return output
|
85 |
+
|
86 |
+
|
87 |
+
class OneModalEncoder(nn.Module):
|
88 |
+
"""
|
89 |
+
One modality Transformer Encoder model
|
90 |
+
"""
|
91 |
+
|
92 |
+
def __init__(self, config,input_dim,hidden_dim):
|
93 |
+
super(OneModalEncoder, self).__init__()
|
94 |
+
self.linear = LinearLayer(
|
95 |
+
in_hsz=input_dim, out_hsz=hidden_dim)
|
96 |
+
self.transformer = TransformerBaseModel(config)
|
97 |
+
|
98 |
+
def forward(self, features, position_ids, token_type_ids, attention_mask):
|
99 |
+
|
100 |
+
transformed_features = self.linear(features)
|
101 |
+
|
102 |
+
output = self.transformer(features=transformed_features,
|
103 |
+
position_ids=position_ids,
|
104 |
+
token_type_ids=token_type_ids,
|
105 |
+
attention_mask=attention_mask)
|
106 |
+
return output
|
107 |
+
|
108 |
+
|
109 |
+
class VideoQueryEncoder(nn.Module):
|
110 |
+
def __init__(self, config, video_modality,
|
111 |
+
visual_dim=4352, text_dim= 768,
|
112 |
+
query_dim=768, hidden_dim = 768,split_num=100,):
|
113 |
+
super(VideoQueryEncoder, self).__init__()
|
114 |
+
self.use_sub = len(video_modality) > 1
|
115 |
+
if self.use_sub:
|
116 |
+
self.videoEncoder = TwoModalEncoder(config=config.bert_config,
|
117 |
+
img_dim = visual_dim,
|
118 |
+
text_dim = text_dim ,
|
119 |
+
hidden_dim = hidden_dim,
|
120 |
+
split_num = split_num
|
121 |
+
)
|
122 |
+
else:
|
123 |
+
self.videoEncoder = OneModalEncoder(config=config.bert_config,
|
124 |
+
input_dim = visual_dim,
|
125 |
+
hidden_dim = hidden_dim,
|
126 |
+
)
|
127 |
+
|
128 |
+
self.queryEncoder = OneModalEncoder(config=config.query_bert_config,
|
129 |
+
input_dim= query_dim,
|
130 |
+
hidden_dim=hidden_dim,
|
131 |
+
)
|
132 |
+
|
133 |
+
def forward_repr_query(self, batch):
|
134 |
+
|
135 |
+
query_output = self.queryEncoder(
|
136 |
+
features=batch["query"]["feat"],
|
137 |
+
position_ids=batch["query"]["feat_pos_id"],
|
138 |
+
token_type_ids=batch["query"]["feat_token_id"],
|
139 |
+
attention_mask=batch["query"]["feat_mask"]
|
140 |
+
)
|
141 |
+
|
142 |
+
return query_output
|
143 |
+
|
144 |
+
def forward_repr_video(self,batch):
|
145 |
+
video_output = dict()
|
146 |
+
|
147 |
+
if len(batch["visual"]["feat"].size()) == 4:
|
148 |
+
bsz, num_video = batch["visual"]["feat"].size()[:2]
|
149 |
+
for key in batch.keys():
|
150 |
+
if key in ["visual", "sub"]:
|
151 |
+
for key_2 in batch[key]:
|
152 |
+
if key_2 in ["feat", "feat_mask", "feat_pos_id", "feat_token_id"]:
|
153 |
+
shape_list = batch[key][key_2].size()[2:]
|
154 |
+
batch[key][key_2] = batch[key][key_2].view((bsz * num_video,) + shape_list)
|
155 |
+
|
156 |
+
|
157 |
+
if self.use_sub:
|
158 |
+
video_output["visual"], video_output["sub"] = self.videoEncoder(
|
159 |
+
visual_features=batch["visual"]["feat"],
|
160 |
+
visual_position_ids=batch["visual"]["feat_pos_id"],
|
161 |
+
visual_token_type_ids=batch["visual"]["feat_token_id"],
|
162 |
+
visual_attention_mask=batch["visual"]["feat_mask"],
|
163 |
+
text_features=batch["sub"]["feat"],
|
164 |
+
text_position_ids=batch["sub"]["feat_pos_id"],
|
165 |
+
text_token_type_ids=batch["sub"]["feat_token_id"],
|
166 |
+
text_attention_mask=batch["sub"]["feat_mask"]
|
167 |
+
)
|
168 |
+
else:
|
169 |
+
video_output["visual"] = self.videoEncoder(
|
170 |
+
features=batch["visual"]["feat"],
|
171 |
+
position_ids=batch["visual"]["feat_pos_id"],
|
172 |
+
token_type_ids=batch["visual"]["feat_token_id"],
|
173 |
+
attention_mask=batch["visual"]["feat_mask"]
|
174 |
+
)
|
175 |
+
|
176 |
+
return video_output
|
177 |
+
|
178 |
+
|
179 |
+
def forward_repr_both(self, batch):
|
180 |
+
video_output = self.forward_repr_video(batch)
|
181 |
+
query_output = self.forward_repr_query(batch)
|
182 |
+
|
183 |
+
return {"video_feat": video_output,
|
184 |
+
"query_feat": query_output}
|
185 |
+
|
186 |
+
def forward(self,batch,task="repr_both"):
|
187 |
+
|
188 |
+
if task == "repr_both":
|
189 |
+
return self.forward_repr_both(batch)
|
190 |
+
elif task == "repr_video":
|
191 |
+
return self.forward_repr_video(batch)
|
192 |
+
elif task == "repr_query":
|
193 |
+
return self.forward_repr_query(batch)
|
194 |
+
|
195 |
+
|
196 |
+
class QueryWeightEncoder(nn.Module):
|
197 |
+
"""
|
198 |
+
Query Weight Encoder
|
199 |
+
Using NetVLAD to aggreate contextual query features
|
200 |
+
Using FC + Softmax to get fusion weights for each modality
|
201 |
+
"""
|
202 |
+
def __init__(self, config, video_modality):
|
203 |
+
super(QueryWeightEncoder, self).__init__()
|
204 |
+
|
205 |
+
##NetVLAD
|
206 |
+
self.text_pooling = NetVLAD(feature_size=config.hidden_size,cluster_size=config.text_cluster)
|
207 |
+
self.moe_txt_dropout = nn.Dropout(config.moe_dropout_prob)
|
208 |
+
|
209 |
+
##FC
|
210 |
+
self.moe_fc_txt = nn.Linear(
|
211 |
+
in_features=self.text_pooling.out_dim,
|
212 |
+
out_features=len(video_modality),
|
213 |
+
bias=False)
|
214 |
+
|
215 |
+
self.video_modality = video_modality
|
216 |
+
|
217 |
+
def forward(self, query_feat):
|
218 |
+
##NetVLAD
|
219 |
+
pooled_text = self.text_pooling(query_feat)
|
220 |
+
pooled_text = self.moe_txt_dropout(pooled_text)
|
221 |
+
|
222 |
+
##FC + Softmax
|
223 |
+
moe_weights = self.moe_fc_txt(pooled_text)
|
224 |
+
softmax_moe_weights = F.softmax(moe_weights, dim=1)
|
225 |
+
|
226 |
+
|
227 |
+
moe_weights_dict = dict()
|
228 |
+
for modality, moe_weight in zip(self.video_modality, torch.split(softmax_moe_weights, 1, dim=1)):
|
229 |
+
moe_weights_dict[modality] = moe_weight.squeeze(1)
|
230 |
+
|
231 |
+
return moe_weights_dict
|
232 |
+
|
233 |
+
|
234 |
+
|
235 |
+
|
model/conquer.py
ADDED
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
from model.backbone.encoder import VideoQueryEncoder, QueryWeightEncoder
|
4 |
+
from model.qal.query_aware_learning_module import BiDirectionalAttention
|
5 |
+
from model.layers import FCPlusTransformer#,MomentLocalizationHead
|
6 |
+
from model.head.ml_head import MomentLocalizationHead
|
7 |
+
from model.head.vs_head import VideoScoringHead
|
8 |
+
|
9 |
+
import logging
|
10 |
+
logger = logging.getLogger(__name__)
|
11 |
+
|
12 |
+
|
13 |
+
class CONQUER(nn.Module):
|
14 |
+
def __init__(self, config,
|
15 |
+
visual_dim = 4352,
|
16 |
+
text_dim = 768,
|
17 |
+
query_dim = 768,
|
18 |
+
hidden_dim = 768,
|
19 |
+
video_len = 100,
|
20 |
+
ctx_mode = "visual_sub",
|
21 |
+
lw_st_ed = 0.01,
|
22 |
+
lw_video_ce = 0.05,
|
23 |
+
similarity_measure="general",
|
24 |
+
use_debug=False,
|
25 |
+
no_output_moe_weight=False):
|
26 |
+
|
27 |
+
super(CONQUER, self).__init__()
|
28 |
+
self.config = config
|
29 |
+
|
30 |
+
# related configs
|
31 |
+
self.lw_st_ed = lw_st_ed
|
32 |
+
self.lw_video_ce = lw_video_ce
|
33 |
+
self.similarity_measure = similarity_measure
|
34 |
+
|
35 |
+
self.video_modality = ctx_mode.split("_")
|
36 |
+
logger.info("video modality : %s" % self.video_modality)
|
37 |
+
self.output_moe_weight = not no_output_moe_weight
|
38 |
+
|
39 |
+
hidden_dim = hidden_dim
|
40 |
+
base_bert_layer_config = config.bert_config
|
41 |
+
|
42 |
+
## Backbone encoder
|
43 |
+
self.encoder = VideoQueryEncoder(config,video_modality=self.video_modality,
|
44 |
+
visual_dim=visual_dim,text_dim=text_dim,query_dim=query_dim,
|
45 |
+
hidden_dim=hidden_dim,split_num=video_len)
|
46 |
+
|
47 |
+
if self.output_moe_weight and len(self.video_modality) > 1:
|
48 |
+
self.query_weight = QueryWeightEncoder(config.netvlad_config,video_modality=self.video_modality)
|
49 |
+
|
50 |
+
## Query_aware_feature_learning Module
|
51 |
+
self.query_aware_feature_learning_layer = BiDirectionalAttention(hidden_dim)
|
52 |
+
|
53 |
+
## Shared transformer for both moment localization and video scoring heads
|
54 |
+
self.contextual_QAL_feature_learning = FCPlusTransformer(base_bert_layer_config,hidden_dim * 4)
|
55 |
+
|
56 |
+
## Moment_localization_head
|
57 |
+
self.moment_localization_head = MomentLocalizationHead(config.moment_localization_config,base_bert_layer_config,hidden_dim)
|
58 |
+
self.temporal_criterion = nn.CrossEntropyLoss(reduction="mean")
|
59 |
+
|
60 |
+
## Optional video_scoring_head
|
61 |
+
if self.similarity_measure == "exclusive":
|
62 |
+
self.video_scoring_head = VideoScoringHead(config.video_scoring_config,base_bert_layer_config,hidden_dim)
|
63 |
+
self.score_ce = nn.CrossEntropyLoss(reduction="mean")
|
64 |
+
|
65 |
+
self.debug_model = use_debug
|
66 |
+
if self.debug_model:
|
67 |
+
logger.setLevel(level=logging.DEBUG)
|
68 |
+
|
69 |
+
self.reset_parameters()
|
70 |
+
|
71 |
+
def reset_parameters(self):
|
72 |
+
""" Initialize the weights."""
|
73 |
+
|
74 |
+
def re_init(module):
|
75 |
+
if isinstance(module, (nn.Linear, nn.Embedding)):
|
76 |
+
# Slightly different from the TF version which uses truncated_normal for initialization
|
77 |
+
# cf https://github.com/pytorch/pytorch/pull/5617
|
78 |
+
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
79 |
+
#print("nn.Linear, nn.Embedding: ", module)
|
80 |
+
elif isinstance(module, nn.LayerNorm):
|
81 |
+
module.bias.data.zero_()
|
82 |
+
module.weight.data.fill_(1.0)
|
83 |
+
elif isinstance(module, nn.Conv1d):
|
84 |
+
module.reset_parameters()
|
85 |
+
|
86 |
+
if isinstance(module, nn.Linear) and module.bias is not None:
|
87 |
+
module.bias.data.zero_()
|
88 |
+
|
89 |
+
self.apply(re_init)
|
90 |
+
|
91 |
+
|
92 |
+
def compute_final_score(self,score_dict,moe_weights=None):
|
93 |
+
|
94 |
+
sample_key = list(score_dict.keys())[0]
|
95 |
+
final_query_context_scores = torch.zeros_like(score_dict[sample_key])
|
96 |
+
shape_size = len(score_dict[sample_key].shape)
|
97 |
+
if moe_weights is not None:
|
98 |
+
for mod in self.video_modality:
|
99 |
+
if shape_size == 2:
|
100 |
+
final_query_context_scores += torch.einsum("nm,n->nm", score_dict[mod], moe_weights[mod])
|
101 |
+
elif shape_size == 3:
|
102 |
+
final_query_context_scores += torch.einsum("nlm,n->nlm", score_dict[mod], moe_weights[mod])
|
103 |
+
else:
|
104 |
+
for mod in self.video_modality:
|
105 |
+
final_query_context_scores += torch.div(score_dict[mod], len(self.video_modality))
|
106 |
+
|
107 |
+
return final_query_context_scores
|
108 |
+
|
109 |
+
|
110 |
+
def get_pred_from_raw_query(self, batch):
|
111 |
+
|
112 |
+
## Extract query and video feature through MMT backbone
|
113 |
+
_query_feature = self.encoder(batch, task="repr_query") #Widehat_Q
|
114 |
+
|
115 |
+
_video_feature_dict = self.encoder(batch, task="repr_video") #Widehat_V and #Widehat_S
|
116 |
+
|
117 |
+
## Shared normalization technique
|
118 |
+
## Use the same query feature for shared_video_num times
|
119 |
+
sample_key = list(_video_feature_dict.keys())[0]
|
120 |
+
query_batch = _query_feature.size()[0]
|
121 |
+
video_batch, video_len = _video_feature_dict[sample_key].size()[:2]
|
122 |
+
shared_video_num = int(video_batch / query_batch)
|
123 |
+
|
124 |
+
query_feature = torch.repeat_interleave(_query_feature, shared_video_num, dim=0)
|
125 |
+
query_mask = torch.repeat_interleave(batch["query"]["feat_mask"], shared_video_num, dim=0)
|
126 |
+
|
127 |
+
|
128 |
+
## Compute Query Dependent Fusion video feature
|
129 |
+
if self.output_moe_weight and len(self.video_modality) > 1:
|
130 |
+
moe_weights_dict = self.query_weight(query_feature)
|
131 |
+
QDF_feature = self.compute_final_score(_video_feature_dict, moe_weights_dict)
|
132 |
+
else:
|
133 |
+
QDF_feature = self.compute_final_score(_video_feature_dict,None)
|
134 |
+
|
135 |
+
video_mask = batch["visual"]["feat_mask"]
|
136 |
+
|
137 |
+
|
138 |
+
## Compute Query Aware Learning video feature
|
139 |
+
QAL_feature = self.query_aware_feature_learning_layer(QDF_feature, query_feature,
|
140 |
+
video_mask,query_mask)
|
141 |
+
|
142 |
+
## Contextualize QAL features
|
143 |
+
Contextual_QAL = self.contextual_QAL_feature_learning(
|
144 |
+
features=QAL_feature,
|
145 |
+
feat_mask=video_mask)
|
146 |
+
|
147 |
+
G = torch.cat([QAL_feature,Contextual_QAL], dim=2)
|
148 |
+
|
149 |
+
## Moment localization head
|
150 |
+
begin_score_distribution , end_score_distribution = self.moment_localization_head(G,Contextual_QAL,video_mask)
|
151 |
+
begin_score_distribution = begin_score_distribution.view(query_batch, shared_video_num, video_len)
|
152 |
+
end_score_distribution = end_score_distribution.view(query_batch, shared_video_num, video_len)
|
153 |
+
|
154 |
+
## Optional video scoring head
|
155 |
+
video_similarity_score = None
|
156 |
+
if self.similarity_measure == "exclusive":
|
157 |
+
video_similarity_score = self.video_scoring_head(G,video_mask)
|
158 |
+
video_similarity_score = video_similarity_score.view(query_batch, shared_video_num)
|
159 |
+
|
160 |
+
return video_similarity_score, begin_score_distribution , end_score_distribution
|
161 |
+
|
162 |
+
|
163 |
+
def get_moment_loss_share_norm(self, begin_score_distribution, end_score_distribution ,st_ed_indices):
|
164 |
+
|
165 |
+
bs , shared_video_num , video_len = begin_score_distribution.size()
|
166 |
+
|
167 |
+
begin_score_distribution = begin_score_distribution.view(bs,-1)
|
168 |
+
end_score_distribution = end_score_distribution.view(bs,-1)
|
169 |
+
|
170 |
+
loss_st = self.temporal_criterion(begin_score_distribution, st_ed_indices[:, 0])
|
171 |
+
loss_ed = self.temporal_criterion(end_score_distribution, st_ed_indices[:, 1])
|
172 |
+
moment_ce_loss = loss_st + loss_ed
|
173 |
+
|
174 |
+
return moment_ce_loss
|
175 |
+
|
176 |
+
|
177 |
+
def forward(self,batch):
|
178 |
+
|
179 |
+
video_similarity_score, begin_score_distribution , end_score_distribution = \
|
180 |
+
self.get_pred_from_raw_query(batch)
|
181 |
+
|
182 |
+
moment_ce_loss, video_ce_loss = 0, 0
|
183 |
+
|
184 |
+
# moment cross-entropy loss
|
185 |
+
# if neg_video_num = 0, we do not sample negative videos
|
186 |
+
# the softmax operator is performed only for the ground-truth video
|
187 |
+
# which mean to not use shared normalization training objective
|
188 |
+
moment_ce_loss = self.get_moment_loss_share_norm(
|
189 |
+
begin_score_distribution, end_score_distribution, batch["st_ed_indices"])
|
190 |
+
moment_ce_loss = self.lw_st_ed * moment_ce_loss
|
191 |
+
|
192 |
+
if self.similarity_measure == "exclusive":
|
193 |
+
ce_label = batch["st_ed_indices"].new_zeros(video_similarity_score.size()[0])
|
194 |
+
video_ce_loss = self.score_ce(video_similarity_score, ce_label)
|
195 |
+
video_ce_loss = self.lw_video_ce*video_ce_loss
|
196 |
+
|
197 |
+
|
198 |
+
loss = moment_ce_loss + video_ce_loss
|
199 |
+
return loss, {"moment_ce_loss": float(moment_ce_loss),
|
200 |
+
"video_ce_loss": float(video_ce_loss),
|
201 |
+
"loss_overall": float(loss)}
|
202 |
+
|
203 |
+
|
204 |
+
|
205 |
+
|
model/head/__init__.py
ADDED
File without changes
|
model/head/ml_head.py
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
from torch import nn
|
3 |
+
import logging
|
4 |
+
logger = logging.getLogger(__name__)
|
5 |
+
|
6 |
+
|
7 |
+
from model.layers import FCPlusTransformer, ConvSE
|
8 |
+
|
9 |
+
|
10 |
+
class MomentLocalizationHead(nn.Module):
|
11 |
+
"""
|
12 |
+
Moment localization head model
|
13 |
+
"""
|
14 |
+
|
15 |
+
def __init__(self, config,base_bert_layer_config,hidden_dim):
|
16 |
+
super(MomentLocalizationHead, self).__init__()
|
17 |
+
|
18 |
+
base_bert_layer_config = base_bert_layer_config
|
19 |
+
hidden_dim = hidden_dim
|
20 |
+
|
21 |
+
self.begin_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
|
22 |
+
|
23 |
+
self.end_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 2)
|
24 |
+
|
25 |
+
self.begin_score_modeling = ConvSE(config)
|
26 |
+
self.end_score_modeling = ConvSE(config)
|
27 |
+
|
28 |
+
def forward(self, G, Contextual_QAL, video_mask):
|
29 |
+
"""
|
30 |
+
Inputs:
|
31 |
+
:param contextual_qal_features: (batch, feat_size, L_v)
|
32 |
+
:param video_mask: (batch, L_v)
|
33 |
+
Return:
|
34 |
+
score: (begin or end) score distribution
|
35 |
+
"""
|
36 |
+
## OUTPUT LAYER
|
37 |
+
begin_features = self.begin_feature_modeling(
|
38 |
+
features=G,
|
39 |
+
feat_mask=video_mask)
|
40 |
+
|
41 |
+
end_features = self.end_feature_modeling(
|
42 |
+
features=torch.cat([Contextual_QAL, begin_features], dim=2),
|
43 |
+
feat_mask=video_mask)
|
44 |
+
|
45 |
+
## Un-normalized
|
46 |
+
begin_input_feature = torch.transpose(begin_features, 1, 2)
|
47 |
+
end_input_feature = torch.transpose(end_features, 1, 2)
|
48 |
+
|
49 |
+
begin_score_distribution = self.begin_score_modeling(
|
50 |
+
contextual_qal_features=begin_input_feature,
|
51 |
+
video_mask=video_mask,
|
52 |
+
)
|
53 |
+
|
54 |
+
end_score_distribution = self.end_score_modeling(
|
55 |
+
contextual_qal_features=end_input_feature,
|
56 |
+
video_mask=video_mask,
|
57 |
+
)
|
58 |
+
|
59 |
+
return begin_score_distribution , end_score_distribution
|
60 |
+
|
61 |
+
|
model/head/vs_head.py
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
from torch import nn
|
3 |
+
|
4 |
+
import logging
|
5 |
+
logger = logging.getLogger(__name__)
|
6 |
+
|
7 |
+
from model.layers import FCPlusTransformer
|
8 |
+
|
9 |
+
class VideoScoringHead(nn.Module):
|
10 |
+
"""
|
11 |
+
Video Scoring Head
|
12 |
+
"""
|
13 |
+
|
14 |
+
def __init__(self, config,base_bert_layer_config,hidden_dim):
|
15 |
+
super(VideoScoringHead, self).__init__()
|
16 |
+
|
17 |
+
base_bert_layer_config = base_bert_layer_config
|
18 |
+
hidden_dim = hidden_dim
|
19 |
+
|
20 |
+
|
21 |
+
self.video_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
|
22 |
+
|
23 |
+
self.video_score_predictor = nn.Sequential(
|
24 |
+
nn.Linear(**config.linear_1_cfg),
|
25 |
+
nn.ReLU(),
|
26 |
+
nn.Linear(**config.linear_2_cfg)
|
27 |
+
)
|
28 |
+
|
29 |
+
|
30 |
+
def forward(self, G, video_mask):
|
31 |
+
|
32 |
+
|
33 |
+
## Contexual_QAL_feature for video scoring
|
34 |
+
R = self.video_feature_modeling(
|
35 |
+
features=G,
|
36 |
+
feat_mask=video_mask)
|
37 |
+
|
38 |
+
holistic_video_feature, _ = torch.max(R, dim=1)
|
39 |
+
|
40 |
+
video_similarity_score = self.video_score_predictor(holistic_video_feature.squeeze(1)) # r
|
41 |
+
|
42 |
+
return video_similarity_score
|
model/layers.py
ADDED
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
import torch.nn.functional as F
|
4 |
+
import math
|
5 |
+
import logging
|
6 |
+
|
7 |
+
logger = logging.getLogger(__name__)
|
8 |
+
try:
|
9 |
+
import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
|
10 |
+
except (ImportError, AttributeError) as e:
|
11 |
+
BertLayerNorm = torch.nn.LayerNorm
|
12 |
+
|
13 |
+
from model.transformer.bert import BertEncoder
|
14 |
+
from model.modeling_utils import mask_logits
|
15 |
+
|
16 |
+
class LinearLayer(nn.Module):
|
17 |
+
"""linear layer configurable with layer normalization, dropout, ReLU."""
|
18 |
+
def __init__(self, in_hsz, out_hsz, layer_norm=True, dropout=0.1, relu=True,tanh=False):
|
19 |
+
super(LinearLayer, self).__init__()
|
20 |
+
self.relu = relu
|
21 |
+
self.tanh = tanh
|
22 |
+
self.layer_norm = layer_norm
|
23 |
+
if layer_norm:
|
24 |
+
self.LayerNorm = BertLayerNorm(in_hsz)
|
25 |
+
layers = [
|
26 |
+
nn.Dropout(dropout),
|
27 |
+
nn.Linear(in_hsz, out_hsz)
|
28 |
+
]
|
29 |
+
self.net = nn.Sequential(*layers)
|
30 |
+
|
31 |
+
def forward(self, x):
|
32 |
+
"""(N, L, D)"""
|
33 |
+
if self.layer_norm:
|
34 |
+
x = self.LayerNorm(x)
|
35 |
+
x = self.net(x)
|
36 |
+
if self.relu:
|
37 |
+
x = F.relu(x, inplace=True)
|
38 |
+
if self.tanh:
|
39 |
+
x = torch.tanh(x)
|
40 |
+
return x # (N, L, D)
|
41 |
+
|
42 |
+
|
43 |
+
class NetVLAD(nn.Module):
|
44 |
+
def __init__(self, cluster_size, feature_size, add_norm=True):
|
45 |
+
super(NetVLAD, self).__init__()
|
46 |
+
self.feature_size = feature_size
|
47 |
+
self.cluster_size = cluster_size
|
48 |
+
self.clusters = nn.Parameter((1 / math.sqrt(feature_size))
|
49 |
+
* torch.randn(feature_size, cluster_size))
|
50 |
+
self.clusters2 = nn.Parameter((1 / math.sqrt(feature_size))
|
51 |
+
* torch.randn(1, feature_size, cluster_size))
|
52 |
+
|
53 |
+
self.add_norm = add_norm
|
54 |
+
self.LayerNorm = BertLayerNorm(cluster_size)
|
55 |
+
self.out_dim = cluster_size * feature_size
|
56 |
+
|
57 |
+
def forward(self, x):
|
58 |
+
max_sample = x.size()[1]
|
59 |
+
x = x.view(-1, self.feature_size)
|
60 |
+
assignment = torch.matmul(x, self.clusters)
|
61 |
+
|
62 |
+
if self.add_norm:
|
63 |
+
assignment = self.LayerNorm(assignment)
|
64 |
+
|
65 |
+
assignment = F.softmax(assignment, dim=1)
|
66 |
+
assignment = assignment.view(-1, max_sample, self.cluster_size)
|
67 |
+
|
68 |
+
a_sum = torch.sum(assignment, -2, keepdim=True)
|
69 |
+
a = a_sum * self.clusters2
|
70 |
+
|
71 |
+
assignment = assignment.transpose(1, 2)
|
72 |
+
|
73 |
+
x = x.view(-1, max_sample, self.feature_size)
|
74 |
+
vlad = torch.matmul(assignment, x)
|
75 |
+
vlad = vlad.transpose(1, 2)
|
76 |
+
vlad = vlad - a
|
77 |
+
|
78 |
+
# L2 intra norm
|
79 |
+
vlad = F.normalize(vlad)
|
80 |
+
|
81 |
+
# flattening + L2 norm
|
82 |
+
vlad = vlad.reshape(-1, self.cluster_size * self.feature_size)
|
83 |
+
vlad = F.normalize(vlad)
|
84 |
+
|
85 |
+
return vlad
|
86 |
+
|
87 |
+
|
88 |
+
class FCPlusTransformer(nn.Module):
|
89 |
+
"""
|
90 |
+
FC + Transformer
|
91 |
+
FC layer reduces input feature size into hidden size
|
92 |
+
Transformer contextualizes QAL feature
|
93 |
+
"""
|
94 |
+
|
95 |
+
def __init__(self, config,input_dim):
|
96 |
+
super(FCPlusTransformer, self).__init__()
|
97 |
+
self.trans_linear = LinearLayer(
|
98 |
+
in_hsz=input_dim, out_hsz=config.hidden_size)
|
99 |
+
self.encoder = BertEncoder(config)
|
100 |
+
|
101 |
+
def forward(self,features, feat_mask):
|
102 |
+
"""
|
103 |
+
Inputs:
|
104 |
+
:param contextual_qal_features: (batch, L_v, input_dim)
|
105 |
+
:param feat_mask: (batch, L_v)
|
106 |
+
Return:
|
107 |
+
sequence_output: (batch, L_v, hidden_size)
|
108 |
+
"""
|
109 |
+
transformed_features = self.trans_linear(features)
|
110 |
+
|
111 |
+
encoder_outputs = self.encoder(transformed_features, feat_mask)
|
112 |
+
|
113 |
+
sequence_output = encoder_outputs[0]
|
114 |
+
|
115 |
+
return sequence_output
|
116 |
+
|
117 |
+
|
118 |
+
class ConvSE(nn.Module):
|
119 |
+
"""
|
120 |
+
ConvSE module
|
121 |
+
"""
|
122 |
+
def __init__(self, config):
|
123 |
+
super(ConvSE, self).__init__()
|
124 |
+
|
125 |
+
self.clip_score_predictor = nn.Sequential(
|
126 |
+
nn.Conv1d(**config.conv_cfg_1),
|
127 |
+
nn.ReLU(),
|
128 |
+
nn.Conv1d(**config.conv_cfg_2),
|
129 |
+
)
|
130 |
+
|
131 |
+
|
132 |
+
def forward(self, contextual_qal_features, video_mask):
|
133 |
+
"""
|
134 |
+
Inputs:
|
135 |
+
:param contextual_qal_features: (batch, feat_size, L_v)
|
136 |
+
:param video_mask: (batch, L_v)
|
137 |
+
Return:
|
138 |
+
score: (begin or end) score distribution
|
139 |
+
"""
|
140 |
+
score = self.clip_score_predictor(contextual_qal_features).squeeze(1) #(batch, L_v)
|
141 |
+
|
142 |
+
score = mask_logits(score, video_mask) #(batch, L_v)
|
143 |
+
|
144 |
+
return score
|
145 |
+
|
146 |
+
|
147 |
+
class MomentLocalizationHead(nn.Module):
|
148 |
+
"""
|
149 |
+
Moment localization head model
|
150 |
+
"""
|
151 |
+
|
152 |
+
def __init__(self, config,base_bert_layer_config,hidden_dim):
|
153 |
+
super(MomentLocalizationHead, self).__init__()
|
154 |
+
|
155 |
+
base_bert_layer_config = base_bert_layer_config
|
156 |
+
hidden_dim = hidden_dim
|
157 |
+
|
158 |
+
self.start_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
|
159 |
+
|
160 |
+
self.end_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 2)
|
161 |
+
|
162 |
+
self.start_reader = ConvSE(config)
|
163 |
+
self.end_reader = ConvSE(config)
|
164 |
+
|
165 |
+
def forward(self, G, Contextual_QAL, video_mask):
|
166 |
+
"""
|
167 |
+
Inputs:
|
168 |
+
:param contextual_qal_features: (batch, feat_size, L_v)
|
169 |
+
:param video_mask: (batch, L_v)
|
170 |
+
Return:
|
171 |
+
score: (begin or end) score distribution
|
172 |
+
"""
|
173 |
+
## OUTPUT LAYER
|
174 |
+
start_features = self.start_modeling(
|
175 |
+
features=G,
|
176 |
+
feat_mask=video_mask)
|
177 |
+
|
178 |
+
end_features = self.end_modeling(
|
179 |
+
features=torch.cat([Contextual_QAL, start_features], dim=2),
|
180 |
+
feat_mask=video_mask)
|
181 |
+
|
182 |
+
## Un-normalized
|
183 |
+
start_reader_input_feature = torch.transpose(start_features, 1, 2)
|
184 |
+
end_reader_input_feature = torch.transpose(end_features, 1, 2)
|
185 |
+
|
186 |
+
reader_st_prob = self.start_reader(
|
187 |
+
contextual_qal_features=start_reader_input_feature,
|
188 |
+
video_mask=video_mask,
|
189 |
+
)
|
190 |
+
|
191 |
+
reader_ed_prob = self.end_reader(
|
192 |
+
contextual_qal_features=end_reader_input_feature,
|
193 |
+
video_mask=video_mask,
|
194 |
+
)
|
195 |
+
|
196 |
+
return reader_st_prob,reader_ed_prob
|
model/modeling_utils.py
ADDED
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Copyright (c) Microsoft Corporation.
|
3 |
+
Licensed under the MIT license.
|
4 |
+
|
5 |
+
some functions are modified from HuggingFace
|
6 |
+
(https://github.com/huggingface/transformers)
|
7 |
+
"""
|
8 |
+
import torch
|
9 |
+
from torch import nn
|
10 |
+
import logging
|
11 |
+
logger = logging.getLogger(__name__)
|
12 |
+
|
13 |
+
|
14 |
+
def prune_linear_layer(layer, index, dim=0):
|
15 |
+
""" Prune a linear layer (a model parameters)
|
16 |
+
to keep only entries in index.
|
17 |
+
Return the pruned layer as a new layer with requires_grad=True.
|
18 |
+
Used to remove heads.
|
19 |
+
"""
|
20 |
+
index = index.to(layer.weight.device)
|
21 |
+
W = layer.weight.index_select(dim, index).clone().detach()
|
22 |
+
if layer.bias is not None:
|
23 |
+
if dim == 1:
|
24 |
+
b = layer.bias.clone().detach()
|
25 |
+
else:
|
26 |
+
b = layer.bias[index].clone().detach()
|
27 |
+
new_size = list(layer.weight.size())
|
28 |
+
new_size[dim] = len(index)
|
29 |
+
new_layer = nn.Linear(
|
30 |
+
new_size[1], new_size[0], bias=layer.bias is not None).to(
|
31 |
+
layer.weight.device)
|
32 |
+
new_layer.weight.requires_grad = False
|
33 |
+
new_layer.weight.copy_(W.contiguous())
|
34 |
+
new_layer.weight.requires_grad = True
|
35 |
+
if layer.bias is not None:
|
36 |
+
new_layer.bias.requires_grad = False
|
37 |
+
new_layer.bias.copy_(b.contiguous())
|
38 |
+
new_layer.bias.requires_grad = True
|
39 |
+
return new_layer
|
40 |
+
|
41 |
+
|
42 |
+
def mask_logits(target, mask, eps=-1e4):
|
43 |
+
return target * mask + (1 - mask) * eps
|
44 |
+
|
45 |
+
|
46 |
+
def load_partial_checkpoint(checkpoint, n_layers, skip_layers=True):
|
47 |
+
if skip_layers:
|
48 |
+
new_checkpoint = {}
|
49 |
+
gap = int(12/n_layers)
|
50 |
+
prefix = "roberta.encoder.layer."
|
51 |
+
layer_range = {str(l): str(i) for i, l in enumerate(
|
52 |
+
list(range(gap-1, 12, gap)))}
|
53 |
+
for k, v in checkpoint.items():
|
54 |
+
if prefix in k:
|
55 |
+
layer_name = k.split(".")
|
56 |
+
layer_num = layer_name[3]
|
57 |
+
if layer_num in layer_range:
|
58 |
+
layer_name[3] = layer_range[layer_num]
|
59 |
+
new_layer_name = ".".join(layer_name)
|
60 |
+
new_checkpoint[new_layer_name] = v
|
61 |
+
else:
|
62 |
+
new_checkpoint[k] = v
|
63 |
+
else:
|
64 |
+
new_checkpoint = checkpoint
|
65 |
+
return new_checkpoint
|
66 |
+
|
67 |
+
|
68 |
+
def load_pretrained_weight(model, state_dict):
|
69 |
+
# Load from a PyTorch state_dict
|
70 |
+
old_keys = []
|
71 |
+
new_keys = []
|
72 |
+
for key in state_dict.keys():
|
73 |
+
new_key = None
|
74 |
+
if 'gamma' in key:
|
75 |
+
new_key = key.replace('gamma', 'weight')
|
76 |
+
if 'beta' in key:
|
77 |
+
new_key = key.replace('beta', 'bias')
|
78 |
+
if new_key:
|
79 |
+
old_keys.append(key)
|
80 |
+
new_keys.append(new_key)
|
81 |
+
for old_key, new_key in zip(old_keys, new_keys):
|
82 |
+
state_dict[new_key] = state_dict.pop(old_key)
|
83 |
+
|
84 |
+
missing_keys = []
|
85 |
+
unexpected_keys = []
|
86 |
+
error_msgs = []
|
87 |
+
# copy state_dict so _load_from_state_dict can modify it
|
88 |
+
metadata = getattr(state_dict, '_metadata', None)
|
89 |
+
state_dict = state_dict.copy()
|
90 |
+
if metadata is not None:
|
91 |
+
state_dict._metadata = metadata
|
92 |
+
|
93 |
+
def load(module, prefix=''):
|
94 |
+
local_metadata = ({} if metadata is None
|
95 |
+
else metadata.get(prefix[:-1], {}))
|
96 |
+
module._load_from_state_dict(
|
97 |
+
state_dict, prefix, local_metadata, True, missing_keys,
|
98 |
+
unexpected_keys, error_msgs)
|
99 |
+
for name, child in module._modules.items():
|
100 |
+
if child is not None:
|
101 |
+
load(child, prefix + name + '.')
|
102 |
+
start_prefix = ''
|
103 |
+
if not hasattr(model, 'roberta') and\
|
104 |
+
any(s.startswith('roberta.') for s in state_dict.keys()):
|
105 |
+
start_prefix = 'roberta.'
|
106 |
+
|
107 |
+
load(model, prefix=start_prefix)
|
108 |
+
if len(missing_keys) > 0:
|
109 |
+
logger.info("Weights of {} not initialized from "
|
110 |
+
"pretrained model: {}".format(
|
111 |
+
model.__class__.__name__, missing_keys))
|
112 |
+
if len(unexpected_keys) > 0:
|
113 |
+
logger.info("Weights from pretrained model not used in "
|
114 |
+
"{}: {}".format(
|
115 |
+
model.__class__.__name__, unexpected_keys))
|
116 |
+
if len(error_msgs) > 0:
|
117 |
+
raise RuntimeError('Error(s) in loading state_dict for '
|
118 |
+
'{}:\n\t{}'.format(
|
119 |
+
model.__class__.__name__,
|
120 |
+
"\n\t".join(error_msgs)))
|
121 |
+
return model
|
122 |
+
|
123 |
+
|
124 |
+
def pad_tensor_to_mul(tensor, dim=0, mul=8):
|
125 |
+
""" pad tensor to multiples (8 for tensor cores) """
|
126 |
+
t_size = list(tensor.size())
|
127 |
+
n_pad = mul - t_size[dim] % mul
|
128 |
+
if n_pad == mul:
|
129 |
+
n_pad = 0
|
130 |
+
padded_tensor = tensor
|
131 |
+
else:
|
132 |
+
t_size[dim] = n_pad
|
133 |
+
pad = torch.zeros(*t_size, dtype=tensor.dtype, device=tensor.device)
|
134 |
+
padded_tensor = torch.cat([tensor, pad], dim=dim)
|
135 |
+
return padded_tensor, n_pad
|
model/qal/__init__.py
ADDED
File without changes
|
model/qal/query_aware_learning_module.py
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
from torch import nn
|
3 |
+
|
4 |
+
import logging
|
5 |
+
logger = logging.getLogger(__name__)
|
6 |
+
|
7 |
+
try:
|
8 |
+
import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
|
9 |
+
except (ImportError, AttributeError) as e:
|
10 |
+
BertLayerNorm = torch.nn.LayerNorm
|
11 |
+
|
12 |
+
from utils.model_utils import mask_logits
|
13 |
+
import torch.nn.functional as F
|
14 |
+
|
15 |
+
|
16 |
+
class BiDirectionalAttention(nn.Module):
|
17 |
+
"""
|
18 |
+
Bi-directional attention flow
|
19 |
+
Perform query-to-video attention (Q2V) and video-to-query attention (V2Q)
|
20 |
+
Append QDF features with a set of query-aware features to form QAL feature
|
21 |
+
"""
|
22 |
+
|
23 |
+
def __init__(self, video_dim):
|
24 |
+
super(BiDirectionalAttention, self).__init__()
|
25 |
+
## Core Attention for query-aware feature learining
|
26 |
+
self.similarity_weight = nn.Linear(video_dim * 3, 1, bias=False)
|
27 |
+
|
28 |
+
|
29 |
+
def forward(self, QDF_emb, query_emb,video_mask, query_mask):
|
30 |
+
"""
|
31 |
+
Inputs:
|
32 |
+
:param QDF_emb: (batch, L_v, feat_size)
|
33 |
+
:param query_emb: (batch, L_q, feat_size)
|
34 |
+
:param video_mask: (batch, L_v)
|
35 |
+
:param query_mask: (batch, L_q)
|
36 |
+
Return:
|
37 |
+
QAL: (batch, L_v, feat_size*4)
|
38 |
+
"""
|
39 |
+
|
40 |
+
## CREATE SIMILARITY MATRIX
|
41 |
+
video_len = QDF_emb.size()[1]
|
42 |
+
query_len = query_emb.size()[1]
|
43 |
+
|
44 |
+
_QDF_emb = QDF_emb.unsqueeze(2).repeat(1, 1, query_len, 1)
|
45 |
+
# [bs, video_len, 1, feat_size] => [bs, video_len, query_len, feat_size]
|
46 |
+
|
47 |
+
_query_emb = query_emb.unsqueeze(1).repeat(1, video_len, 1, 1)
|
48 |
+
# [bs, 1, query_len, feat_size] => [bs, video_len, query_len, feat_size]
|
49 |
+
|
50 |
+
elementwise_prod = torch.mul(_QDF_emb, _query_emb)
|
51 |
+
# [bs, video_len, query_len, feat_size]
|
52 |
+
|
53 |
+
alpha = torch.cat([_QDF_emb, _query_emb, elementwise_prod], dim=3)
|
54 |
+
# [bs, video_len, query_len, feat_size*3]
|
55 |
+
|
56 |
+
similarity_matrix = self.similarity_weight(alpha).view(-1, video_len, query_len)
|
57 |
+
|
58 |
+
similarity_matrix_mask = torch.einsum("bn,bm->bnm", video_mask, query_mask)
|
59 |
+
# [bs, video_len, query_len]
|
60 |
+
|
61 |
+
## CALCULATE Video2Query ATTENTION
|
62 |
+
|
63 |
+
a = F.softmax(mask_logits(similarity_matrix,
|
64 |
+
similarity_matrix_mask), dim=-1)
|
65 |
+
# [bs, video_len, query_len]
|
66 |
+
|
67 |
+
V2Q = torch.bmm(a, query_emb)
|
68 |
+
# [bs] ([video_len, query_len] X [query_len, feat_size]) => [bs, video_len, feat_size]
|
69 |
+
|
70 |
+
## CALCULATE Query2Video ATTENTION
|
71 |
+
|
72 |
+
b = F.softmax(torch.max(mask_logits(similarity_matrix, similarity_matrix_mask), 2)[0], dim=-1)
|
73 |
+
# [bs, video_len]
|
74 |
+
|
75 |
+
b = b.unsqueeze(1)
|
76 |
+
# [bs, 1, video_len]
|
77 |
+
|
78 |
+
Q2V = torch.bmm(b, QDF_emb)
|
79 |
+
# [bs] ([bs, 1, video_len] X [bs, video_len, feat_size]) => [bs, 1, feat_size]
|
80 |
+
|
81 |
+
Q2V = Q2V.repeat(1, video_len, 1)
|
82 |
+
# [bs, video_len, feat_size]
|
83 |
+
|
84 |
+
## Append QDF_emb with three query-aware features
|
85 |
+
|
86 |
+
QAL = torch.cat([QDF_emb, V2Q,
|
87 |
+
torch.mul(QDF_emb, V2Q),
|
88 |
+
torch.mul(QDF_emb, Q2V)], dim=2)
|
89 |
+
|
90 |
+
# [bs, video_len, feat_size*4]
|
91 |
+
|
92 |
+
return QAL
|
model/transformer/__init__.py
ADDED
File without changes
|
model/transformer/bert.py
ADDED
@@ -0,0 +1,275 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
BERT/RoBERTa layers from the huggingface implementation
|
3 |
+
(https://github.com/huggingface/transformers)
|
4 |
+
"""
|
5 |
+
|
6 |
+
import torch
|
7 |
+
import torch.nn as nn
|
8 |
+
import torch.nn.functional as F
|
9 |
+
from model.modeling_utils import prune_linear_layer
|
10 |
+
import math
|
11 |
+
import logging
|
12 |
+
logger = logging.getLogger(__name__)
|
13 |
+
try:
|
14 |
+
import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
|
15 |
+
except (ImportError, AttributeError) as e:
|
16 |
+
BertLayerNorm = torch.nn.LayerNorm
|
17 |
+
|
18 |
+
|
19 |
+
def gelu(x):
|
20 |
+
""" Original Implementation of the gelu activation function
|
21 |
+
in Google Bert repo when initialy created.
|
22 |
+
For information: OpenAI GPT's gelu is slightly different
|
23 |
+
(and gives slightly different results):
|
24 |
+
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi)
|
25 |
+
* (x + 0.044715 * torch.pow(x, 3))))
|
26 |
+
Also see https://arxiv.org/abs/1606.08415
|
27 |
+
"""
|
28 |
+
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
|
29 |
+
|
30 |
+
|
31 |
+
def gelu_new(x):
|
32 |
+
""" Implementation of the gelu activation function currently
|
33 |
+
in Google Bert repo (identical to OpenAI GPT).
|
34 |
+
Also see https://arxiv.org/abs/1606.08415
|
35 |
+
"""
|
36 |
+
return 0.5 * x * (
|
37 |
+
1 + torch.tanh(
|
38 |
+
math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
|
39 |
+
|
40 |
+
def swish(x):
|
41 |
+
return x * torch.sigmoid(x)
|
42 |
+
|
43 |
+
|
44 |
+
ACT2FN = {
|
45 |
+
"gelu": gelu,
|
46 |
+
"relu": torch.nn.functional.relu,
|
47 |
+
"swish": swish, "gelu_new": gelu_new}
|
48 |
+
|
49 |
+
class BertSelfAttention(nn.Module):
|
50 |
+
def __init__(self, config):
|
51 |
+
super(BertSelfAttention, self).__init__()
|
52 |
+
if config.hidden_size % config.num_attention_heads != 0:
|
53 |
+
raise ValueError(
|
54 |
+
"The hidden size (%d) is not a multiple of "
|
55 |
+
"the number of attention heads (%d)" % (
|
56 |
+
config.hidden_size, config.num_attention_heads))
|
57 |
+
self.output_attentions = config.output_attentions
|
58 |
+
|
59 |
+
self.num_attention_heads = config.num_attention_heads
|
60 |
+
self.attention_head_size = int(
|
61 |
+
config.hidden_size / config.num_attention_heads)
|
62 |
+
self.all_head_size = self.num_attention_heads *\
|
63 |
+
self.attention_head_size
|
64 |
+
|
65 |
+
self.query = nn.Linear(config.hidden_size, self.all_head_size)
|
66 |
+
self.key = nn.Linear(config.hidden_size, self.all_head_size)
|
67 |
+
self.value = nn.Linear(config.hidden_size, self.all_head_size)
|
68 |
+
|
69 |
+
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
|
70 |
+
|
71 |
+
def transpose_for_scores(self, x):
|
72 |
+
new_x_shape = x.size()[:-1] + (
|
73 |
+
self.num_attention_heads, self.attention_head_size)
|
74 |
+
x = x.view(*new_x_shape)
|
75 |
+
return x.permute(0, 2, 1, 3)
|
76 |
+
|
77 |
+
def forward(self, hidden_states, attention_mask=None, head_mask=None):
|
78 |
+
mixed_query_layer = self.query(hidden_states)
|
79 |
+
mixed_key_layer = self.key(hidden_states)
|
80 |
+
mixed_value_layer = self.value(hidden_states)
|
81 |
+
|
82 |
+
query_layer = self.transpose_for_scores(mixed_query_layer)
|
83 |
+
key_layer = self.transpose_for_scores(mixed_key_layer)
|
84 |
+
value_layer = self.transpose_for_scores(mixed_value_layer)
|
85 |
+
|
86 |
+
# Take the dot product between "query"
|
87 |
+
# and "key" to get the raw attention scores.
|
88 |
+
attention_scores = torch.matmul(
|
89 |
+
query_layer, key_layer.transpose(-1, -2))
|
90 |
+
attention_scores = attention_scores / math.sqrt(
|
91 |
+
self.attention_head_size)
|
92 |
+
if attention_mask is not None:
|
93 |
+
# Apply the attention mask is
|
94 |
+
# (precomputed for all layers in BertModel forward() function)
|
95 |
+
attention_scores = attention_scores + attention_mask
|
96 |
+
|
97 |
+
# Normalize the attention scores to probabilities.
|
98 |
+
attention_probs = nn.Softmax(dim=-1)(attention_scores)
|
99 |
+
|
100 |
+
# This is actually dropping out entire tokens to attend to, which might
|
101 |
+
# seem a bit unusual, but is taken from the original Transformer paper.
|
102 |
+
attention_probs = self.dropout(attention_probs)
|
103 |
+
|
104 |
+
# Mask heads if we want to
|
105 |
+
if head_mask is not None:
|
106 |
+
attention_probs = attention_probs * head_mask
|
107 |
+
|
108 |
+
context_layer = torch.matmul(attention_probs, value_layer)
|
109 |
+
|
110 |
+
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
|
111 |
+
new_context_layer_shape = context_layer.size()[:-2] + (
|
112 |
+
self.all_head_size,)
|
113 |
+
context_layer = context_layer.view(*new_context_layer_shape)
|
114 |
+
|
115 |
+
outputs = (context_layer, attention_probs)\
|
116 |
+
if self.output_attentions else (context_layer,)
|
117 |
+
return outputs
|
118 |
+
|
119 |
+
|
120 |
+
class BertSelfOutput(nn.Module):
|
121 |
+
def __init__(self, config):
|
122 |
+
super(BertSelfOutput, self).__init__()
|
123 |
+
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
|
124 |
+
self.LayerNorm = BertLayerNorm(
|
125 |
+
config.hidden_size, eps=config.layer_norm_eps)
|
126 |
+
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
127 |
+
|
128 |
+
def forward(self, hidden_states, input_tensor):
|
129 |
+
hidden_states = self.dense(hidden_states)
|
130 |
+
hidden_states = self.dropout(hidden_states)
|
131 |
+
hidden_states = self.LayerNorm(hidden_states + input_tensor)
|
132 |
+
return hidden_states
|
133 |
+
|
134 |
+
|
135 |
+
class BertAttention(nn.Module):
|
136 |
+
def __init__(self, config):
|
137 |
+
super(BertAttention, self).__init__()
|
138 |
+
self.self = BertSelfAttention(config)
|
139 |
+
self.output = BertSelfOutput(config)
|
140 |
+
self.pruned_heads = set()
|
141 |
+
|
142 |
+
def prune_heads(self, heads):
|
143 |
+
if len(heads) == 0:
|
144 |
+
return
|
145 |
+
mask = torch.ones(
|
146 |
+
self.self.num_attention_heads, self.self.attention_head_size)
|
147 |
+
# Convert to set and emove already pruned heads
|
148 |
+
heads = set(heads) - self.pruned_heads
|
149 |
+
for head in heads:
|
150 |
+
# Compute how many pruned heads are
|
151 |
+
# before the head and move the index accordingly
|
152 |
+
head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
|
153 |
+
mask[head] = 0
|
154 |
+
mask = mask.view(-1).contiguous().eq(1)
|
155 |
+
index = torch.arange(len(mask))[mask].long()
|
156 |
+
|
157 |
+
# Prune linear layers
|
158 |
+
self.self.query = prune_linear_layer(self.self.query, index)
|
159 |
+
self.self.key = prune_linear_layer(self.self.key, index)
|
160 |
+
self.self.value = prune_linear_layer(self.self.value, index)
|
161 |
+
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
|
162 |
+
|
163 |
+
# Update hyper params and store pruned heads
|
164 |
+
self.self.num_attention_heads = self.self.num_attention_heads - len(
|
165 |
+
heads)
|
166 |
+
self.self.all_head_size =\
|
167 |
+
self.self.attention_head_size * self.self.num_attention_heads
|
168 |
+
self.pruned_heads = self.pruned_heads.union(heads)
|
169 |
+
|
170 |
+
def forward(self, input_tensor, attention_mask=None, head_mask=None):
|
171 |
+
self_outputs = self.self(input_tensor, attention_mask, head_mask)
|
172 |
+
attention_output = self.output(self_outputs[0], input_tensor)
|
173 |
+
# add attentions if we output them
|
174 |
+
outputs = (attention_output,) + self_outputs[1:]
|
175 |
+
return outputs
|
176 |
+
|
177 |
+
|
178 |
+
class BertIntermediate(nn.Module):
|
179 |
+
def __init__(self, config):
|
180 |
+
super(BertIntermediate, self).__init__()
|
181 |
+
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
|
182 |
+
if isinstance(config.hidden_act, str):
|
183 |
+
self.intermediate_act_fn = ACT2FN[config.hidden_act]
|
184 |
+
else:
|
185 |
+
self.intermediate_act_fn = config.hidden_act
|
186 |
+
|
187 |
+
def forward(self, hidden_states):
|
188 |
+
hidden_states = self.dense(hidden_states)
|
189 |
+
hidden_states = self.intermediate_act_fn(hidden_states)
|
190 |
+
return hidden_states
|
191 |
+
|
192 |
+
|
193 |
+
class BertOutput(nn.Module):
|
194 |
+
def __init__(self, config):
|
195 |
+
super(BertOutput, self).__init__()
|
196 |
+
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
|
197 |
+
self.LayerNorm = BertLayerNorm(
|
198 |
+
config.hidden_size, eps=config.layer_norm_eps)
|
199 |
+
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
200 |
+
|
201 |
+
def forward(self, hidden_states, input_tensor):
|
202 |
+
hidden_states = self.dense(hidden_states)
|
203 |
+
hidden_states = self.dropout(hidden_states)
|
204 |
+
hidden_states = self.LayerNorm(hidden_states + input_tensor)
|
205 |
+
return hidden_states
|
206 |
+
|
207 |
+
|
208 |
+
class BertLayer(nn.Module):
|
209 |
+
def __init__(self, config):
|
210 |
+
super(BertLayer, self).__init__()
|
211 |
+
self.attention = BertAttention(config)
|
212 |
+
self.intermediate = BertIntermediate(config)
|
213 |
+
self.output = BertOutput(config)
|
214 |
+
|
215 |
+
def forward(self, hidden_states, attention_mask=None, head_mask=None):
|
216 |
+
attention_outputs = self.attention(
|
217 |
+
hidden_states, attention_mask, head_mask)
|
218 |
+
attention_output = attention_outputs[0]
|
219 |
+
intermediate_output = self.intermediate(attention_output)
|
220 |
+
layer_output = self.output(intermediate_output, attention_output)
|
221 |
+
# add attentions if we output them
|
222 |
+
outputs = (layer_output,) + attention_outputs[1:]
|
223 |
+
return outputs
|
224 |
+
|
225 |
+
|
226 |
+
class BertEncoder(nn.Module):
|
227 |
+
def __init__(self, config):
|
228 |
+
super(BertEncoder, self).__init__()
|
229 |
+
self.output_attentions = config.output_attentions
|
230 |
+
self.output_hidden_states = config.output_hidden_states
|
231 |
+
self.layer = nn.ModuleList([BertLayer(config) for _ in range(
|
232 |
+
config.num_hidden_layers)])
|
233 |
+
|
234 |
+
def forward(self, hidden_states, attention_mask=None, head_mask=None):
|
235 |
+
|
236 |
+
# We create a 3D attention mask from a 2D tensor mask.
|
237 |
+
# Sizes are [batch_size, 1, 1, to_seq_length]
|
238 |
+
# So we can broadcast to
|
239 |
+
# [batch_size, num_heads, from_seq_length, to_seq_length]
|
240 |
+
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
|
241 |
+
|
242 |
+
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
|
243 |
+
# masked positions, this operation will create a tensor which is 0.0 for
|
244 |
+
# positions we want to attend and -10000.0 for masked positions.
|
245 |
+
# Since we are adding it to the raw scores before the softmax, this is
|
246 |
+
# effectively the same as removing these entirely.
|
247 |
+
extended_attention_mask = extended_attention_mask.to(
|
248 |
+
dtype=next(self.parameters()).dtype) # fp16 compatibility
|
249 |
+
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
|
250 |
+
|
251 |
+
|
252 |
+
all_hidden_states = ()
|
253 |
+
all_attentions = ()
|
254 |
+
for i, layer_module in enumerate(self.layer):
|
255 |
+
if self.output_hidden_states:
|
256 |
+
all_hidden_states = all_hidden_states + (hidden_states,)
|
257 |
+
|
258 |
+
layer_outputs = layer_module(
|
259 |
+
hidden_states, extended_attention_mask, None)
|
260 |
+
hidden_states = layer_outputs[0]
|
261 |
+
|
262 |
+
if self.output_attentions:
|
263 |
+
all_attentions = all_attentions + (layer_outputs[1],)
|
264 |
+
|
265 |
+
# Add last layer
|
266 |
+
if self.output_hidden_states:
|
267 |
+
all_hidden_states = all_hidden_states + (hidden_states,)
|
268 |
+
|
269 |
+
outputs = (hidden_states,)
|
270 |
+
if self.output_hidden_states:
|
271 |
+
outputs = outputs + (all_hidden_states,)
|
272 |
+
if self.output_attentions:
|
273 |
+
outputs = outputs + (all_attentions,)
|
274 |
+
# last-layer hidden state, (all hidden states), (all attentions)
|
275 |
+
return outputs
|
model/transformer/bert_embed.py
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Input Embedding Layers
|
3 |
+
"""
|
4 |
+
import torch
|
5 |
+
import torch.nn as nn
|
6 |
+
import logging
|
7 |
+
|
8 |
+
|
9 |
+
logger = logging.getLogger(__name__)
|
10 |
+
try:
|
11 |
+
import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
|
12 |
+
except (ImportError, AttributeError) as e:
|
13 |
+
logger.info(
|
14 |
+
"Better speed can be achieved with apex installed from "
|
15 |
+
"https://www.github.com/nvidia/apex ."
|
16 |
+
)
|
17 |
+
BertLayerNorm = torch.nn.LayerNorm
|
18 |
+
|
19 |
+
|
20 |
+
class BertEmbeddings(nn.Module):
|
21 |
+
"""Construct the embeddings from word, position and token_type embeddings."""
|
22 |
+
|
23 |
+
def __init__(self, config):
|
24 |
+
super(BertEmbeddings, self).__init__()
|
25 |
+
#self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
|
26 |
+
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
|
27 |
+
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
|
28 |
+
|
29 |
+
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
|
30 |
+
# any TensorFlow checkpoint file
|
31 |
+
self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
|
32 |
+
self.dropout = nn.Dropout(config.hidden_dropout_prob)
|
33 |
+
|
34 |
+
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
|
35 |
+
# self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
|
36 |
+
# self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
|
37 |
+
|
38 |
+
def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
|
39 |
+
if input_ids is not None:
|
40 |
+
input_shape = input_ids.size()
|
41 |
+
else:
|
42 |
+
input_shape = inputs_embeds.size()[:-1]
|
43 |
+
|
44 |
+
seq_length = input_shape[1]
|
45 |
+
|
46 |
+
if position_ids is None:
|
47 |
+
position_ids = self.position_ids[:, :seq_length]
|
48 |
+
|
49 |
+
if token_type_ids is None:
|
50 |
+
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
|
51 |
+
|
52 |
+
if inputs_embeds is None:
|
53 |
+
inputs_embeds = self.word_embeddings(input_ids)
|
54 |
+
token_type_embeddings = self.token_type_embeddings(token_type_ids)
|
55 |
+
|
56 |
+
position_embeddings = self.position_embeddings(position_ids)
|
57 |
+
|
58 |
+
embeddings = inputs_embeds + token_type_embeddings + position_embeddings
|
59 |
+
|
60 |
+
embeddings = self.LayerNorm(embeddings)
|
61 |
+
embeddings = self.dropout(embeddings)
|
62 |
+
return embeddings
|
63 |
+
|
64 |
+
|
ndcg_iou_topk.py
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from utils.basic_utils import load_jsonl, save_jsonl, load_json
|
2 |
+
import pandas as pd
|
3 |
+
from tqdm import tqdm
|
4 |
+
import numpy as np
|
5 |
+
from collections import defaultdict
|
6 |
+
import copy
|
7 |
+
|
8 |
+
def calculate_iou(pred_start: float, pred_end: float, gt_start: float, gt_end: float) -> float:
|
9 |
+
intersection_start = max(pred_start, gt_start)
|
10 |
+
intersection_end = min(pred_end, gt_end)
|
11 |
+
intersection = max(0, intersection_end - intersection_start)
|
12 |
+
union = (pred_end - pred_start) + (gt_end - gt_start) - intersection
|
13 |
+
return intersection / union if union > 0 else 0
|
14 |
+
|
15 |
+
|
16 |
+
# Function to calculate DCG
|
17 |
+
def calculate_dcg(scores):
|
18 |
+
return sum((2**score - 1) / np.log2(idx + 2) for idx, score in enumerate(scores))
|
19 |
+
|
20 |
+
# Function to calculate NDCG
|
21 |
+
def calculate_ndcg(pred_scores, true_scores):
|
22 |
+
dcg = calculate_dcg(pred_scores)
|
23 |
+
idcg = calculate_dcg(sorted(true_scores, reverse=True))
|
24 |
+
return dcg / idcg if idcg > 0 else 0
|
25 |
+
|
26 |
+
|
27 |
+
|
28 |
+
def calculate_ndcg_iou(all_gt, all_pred, TS, KS):
|
29 |
+
performance = defaultdict(lambda: defaultdict(list))
|
30 |
+
performance_avg = defaultdict(lambda: defaultdict(float))
|
31 |
+
for k in tqdm(all_pred.keys(), desc="Calculate NDCG"):
|
32 |
+
one_pred = all_pred[k]
|
33 |
+
one_gt = all_gt[k]
|
34 |
+
|
35 |
+
one_gt.sort(key=lambda x: x["relevance"], reverse=True)
|
36 |
+
for T in TS:
|
37 |
+
one_gt_drop = copy.deepcopy(one_gt)
|
38 |
+
predictions_with_scores = []
|
39 |
+
|
40 |
+
for pred in one_pred:
|
41 |
+
pred_video_name, pred_time = pred["video_name"], pred["timestamp"]
|
42 |
+
matched_rows = [gt for gt in one_gt_drop if gt["video_name"] == pred_video_name]
|
43 |
+
if not matched_rows:
|
44 |
+
pred["pred_relevance"] = 0
|
45 |
+
else:
|
46 |
+
ious = [calculate_iou(pred_time[0], pred_time[1], gt["timestamp"][0], gt["timestamp"][1]) for gt in matched_rows]
|
47 |
+
max_iou_idx = np.argmax(ious)
|
48 |
+
max_iou_row = matched_rows[max_iou_idx]
|
49 |
+
|
50 |
+
if ious[max_iou_idx] > T:
|
51 |
+
pred["pred_relevance"] = max_iou_row["relevance"]
|
52 |
+
# Remove the matched ground truth row
|
53 |
+
original_idx = one_gt_drop.index(max_iou_row)
|
54 |
+
one_gt_drop.pop(original_idx)
|
55 |
+
else:
|
56 |
+
pred["pred_relevance"] = 0
|
57 |
+
predictions_with_scores.append(pred)
|
58 |
+
for K in KS:
|
59 |
+
true_scores = [gt["relevance"] for gt in one_gt][:K]
|
60 |
+
pred_scores = [pred["pred_relevance"] for pred in predictions_with_scores][:K]
|
61 |
+
ndcg_score = calculate_ndcg(pred_scores, true_scores)
|
62 |
+
performance[K][T].append(ndcg_score)
|
63 |
+
for K, vs in performance.items():
|
64 |
+
for T, v in vs.items():
|
65 |
+
performance_avg[K][T] = np.mean(v)
|
66 |
+
return performance_avg
|
optim/adamw.py
ADDED
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
AdamW optimizer (weight decay fix)
|
3 |
+
originally from hugginface (https://github.com/huggingface/transformers).
|
4 |
+
|
5 |
+
Copied from UNITER
|
6 |
+
(https://github.com/ChenRocks/UNITER)
|
7 |
+
"""
|
8 |
+
import math
|
9 |
+
|
10 |
+
import torch
|
11 |
+
from torch.optim import Optimizer
|
12 |
+
|
13 |
+
|
14 |
+
class AdamW(Optimizer):
|
15 |
+
""" Implements Adam algorithm with weight decay fix.
|
16 |
+
Parameters:
|
17 |
+
lr (float): learning rate. Default 1e-3.
|
18 |
+
betas (tuple of 2 floats): Adams beta parameters (b1, b2).
|
19 |
+
Default: (0.9, 0.999)
|
20 |
+
eps (float): Adams epsilon. Default: 1e-6
|
21 |
+
weight_decay (float): Weight decay. Default: 0.0
|
22 |
+
correct_bias (bool): can be set to False to avoid correcting bias
|
23 |
+
in Adam (e.g. like in Bert TF repository). Default True.
|
24 |
+
"""
|
25 |
+
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,
|
26 |
+
weight_decay=0.0, correct_bias=True):
|
27 |
+
if lr < 0.0:
|
28 |
+
raise ValueError(
|
29 |
+
"Invalid learning rate: {} - should be >= 0.0".format(lr))
|
30 |
+
if not 0.0 <= betas[0] < 1.0:
|
31 |
+
raise ValueError("Invalid beta parameter: {} - "
|
32 |
+
"should be in [0.0, 1.0[".format(betas[0]))
|
33 |
+
if not 0.0 <= betas[1] < 1.0:
|
34 |
+
raise ValueError("Invalid beta parameter: {} - "
|
35 |
+
"should be in [0.0, 1.0[".format(betas[1]))
|
36 |
+
if not 0.0 <= eps:
|
37 |
+
raise ValueError("Invalid epsilon value: {} - "
|
38 |
+
"should be >= 0.0".format(eps))
|
39 |
+
defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
|
40 |
+
correct_bias=correct_bias)
|
41 |
+
super(AdamW, self).__init__(params, defaults)
|
42 |
+
|
43 |
+
def step(self, closure=None):
|
44 |
+
"""Performs a single optimization step.
|
45 |
+
Arguments:
|
46 |
+
closure (callable, optional): A closure that reevaluates the model
|
47 |
+
and returns the loss.
|
48 |
+
"""
|
49 |
+
loss = None
|
50 |
+
if closure is not None:
|
51 |
+
loss = closure()
|
52 |
+
|
53 |
+
for group in self.param_groups:
|
54 |
+
for p in group['params']:
|
55 |
+
if p.grad is None:
|
56 |
+
continue
|
57 |
+
grad = p.grad.data
|
58 |
+
if grad.is_sparse:
|
59 |
+
raise RuntimeError(
|
60 |
+
'Adam does not support sparse '
|
61 |
+
'gradients, please consider SparseAdam instead')
|
62 |
+
|
63 |
+
state = self.state[p]
|
64 |
+
|
65 |
+
# State initialization
|
66 |
+
if len(state) == 0:
|
67 |
+
state['step'] = 0
|
68 |
+
# Exponential moving average of gradient values
|
69 |
+
state['exp_avg'] = torch.zeros_like(p.data)
|
70 |
+
# Exponential moving average of squared gradient values
|
71 |
+
state['exp_avg_sq'] = torch.zeros_like(p.data)
|
72 |
+
|
73 |
+
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
|
74 |
+
beta1, beta2 = group['betas']
|
75 |
+
|
76 |
+
state['step'] += 1
|
77 |
+
|
78 |
+
# Decay the first and second moment running average coefficient
|
79 |
+
# In-place operations to update the averages at the same time
|
80 |
+
exp_avg.mul_(beta1).add_(grad , alpha=1.0 - beta1)
|
81 |
+
exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
|
82 |
+
denom = exp_avg_sq.sqrt().add_(group['eps'])
|
83 |
+
|
84 |
+
step_size = group['lr']
|
85 |
+
if group['correct_bias']: # No bias correction for Bert
|
86 |
+
bias_correction1 = 1.0 - beta1 ** state['step']
|
87 |
+
bias_correction2 = 1.0 - beta2 ** state['step']
|
88 |
+
step_size = (step_size * math.sqrt(bias_correction2)
|
89 |
+
/ bias_correction1)
|
90 |
+
|
91 |
+
p.data.addcdiv_(exp_avg, denom, value=-step_size)
|
92 |
+
|
93 |
+
# Just adding the square of the weights to the loss function is
|
94 |
+
# *not* the correct way of using L2 regularization/weight decay
|
95 |
+
# with Adam, since that will interact with the m and v
|
96 |
+
# parameters in strange ways.
|
97 |
+
#
|
98 |
+
# Instead we want to decay the weights in a manner that doesn't
|
99 |
+
# interact with the m/v parameters. This is equivalent to
|
100 |
+
# adding the square of the weights to the loss with plain
|
101 |
+
# (non-momentum) SGD.
|
102 |
+
# Add weight decay at the end (fixed version)
|
103 |
+
if group['weight_decay'] > 0.0:
|
104 |
+
p.data.add_(p.data, alpha=-group['lr'] * group['weight_decay'])
|
105 |
+
|
106 |
+
return loss
|
results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01.log
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a4d870ccff8ab61b72571cd7c9f84eb916d84fd7f091b2e300dfb9d4be5ee518
|
3 |
+
size 29628
|
results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01_back.log
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ef85a542568c80fab7d57d69041ebd898e30d4fc912082bd4d571aea3ec6424c
|
3 |
+
size 29917
|
results/tvr-top01-2024_07_08_17_18_30/best_test_predictions.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0becb2747c635a0080149ccb3e92975f7bf4bf3a99d025fd41d29ae9287db438
|
3 |
+
size 14263264
|
results/tvr-top01-2024_07_08_17_18_30/best_val_predictions.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:47ced0079b54bdbc05268645d80c6fa52b1ed44c6e04f6922d535be29aa3fd8c
|
3 |
+
size 2560976
|
results/tvr-top01-2024_07_08_17_18_30/code.zip
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:88b0711364459d5340f2e887420295145188a9008d5b50b5ddde46b221645c23
|
3 |
+
size 1141392
|
results/tvr-top01-2024_07_08_17_18_30/model.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:aa2b8044636fe7ce9ab4d36df179ec2358f10a579de4ee5a7e58f338553558d2
|
3 |
+
size 190742082
|
results/tvr-top01-2024_07_08_17_18_30/opt.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c93c28739229f5e35afc1239e1f30e0cad28353909eed88b6d65732943a5ac61
|
3 |
+
size 1370
|
results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20.log
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ea621825b2f1d618daf456f872246d6d50bd3729a36606c7cdcf75dcddbec57a
|
3 |
+
size 30298
|
results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20_back.log
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:03b9976e0b0049f434e91251cfcde27b9a2334e95216d995ada4699f83d889c9
|
3 |
+
size 31752
|
results/tvr-top20-2024_07_08_21_19_47/best_test_predictions.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:12895f4d15d70eff1737745bda045cf6fb1bf6e85aa4e8c4cdd86633cb70274a
|
3 |
+
size 14324579
|
results/tvr-top20-2024_07_08_21_19_47/best_val_predictions.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:103076d328e1b7efdc2773625c38fc73a29492a67bcb27e023af73f8b21c8732
|
3 |
+
size 2571786
|
results/tvr-top20-2024_07_08_21_19_47/code.zip
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:88b0711364459d5340f2e887420295145188a9008d5b50b5ddde46b221645c23
|
3 |
+
size 1141392
|
results/tvr-top20-2024_07_08_21_19_47/model.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:baff5eaebb7f211640af4e21f2876be344eaa95431ab32398ac7260e9803471f
|
3 |
+
size 190742082
|
results/tvr-top20-2024_07_08_21_19_47/opt.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:90d02a58cbb9a5ea0f23e3fefedd3f8f7b8852332b4877cfe7ba2833ca699071
|
3 |
+
size 1368
|
results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40.log
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:895455a13565da5f3d44126722152288a3057649fef1daa94d7558d490d97d81
|
3 |
+
size 24491
|
results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40_back.log
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6085e3055b53b0afc63799813027a70b1d1999beeecf22b0accda3b5a60fe8cc
|
3 |
+
size 26137
|
results/tvr-top40-2024_07_11_10_58_46/best_test_predictions.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5deaab54d6eec95172c5877b38dc72712f76b0357f26e255938a55835627ed2c
|
3 |
+
size 14329598
|
results/tvr-top40-2024_07_11_10_58_46/best_val_predictions.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:e9d7b68cde82958c1a7039210d2ac4bb5cfb5083abee6bbb550083395061a8a8
|
3 |
+
size 2572649
|
results/tvr-top40-2024_07_11_10_58_46/code.zip
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:88e51fa09336f4a4545dc2e281cfe8cea943daf17de87c12b6b75d226fdb61dd
|
3 |
+
size 1141399
|
results/tvr-top40-2024_07_11_10_58_46/model.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5eba8e53656fed1ddcbb7d8129bd6c72862797c63684f11121a9a78c86b30c70
|
3 |
+
size 190742082
|
results/tvr-top40-2024_07_11_10_58_46/opt.json
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:0e03b5de0524d803c796aaef3fa4aaf1152cfae63644403e236262fe1a4663b3
|
3 |
+
size 1368
|
run_disjoint_top01.sh
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
python train.py \
|
2 |
+
--model_name conquer \
|
3 |
+
--dataset_config config/tvr_ranking_data_config_top01.json \
|
4 |
+
--model_config config/model_config.json \
|
5 |
+
--eval_tasks_at_training VCMR \
|
6 |
+
--use_interal_vr_scores \
|
7 |
+
--use_extend_pool 500 \
|
8 |
+
--neg_video_num 0 \
|
9 |
+
--max_vcmr_video 10 \
|
10 |
+
--similarity_measure disjoint \
|
11 |
+
--bsz 196 \
|
12 |
+
--eval_query_bsz 8 \
|
13 |
+
--eval_num_per_epoch 0.05 \
|
14 |
+
--n_epoch 4000 \
|
15 |
+
--exp_id top01
|
16 |
+
|
17 |
+
# qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
|
18 |
+
# cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top01.sh
|
19 |
+
|
run_disjoint_top20.sh
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
python train.py \
|
2 |
+
--model_name conquer \
|
3 |
+
--dataset_config config/tvr_ranking_data_config_top20.json \
|
4 |
+
--model_config config/model_config.json \
|
5 |
+
--eval_tasks_at_training VCMR \
|
6 |
+
--use_interal_vr_scores \
|
7 |
+
--use_extend_pool 500 \
|
8 |
+
--neg_video_num 0 \
|
9 |
+
--max_vcmr_video 10 \
|
10 |
+
--similarity_measure disjoint \
|
11 |
+
--bsz 196 \
|
12 |
+
--eval_query_bsz 8 \
|
13 |
+
--eval_num_per_epoch 1 \
|
14 |
+
--n_epoch 200 \
|
15 |
+
--exp_id top20
|
16 |
+
|
17 |
+
# qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
|
18 |
+
# cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top20.sh
|
19 |
+
|
run_disjoint_top40.sh
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
python train.py \
|
2 |
+
--model_name conquer \
|
3 |
+
--dataset_config config/tvr_ranking_data_config_top40.json \
|
4 |
+
--model_config config/model_config.json \
|
5 |
+
--eval_tasks_at_training VCMR \
|
6 |
+
--use_interal_vr_scores \
|
7 |
+
--use_extend_pool 500 \
|
8 |
+
--neg_video_num 0 \
|
9 |
+
--max_vcmr_video 10 \
|
10 |
+
--similarity_measure disjoint \
|
11 |
+
--bsz 196 \
|
12 |
+
--eval_query_bsz 8 \
|
13 |
+
--eval_num_per_epoch 2 \
|
14 |
+
--n_epoch 100 \
|
15 |
+
--exp_id top40
|
16 |
+
|
17 |
+
# qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
|
18 |
+
# cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top40.sh
|
19 |
+
|