Liangrj5 commited on
Commit
a638e43
·
1 Parent(s): f2d2d1a
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +2 -0
  2. .gitignore +1 -0
  3. README.md +47 -3
  4. config/config.py +227 -0
  5. config/model_config.json +3 -0
  6. config/tvr_ranking_data_config_top01.json +3 -0
  7. config/tvr_ranking_data_config_top20.json +3 -0
  8. config/tvr_ranking_data_config_top40.json +3 -0
  9. data_loader/second_stage_start_end_dataset.py +349 -0
  10. inference.py +570 -0
  11. model/__init__.py +0 -0
  12. model/backbone/__init__.py +0 -0
  13. model/backbone/encoder.py +235 -0
  14. model/conquer.py +205 -0
  15. model/head/__init__.py +0 -0
  16. model/head/ml_head.py +61 -0
  17. model/head/vs_head.py +42 -0
  18. model/layers.py +196 -0
  19. model/modeling_utils.py +135 -0
  20. model/qal/__init__.py +0 -0
  21. model/qal/query_aware_learning_module.py +92 -0
  22. model/transformer/__init__.py +0 -0
  23. model/transformer/bert.py +275 -0
  24. model/transformer/bert_embed.py +64 -0
  25. ndcg_iou_topk.py +66 -0
  26. optim/adamw.py +106 -0
  27. results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01.log +3 -0
  28. results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01_back.log +3 -0
  29. results/tvr-top01-2024_07_08_17_18_30/best_test_predictions.json +3 -0
  30. results/tvr-top01-2024_07_08_17_18_30/best_val_predictions.json +3 -0
  31. results/tvr-top01-2024_07_08_17_18_30/code.zip +3 -0
  32. results/tvr-top01-2024_07_08_17_18_30/model.ckpt +3 -0
  33. results/tvr-top01-2024_07_08_17_18_30/opt.json +3 -0
  34. results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20.log +3 -0
  35. results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20_back.log +3 -0
  36. results/tvr-top20-2024_07_08_21_19_47/best_test_predictions.json +3 -0
  37. results/tvr-top20-2024_07_08_21_19_47/best_val_predictions.json +3 -0
  38. results/tvr-top20-2024_07_08_21_19_47/code.zip +3 -0
  39. results/tvr-top20-2024_07_08_21_19_47/model.ckpt +3 -0
  40. results/tvr-top20-2024_07_08_21_19_47/opt.json +3 -0
  41. results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40.log +3 -0
  42. results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40_back.log +3 -0
  43. results/tvr-top40-2024_07_11_10_58_46/best_test_predictions.json +3 -0
  44. results/tvr-top40-2024_07_11_10_58_46/best_val_predictions.json +3 -0
  45. results/tvr-top40-2024_07_11_10_58_46/code.zip +3 -0
  46. results/tvr-top40-2024_07_11_10_58_46/model.ckpt +3 -0
  47. results/tvr-top40-2024_07_11_10_58_46/opt.json +3 -0
  48. run_disjoint_top01.sh +19 -0
  49. run_disjoint_top20.sh +19 -0
  50. run_disjoint_top40.sh +19 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.json filter=lfs diff=lfs merge=lfs -text
37
+ *.log filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ *__pycache__
README.md CHANGED
@@ -1,3 +1,47 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - axgroup/Ranking_TVR
5
+ language:
6
+ - en
7
+ ---
8
+ # CONQUER_RVMR
9
+
10
+ This repository contains the XML model for the baseline of the Ranked Video Moment Retrieval (RVMR) task. The associated paper is titled "Video Moment Retrieval in Practical Setting: A Dataset of Ranked Moments for Imprecise Queries."
11
+
12
+ The main repository of the paper is [TVR-Ranking](https://huggingface.co/axgroup/TVR-Ranking), and this model is adapted from [CONQUER](https://github.com/houzhijian/CONQUER.git). The environment setup is the same as for RelocNet_RVMR, as detailed in the [TVR-Ranking](https://huggingface.co/axgroup/TVR-Ranking) repository.
13
+
14
+
15
+ CONQUER leverages video retrieval results from [HERO](https://github.com/linjieli222/HERO.git). We continue to use these
16
+ results when training on our TVR-Ranking dataset. Note that, because the HERO results are obtained from the TVR dataset, there could be a data leak issue in our task setting. However, this issue is negligible for two reasons: (i) the queries used in our setting is imprecise query with query re-written, and (ii) a query has multiple ground truth moments in our task setting, which was not annotated in the original TVR dataset.
17
+
18
+
19
+ ## Performance
20
+
21
+
22
+ | **Model** | **Train Set Top N** | **IoU=0.3** | | **IoU=0.5** | | **IoU=0.7** | |
23
+ |------------|---------------------|-------------|----------|-------------|----------|-------------|----------|
24
+ | | | **Val** | **Test** | **Val** | **Test** | **Val** | **Test** |
25
+ | **NDCG@10**| | | | | | | |
26
+ | CONQUER | 1 | 0.0999 | 0.0859 | 0.0844 | 0.0709 | 0.0530 | 0.0512 |
27
+ | CONQUER | 20 | 0.2406 | 0.2249 | 0.2222 | 0.2104 | 0.1672 | 0.1517 |
28
+ | CONQUER | 40 | 0.2450 | 0.2219 | 0.2262 | 0.2085 | 0.1670 | 0.1515 |
29
+ | **NDCG@20**| | | | | | | |
30
+ | CONQUER | 1 | 0.0952 | 0.0835 | 0.0808 | 0.0687 | 0.0526 | 0.0484 |
31
+ | CONQUER | 20 | 0.2130 | 0.1995 | 0.1976 | 0.1867 | 0.1527 | 0.1368 |
32
+ | CONQUER | 40 | 0.2183 | 0.1968 | 0.2022 | 0.1851 | 0.1524 | 0.1365 |
33
+ | **NDCG@40**| | | | | | | |
34
+ | CONQUER | 1 | 0.0974 | 0.0866 | 0.0832 | 0.0718 | 0.0557 | 0.0510 |
35
+ | CONQUER | 20 | 0.2029 | 0.1906 | 0.1891 | 0.1788 | 0.1476 | 0.1326 |
36
+ | CONQUER | 40 | 0.2080 | 0.1885 | 0.1934 | 0.1775 | 0.1473 | 0.1323 |
37
+
38
+
39
+ ## Quick Start
40
+
41
+ Modify the path in `run_disjoint_top20.sh` and then execute the script:
42
+
43
+ ```sh
44
+ sh run_disjoint_top20.sh
45
+ ```
46
+
47
+ Feel free to contribute or raise issues for any problems encountered.
config/config.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import time
3
+ import torch
4
+ import argparse
5
+ import sys
6
+ import pprint
7
+
8
+ import json
9
+ from utils.basic_utils import mkdirp, load_json, save_json, make_zipfile
10
+
11
+
12
+ def parse_with_config(parser):
13
+ args = parser.parse_args()
14
+ if args.config is not None:
15
+ config_args = json.load(open(args.config))
16
+ override_keys = {arg[2:].split('=')[0] for arg in sys.argv[1:]
17
+ if arg.startswith('--')}
18
+ for k, v in config_args.items():
19
+ if k not in override_keys:
20
+ setattr(args, k, v)
21
+ del args.config
22
+ return args
23
+
24
+
25
+ class BaseOptions(object):
26
+ saved_option_filename = "opt.json"
27
+ ckpt_filename = "model.ckpt"
28
+ tensorboard_log_dir = "tensorboard_log"
29
+ train_log_filename = "train.log.txt"
30
+ eval_log_filename = "eval.log.txt"
31
+
32
+ def __init__(self):
33
+ self.parser = argparse.ArgumentParser()
34
+ self.initialized = False
35
+ self.opt = None
36
+
37
+ def initialize(self):
38
+ self.initialized = True
39
+ self.parser.add_argument("--dset_name", type=str, default="tvr", choices=["tvr", "didemo"])
40
+ self.parser.add_argument("--eval_split_name", type=str, default="val",
41
+ help="should match keys in video_duration_idx_path, must set for VCMR")
42
+ self.parser.add_argument("--data_ratio", type=float, default=1.0,
43
+ help="how many training and eval data to use. 1.0: use all, 0.1: use 10%."
44
+ "Use small portion for debug purposes. Note this is different from --debug, "
45
+ "which works by breaking the loops, typically they are not used together.")
46
+ self.parser.add_argument("--debug", action="store_true",
47
+ help="debug (fast) mode, break all loops, do not load all data into memory.")
48
+ self.parser.add_argument("--disable_eval", action="store_true",
49
+ help="disable eval")
50
+ self.parser.add_argument("--results_root", type=str, default="results")
51
+ self.parser.add_argument("--exp_id", type=str, default=None, help="id of this run, required at training")
52
+ self.parser.add_argument("--seed", type=int, default=2018, help="random seed")
53
+ self.parser.add_argument("--device", type=int, default=0, help="0 cuda, -1 cpu")
54
+ self.parser.add_argument("--device_ids", type=int, nargs="+", default=[0], help="GPU ids to run the job")
55
+ self.parser.add_argument("--num_workers", type=int, default=8,
56
+ help="num subprocesses used to load the data, 0: use main process")
57
+
58
+ # training config
59
+ self.parser.add_argument("--lr", type=float, default=1e-4, help="learning rate")
60
+ self.parser.add_argument("--lr_warmup_proportion", type=float, default=0.01,
61
+ help="Proportion of training to perform linear learning rate warmup for. "
62
+ "E.g., 0.1 = 10% of training.")
63
+ self.parser.add_argument("--wd", type=float, default=0.01, help="weight decay")
64
+ self.parser.add_argument("--n_epoch", type=int, default=50, help="number of epochs to run")
65
+ self.parser.add_argument("--max_es_cnt", type=int, default=3,
66
+ help="number of epochs to early stop, use -1 to disable early stop")
67
+ self.parser.add_argument("--eval_tasks_at_training", type=str, nargs="+",
68
+ default=["VCMR", "SVMR", "VR"], choices=["VCMR", "SVMR", "VR"],
69
+ help="evaluate and report numbers for tasks specified here.")
70
+ self.parser.add_argument("--bsz", type=int, default=128, help="mini-batch size")
71
+ self.parser.add_argument("--eval_query_bsz", type=int, default=8,
72
+ help="mini-batch size at inference, for query")
73
+ self.parser.add_argument("--no_eval_untrained", action="store_true", help="Evaluate on un-trained model")
74
+ self.parser.add_argument("--grad_clip", type=float, default=-1, help="perform gradient clip, -1: disable")
75
+ self.parser.add_argument("--eval_epoch_num", type=int, default=1, help="eval_epoch_num")
76
+
77
+ # Data config
78
+ self.parser.add_argument("--max_ctx_len", type=int, default=100,
79
+ help="max number of snippets, 100 for tvr clip_length=1.5, only 109/21825 > 100")
80
+ self.parser.add_argument("--max_desc_len", type=int, default=30, help="max number of query token")
81
+ self.parser.add_argument("--clip_length", type=float, default=1.5,
82
+ help="each video will be uniformly segmented into small clips")
83
+ self.parser.add_argument("--ctx_mode", type=str, default="visual_sub",
84
+ help="adopted modality list for each clip")
85
+ self.parser.add_argument("--dataset_config", type=str,help="data config")
86
+
87
+
88
+ # Model config
89
+
90
+ self.parser.add_argument("--visual_dim", type=int,default=4352,help="visual modality feature dimension")
91
+ self.parser.add_argument("--text_dim", type=int, default=768, help="textual modality feature dimension")
92
+ self.parser.add_argument("--query_dim", type=int, default=768, help="query feature dimension")
93
+ self.parser.add_argument("--hidden_dim", type=int, default=768, help="joint dimension")
94
+ self.parser.add_argument("--no_output_moe_weight",action="store_true",
95
+ help="whether NOT to use query dependent fusion")
96
+ self.parser.add_argument("--model_config", type=str, help="model config")
97
+
98
+
99
+ ## Train config
100
+ self.parser.add_argument("--lw_st_ed", type=float, default=0.01, help="weight for moment cross-entropy loss")
101
+ self.parser.add_argument("--lw_video_ce", type=float, default=0.05, help="weight for video cross-entropy loss")
102
+ self.parser.add_argument("--lr_mul", type=float, default=1, help="Learning rate multiplier for backbone module")
103
+ self.parser.add_argument("--use_extend_pool", type=int, default=1000,
104
+ help="use_extend_pool")
105
+ self.parser.add_argument("--neg_video_num",type=int,default=3,
106
+ help="sample the number of negative video, "
107
+ "if neg_video_num=0, then disable shared normalization training objective")
108
+ self.parser.add_argument("--encoder_pretrain_ckpt_filepath", type=str,
109
+ default="None",
110
+ help="first_stage_pretrain checkpoint")
111
+ self.parser.add_argument("--use_interal_vr_scores", action="store_true",
112
+ help="whether to interal_vr_scores, true only for general similarity measure function")
113
+
114
+ ## Eval config
115
+ self.parser.add_argument("--similarity_measure",
116
+ type=str, choices=["general", "exclusive","disjoint"],
117
+ default="general",help="similarity_measure_function")
118
+ # post processing
119
+ self.parser.add_argument("--min_pred_l", type=int, default=0,
120
+ help="constrain the [st, ed] with ed - st >= 1"
121
+ "(1 clips with length 1.5 each, 1.5 secs in total"
122
+ "this is the min length for proposal-based method)")
123
+ self.parser.add_argument("--max_pred_l", type=int, default=24,
124
+ help="constrain the [st, ed] pairs with ed - st <= 24, 36 secs in total"
125
+ "(24 clips with length 1.5 each, "
126
+ "this is the max length for proposal-based method)")
127
+ self.parser.add_argument("--max_before_nms", type=int, default=200)
128
+ self.parser.add_argument("--max_vcmr_video", type=int, default=10,
129
+ help="ranking in top-max_vcmr_video")
130
+ self.parser.add_argument("--nms_thd", type=float, default=-1,
131
+ help="additionally use non-maximum suppression "
132
+ "(or non-minimum suppression for distance)"
133
+ "to post-processing the predictions. "
134
+ "-1: do not use nms. 0.7 for tvr")
135
+ self.parser.add_argument("--eval_num_per_epoch", type=float)
136
+
137
+ # can use config files
138
+ self.parser.add_argument('--config', help='JSON config files')
139
+ self.parser.add_argument('--model_name', type=str)
140
+
141
+
142
+ def display_save(self, opt):
143
+ args = vars(opt)
144
+ # Display settings
145
+ # print("------------ Options -------------\n{}\n-------------------"
146
+ # .format({str(k): str(v) for k, v in sorted(args.items())}))
147
+ print("------------ Options -------------\n{}\n-------------------"
148
+ .format(pprint.pformat({str(k): str(v) for k, v in sorted(args.items())}, indent=4)))
149
+
150
+
151
+ # Save settings
152
+ if not isinstance(self, TestOptions):
153
+ option_file_path = os.path.join(opt.results_dir, self.saved_option_filename) # not yaml file indeed
154
+ save_json(args, option_file_path, save_pretty=True)
155
+
156
+
157
+ def parse(self):
158
+ if not self.initialized:
159
+ self.initialize()
160
+ opt = parse_with_config(self.parser)
161
+
162
+ if opt.debug:
163
+ opt.results_root = os.path.sep.join(opt.results_root.split(os.path.sep)[:-1] + ["debug_results", ])
164
+ #opt.disable_eval = True
165
+
166
+ if isinstance(self, TestOptions):
167
+
168
+ # modify model_dir to absolute path
169
+ opt.model_dir = os.path.join("results", opt.model_dir)
170
+
171
+ saved_options = load_json(os.path.join(opt.model_dir, self.saved_option_filename))
172
+ for arg in saved_options: # use saved options to overwrite all BaseOptions args.
173
+ if arg not in ["results_root", "nms_thd", "debug", "dataset_config", "model_config","device",
174
+ "eval_split_name", "bsz", "eval_context_bsz", "device_ids",
175
+ "max_vcmr_video","max_pred_l", "min_pred_l", "external_inference_vr_res_path"]:
176
+ setattr(opt, arg, saved_options[arg])
177
+ else:
178
+ if opt.exp_id is None:
179
+ raise ValueError("--exp_id is required for at a training option!")
180
+
181
+ opt.results_dir = os.path.join(opt.results_root,
182
+ "-".join([opt.dset_name, opt.exp_id,
183
+ time.strftime("%Y_%m_%d_%H_%M_%S")]))
184
+ mkdirp(opt.results_dir)
185
+ # save a copy of current code
186
+ code_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
187
+ code_zip_filename = os.path.join(opt.results_dir, "code.zip")
188
+ make_zipfile(code_dir, code_zip_filename,
189
+ enclosing_dir="code",
190
+ exclude_dirs_substring="results",
191
+ exclude_dirs=["condor","data","results", "debug_results", "__pycache__"],
192
+ exclude_extensions=[".pyc", ".ipynb", ".swap"],)
193
+
194
+ self.display_save(opt)
195
+
196
+
197
+ # assert opt.stop_task in opt.eval_tasks_at_training
198
+ opt.ckpt_filepath = os.path.join(opt.results_dir, self.ckpt_filename)
199
+ opt.train_log_filepath = os.path.join(opt.results_dir, self.train_log_filename)
200
+ opt.eval_log_filepath = os.path.join(opt.results_dir, self.eval_log_filename)
201
+ opt.tensorboard_log_dir = os.path.join(opt.results_dir, self.tensorboard_log_dir)
202
+ opt.device = torch.device("cuda:%d" % opt.device_ids[0] if opt.device >= 0 else "cpu")
203
+
204
+ self.opt = opt
205
+ return opt
206
+
207
+
208
+ class TestOptions(BaseOptions):
209
+ """add additional options for evaluating"""
210
+ def initialize(self):
211
+ BaseOptions.initialize(self)
212
+ # also need to specify --eval_split_name
213
+ self.parser.add_argument("--eval_id", type=str, help="evaluation id")
214
+ self.parser.add_argument("--model_dir", type=str,
215
+ help="dir contains the model file, will be converted to absolute path afterwards")
216
+ self.parser.add_argument("--tasks", type=str, nargs="+",
217
+ choices=["VCMR", "SVMR", "VR"], default=["VCMR", "SVMR", "VR"],
218
+ help="Which tasks to run."
219
+ "VCMR: Video Corpus Moment Retrieval;"
220
+ "SVMR: Single Video Moment Retrieval;"
221
+ "VR: regular Video Retrieval. (will be performed automatically with VCMR)")
222
+
223
+ if __name__ == '__main__':
224
+ print(__file__)
225
+ print(os.path.realpath(__file__))
226
+ code_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
227
+ print(code_dir)
config/model_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1458b56e285bd34b5db29a8e6babc61f9bf02d377a7ce594579baa833190f582
3
+ size 1637
config/tvr_ranking_data_config_top01.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03ed22c7ab836800651a9ab882496e71d93266bb6dff35c13d308243d1a5c98e
3
+ size 926
config/tvr_ranking_data_config_top20.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:509c13907d08921dd59c41b040166b4e0fd6e49260fa79adca9d23f46a804f70
3
+ size 926
config/tvr_ranking_data_config_top40.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75a6540a46a85534dcf79b5049cc47053cd48232f6983268a584565b4a55d48b
3
+ size 926
data_loader/second_stage_start_end_dataset.py ADDED
@@ -0,0 +1,349 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch.utils.data import Dataset
3
+ import math
4
+ import os
5
+ import random
6
+ import numpy as np
7
+ from utils.basic_utils import load_json, l2_normalize_np_array, load_json
8
+ import h5py
9
+
10
+
11
+ class StartEndDataset(Dataset):
12
+ """
13
+ Args:
14
+ dset_name, str, ["tvr"]
15
+ Return:
16
+ a dict: {
17
+ "model_inputs": {
18
+ "query"
19
+ "feat": torch.tensor, (max_desc_len, D_q)
20
+ "feat_mask": torch.tensor, (max_desc_len)
21
+ "feat_pos_id": torch.tensor, (max_desc_len)
22
+ "feat_token_id": torch.tensor, (max_desc_len)
23
+ "visual"
24
+ "feat": torch.tensor, (max_ctx_len, D_video)
25
+ "feat_mask": torch.tensor, (max_ctx_len)
26
+ "feat_pos_id": torch.tensor, (max_ctx_len)
27
+ "feat_token_id": torch.tensor, (max_ctx_len)
28
+ "sub" (optional)
29
+ "st_ed_indices": torch.LongTensor, (2, )
30
+ }
31
+ }
32
+ """
33
+ def __init__(self, config, data_path, vr_rank_path, max_ctx_len=100, max_desc_len=30, clip_length=1.5,ctx_mode="visual_sub",
34
+ is_eval = False, mode = "train",
35
+ neg_video_num=3, data_ratio=1,
36
+ use_extend_pool=500, inference_top_k=10):
37
+
38
+
39
+ self.dset_name = config.dset_name
40
+ self.root_path = config.root_path
41
+
42
+ self.desc_bert_path = os.path.join(self.root_path,config.desc_bert_path)
43
+ self.vid_feat_path = os.path.join(self.root_path,config.vid_feat_path)
44
+
45
+ self.ctx_mode = ctx_mode
46
+ self.use_sub = "sub" in self.ctx_mode
47
+
48
+ if self.use_sub:
49
+ self.sub_bert_path = os.path.join(self.root_path, config.sub_bert_path)
50
+
51
+ self.max_ctx_len = max_ctx_len
52
+ self.max_desc_len = max_desc_len
53
+ self.clip_length = clip_length
54
+
55
+ self.neg_video_num = neg_video_num
56
+ self.is_eval = is_eval
57
+
58
+ self.mode = mode
59
+ if mode in ["val", "test"]:
60
+ # = load_json(data_path)
61
+ self.annotations = load_json(data_path)
62
+ self.ground_truth = self.get_relevant_moment_gt()
63
+ self.annotations = self.expand_annotations( self.annotations)
64
+ if mode == "train":
65
+ self.annotations = self.expand_annotations(load_json(data_path))
66
+
67
+ self.first_VR_ranklist_pool_txn = h5py.File(vr_rank_path, "r")
68
+ self.query_bert_h5 = h5py.File(self.desc_bert_path, "r")
69
+ self.vid_feat_txn = h5py.File(self.vid_feat_path, "r")
70
+ if self.use_sub:
71
+ self.sub_bert_txn = h5py.File(self.sub_bert_path, "r")
72
+
73
+
74
+ self.inference_top_k = inference_top_k
75
+ video_data = load_json(os.path.join(self.root_path,config.video_duration_idx_path))
76
+
77
+ self.video_data = [{"vid_name": k, "duration": v[0]} for k, v in video_data.items()]
78
+ self.video2idx = {k: v[1] for k, v in video_data.items()}
79
+ self.idx2video = {v[1]:k for k, v in video_data.items()}
80
+ self.use_extend_pool = use_extend_pool
81
+
82
+ self.normalize_vfeat = True
83
+ self.normalize_tfeat = False
84
+
85
+ self.visual_token_id = 0
86
+ self.text_token_id = 1
87
+
88
+ def __len__(self):
89
+ return len(self.annotations)
90
+
91
+ def expand_annotations(self, annotations):
92
+ new_annotations = []
93
+ for i in annotations:
94
+ query = i["query"]
95
+ query_id = i["query_id"]
96
+ for moment in i["relevant_moment"]:
97
+ moment.update({'query': query, 'query_id': query_id})
98
+ new_annotations.append(moment)
99
+ return new_annotations
100
+
101
+ def get_relevant_moment_gt(self):
102
+ gt_all = {}
103
+ for data in self.annotations:
104
+ gt_all[data["query_id"]] = data["relevant_moment"]
105
+ return gt_all
106
+
107
+
108
+ def pad_feature(self, feature, max_ctx_len):
109
+ """
110
+ Args:
111
+ feature: original feature without padding
112
+ max_ctx_len: the maximum length of video clips (or query token)
113
+
114
+ Returns:
115
+ feat_pad : padded feature
116
+ feat_mask : feature mask
117
+ """
118
+ N_clip, feat_dim = feature.shape
119
+
120
+ feat_pad = torch.zeros((max_ctx_len, feat_dim))
121
+ feat_mask = torch.zeros(max_ctx_len, dtype=torch.long)
122
+ feat_pad[:N_clip, :] = torch.from_numpy(feature)
123
+ feat_mask[:N_clip] = 1
124
+
125
+ return feat_pad , feat_mask
126
+
127
+ def get_query_feat_by_query_id(self, query_id, token_id=1):
128
+ """
129
+ Args:
130
+ query_id: unique query description id
131
+ token_id: specify modality embedding
132
+ Returns:
133
+ a dict for query: {
134
+ "feat": torch.tensor, (max_desc_len, D_q)
135
+ "feat_mask": torch.tensor, (max_desc_len)
136
+ "feat_pos_id": torch.tensor, (max_desc_len)
137
+ "feat_token_id": torch.tensor, (max_desc_len)
138
+ }
139
+ """
140
+
141
+ query_feat = self.query_bert_h5[str(query_id)][:self.max_desc_len]
142
+
143
+ if self.normalize_tfeat:
144
+ query_feat = l2_normalize_np_array(query_feat)
145
+
146
+ feat_pad, feat_mask = \
147
+ self.pad_feature(query_feat, self.max_desc_len)
148
+
149
+ temp_model_inputs = dict()
150
+ temp_model_inputs["feat"] = feat_pad
151
+ temp_model_inputs["feat_mask"] = feat_mask
152
+ temp_model_inputs["feat_pos_id"] = torch.arange(self.max_desc_len, dtype=torch.long)
153
+ temp_model_inputs["feat_token_id"] = torch.full((self.max_desc_len,), token_id, dtype=torch.long)
154
+
155
+ return temp_model_inputs
156
+
157
+ def get_visual_feat_from_storage(self,vid_name):
158
+ """
159
+ Args:
160
+ vid_name: unique video description id
161
+ Returns:
162
+ visual_feat: torch.tensor, (max_ctx_len, D_v)
163
+ Use ResNet + SlowFast , D_v = 2048 + 2304 = 4352
164
+ """
165
+
166
+ visual_feat = self.vid_feat_txn[vid_name][:][:self.max_ctx_len]
167
+
168
+ if self.normalize_vfeat:
169
+ visual_feat = l2_normalize_np_array(visual_feat)
170
+
171
+ return visual_feat
172
+
173
+ def get_sub_feat_from_storage(self,vid_name):
174
+ """
175
+ Args:
176
+ vid_name: unique video description id
177
+ Returns:
178
+ visual_feat: torch.tensor, (max_ctx_len, D_s)
179
+ Use RoBERTa, D_s =768
180
+ """
181
+
182
+ sub_feat = self.sub_bert_txn[vid_name][:][:self.max_ctx_len]
183
+
184
+ if self.normalize_tfeat:
185
+ sub_feat = l2_normalize_np_array(sub_feat)
186
+
187
+ return sub_feat
188
+
189
+ def __getitem__(self, index):
190
+
191
+ raw_data = self.annotations[index]
192
+ # if "video_name" not in raw_data.keys():
193
+ # initialize with basic data
194
+ meta = dict(
195
+ query_id=raw_data["query_id"],
196
+ desc=raw_data["query"],
197
+ vid_name=raw_data["video_name"],
198
+ ts=raw_data["timestamp"],
199
+ )
200
+
201
+ # If mode is test_public, no ground-truth video_id is provided. So use a fixed dummy ground-truth video_id
202
+ if self.mode =="test_public":
203
+ meta["vid_name"] = "placeholder"
204
+
205
+
206
+ model_inputs = dict()
207
+ ## query information
208
+ model_inputs["query"] = self.get_query_feat_by_query_id(meta["query_id"],
209
+ token_id=self.text_token_id)
210
+
211
+ query_id = meta["query_id"]
212
+ if query_id == 7806:
213
+ query_id += 1
214
+
215
+ _external_inference_vr_res = self.first_VR_ranklist_pool_txn[str(query_id)][:]
216
+ if not self.is_eval:
217
+ ##get the rank location of the ground-truth video for the first VR search engine
218
+ location = 100
219
+ for idx, item in enumerate(_external_inference_vr_res):
220
+ if meta["vid_name"] == self.idx2video[item[0]]:
221
+ location = idx
222
+ break
223
+
224
+ ##check all the location is below 100 when mode is train
225
+ # if self.mode =="train":
226
+ # assert 0<=location<100, meta["query_id"]
227
+
228
+ ##get the ranklist without the ground-truth video
229
+ negative_video_pool_list = [self.idx2video[item[0]] for item in _external_inference_vr_res if meta["vid_name"] != self.idx2video[item[0]] ]
230
+
231
+ ##sample neg_video_num negative videos for shared normalization
232
+ sampled_negative_video_pool = random.sample(negative_video_pool_list[:location+self.use_extend_pool],
233
+ k=self.neg_video_num)
234
+ ##the complete sampled video list , [pos, neg1, neg2, ...]
235
+ total_vid_name_list = [meta["vid_name"],] + sampled_negative_video_pool
236
+
237
+ self.shared_video_num = 1 + self.neg_video_num
238
+
239
+ else:
240
+ ##during eval, use top-k videos recommended by the first VR search engine
241
+ inference_video_list = [ self.idx2video[item[0]] for item in _external_inference_vr_res[:self.inference_top_k]]
242
+ inference_video_scores = [ item[1] for item in _external_inference_vr_res[:self.inference_top_k]]
243
+ model_inputs["inference_vr_scores"] = torch.FloatTensor(inference_video_scores)
244
+ total_vid_name_list = [meta["vid_name"],] + inference_video_list
245
+ self.shared_video_num = 1 + self.inference_top_k
246
+
247
+ # sampled neg_video_num negative videos or top-k videos
248
+ meta["sample_vid_name_list"] = total_vid_name_list[1:]
249
+
250
+ """
251
+ a dict for visual modality: {
252
+ "feat": torch.tensor, (shared_video_num, max_ctx_len, D_v)
253
+ "feat_mask": torch.tensor, (shared_video_num, max_ctx_len)
254
+ "feat_pos_id": torch.tensor, (shared_video_num, max_ctx_len)
255
+ "feat_token_id": torch.tensor, (shared_video_num, max_ctx_len)
256
+ }
257
+ """
258
+ groundtruth_visual_feat = self.get_visual_feat_from_storage(meta["vid_name"])
259
+ ctx_l, feat_dim = groundtruth_visual_feat.shape
260
+
261
+ visual_feat_pad = torch.zeros((self.shared_video_num, self.max_ctx_len, feat_dim))
262
+ visual_feat_mask = torch.zeros((self.shared_video_num, self.max_ctx_len), dtype=torch.long)
263
+ visual_feat_pos_id = \
264
+ torch.repeat_interleave(torch.arange(self.max_ctx_len, dtype=torch.long).unsqueeze(0),
265
+ self.shared_video_num, dim=0)
266
+ visual_feat_token_id = torch.full((self.shared_video_num, self.max_ctx_len), self.visual_token_id,
267
+ dtype=torch.long)
268
+
269
+ for index, video_name in enumerate(total_vid_name_list,start=0):
270
+ visual_feat = self.get_visual_feat_from_storage(video_name)
271
+
272
+ feat_pad, feat_mask = \
273
+ self.pad_feature(visual_feat, self.max_ctx_len)
274
+
275
+ visual_feat_pad[index] = feat_pad
276
+ visual_feat_mask[index] = feat_mask
277
+
278
+ temp_model_inputs = dict()
279
+ temp_model_inputs["feat"] = visual_feat_pad
280
+ temp_model_inputs["feat_mask"] = visual_feat_mask
281
+ temp_model_inputs["feat_pos_id"] = visual_feat_pos_id
282
+ temp_model_inputs["feat_token_id"] = visual_feat_token_id
283
+
284
+ model_inputs["visual"] = temp_model_inputs
285
+
286
+ """
287
+ a dict for sub modality: {
288
+ "feat": torch.tensor, (shared_video_num, max_ctx_len, D_t)
289
+ "feat_mask": torch.tensor, (shared_video_num, max_ctx_len)
290
+ "feat_pos_id": torch.tensor, (shared_video_num, max_ctx_len)
291
+ "feat_token_id": torch.tensor, (shared_video_num, max_ctx_len)
292
+ }
293
+ """
294
+ if self.use_sub:
295
+ groundtruth_sub_feat = self.get_sub_feat_from_storage(meta["vid_name"])
296
+
297
+ _ , feat_dim = groundtruth_sub_feat.shape
298
+
299
+ sub_feat_pad = torch.zeros((self.shared_video_num, self.max_ctx_len, feat_dim))
300
+ sub_feat_mask = torch.zeros((self.shared_video_num, self.max_ctx_len), dtype=torch.long)
301
+ sub_feat_pos_id = \
302
+ torch.repeat_interleave(torch.arange(self.max_ctx_len, dtype=torch.long).unsqueeze(0),
303
+ self.shared_video_num, dim=0)
304
+ sub_feat_token_id = torch.full((self.shared_video_num, self.max_ctx_len), self.text_token_id, dtype=torch.long)
305
+
306
+ for index, video_name in enumerate(total_vid_name_list, start=0):
307
+ sub_feat = self.get_sub_feat_from_storage(video_name)
308
+
309
+ feat_pad, feat_mask = \
310
+ self.pad_feature(sub_feat, self.max_ctx_len)
311
+
312
+ sub_feat_pad[index] = feat_pad
313
+ sub_feat_mask[index] = feat_mask
314
+
315
+ temp_model_inputs = dict()
316
+ temp_model_inputs["feat"] = sub_feat_pad
317
+ temp_model_inputs["feat_mask"] = sub_feat_mask
318
+ temp_model_inputs["feat_pos_id"] = sub_feat_pos_id
319
+ temp_model_inputs["feat_token_id"] = sub_feat_token_id
320
+
321
+ model_inputs["sub"] = temp_model_inputs
322
+
323
+ if not self.is_eval:
324
+ model_inputs["st_ed_indices"] = self.get_st_ed_label(meta["ts"],
325
+ max_idx=ctx_l - 1)
326
+
327
+ return dict(meta=meta, model_inputs=model_inputs)
328
+
329
+ def get_st_ed_label(self, ts, max_idx):
330
+ """
331
+ Args:
332
+ ts: [st (float), ed (float)] in seconds, ed > st
333
+ max_idx: length of the video
334
+
335
+ Returns:
336
+ [st_idx, ed_idx]: int,
337
+ ed_idx >= st_idx
338
+ st_idx, ed_idx both belong to [0, max_idx-1]
339
+
340
+ Given ts = [3.2, 7.6], st_idx = 2, ed_idx = 6,
341
+ clips should be indexed as [2: 6), the translated back ts should be [3:9].
342
+ # TODO which one is better, [2: 5] or [2: 6)
343
+ """
344
+ st_idx = min(math.floor(ts[0] / self.clip_length), max_idx)
345
+ ed_idx = min(math.ceil(ts[1] / self.clip_length) - 1, max_idx) # st_idx could be the same as ed_idx
346
+ assert 0 <= st_idx <= ed_idx <= max_idx, (ts, st_idx, ed_idx, max_idx)
347
+ return torch.LongTensor([st_idx, ed_idx])
348
+
349
+
inference.py ADDED
@@ -0,0 +1,570 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import pprint
3
+ from tqdm import tqdm
4
+ import numpy as np
5
+
6
+ import torch
7
+ import torch.nn.functional as F
8
+ import torch.backends.cudnn as cudnn
9
+ from torch.utils.data import DataLoader
10
+
11
+ from config.config import TestOptions
12
+ from model.conquer import CONQUER
13
+ from data_loader.second_stage_start_end_dataset import StartEndDataset as StartEndEvalDataset
14
+ from utils.inference_utils import \
15
+ get_submission_top_n, post_processing_vcmr_nms
16
+ from utils.basic_utils import save_json , load_config
17
+ from utils.tensor_utils import find_max_triples_from_upper_triangle_product
18
+ from standalone_eval.eval import eval_retrieval
19
+ from utils.model_utils import move_cuda , start_end_collate
20
+ from utils.model_utils import VERY_NEGATIVE_NUMBER
21
+ import logging
22
+ from time import time
23
+ from ndcg_iou_topk import calculate_ndcg_iou
24
+
25
+ logger = logging.getLogger(__name__)
26
+ logging.basicConfig(format="%(asctime)s.%(msecs)03d:%(levelname)s:%(name)s - %(message)s",
27
+ datefmt="%Y-%m-%d %H:%M:%S",
28
+ level=logging.INFO)
29
+
30
+ def generate_min_max_length_mask(array_shape, min_l, max_l):
31
+ """ The last two dimension denotes matrix of upper-triangle with upper-right corner masked,
32
+ below is the case for 4x4.
33
+ [[0, 1, 1, 0],
34
+ [0, 0, 1, 1],
35
+ [0, 0, 0, 1],
36
+ [0, 0, 0, 0]]
37
+
38
+ Args:
39
+ array_shape: np.shape??? The last two dimensions should be the same
40
+ min_l: int, minimum length of predicted span
41
+ max_l: int, maximum length of predicted span
42
+
43
+ Returns:
44
+
45
+ """
46
+ single_dims = (1, ) * (len(array_shape) - 2)
47
+ mask_shape = single_dims + array_shape[-2:]
48
+ extra_length_mask_array = np.ones(mask_shape, dtype=np.float32) # (1, ..., 1, L, L)
49
+ mask_triu = np.triu(extra_length_mask_array, k=min_l)
50
+ mask_triu_reversed = 1 - np.triu(extra_length_mask_array, k=max_l)
51
+ final_prob_mask = mask_triu * mask_triu_reversed
52
+ return final_prob_mask # with valid bit to be 1
53
+
54
+
55
+ def get_svmr_res_from_st_ed_probs_disjoint(svmr_gt_st_probs, svmr_gt_ed_probs, query_metas, video2idx,
56
+ clip_length, min_pred_l, max_pred_l, max_before_nms):
57
+ """
58
+ Args:
59
+ svmr_gt_st_probs: np.ndarray (N_queries, L, L), value range [0, 1]
60
+ svmr_gt_ed_probs:
61
+ query_metas:
62
+ video2idx:
63
+ clip_length: float, how long each clip is in seconds
64
+ min_pred_l: int, minimum number of clips
65
+ max_pred_l: int, maximum number of clips
66
+ max_before_nms: get top-max_before_nms predictions for each query
67
+
68
+ Returns:
69
+
70
+ """
71
+ svmr_res = []
72
+ query_vid_names = [e["vid_name"] for e in query_metas]
73
+
74
+ # masking very long ones! Since most are relatively short.
75
+ # disjoint : b_i + e_i
76
+ _st_ed_scores = np.expand_dims(svmr_gt_st_probs,axis=2) + np.expand_dims(svmr_gt_ed_probs,axis=1)
77
+
78
+ _N_q = _st_ed_scores.shape[0]
79
+
80
+ _valid_prob_mask = np.logical_not(generate_min_max_length_mask(
81
+ _st_ed_scores.shape, min_l=min_pred_l, max_l=max_pred_l).astype(bool))
82
+
83
+ valid_prob_mask = np.tile(_valid_prob_mask,(_N_q, 1, 1))
84
+
85
+ # invalid location will become VERY_NEGATIVE_NUMBER!
86
+ _st_ed_scores[valid_prob_mask] = VERY_NEGATIVE_NUMBER
87
+
88
+ batched_sorted_triples = find_max_triples_from_upper_triangle_product(
89
+ _st_ed_scores, top_n=max_before_nms, prob_thd=None)
90
+ for i, q_vid_name in tqdm(enumerate(query_vid_names),
91
+ desc="[SVMR] Loop over queries to generate predictions",
92
+ total=len(query_vid_names)): # i is query_id
93
+ q_m = query_metas[i]
94
+ video_idx = video2idx[q_vid_name]
95
+ _sorted_triples = batched_sorted_triples[i]
96
+ _sorted_triples[:, 1] += 1 # as we redefined ed_idx, which is inside the moment.
97
+ _sorted_triples[:, :2] = _sorted_triples[:, :2] * clip_length
98
+ # [video_idx(int), st(float), ed(float), score(float)]
99
+ cur_ranked_predictions = [[video_idx, ] + row for row in _sorted_triples.tolist()]
100
+ cur_query_pred = dict(
101
+ query_id=q_m["query_id"],
102
+ desc=q_m["desc"],
103
+ predictions=cur_ranked_predictions
104
+ )
105
+ svmr_res.append(cur_query_pred)
106
+ return svmr_res
107
+
108
+
109
+ def get_svmr_res_from_st_ed_probs(svmr_gt_st_probs, svmr_gt_ed_probs, query_metas, video2idx,
110
+ clip_length, min_pred_l, max_pred_l, max_before_nms):
111
+ """
112
+ Args:
113
+ svmr_gt_st_probs: np.ndarray (N_queries, L, L), value range [0, 1]
114
+ svmr_gt_ed_probs:
115
+ query_metas:
116
+ video2idx:
117
+ clip_length: float, how long each clip is in seconds
118
+ min_pred_l: int, minimum number of clips
119
+ max_pred_l: int, maximum number of clips
120
+ max_before_nms: get top-max_before_nms predictions for each query
121
+
122
+ Returns:
123
+
124
+ """
125
+ svmr_res = []
126
+ query_vid_names = [e["vid_name"] for e in query_metas]
127
+
128
+ # masking very long ones! Since most are relatively short.
129
+ # general/exclusive : \hat{b_i} * \hat{e_i}
130
+ st_ed_prob_product = np.einsum("bm,bn->bmn", svmr_gt_st_probs, svmr_gt_ed_probs) # (N, L, L)
131
+
132
+ valid_prob_mask = generate_min_max_length_mask(st_ed_prob_product.shape, min_l=min_pred_l, max_l=max_pred_l)
133
+ st_ed_prob_product *= valid_prob_mask # invalid location will become zero!
134
+
135
+ batched_sorted_triples = find_max_triples_from_upper_triangle_product(
136
+ st_ed_prob_product, top_n=max_before_nms, prob_thd=None)
137
+ for i, q_vid_name in tqdm(enumerate(query_vid_names),
138
+ desc="[SVMR] Loop over queries to generate predictions",
139
+ total=len(query_vid_names)): # i is query_id
140
+ q_m = query_metas[i]
141
+ video_idx = video2idx[q_vid_name]
142
+ _sorted_triples = batched_sorted_triples[i]
143
+ _sorted_triples[:, 1] += 1 # as we redefined ed_idx, which is inside the moment.
144
+ _sorted_triples[:, :2] = _sorted_triples[:, :2] * clip_length
145
+ # [video_idx(int), st(float), ed(float), score(float)]
146
+ cur_ranked_predictions = [[video_idx, ] + row for row in _sorted_triples.tolist()]
147
+ cur_query_pred = dict(
148
+ query_id=q_m["query_id"],
149
+ desc=q_m["desc"],
150
+ predictions=cur_ranked_predictions
151
+ )
152
+ svmr_res.append(cur_query_pred)
153
+ return svmr_res
154
+
155
+
156
+
157
+ def compute_query2ctx_info(model, eval_dataset, opt,
158
+ max_before_nms=200, max_n_videos=100, tasks=("SVMR",)):
159
+ """
160
+ Use val set to do evaluation, remember to run with torch.no_grad().
161
+ model : CONQUER
162
+ eval_dataset :
163
+ opt :
164
+ max_before_nms : max moment number before non-maximum suppression
165
+ tasks: evaluation tasks
166
+
167
+ general/exclusive function : r * \hat{b_i} + \hat{e_i}
168
+ """
169
+ is_vr = "VR" in tasks
170
+ is_vcmr = "VCMR" in tasks
171
+ is_svmr = "SVMR" in tasks
172
+
173
+ video2idx = eval_dataset.video2idx
174
+
175
+ model.eval()
176
+ query_eval_loader = DataLoader(eval_dataset,
177
+ collate_fn= start_end_collate,
178
+ batch_size=opt.eval_query_bsz,
179
+ num_workers=opt.num_workers,
180
+ shuffle=False,
181
+ pin_memory=True)
182
+
183
+ n_total_query = len(eval_dataset)
184
+ bsz = opt.eval_query_bsz
185
+
186
+ if is_vcmr:
187
+ flat_st_ed_scores_sorted_indices = np.empty((n_total_query, max_before_nms), dtype=int)
188
+ flat_st_ed_sorted_scores = np.zeros((n_total_query, max_before_nms), dtype=np.float32)
189
+
190
+ if is_vr :
191
+ if opt.use_interal_vr_scores:
192
+ sorted_q2c_indices = np.tile(np.arange(max_n_videos, dtype=int),n_total_query).reshape(n_total_query,max_n_videos)
193
+ sorted_q2c_scores = np.empty((n_total_query, max_n_videos), dtype=np.float32)
194
+ else:
195
+ sorted_q2c_indices = np.empty((n_total_query, max_n_videos), dtype=int)
196
+ sorted_q2c_scores = np.empty((n_total_query, max_n_videos), dtype=np.float32)
197
+
198
+ if is_svmr:
199
+ svmr_gt_st_probs = np.zeros((n_total_query, opt.max_ctx_len), dtype=np.float32)
200
+ svmr_gt_ed_probs = np.zeros((n_total_query, opt.max_ctx_len), dtype=np.float32)
201
+
202
+ query_metas = []
203
+ for idx, batch in tqdm(
204
+ enumerate(query_eval_loader), desc="Computing q embedding", total=len(query_eval_loader)):
205
+
206
+ _query_metas = batch["meta"]
207
+ query_metas.extend(batch["meta"])
208
+
209
+ if opt.device.type == "cuda":
210
+ model_inputs = move_cuda(batch["model_inputs"], opt.device)
211
+ else:
212
+ model_inputs = batch["model_inputs"]
213
+
214
+
215
+ video_similarity_score, begin_score_distribution, end_score_distribution = \
216
+ model.get_pred_from_raw_query(model_inputs)
217
+
218
+ if is_svmr:
219
+ _svmr_st_probs = begin_score_distribution[:, 0]
220
+ _svmr_ed_probs = end_score_distribution[:, 0]
221
+
222
+ # normalize to get true probabilities!!!
223
+ # the probabilities here are already (pad) masked, so only need to do softmax
224
+ _svmr_st_probs = F.softmax(_svmr_st_probs, dim=-1) # (_N_q, L)
225
+ _svmr_ed_probs = F.softmax(_svmr_ed_probs, dim=-1)
226
+ if opt.debug:
227
+ print("svmr_st_probs: ", _svmr_st_probs)
228
+
229
+ svmr_gt_st_probs[idx * bsz:(idx + 1) * bsz] = \
230
+ _svmr_st_probs.cpu().numpy()
231
+
232
+ svmr_gt_ed_probs[idx * bsz:(idx + 1) * bsz] = \
233
+ _svmr_ed_probs.cpu().numpy()
234
+
235
+ _vcmr_st_prob = begin_score_distribution[:, 1:]
236
+ _vcmr_ed_prob = end_score_distribution[:, 1:]
237
+
238
+ if not (is_vr or is_vcmr):
239
+ continue
240
+
241
+ if opt.use_interal_vr_scores:
242
+ bs = begin_score_distribution.size()[0]
243
+ _sorted_q2c_indices = torch.arange(max_n_videos).to(begin_score_distribution.device).repeat(bs,1)
244
+ _sorted_q2c_scores = model_inputs["inference_vr_scores"]
245
+ if is_vr:
246
+ sorted_q2c_scores[idx * bsz:(idx + 1) * bsz] = model_inputs["inference_vr_scores"].cpu().numpy()
247
+ else:
248
+ video_similarity_score = video_similarity_score[:, 1:]
249
+ _query_context_scores = torch.softmax(video_similarity_score,dim=1)
250
+
251
+ # Get top-max_n_videos videos for each query
252
+ _sorted_q2c_scores, _sorted_q2c_indices = \
253
+ torch.topk(_query_context_scores, max_n_videos, dim=1, largest=True)
254
+ if is_vr:
255
+ sorted_q2c_indices[idx * bsz:(idx + 1) * bsz] = _sorted_q2c_indices.cpu().numpy()
256
+ sorted_q2c_scores[idx * bsz:(idx + 1) * bsz] = _sorted_q2c_scores.cpu().numpy()
257
+
258
+
259
+ if not is_vcmr:
260
+ continue
261
+
262
+
263
+ # normalize to get true probabilities!!!
264
+ # the probabilities here are already (pad) masked, so only need to do softmax
265
+ _st_probs = F.softmax(_vcmr_st_prob, dim=-1) # (_N_q, N_videos, L)
266
+ _ed_probs = F.softmax(_vcmr_ed_prob, dim=-1)
267
+
268
+
269
+ # Get VCMR results
270
+ # compute combined scores
271
+ row_indices = torch.arange(0, len(_st_probs), device=opt.device).unsqueeze(1)
272
+ _st_probs = _st_probs[row_indices, _sorted_q2c_indices] # (_N_q, max_n_videos, L)
273
+ _ed_probs = _ed_probs[row_indices, _sorted_q2c_indices]
274
+
275
+ # (_N_q, max_n_videos, L, L)
276
+ # general/exclusive : r * \hat{b_i} * \hat{e_i}
277
+ _st_ed_scores = torch.einsum("qvm,qv,qvn->qvmn", _st_probs, _sorted_q2c_scores, _ed_probs)
278
+
279
+ valid_prob_mask = generate_min_max_length_mask(
280
+ _st_ed_scores.shape, min_l=opt.min_pred_l, max_l=opt.max_pred_l)
281
+
282
+ _st_ed_scores *= torch.from_numpy(
283
+ valid_prob_mask).to(_st_ed_scores.device) # invalid location will become zero!
284
+
285
+ _n_q = _st_ed_scores.shape[0]
286
+
287
+ # sort across the total_n_videos videos (by flatten from the 2nd dim)
288
+ # the indices here are local indices, not global indices
289
+
290
+ _flat_st_ed_scores = _st_ed_scores.reshape(_n_q, -1) # (N_q, total_n_videos*L*L)
291
+ _flat_st_ed_sorted_scores, _flat_st_ed_scores_sorted_indices = \
292
+ torch.sort(_flat_st_ed_scores, dim=1, descending=True)
293
+
294
+ # collect data
295
+ flat_st_ed_sorted_scores[idx * bsz:(idx + 1) * bsz] = \
296
+ _flat_st_ed_sorted_scores[:, :max_before_nms].detach().cpu().numpy()
297
+ flat_st_ed_scores_sorted_indices[idx * bsz:(idx + 1) * bsz] = \
298
+ _flat_st_ed_scores_sorted_indices[:, :max_before_nms].detach().cpu().numpy()
299
+
300
+ if opt.debug:
301
+ break
302
+
303
+ # Numpy starts here!!!
304
+ vr_res = []
305
+ if is_vr:
306
+ for i, (_sorted_q2c_scores_row, _sorted_q2c_indices_row) in tqdm(
307
+ enumerate(zip(sorted_q2c_scores, sorted_q2c_indices)),
308
+ desc="[VR] Loop over queries to generate predictions", total=n_total_query):
309
+ cur_vr_redictions = []
310
+ query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
311
+ for j, (v_score, v_meta_idx) in enumerate(zip(_sorted_q2c_scores_row, _sorted_q2c_indices_row)):
312
+ video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
313
+ cur_vr_redictions.append([video_idx, 0, 0, float(v_score)])
314
+ cur_query_pred = dict(
315
+ query_id=query_metas[i]["query_id"],
316
+ desc=query_metas[i]["desc"],
317
+ predictions=cur_vr_redictions
318
+ )
319
+ vr_res.append(cur_query_pred)
320
+
321
+ svmr_res = []
322
+ if is_svmr:
323
+ svmr_res = get_svmr_res_from_st_ed_probs(svmr_gt_st_probs, svmr_gt_ed_probs,
324
+ query_metas, video2idx,
325
+ clip_length=opt.clip_length,
326
+ min_pred_l=opt.min_pred_l,
327
+ max_pred_l=opt.max_pred_l,
328
+ max_before_nms=max_before_nms)
329
+
330
+
331
+ vcmr_res = []
332
+ if is_vcmr:
333
+ for i, (_flat_st_ed_scores_sorted_indices, _flat_st_ed_sorted_scores) in tqdm(
334
+ enumerate(zip(flat_st_ed_scores_sorted_indices, flat_st_ed_sorted_scores)),
335
+ desc="[VCMR] Loop over queries to generate predictions", total=n_total_query): # i is query_idx
336
+ # list([video_idx(int), st(float), ed(float), score(float)])
337
+ video_meta_indices_local, pred_st_indices, pred_ed_indices = \
338
+ np.unravel_index(_flat_st_ed_scores_sorted_indices,
339
+ shape=(max_n_videos, opt.max_ctx_len, opt.max_ctx_len))
340
+ # video_meta_indices refers to the indices among the total_n_videos
341
+ # video_meta_indices_local refers to the indices among the top-max_n_videos
342
+ # video_meta_indices refers to the indices in all the videos, which is the True indices
343
+ video_meta_indices = sorted_q2c_indices[i, video_meta_indices_local]
344
+
345
+ pred_st_in_seconds = pred_st_indices.astype(np.float32) * opt.clip_length
346
+ pred_ed_in_seconds = pred_ed_indices.astype(np.float32) * opt.clip_length + opt.clip_length
347
+ cur_vcmr_redictions = []
348
+ query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
349
+ for j, (v_meta_idx, v_score) in enumerate(zip(video_meta_indices, _flat_st_ed_sorted_scores)): # videos
350
+ video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
351
+ cur_vcmr_redictions.append(
352
+ [video_idx, float(pred_st_in_seconds[j]), float(pred_ed_in_seconds[j]), float(v_score)])
353
+
354
+ cur_query_pred = dict(
355
+ query_id=query_metas[i]["query_id"],
356
+ desc=query_metas[i]["desc"],
357
+ predictions=cur_vcmr_redictions)
358
+ vcmr_res.append(cur_query_pred)
359
+
360
+ res = dict(VCMR=vcmr_res, SVMR=svmr_res, VR=vr_res)
361
+ return {k: v for k, v in res.items() if len(v) != 0}
362
+
363
+
364
+ def compute_query2ctx_info_disjoint(model, eval_dataset, opt,
365
+ max_before_nms=200, max_n_videos=100, maxtopk = 40):
366
+ """Use val set to do evaluation, remember to run with torch.no_grad().
367
+ model : CONQUER
368
+ eval_dataset :
369
+ opt :
370
+ max_before_nms : max moment number before non-maximum suppression
371
+ tasks: evaluation tasks
372
+
373
+ disjoint function : b_i + e_i
374
+
375
+ """
376
+ video2idx = eval_dataset.video2idx
377
+
378
+ model.eval()
379
+ query_eval_loader = DataLoader(eval_dataset, collate_fn= start_end_collate, batch_size=opt.eval_query_bsz,
380
+ num_workers=opt.num_workers, shuffle=False, pin_memory=True)
381
+
382
+ n_total_query = len(eval_dataset)
383
+ bsz = opt.eval_query_bsz
384
+
385
+ flat_st_ed_scores_sorted_indices = np.empty((n_total_query, max_before_nms), dtype=int)
386
+ flat_st_ed_sorted_scores = np.zeros((n_total_query, max_before_nms), dtype=np.float32)
387
+
388
+
389
+ query_metas = []
390
+ for idx, batch in tqdm(
391
+ enumerate(query_eval_loader), desc="Computing q embedding", total=len(query_eval_loader)):
392
+
393
+ query_metas.extend(batch["meta"])
394
+ if opt.device.type == "cuda":
395
+ model_inputs = move_cuda(batch["model_inputs"], opt.device)
396
+
397
+ else:
398
+ model_inputs = batch["model_inputs"]
399
+
400
+ _ , begin_score_distribution, end_score_distribution = model.get_pred_from_raw_query(model_inputs)
401
+
402
+ begin_score_distribution = begin_score_distribution[:,1:]
403
+ end_score_distribution= end_score_distribution[:,1:]
404
+
405
+ # Get VCMR results
406
+ # (_N_q, total_n_videos, L, L)
407
+ # b_i + e_i
408
+ _st_ed_scores = torch.unsqueeze(begin_score_distribution, 3) + torch.unsqueeze(end_score_distribution, 2)
409
+
410
+ _n_q, total_n_videos = _st_ed_scores.size()[:2]
411
+
412
+
413
+ ## mask the invalid location out of moment length constrain
414
+ _valid_prob_mask = np.logical_not(generate_min_max_length_mask(
415
+ _st_ed_scores.shape, min_l=opt.min_pred_l, max_l=opt.max_pred_l).astype(bool))
416
+
417
+ _valid_prob_mask = torch.from_numpy(_valid_prob_mask).to(_st_ed_scores.device)
418
+
419
+ valid_prob_mask = _valid_prob_mask.repeat(_n_q,total_n_videos,1,1)
420
+
421
+ # invalid location will become VERY_NEGATIVE_NUMBER!
422
+ _st_ed_scores[valid_prob_mask] = VERY_NEGATIVE_NUMBER
423
+
424
+ # sort across the total_n_videos videos (by flatten from the 2nd dim)
425
+ # the indices here are local indices, not global indices
426
+ _flat_st_ed_scores = _st_ed_scores.reshape(_n_q, -1) # (N_q, total_n_videos*L*L)
427
+ _flat_st_ed_sorted_scores, _flat_st_ed_scores_sorted_indices = \
428
+ torch.sort(_flat_st_ed_scores, dim=1, descending=True)
429
+
430
+ # collect data
431
+ flat_st_ed_sorted_scores[idx * bsz:(idx + 1) * bsz] = \
432
+ _flat_st_ed_sorted_scores[:, :max_before_nms].detach().cpu().numpy()
433
+ flat_st_ed_scores_sorted_indices[idx * bsz:(idx + 1) * bsz] = \
434
+ _flat_st_ed_scores_sorted_indices[:, :max_before_nms].detach().cpu().numpy()
435
+
436
+
437
+
438
+ vcmr_res = {}
439
+ for i, (_flat_st_ed_scores_sorted_indices, _flat_st_ed_sorted_scores) in tqdm(
440
+ enumerate(zip(flat_st_ed_scores_sorted_indices, flat_st_ed_sorted_scores)),
441
+ desc="[VCMR] Loop over queries to generate predictions", total=n_total_query): # i is query_idx
442
+ # list([video_idx(int), st(float), ed(float), score(float)])
443
+ video_meta_indices_local, pred_st_indices, pred_ed_indices = \
444
+ np.unravel_index(_flat_st_ed_scores_sorted_indices,
445
+ shape=(total_n_videos, opt.max_ctx_len, opt.max_ctx_len))
446
+
447
+ pred_st_in_seconds = pred_st_indices.astype(np.float32) * opt.clip_length
448
+ pred_ed_in_seconds = pred_ed_indices.astype(np.float32) * opt.clip_length + opt.clip_length
449
+ cur_vcmr_redictions = []
450
+ query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
451
+ for j, (v_meta_idx, v_score) in enumerate(zip(video_meta_indices_local, _flat_st_ed_sorted_scores)): # videos
452
+ # video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
453
+ cur_vcmr_redictions.append(
454
+ {
455
+ "video_name": query_specific_video_metas[v_meta_idx],
456
+ "timestamp": [float(pred_st_in_seconds[j]), float(pred_ed_in_seconds[j])],
457
+ "model_scores": float(v_score)
458
+ }
459
+ )
460
+ query_id=query_metas[i]["query_id"]
461
+ vcmr_res[query_id] = cur_vcmr_redictions[:maxtopk]
462
+ return vcmr_res
463
+
464
+ def get_eval_res(model, eval_dataset, opt):
465
+ """compute and save query and video proposal embeddings"""
466
+
467
+ if opt.similarity_measure == "disjoint": #disjoint b_i+ e_i
468
+ eval_res = compute_query2ctx_info_disjoint(model, eval_dataset, opt,
469
+ max_before_nms=opt.max_before_nms,
470
+ max_n_videos=opt.max_vcmr_video)
471
+ elif opt.similarity_measure in ["general" , "exclusive" ] : # r * \hat{b_i} * \hat{e_i}
472
+ eval_res = compute_query2ctx_info(model, eval_dataset, opt,
473
+ max_before_nms=opt.max_before_nms,
474
+ max_n_videos=opt.max_vcmr_video,
475
+ tasks=tasks)
476
+
477
+
478
+ return eval_res
479
+
480
+
481
+ POST_PROCESSING_MMS_FUNC = {
482
+ "SVMR": post_processing_vcmr_nms,
483
+ "VCMR": post_processing_vcmr_nms
484
+ }
485
+
486
+ def get_prediction_top_n(list_dict_predictions, top_n):
487
+ top_n_res = []
488
+ for e in list_dict_predictions:
489
+ e["predictions"] = e["predictions"][:top_n]
490
+ top_n_res.append(e)
491
+ return top_n_res
492
+
493
+
494
+ def eval_epoch(model, eval_dataset, opt, max_after_nms, iou_thds, topks):
495
+
496
+ pred_data = get_eval_res(model, eval_dataset, opt)
497
+ # video2idx = eval_dataset.video2idx
498
+ # pred_data = get_prediction_top_n(eval_res, top_n=max_after_nms)
499
+ # pred_data = get_prediction_top_n(eval_res, top_n=max_after_nms)
500
+ gt_data = eval_dataset.ground_truth
501
+ average_ndcg = calculate_ndcg_iou(gt_data, pred_data, iou_thds, topks)
502
+ return average_ndcg, pred_data
503
+
504
+
505
+
506
+ def setup_model(opt):
507
+ """Load model from checkpoint and move to specified device"""
508
+ checkpoint = torch.load(opt.ckpt_filepath)
509
+ loaded_model_cfg = checkpoint["model_cfg"]
510
+
511
+ model = CONQUER(loaded_model_cfg,
512
+ visual_dim=opt.visual_dim,
513
+ text_dim=opt.text_dim,
514
+ query_dim=opt.query_dim,
515
+ hidden_dim=opt.hidden_dim,
516
+ video_len=opt.max_ctx_len,
517
+ ctx_mode=opt.ctx_mode,
518
+ no_output_moe_weight=opt.no_output_moe_weight,
519
+ similarity_measure=opt.similarity_measure,
520
+ use_debug = opt.debug)
521
+ model.load_state_dict(checkpoint["model"])
522
+
523
+ logger.info("Loaded model saved at epoch {} from checkpoint: {}"
524
+ .format(checkpoint["epoch"], opt.ckpt_filepath))
525
+
526
+ if opt.device.type == "cuda":
527
+ logger.info("CUDA enabled.")
528
+ model.to(opt.device)
529
+ assert len(opt.device_ids) == 1
530
+ # if len(opt.device_ids) > 1:
531
+ # logger.info("Use multi GPU", opt.device_ids)
532
+ # model = torch.nn.DataParallel(model, device_ids=opt.device_ids) # use multi GPU
533
+ return model
534
+
535
+
536
+ def start_inference():
537
+ logger.info("Setup config, data and model...")
538
+ opt = TestOptions().parse()
539
+ cudnn.benchmark = False
540
+ cudnn.deterministic = True
541
+
542
+ data_config = load_config(opt.dataset_config)
543
+
544
+ eval_dataset = StartEndEvalDataset(
545
+ config = data_config,
546
+ max_ctx_len=opt.max_ctx_len,
547
+ max_desc_len= opt.max_desc_len,
548
+ clip_length = opt.clip_length,
549
+ ctx_mode = opt.ctx_mode,
550
+ mode = opt.eval_split_name,
551
+ data_ratio = opt.data_ratio,
552
+ is_eval = True,
553
+ inference_top_k = opt.max_vcmr_video)
554
+
555
+ postfix = "_hero"
556
+ model = setup_model(opt)
557
+ save_submission_filename = "inference_{}_{}_{}_predictions_{}{}.json".format(
558
+ opt.dset_name, opt.eval_split_name, opt.eval_id, "_".join(opt.tasks),postfix)
559
+ print(save_submission_filename)
560
+ logger.info("Starting inference...")
561
+ with torch.no_grad():
562
+ metrics_no_nms, metrics_nms, latest_file_paths = \
563
+ eval_epoch(model, eval_dataset, opt, save_submission_filename,
564
+ tasks=opt.tasks, max_after_nms=100)
565
+ logger.info("metrics_no_nms \n{}".format(pprint.pformat(metrics_no_nms, indent=4)))
566
+ logger.info("metrics_nms \n{}".format(pprint.pformat(metrics_nms, indent=4)))
567
+
568
+
569
+ if __name__ == '__main__':
570
+ start_inference()
model/__init__.py ADDED
File without changes
model/backbone/__init__.py ADDED
File without changes
model/backbone/encoder.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pytorch modules
3
+ some classes are modified from HuggingFace
4
+ (https://github.com/huggingface/transformers)
5
+ """
6
+
7
+ import torch
8
+ import logging
9
+ from torch import nn
10
+ logger = logging.getLogger(__name__)
11
+
12
+ try:
13
+ import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
14
+ except (ImportError, AttributeError) as e:
15
+ BertLayerNorm = torch.nn.LayerNorm
16
+
17
+ from model.transformer.bert import BertEncoder
18
+ from model.layers import (NetVLAD, LinearLayer)
19
+ from model.transformer.bert_embed import (BertEmbeddings)
20
+ from utils.model_utils import mask_logits
21
+ import torch.nn.functional as F
22
+
23
+
24
+
25
+ class TransformerBaseModel(nn.Module):
26
+ """
27
+ Base Transformer model
28
+ """
29
+ def __init__(self, config):
30
+ super(TransformerBaseModel, self).__init__()
31
+ self.embeddings = BertEmbeddings(config)
32
+ self.encoder = BertEncoder(config)
33
+
34
+
35
+ def forward(self,features,position_ids,token_type_ids,attention_mask):
36
+ # embedding layer
37
+ embedding_output = self.embeddings(token_type_ids=token_type_ids,
38
+ inputs_embeds=features,
39
+ position_ids=position_ids)
40
+
41
+ encoder_outputs = self.encoder(embedding_output, attention_mask)
42
+
43
+ sequence_output = encoder_outputs[0]
44
+
45
+ return sequence_output
46
+
47
+ class TwoModalEncoder(nn.Module):
48
+ """
49
+ Two modality Transformer Encoder model
50
+ """
51
+
52
+ def __init__(self, config,img_dim,text_dim,hidden_dim,split_num,output_split=True):
53
+ super(TwoModalEncoder, self).__init__()
54
+ self.img_linear = LinearLayer(
55
+ in_hsz=img_dim, out_hsz=hidden_dim)
56
+ self.text_linear = LinearLayer(
57
+ in_hsz=text_dim, out_hsz=hidden_dim)
58
+
59
+ self.transformer = TransformerBaseModel(config)
60
+ self.output_split = output_split
61
+ if self.output_split:
62
+ self.split_num = split_num
63
+
64
+
65
+ def forward(self, visual_features, visual_position_ids, visual_token_type_ids, visual_attention_mask,
66
+ text_features,text_position_ids,text_token_type_ids,text_attention_mask):
67
+
68
+ transformed_im = self.img_linear(visual_features)
69
+ transformed_text = self.text_linear(text_features)
70
+
71
+ transformer_input_feat = torch.cat((transformed_im,transformed_text),dim=1)
72
+ transformer_input_feat_pos_id = torch.cat((visual_position_ids,text_position_ids),dim=1)
73
+ transformer_input_feat_token_id = torch.cat((visual_token_type_ids,text_token_type_ids),dim=1)
74
+ transformer_input_feat_mask = torch.cat((visual_attention_mask,text_attention_mask),dim=1)
75
+
76
+ output = self.transformer(features=transformer_input_feat,
77
+ position_ids=transformer_input_feat_pos_id,
78
+ token_type_ids=transformer_input_feat_token_id,
79
+ attention_mask=transformer_input_feat_mask)
80
+
81
+ if self.output_split:
82
+ return torch.split(output,self.split_num,dim=1)
83
+ else:
84
+ return output
85
+
86
+
87
+ class OneModalEncoder(nn.Module):
88
+ """
89
+ One modality Transformer Encoder model
90
+ """
91
+
92
+ def __init__(self, config,input_dim,hidden_dim):
93
+ super(OneModalEncoder, self).__init__()
94
+ self.linear = LinearLayer(
95
+ in_hsz=input_dim, out_hsz=hidden_dim)
96
+ self.transformer = TransformerBaseModel(config)
97
+
98
+ def forward(self, features, position_ids, token_type_ids, attention_mask):
99
+
100
+ transformed_features = self.linear(features)
101
+
102
+ output = self.transformer(features=transformed_features,
103
+ position_ids=position_ids,
104
+ token_type_ids=token_type_ids,
105
+ attention_mask=attention_mask)
106
+ return output
107
+
108
+
109
+ class VideoQueryEncoder(nn.Module):
110
+ def __init__(self, config, video_modality,
111
+ visual_dim=4352, text_dim= 768,
112
+ query_dim=768, hidden_dim = 768,split_num=100,):
113
+ super(VideoQueryEncoder, self).__init__()
114
+ self.use_sub = len(video_modality) > 1
115
+ if self.use_sub:
116
+ self.videoEncoder = TwoModalEncoder(config=config.bert_config,
117
+ img_dim = visual_dim,
118
+ text_dim = text_dim ,
119
+ hidden_dim = hidden_dim,
120
+ split_num = split_num
121
+ )
122
+ else:
123
+ self.videoEncoder = OneModalEncoder(config=config.bert_config,
124
+ input_dim = visual_dim,
125
+ hidden_dim = hidden_dim,
126
+ )
127
+
128
+ self.queryEncoder = OneModalEncoder(config=config.query_bert_config,
129
+ input_dim= query_dim,
130
+ hidden_dim=hidden_dim,
131
+ )
132
+
133
+ def forward_repr_query(self, batch):
134
+
135
+ query_output = self.queryEncoder(
136
+ features=batch["query"]["feat"],
137
+ position_ids=batch["query"]["feat_pos_id"],
138
+ token_type_ids=batch["query"]["feat_token_id"],
139
+ attention_mask=batch["query"]["feat_mask"]
140
+ )
141
+
142
+ return query_output
143
+
144
+ def forward_repr_video(self,batch):
145
+ video_output = dict()
146
+
147
+ if len(batch["visual"]["feat"].size()) == 4:
148
+ bsz, num_video = batch["visual"]["feat"].size()[:2]
149
+ for key in batch.keys():
150
+ if key in ["visual", "sub"]:
151
+ for key_2 in batch[key]:
152
+ if key_2 in ["feat", "feat_mask", "feat_pos_id", "feat_token_id"]:
153
+ shape_list = batch[key][key_2].size()[2:]
154
+ batch[key][key_2] = batch[key][key_2].view((bsz * num_video,) + shape_list)
155
+
156
+
157
+ if self.use_sub:
158
+ video_output["visual"], video_output["sub"] = self.videoEncoder(
159
+ visual_features=batch["visual"]["feat"],
160
+ visual_position_ids=batch["visual"]["feat_pos_id"],
161
+ visual_token_type_ids=batch["visual"]["feat_token_id"],
162
+ visual_attention_mask=batch["visual"]["feat_mask"],
163
+ text_features=batch["sub"]["feat"],
164
+ text_position_ids=batch["sub"]["feat_pos_id"],
165
+ text_token_type_ids=batch["sub"]["feat_token_id"],
166
+ text_attention_mask=batch["sub"]["feat_mask"]
167
+ )
168
+ else:
169
+ video_output["visual"] = self.videoEncoder(
170
+ features=batch["visual"]["feat"],
171
+ position_ids=batch["visual"]["feat_pos_id"],
172
+ token_type_ids=batch["visual"]["feat_token_id"],
173
+ attention_mask=batch["visual"]["feat_mask"]
174
+ )
175
+
176
+ return video_output
177
+
178
+
179
+ def forward_repr_both(self, batch):
180
+ video_output = self.forward_repr_video(batch)
181
+ query_output = self.forward_repr_query(batch)
182
+
183
+ return {"video_feat": video_output,
184
+ "query_feat": query_output}
185
+
186
+ def forward(self,batch,task="repr_both"):
187
+
188
+ if task == "repr_both":
189
+ return self.forward_repr_both(batch)
190
+ elif task == "repr_video":
191
+ return self.forward_repr_video(batch)
192
+ elif task == "repr_query":
193
+ return self.forward_repr_query(batch)
194
+
195
+
196
+ class QueryWeightEncoder(nn.Module):
197
+ """
198
+ Query Weight Encoder
199
+ Using NetVLAD to aggreate contextual query features
200
+ Using FC + Softmax to get fusion weights for each modality
201
+ """
202
+ def __init__(self, config, video_modality):
203
+ super(QueryWeightEncoder, self).__init__()
204
+
205
+ ##NetVLAD
206
+ self.text_pooling = NetVLAD(feature_size=config.hidden_size,cluster_size=config.text_cluster)
207
+ self.moe_txt_dropout = nn.Dropout(config.moe_dropout_prob)
208
+
209
+ ##FC
210
+ self.moe_fc_txt = nn.Linear(
211
+ in_features=self.text_pooling.out_dim,
212
+ out_features=len(video_modality),
213
+ bias=False)
214
+
215
+ self.video_modality = video_modality
216
+
217
+ def forward(self, query_feat):
218
+ ##NetVLAD
219
+ pooled_text = self.text_pooling(query_feat)
220
+ pooled_text = self.moe_txt_dropout(pooled_text)
221
+
222
+ ##FC + Softmax
223
+ moe_weights = self.moe_fc_txt(pooled_text)
224
+ softmax_moe_weights = F.softmax(moe_weights, dim=1)
225
+
226
+
227
+ moe_weights_dict = dict()
228
+ for modality, moe_weight in zip(self.video_modality, torch.split(softmax_moe_weights, 1, dim=1)):
229
+ moe_weights_dict[modality] = moe_weight.squeeze(1)
230
+
231
+ return moe_weights_dict
232
+
233
+
234
+
235
+
model/conquer.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ from model.backbone.encoder import VideoQueryEncoder, QueryWeightEncoder
4
+ from model.qal.query_aware_learning_module import BiDirectionalAttention
5
+ from model.layers import FCPlusTransformer#,MomentLocalizationHead
6
+ from model.head.ml_head import MomentLocalizationHead
7
+ from model.head.vs_head import VideoScoringHead
8
+
9
+ import logging
10
+ logger = logging.getLogger(__name__)
11
+
12
+
13
+ class CONQUER(nn.Module):
14
+ def __init__(self, config,
15
+ visual_dim = 4352,
16
+ text_dim = 768,
17
+ query_dim = 768,
18
+ hidden_dim = 768,
19
+ video_len = 100,
20
+ ctx_mode = "visual_sub",
21
+ lw_st_ed = 0.01,
22
+ lw_video_ce = 0.05,
23
+ similarity_measure="general",
24
+ use_debug=False,
25
+ no_output_moe_weight=False):
26
+
27
+ super(CONQUER, self).__init__()
28
+ self.config = config
29
+
30
+ # related configs
31
+ self.lw_st_ed = lw_st_ed
32
+ self.lw_video_ce = lw_video_ce
33
+ self.similarity_measure = similarity_measure
34
+
35
+ self.video_modality = ctx_mode.split("_")
36
+ logger.info("video modality : %s" % self.video_modality)
37
+ self.output_moe_weight = not no_output_moe_weight
38
+
39
+ hidden_dim = hidden_dim
40
+ base_bert_layer_config = config.bert_config
41
+
42
+ ## Backbone encoder
43
+ self.encoder = VideoQueryEncoder(config,video_modality=self.video_modality,
44
+ visual_dim=visual_dim,text_dim=text_dim,query_dim=query_dim,
45
+ hidden_dim=hidden_dim,split_num=video_len)
46
+
47
+ if self.output_moe_weight and len(self.video_modality) > 1:
48
+ self.query_weight = QueryWeightEncoder(config.netvlad_config,video_modality=self.video_modality)
49
+
50
+ ## Query_aware_feature_learning Module
51
+ self.query_aware_feature_learning_layer = BiDirectionalAttention(hidden_dim)
52
+
53
+ ## Shared transformer for both moment localization and video scoring heads
54
+ self.contextual_QAL_feature_learning = FCPlusTransformer(base_bert_layer_config,hidden_dim * 4)
55
+
56
+ ## Moment_localization_head
57
+ self.moment_localization_head = MomentLocalizationHead(config.moment_localization_config,base_bert_layer_config,hidden_dim)
58
+ self.temporal_criterion = nn.CrossEntropyLoss(reduction="mean")
59
+
60
+ ## Optional video_scoring_head
61
+ if self.similarity_measure == "exclusive":
62
+ self.video_scoring_head = VideoScoringHead(config.video_scoring_config,base_bert_layer_config,hidden_dim)
63
+ self.score_ce = nn.CrossEntropyLoss(reduction="mean")
64
+
65
+ self.debug_model = use_debug
66
+ if self.debug_model:
67
+ logger.setLevel(level=logging.DEBUG)
68
+
69
+ self.reset_parameters()
70
+
71
+ def reset_parameters(self):
72
+ """ Initialize the weights."""
73
+
74
+ def re_init(module):
75
+ if isinstance(module, (nn.Linear, nn.Embedding)):
76
+ # Slightly different from the TF version which uses truncated_normal for initialization
77
+ # cf https://github.com/pytorch/pytorch/pull/5617
78
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
79
+ #print("nn.Linear, nn.Embedding: ", module)
80
+ elif isinstance(module, nn.LayerNorm):
81
+ module.bias.data.zero_()
82
+ module.weight.data.fill_(1.0)
83
+ elif isinstance(module, nn.Conv1d):
84
+ module.reset_parameters()
85
+
86
+ if isinstance(module, nn.Linear) and module.bias is not None:
87
+ module.bias.data.zero_()
88
+
89
+ self.apply(re_init)
90
+
91
+
92
+ def compute_final_score(self,score_dict,moe_weights=None):
93
+
94
+ sample_key = list(score_dict.keys())[0]
95
+ final_query_context_scores = torch.zeros_like(score_dict[sample_key])
96
+ shape_size = len(score_dict[sample_key].shape)
97
+ if moe_weights is not None:
98
+ for mod in self.video_modality:
99
+ if shape_size == 2:
100
+ final_query_context_scores += torch.einsum("nm,n->nm", score_dict[mod], moe_weights[mod])
101
+ elif shape_size == 3:
102
+ final_query_context_scores += torch.einsum("nlm,n->nlm", score_dict[mod], moe_weights[mod])
103
+ else:
104
+ for mod in self.video_modality:
105
+ final_query_context_scores += torch.div(score_dict[mod], len(self.video_modality))
106
+
107
+ return final_query_context_scores
108
+
109
+
110
+ def get_pred_from_raw_query(self, batch):
111
+
112
+ ## Extract query and video feature through MMT backbone
113
+ _query_feature = self.encoder(batch, task="repr_query") #Widehat_Q
114
+
115
+ _video_feature_dict = self.encoder(batch, task="repr_video") #Widehat_V and #Widehat_S
116
+
117
+ ## Shared normalization technique
118
+ ## Use the same query feature for shared_video_num times
119
+ sample_key = list(_video_feature_dict.keys())[0]
120
+ query_batch = _query_feature.size()[0]
121
+ video_batch, video_len = _video_feature_dict[sample_key].size()[:2]
122
+ shared_video_num = int(video_batch / query_batch)
123
+
124
+ query_feature = torch.repeat_interleave(_query_feature, shared_video_num, dim=0)
125
+ query_mask = torch.repeat_interleave(batch["query"]["feat_mask"], shared_video_num, dim=0)
126
+
127
+
128
+ ## Compute Query Dependent Fusion video feature
129
+ if self.output_moe_weight and len(self.video_modality) > 1:
130
+ moe_weights_dict = self.query_weight(query_feature)
131
+ QDF_feature = self.compute_final_score(_video_feature_dict, moe_weights_dict)
132
+ else:
133
+ QDF_feature = self.compute_final_score(_video_feature_dict,None)
134
+
135
+ video_mask = batch["visual"]["feat_mask"]
136
+
137
+
138
+ ## Compute Query Aware Learning video feature
139
+ QAL_feature = self.query_aware_feature_learning_layer(QDF_feature, query_feature,
140
+ video_mask,query_mask)
141
+
142
+ ## Contextualize QAL features
143
+ Contextual_QAL = self.contextual_QAL_feature_learning(
144
+ features=QAL_feature,
145
+ feat_mask=video_mask)
146
+
147
+ G = torch.cat([QAL_feature,Contextual_QAL], dim=2)
148
+
149
+ ## Moment localization head
150
+ begin_score_distribution , end_score_distribution = self.moment_localization_head(G,Contextual_QAL,video_mask)
151
+ begin_score_distribution = begin_score_distribution.view(query_batch, shared_video_num, video_len)
152
+ end_score_distribution = end_score_distribution.view(query_batch, shared_video_num, video_len)
153
+
154
+ ## Optional video scoring head
155
+ video_similarity_score = None
156
+ if self.similarity_measure == "exclusive":
157
+ video_similarity_score = self.video_scoring_head(G,video_mask)
158
+ video_similarity_score = video_similarity_score.view(query_batch, shared_video_num)
159
+
160
+ return video_similarity_score, begin_score_distribution , end_score_distribution
161
+
162
+
163
+ def get_moment_loss_share_norm(self, begin_score_distribution, end_score_distribution ,st_ed_indices):
164
+
165
+ bs , shared_video_num , video_len = begin_score_distribution.size()
166
+
167
+ begin_score_distribution = begin_score_distribution.view(bs,-1)
168
+ end_score_distribution = end_score_distribution.view(bs,-1)
169
+
170
+ loss_st = self.temporal_criterion(begin_score_distribution, st_ed_indices[:, 0])
171
+ loss_ed = self.temporal_criterion(end_score_distribution, st_ed_indices[:, 1])
172
+ moment_ce_loss = loss_st + loss_ed
173
+
174
+ return moment_ce_loss
175
+
176
+
177
+ def forward(self,batch):
178
+
179
+ video_similarity_score, begin_score_distribution , end_score_distribution = \
180
+ self.get_pred_from_raw_query(batch)
181
+
182
+ moment_ce_loss, video_ce_loss = 0, 0
183
+
184
+ # moment cross-entropy loss
185
+ # if neg_video_num = 0, we do not sample negative videos
186
+ # the softmax operator is performed only for the ground-truth video
187
+ # which mean to not use shared normalization training objective
188
+ moment_ce_loss = self.get_moment_loss_share_norm(
189
+ begin_score_distribution, end_score_distribution, batch["st_ed_indices"])
190
+ moment_ce_loss = self.lw_st_ed * moment_ce_loss
191
+
192
+ if self.similarity_measure == "exclusive":
193
+ ce_label = batch["st_ed_indices"].new_zeros(video_similarity_score.size()[0])
194
+ video_ce_loss = self.score_ce(video_similarity_score, ce_label)
195
+ video_ce_loss = self.lw_video_ce*video_ce_loss
196
+
197
+
198
+ loss = moment_ce_loss + video_ce_loss
199
+ return loss, {"moment_ce_loss": float(moment_ce_loss),
200
+ "video_ce_loss": float(video_ce_loss),
201
+ "loss_overall": float(loss)}
202
+
203
+
204
+
205
+
model/head/__init__.py ADDED
File without changes
model/head/ml_head.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ import logging
4
+ logger = logging.getLogger(__name__)
5
+
6
+
7
+ from model.layers import FCPlusTransformer, ConvSE
8
+
9
+
10
+ class MomentLocalizationHead(nn.Module):
11
+ """
12
+ Moment localization head model
13
+ """
14
+
15
+ def __init__(self, config,base_bert_layer_config,hidden_dim):
16
+ super(MomentLocalizationHead, self).__init__()
17
+
18
+ base_bert_layer_config = base_bert_layer_config
19
+ hidden_dim = hidden_dim
20
+
21
+ self.begin_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
22
+
23
+ self.end_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 2)
24
+
25
+ self.begin_score_modeling = ConvSE(config)
26
+ self.end_score_modeling = ConvSE(config)
27
+
28
+ def forward(self, G, Contextual_QAL, video_mask):
29
+ """
30
+ Inputs:
31
+ :param contextual_qal_features: (batch, feat_size, L_v)
32
+ :param video_mask: (batch, L_v)
33
+ Return:
34
+ score: (begin or end) score distribution
35
+ """
36
+ ## OUTPUT LAYER
37
+ begin_features = self.begin_feature_modeling(
38
+ features=G,
39
+ feat_mask=video_mask)
40
+
41
+ end_features = self.end_feature_modeling(
42
+ features=torch.cat([Contextual_QAL, begin_features], dim=2),
43
+ feat_mask=video_mask)
44
+
45
+ ## Un-normalized
46
+ begin_input_feature = torch.transpose(begin_features, 1, 2)
47
+ end_input_feature = torch.transpose(end_features, 1, 2)
48
+
49
+ begin_score_distribution = self.begin_score_modeling(
50
+ contextual_qal_features=begin_input_feature,
51
+ video_mask=video_mask,
52
+ )
53
+
54
+ end_score_distribution = self.end_score_modeling(
55
+ contextual_qal_features=end_input_feature,
56
+ video_mask=video_mask,
57
+ )
58
+
59
+ return begin_score_distribution , end_score_distribution
60
+
61
+
model/head/vs_head.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+
4
+ import logging
5
+ logger = logging.getLogger(__name__)
6
+
7
+ from model.layers import FCPlusTransformer
8
+
9
+ class VideoScoringHead(nn.Module):
10
+ """
11
+ Video Scoring Head
12
+ """
13
+
14
+ def __init__(self, config,base_bert_layer_config,hidden_dim):
15
+ super(VideoScoringHead, self).__init__()
16
+
17
+ base_bert_layer_config = base_bert_layer_config
18
+ hidden_dim = hidden_dim
19
+
20
+
21
+ self.video_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
22
+
23
+ self.video_score_predictor = nn.Sequential(
24
+ nn.Linear(**config.linear_1_cfg),
25
+ nn.ReLU(),
26
+ nn.Linear(**config.linear_2_cfg)
27
+ )
28
+
29
+
30
+ def forward(self, G, video_mask):
31
+
32
+
33
+ ## Contexual_QAL_feature for video scoring
34
+ R = self.video_feature_modeling(
35
+ features=G,
36
+ feat_mask=video_mask)
37
+
38
+ holistic_video_feature, _ = torch.max(R, dim=1)
39
+
40
+ video_similarity_score = self.video_score_predictor(holistic_video_feature.squeeze(1)) # r
41
+
42
+ return video_similarity_score
model/layers.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ import math
5
+ import logging
6
+
7
+ logger = logging.getLogger(__name__)
8
+ try:
9
+ import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
10
+ except (ImportError, AttributeError) as e:
11
+ BertLayerNorm = torch.nn.LayerNorm
12
+
13
+ from model.transformer.bert import BertEncoder
14
+ from model.modeling_utils import mask_logits
15
+
16
+ class LinearLayer(nn.Module):
17
+ """linear layer configurable with layer normalization, dropout, ReLU."""
18
+ def __init__(self, in_hsz, out_hsz, layer_norm=True, dropout=0.1, relu=True,tanh=False):
19
+ super(LinearLayer, self).__init__()
20
+ self.relu = relu
21
+ self.tanh = tanh
22
+ self.layer_norm = layer_norm
23
+ if layer_norm:
24
+ self.LayerNorm = BertLayerNorm(in_hsz)
25
+ layers = [
26
+ nn.Dropout(dropout),
27
+ nn.Linear(in_hsz, out_hsz)
28
+ ]
29
+ self.net = nn.Sequential(*layers)
30
+
31
+ def forward(self, x):
32
+ """(N, L, D)"""
33
+ if self.layer_norm:
34
+ x = self.LayerNorm(x)
35
+ x = self.net(x)
36
+ if self.relu:
37
+ x = F.relu(x, inplace=True)
38
+ if self.tanh:
39
+ x = torch.tanh(x)
40
+ return x # (N, L, D)
41
+
42
+
43
+ class NetVLAD(nn.Module):
44
+ def __init__(self, cluster_size, feature_size, add_norm=True):
45
+ super(NetVLAD, self).__init__()
46
+ self.feature_size = feature_size
47
+ self.cluster_size = cluster_size
48
+ self.clusters = nn.Parameter((1 / math.sqrt(feature_size))
49
+ * torch.randn(feature_size, cluster_size))
50
+ self.clusters2 = nn.Parameter((1 / math.sqrt(feature_size))
51
+ * torch.randn(1, feature_size, cluster_size))
52
+
53
+ self.add_norm = add_norm
54
+ self.LayerNorm = BertLayerNorm(cluster_size)
55
+ self.out_dim = cluster_size * feature_size
56
+
57
+ def forward(self, x):
58
+ max_sample = x.size()[1]
59
+ x = x.view(-1, self.feature_size)
60
+ assignment = torch.matmul(x, self.clusters)
61
+
62
+ if self.add_norm:
63
+ assignment = self.LayerNorm(assignment)
64
+
65
+ assignment = F.softmax(assignment, dim=1)
66
+ assignment = assignment.view(-1, max_sample, self.cluster_size)
67
+
68
+ a_sum = torch.sum(assignment, -2, keepdim=True)
69
+ a = a_sum * self.clusters2
70
+
71
+ assignment = assignment.transpose(1, 2)
72
+
73
+ x = x.view(-1, max_sample, self.feature_size)
74
+ vlad = torch.matmul(assignment, x)
75
+ vlad = vlad.transpose(1, 2)
76
+ vlad = vlad - a
77
+
78
+ # L2 intra norm
79
+ vlad = F.normalize(vlad)
80
+
81
+ # flattening + L2 norm
82
+ vlad = vlad.reshape(-1, self.cluster_size * self.feature_size)
83
+ vlad = F.normalize(vlad)
84
+
85
+ return vlad
86
+
87
+
88
+ class FCPlusTransformer(nn.Module):
89
+ """
90
+ FC + Transformer
91
+ FC layer reduces input feature size into hidden size
92
+ Transformer contextualizes QAL feature
93
+ """
94
+
95
+ def __init__(self, config,input_dim):
96
+ super(FCPlusTransformer, self).__init__()
97
+ self.trans_linear = LinearLayer(
98
+ in_hsz=input_dim, out_hsz=config.hidden_size)
99
+ self.encoder = BertEncoder(config)
100
+
101
+ def forward(self,features, feat_mask):
102
+ """
103
+ Inputs:
104
+ :param contextual_qal_features: (batch, L_v, input_dim)
105
+ :param feat_mask: (batch, L_v)
106
+ Return:
107
+ sequence_output: (batch, L_v, hidden_size)
108
+ """
109
+ transformed_features = self.trans_linear(features)
110
+
111
+ encoder_outputs = self.encoder(transformed_features, feat_mask)
112
+
113
+ sequence_output = encoder_outputs[0]
114
+
115
+ return sequence_output
116
+
117
+
118
+ class ConvSE(nn.Module):
119
+ """
120
+ ConvSE module
121
+ """
122
+ def __init__(self, config):
123
+ super(ConvSE, self).__init__()
124
+
125
+ self.clip_score_predictor = nn.Sequential(
126
+ nn.Conv1d(**config.conv_cfg_1),
127
+ nn.ReLU(),
128
+ nn.Conv1d(**config.conv_cfg_2),
129
+ )
130
+
131
+
132
+ def forward(self, contextual_qal_features, video_mask):
133
+ """
134
+ Inputs:
135
+ :param contextual_qal_features: (batch, feat_size, L_v)
136
+ :param video_mask: (batch, L_v)
137
+ Return:
138
+ score: (begin or end) score distribution
139
+ """
140
+ score = self.clip_score_predictor(contextual_qal_features).squeeze(1) #(batch, L_v)
141
+
142
+ score = mask_logits(score, video_mask) #(batch, L_v)
143
+
144
+ return score
145
+
146
+
147
+ class MomentLocalizationHead(nn.Module):
148
+ """
149
+ Moment localization head model
150
+ """
151
+
152
+ def __init__(self, config,base_bert_layer_config,hidden_dim):
153
+ super(MomentLocalizationHead, self).__init__()
154
+
155
+ base_bert_layer_config = base_bert_layer_config
156
+ hidden_dim = hidden_dim
157
+
158
+ self.start_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
159
+
160
+ self.end_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 2)
161
+
162
+ self.start_reader = ConvSE(config)
163
+ self.end_reader = ConvSE(config)
164
+
165
+ def forward(self, G, Contextual_QAL, video_mask):
166
+ """
167
+ Inputs:
168
+ :param contextual_qal_features: (batch, feat_size, L_v)
169
+ :param video_mask: (batch, L_v)
170
+ Return:
171
+ score: (begin or end) score distribution
172
+ """
173
+ ## OUTPUT LAYER
174
+ start_features = self.start_modeling(
175
+ features=G,
176
+ feat_mask=video_mask)
177
+
178
+ end_features = self.end_modeling(
179
+ features=torch.cat([Contextual_QAL, start_features], dim=2),
180
+ feat_mask=video_mask)
181
+
182
+ ## Un-normalized
183
+ start_reader_input_feature = torch.transpose(start_features, 1, 2)
184
+ end_reader_input_feature = torch.transpose(end_features, 1, 2)
185
+
186
+ reader_st_prob = self.start_reader(
187
+ contextual_qal_features=start_reader_input_feature,
188
+ video_mask=video_mask,
189
+ )
190
+
191
+ reader_ed_prob = self.end_reader(
192
+ contextual_qal_features=end_reader_input_feature,
193
+ video_mask=video_mask,
194
+ )
195
+
196
+ return reader_st_prob,reader_ed_prob
model/modeling_utils.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Copyright (c) Microsoft Corporation.
3
+ Licensed under the MIT license.
4
+
5
+ some functions are modified from HuggingFace
6
+ (https://github.com/huggingface/transformers)
7
+ """
8
+ import torch
9
+ from torch import nn
10
+ import logging
11
+ logger = logging.getLogger(__name__)
12
+
13
+
14
+ def prune_linear_layer(layer, index, dim=0):
15
+ """ Prune a linear layer (a model parameters)
16
+ to keep only entries in index.
17
+ Return the pruned layer as a new layer with requires_grad=True.
18
+ Used to remove heads.
19
+ """
20
+ index = index.to(layer.weight.device)
21
+ W = layer.weight.index_select(dim, index).clone().detach()
22
+ if layer.bias is not None:
23
+ if dim == 1:
24
+ b = layer.bias.clone().detach()
25
+ else:
26
+ b = layer.bias[index].clone().detach()
27
+ new_size = list(layer.weight.size())
28
+ new_size[dim] = len(index)
29
+ new_layer = nn.Linear(
30
+ new_size[1], new_size[0], bias=layer.bias is not None).to(
31
+ layer.weight.device)
32
+ new_layer.weight.requires_grad = False
33
+ new_layer.weight.copy_(W.contiguous())
34
+ new_layer.weight.requires_grad = True
35
+ if layer.bias is not None:
36
+ new_layer.bias.requires_grad = False
37
+ new_layer.bias.copy_(b.contiguous())
38
+ new_layer.bias.requires_grad = True
39
+ return new_layer
40
+
41
+
42
+ def mask_logits(target, mask, eps=-1e4):
43
+ return target * mask + (1 - mask) * eps
44
+
45
+
46
+ def load_partial_checkpoint(checkpoint, n_layers, skip_layers=True):
47
+ if skip_layers:
48
+ new_checkpoint = {}
49
+ gap = int(12/n_layers)
50
+ prefix = "roberta.encoder.layer."
51
+ layer_range = {str(l): str(i) for i, l in enumerate(
52
+ list(range(gap-1, 12, gap)))}
53
+ for k, v in checkpoint.items():
54
+ if prefix in k:
55
+ layer_name = k.split(".")
56
+ layer_num = layer_name[3]
57
+ if layer_num in layer_range:
58
+ layer_name[3] = layer_range[layer_num]
59
+ new_layer_name = ".".join(layer_name)
60
+ new_checkpoint[new_layer_name] = v
61
+ else:
62
+ new_checkpoint[k] = v
63
+ else:
64
+ new_checkpoint = checkpoint
65
+ return new_checkpoint
66
+
67
+
68
+ def load_pretrained_weight(model, state_dict):
69
+ # Load from a PyTorch state_dict
70
+ old_keys = []
71
+ new_keys = []
72
+ for key in state_dict.keys():
73
+ new_key = None
74
+ if 'gamma' in key:
75
+ new_key = key.replace('gamma', 'weight')
76
+ if 'beta' in key:
77
+ new_key = key.replace('beta', 'bias')
78
+ if new_key:
79
+ old_keys.append(key)
80
+ new_keys.append(new_key)
81
+ for old_key, new_key in zip(old_keys, new_keys):
82
+ state_dict[new_key] = state_dict.pop(old_key)
83
+
84
+ missing_keys = []
85
+ unexpected_keys = []
86
+ error_msgs = []
87
+ # copy state_dict so _load_from_state_dict can modify it
88
+ metadata = getattr(state_dict, '_metadata', None)
89
+ state_dict = state_dict.copy()
90
+ if metadata is not None:
91
+ state_dict._metadata = metadata
92
+
93
+ def load(module, prefix=''):
94
+ local_metadata = ({} if metadata is None
95
+ else metadata.get(prefix[:-1], {}))
96
+ module._load_from_state_dict(
97
+ state_dict, prefix, local_metadata, True, missing_keys,
98
+ unexpected_keys, error_msgs)
99
+ for name, child in module._modules.items():
100
+ if child is not None:
101
+ load(child, prefix + name + '.')
102
+ start_prefix = ''
103
+ if not hasattr(model, 'roberta') and\
104
+ any(s.startswith('roberta.') for s in state_dict.keys()):
105
+ start_prefix = 'roberta.'
106
+
107
+ load(model, prefix=start_prefix)
108
+ if len(missing_keys) > 0:
109
+ logger.info("Weights of {} not initialized from "
110
+ "pretrained model: {}".format(
111
+ model.__class__.__name__, missing_keys))
112
+ if len(unexpected_keys) > 0:
113
+ logger.info("Weights from pretrained model not used in "
114
+ "{}: {}".format(
115
+ model.__class__.__name__, unexpected_keys))
116
+ if len(error_msgs) > 0:
117
+ raise RuntimeError('Error(s) in loading state_dict for '
118
+ '{}:\n\t{}'.format(
119
+ model.__class__.__name__,
120
+ "\n\t".join(error_msgs)))
121
+ return model
122
+
123
+
124
+ def pad_tensor_to_mul(tensor, dim=0, mul=8):
125
+ """ pad tensor to multiples (8 for tensor cores) """
126
+ t_size = list(tensor.size())
127
+ n_pad = mul - t_size[dim] % mul
128
+ if n_pad == mul:
129
+ n_pad = 0
130
+ padded_tensor = tensor
131
+ else:
132
+ t_size[dim] = n_pad
133
+ pad = torch.zeros(*t_size, dtype=tensor.dtype, device=tensor.device)
134
+ padded_tensor = torch.cat([tensor, pad], dim=dim)
135
+ return padded_tensor, n_pad
model/qal/__init__.py ADDED
File without changes
model/qal/query_aware_learning_module.py ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+
4
+ import logging
5
+ logger = logging.getLogger(__name__)
6
+
7
+ try:
8
+ import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
9
+ except (ImportError, AttributeError) as e:
10
+ BertLayerNorm = torch.nn.LayerNorm
11
+
12
+ from utils.model_utils import mask_logits
13
+ import torch.nn.functional as F
14
+
15
+
16
+ class BiDirectionalAttention(nn.Module):
17
+ """
18
+ Bi-directional attention flow
19
+ Perform query-to-video attention (Q2V) and video-to-query attention (V2Q)
20
+ Append QDF features with a set of query-aware features to form QAL feature
21
+ """
22
+
23
+ def __init__(self, video_dim):
24
+ super(BiDirectionalAttention, self).__init__()
25
+ ## Core Attention for query-aware feature learining
26
+ self.similarity_weight = nn.Linear(video_dim * 3, 1, bias=False)
27
+
28
+
29
+ def forward(self, QDF_emb, query_emb,video_mask, query_mask):
30
+ """
31
+ Inputs:
32
+ :param QDF_emb: (batch, L_v, feat_size)
33
+ :param query_emb: (batch, L_q, feat_size)
34
+ :param video_mask: (batch, L_v)
35
+ :param query_mask: (batch, L_q)
36
+ Return:
37
+ QAL: (batch, L_v, feat_size*4)
38
+ """
39
+
40
+ ## CREATE SIMILARITY MATRIX
41
+ video_len = QDF_emb.size()[1]
42
+ query_len = query_emb.size()[1]
43
+
44
+ _QDF_emb = QDF_emb.unsqueeze(2).repeat(1, 1, query_len, 1)
45
+ # [bs, video_len, 1, feat_size] => [bs, video_len, query_len, feat_size]
46
+
47
+ _query_emb = query_emb.unsqueeze(1).repeat(1, video_len, 1, 1)
48
+ # [bs, 1, query_len, feat_size] => [bs, video_len, query_len, feat_size]
49
+
50
+ elementwise_prod = torch.mul(_QDF_emb, _query_emb)
51
+ # [bs, video_len, query_len, feat_size]
52
+
53
+ alpha = torch.cat([_QDF_emb, _query_emb, elementwise_prod], dim=3)
54
+ # [bs, video_len, query_len, feat_size*3]
55
+
56
+ similarity_matrix = self.similarity_weight(alpha).view(-1, video_len, query_len)
57
+
58
+ similarity_matrix_mask = torch.einsum("bn,bm->bnm", video_mask, query_mask)
59
+ # [bs, video_len, query_len]
60
+
61
+ ## CALCULATE Video2Query ATTENTION
62
+
63
+ a = F.softmax(mask_logits(similarity_matrix,
64
+ similarity_matrix_mask), dim=-1)
65
+ # [bs, video_len, query_len]
66
+
67
+ V2Q = torch.bmm(a, query_emb)
68
+ # [bs] ([video_len, query_len] X [query_len, feat_size]) => [bs, video_len, feat_size]
69
+
70
+ ## CALCULATE Query2Video ATTENTION
71
+
72
+ b = F.softmax(torch.max(mask_logits(similarity_matrix, similarity_matrix_mask), 2)[0], dim=-1)
73
+ # [bs, video_len]
74
+
75
+ b = b.unsqueeze(1)
76
+ # [bs, 1, video_len]
77
+
78
+ Q2V = torch.bmm(b, QDF_emb)
79
+ # [bs] ([bs, 1, video_len] X [bs, video_len, feat_size]) => [bs, 1, feat_size]
80
+
81
+ Q2V = Q2V.repeat(1, video_len, 1)
82
+ # [bs, video_len, feat_size]
83
+
84
+ ## Append QDF_emb with three query-aware features
85
+
86
+ QAL = torch.cat([QDF_emb, V2Q,
87
+ torch.mul(QDF_emb, V2Q),
88
+ torch.mul(QDF_emb, Q2V)], dim=2)
89
+
90
+ # [bs, video_len, feat_size*4]
91
+
92
+ return QAL
model/transformer/__init__.py ADDED
File without changes
model/transformer/bert.py ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ BERT/RoBERTa layers from the huggingface implementation
3
+ (https://github.com/huggingface/transformers)
4
+ """
5
+
6
+ import torch
7
+ import torch.nn as nn
8
+ import torch.nn.functional as F
9
+ from model.modeling_utils import prune_linear_layer
10
+ import math
11
+ import logging
12
+ logger = logging.getLogger(__name__)
13
+ try:
14
+ import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
15
+ except (ImportError, AttributeError) as e:
16
+ BertLayerNorm = torch.nn.LayerNorm
17
+
18
+
19
+ def gelu(x):
20
+ """ Original Implementation of the gelu activation function
21
+ in Google Bert repo when initialy created.
22
+ For information: OpenAI GPT's gelu is slightly different
23
+ (and gives slightly different results):
24
+ 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi)
25
+ * (x + 0.044715 * torch.pow(x, 3))))
26
+ Also see https://arxiv.org/abs/1606.08415
27
+ """
28
+ return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
29
+
30
+
31
+ def gelu_new(x):
32
+ """ Implementation of the gelu activation function currently
33
+ in Google Bert repo (identical to OpenAI GPT).
34
+ Also see https://arxiv.org/abs/1606.08415
35
+ """
36
+ return 0.5 * x * (
37
+ 1 + torch.tanh(
38
+ math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
39
+
40
+ def swish(x):
41
+ return x * torch.sigmoid(x)
42
+
43
+
44
+ ACT2FN = {
45
+ "gelu": gelu,
46
+ "relu": torch.nn.functional.relu,
47
+ "swish": swish, "gelu_new": gelu_new}
48
+
49
+ class BertSelfAttention(nn.Module):
50
+ def __init__(self, config):
51
+ super(BertSelfAttention, self).__init__()
52
+ if config.hidden_size % config.num_attention_heads != 0:
53
+ raise ValueError(
54
+ "The hidden size (%d) is not a multiple of "
55
+ "the number of attention heads (%d)" % (
56
+ config.hidden_size, config.num_attention_heads))
57
+ self.output_attentions = config.output_attentions
58
+
59
+ self.num_attention_heads = config.num_attention_heads
60
+ self.attention_head_size = int(
61
+ config.hidden_size / config.num_attention_heads)
62
+ self.all_head_size = self.num_attention_heads *\
63
+ self.attention_head_size
64
+
65
+ self.query = nn.Linear(config.hidden_size, self.all_head_size)
66
+ self.key = nn.Linear(config.hidden_size, self.all_head_size)
67
+ self.value = nn.Linear(config.hidden_size, self.all_head_size)
68
+
69
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
70
+
71
+ def transpose_for_scores(self, x):
72
+ new_x_shape = x.size()[:-1] + (
73
+ self.num_attention_heads, self.attention_head_size)
74
+ x = x.view(*new_x_shape)
75
+ return x.permute(0, 2, 1, 3)
76
+
77
+ def forward(self, hidden_states, attention_mask=None, head_mask=None):
78
+ mixed_query_layer = self.query(hidden_states)
79
+ mixed_key_layer = self.key(hidden_states)
80
+ mixed_value_layer = self.value(hidden_states)
81
+
82
+ query_layer = self.transpose_for_scores(mixed_query_layer)
83
+ key_layer = self.transpose_for_scores(mixed_key_layer)
84
+ value_layer = self.transpose_for_scores(mixed_value_layer)
85
+
86
+ # Take the dot product between "query"
87
+ # and "key" to get the raw attention scores.
88
+ attention_scores = torch.matmul(
89
+ query_layer, key_layer.transpose(-1, -2))
90
+ attention_scores = attention_scores / math.sqrt(
91
+ self.attention_head_size)
92
+ if attention_mask is not None:
93
+ # Apply the attention mask is
94
+ # (precomputed for all layers in BertModel forward() function)
95
+ attention_scores = attention_scores + attention_mask
96
+
97
+ # Normalize the attention scores to probabilities.
98
+ attention_probs = nn.Softmax(dim=-1)(attention_scores)
99
+
100
+ # This is actually dropping out entire tokens to attend to, which might
101
+ # seem a bit unusual, but is taken from the original Transformer paper.
102
+ attention_probs = self.dropout(attention_probs)
103
+
104
+ # Mask heads if we want to
105
+ if head_mask is not None:
106
+ attention_probs = attention_probs * head_mask
107
+
108
+ context_layer = torch.matmul(attention_probs, value_layer)
109
+
110
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
111
+ new_context_layer_shape = context_layer.size()[:-2] + (
112
+ self.all_head_size,)
113
+ context_layer = context_layer.view(*new_context_layer_shape)
114
+
115
+ outputs = (context_layer, attention_probs)\
116
+ if self.output_attentions else (context_layer,)
117
+ return outputs
118
+
119
+
120
+ class BertSelfOutput(nn.Module):
121
+ def __init__(self, config):
122
+ super(BertSelfOutput, self).__init__()
123
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
124
+ self.LayerNorm = BertLayerNorm(
125
+ config.hidden_size, eps=config.layer_norm_eps)
126
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
127
+
128
+ def forward(self, hidden_states, input_tensor):
129
+ hidden_states = self.dense(hidden_states)
130
+ hidden_states = self.dropout(hidden_states)
131
+ hidden_states = self.LayerNorm(hidden_states + input_tensor)
132
+ return hidden_states
133
+
134
+
135
+ class BertAttention(nn.Module):
136
+ def __init__(self, config):
137
+ super(BertAttention, self).__init__()
138
+ self.self = BertSelfAttention(config)
139
+ self.output = BertSelfOutput(config)
140
+ self.pruned_heads = set()
141
+
142
+ def prune_heads(self, heads):
143
+ if len(heads) == 0:
144
+ return
145
+ mask = torch.ones(
146
+ self.self.num_attention_heads, self.self.attention_head_size)
147
+ # Convert to set and emove already pruned heads
148
+ heads = set(heads) - self.pruned_heads
149
+ for head in heads:
150
+ # Compute how many pruned heads are
151
+ # before the head and move the index accordingly
152
+ head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
153
+ mask[head] = 0
154
+ mask = mask.view(-1).contiguous().eq(1)
155
+ index = torch.arange(len(mask))[mask].long()
156
+
157
+ # Prune linear layers
158
+ self.self.query = prune_linear_layer(self.self.query, index)
159
+ self.self.key = prune_linear_layer(self.self.key, index)
160
+ self.self.value = prune_linear_layer(self.self.value, index)
161
+ self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
162
+
163
+ # Update hyper params and store pruned heads
164
+ self.self.num_attention_heads = self.self.num_attention_heads - len(
165
+ heads)
166
+ self.self.all_head_size =\
167
+ self.self.attention_head_size * self.self.num_attention_heads
168
+ self.pruned_heads = self.pruned_heads.union(heads)
169
+
170
+ def forward(self, input_tensor, attention_mask=None, head_mask=None):
171
+ self_outputs = self.self(input_tensor, attention_mask, head_mask)
172
+ attention_output = self.output(self_outputs[0], input_tensor)
173
+ # add attentions if we output them
174
+ outputs = (attention_output,) + self_outputs[1:]
175
+ return outputs
176
+
177
+
178
+ class BertIntermediate(nn.Module):
179
+ def __init__(self, config):
180
+ super(BertIntermediate, self).__init__()
181
+ self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
182
+ if isinstance(config.hidden_act, str):
183
+ self.intermediate_act_fn = ACT2FN[config.hidden_act]
184
+ else:
185
+ self.intermediate_act_fn = config.hidden_act
186
+
187
+ def forward(self, hidden_states):
188
+ hidden_states = self.dense(hidden_states)
189
+ hidden_states = self.intermediate_act_fn(hidden_states)
190
+ return hidden_states
191
+
192
+
193
+ class BertOutput(nn.Module):
194
+ def __init__(self, config):
195
+ super(BertOutput, self).__init__()
196
+ self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
197
+ self.LayerNorm = BertLayerNorm(
198
+ config.hidden_size, eps=config.layer_norm_eps)
199
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
200
+
201
+ def forward(self, hidden_states, input_tensor):
202
+ hidden_states = self.dense(hidden_states)
203
+ hidden_states = self.dropout(hidden_states)
204
+ hidden_states = self.LayerNorm(hidden_states + input_tensor)
205
+ return hidden_states
206
+
207
+
208
+ class BertLayer(nn.Module):
209
+ def __init__(self, config):
210
+ super(BertLayer, self).__init__()
211
+ self.attention = BertAttention(config)
212
+ self.intermediate = BertIntermediate(config)
213
+ self.output = BertOutput(config)
214
+
215
+ def forward(self, hidden_states, attention_mask=None, head_mask=None):
216
+ attention_outputs = self.attention(
217
+ hidden_states, attention_mask, head_mask)
218
+ attention_output = attention_outputs[0]
219
+ intermediate_output = self.intermediate(attention_output)
220
+ layer_output = self.output(intermediate_output, attention_output)
221
+ # add attentions if we output them
222
+ outputs = (layer_output,) + attention_outputs[1:]
223
+ return outputs
224
+
225
+
226
+ class BertEncoder(nn.Module):
227
+ def __init__(self, config):
228
+ super(BertEncoder, self).__init__()
229
+ self.output_attentions = config.output_attentions
230
+ self.output_hidden_states = config.output_hidden_states
231
+ self.layer = nn.ModuleList([BertLayer(config) for _ in range(
232
+ config.num_hidden_layers)])
233
+
234
+ def forward(self, hidden_states, attention_mask=None, head_mask=None):
235
+
236
+ # We create a 3D attention mask from a 2D tensor mask.
237
+ # Sizes are [batch_size, 1, 1, to_seq_length]
238
+ # So we can broadcast to
239
+ # [batch_size, num_heads, from_seq_length, to_seq_length]
240
+ extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
241
+
242
+ # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
243
+ # masked positions, this operation will create a tensor which is 0.0 for
244
+ # positions we want to attend and -10000.0 for masked positions.
245
+ # Since we are adding it to the raw scores before the softmax, this is
246
+ # effectively the same as removing these entirely.
247
+ extended_attention_mask = extended_attention_mask.to(
248
+ dtype=next(self.parameters()).dtype) # fp16 compatibility
249
+ extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
250
+
251
+
252
+ all_hidden_states = ()
253
+ all_attentions = ()
254
+ for i, layer_module in enumerate(self.layer):
255
+ if self.output_hidden_states:
256
+ all_hidden_states = all_hidden_states + (hidden_states,)
257
+
258
+ layer_outputs = layer_module(
259
+ hidden_states, extended_attention_mask, None)
260
+ hidden_states = layer_outputs[0]
261
+
262
+ if self.output_attentions:
263
+ all_attentions = all_attentions + (layer_outputs[1],)
264
+
265
+ # Add last layer
266
+ if self.output_hidden_states:
267
+ all_hidden_states = all_hidden_states + (hidden_states,)
268
+
269
+ outputs = (hidden_states,)
270
+ if self.output_hidden_states:
271
+ outputs = outputs + (all_hidden_states,)
272
+ if self.output_attentions:
273
+ outputs = outputs + (all_attentions,)
274
+ # last-layer hidden state, (all hidden states), (all attentions)
275
+ return outputs
model/transformer/bert_embed.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Input Embedding Layers
3
+ """
4
+ import torch
5
+ import torch.nn as nn
6
+ import logging
7
+
8
+
9
+ logger = logging.getLogger(__name__)
10
+ try:
11
+ import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
12
+ except (ImportError, AttributeError) as e:
13
+ logger.info(
14
+ "Better speed can be achieved with apex installed from "
15
+ "https://www.github.com/nvidia/apex ."
16
+ )
17
+ BertLayerNorm = torch.nn.LayerNorm
18
+
19
+
20
+ class BertEmbeddings(nn.Module):
21
+ """Construct the embeddings from word, position and token_type embeddings."""
22
+
23
+ def __init__(self, config):
24
+ super(BertEmbeddings, self).__init__()
25
+ #self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
26
+ self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
27
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
28
+
29
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
30
+ # any TensorFlow checkpoint file
31
+ self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
32
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
33
+
34
+ # position_ids (1, len position emb) is contiguous in memory and exported when serialized
35
+ # self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
36
+ # self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
37
+
38
+ def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
39
+ if input_ids is not None:
40
+ input_shape = input_ids.size()
41
+ else:
42
+ input_shape = inputs_embeds.size()[:-1]
43
+
44
+ seq_length = input_shape[1]
45
+
46
+ if position_ids is None:
47
+ position_ids = self.position_ids[:, :seq_length]
48
+
49
+ if token_type_ids is None:
50
+ token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
51
+
52
+ if inputs_embeds is None:
53
+ inputs_embeds = self.word_embeddings(input_ids)
54
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
55
+
56
+ position_embeddings = self.position_embeddings(position_ids)
57
+
58
+ embeddings = inputs_embeds + token_type_embeddings + position_embeddings
59
+
60
+ embeddings = self.LayerNorm(embeddings)
61
+ embeddings = self.dropout(embeddings)
62
+ return embeddings
63
+
64
+
ndcg_iou_topk.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from utils.basic_utils import load_jsonl, save_jsonl, load_json
2
+ import pandas as pd
3
+ from tqdm import tqdm
4
+ import numpy as np
5
+ from collections import defaultdict
6
+ import copy
7
+
8
+ def calculate_iou(pred_start: float, pred_end: float, gt_start: float, gt_end: float) -> float:
9
+ intersection_start = max(pred_start, gt_start)
10
+ intersection_end = min(pred_end, gt_end)
11
+ intersection = max(0, intersection_end - intersection_start)
12
+ union = (pred_end - pred_start) + (gt_end - gt_start) - intersection
13
+ return intersection / union if union > 0 else 0
14
+
15
+
16
+ # Function to calculate DCG
17
+ def calculate_dcg(scores):
18
+ return sum((2**score - 1) / np.log2(idx + 2) for idx, score in enumerate(scores))
19
+
20
+ # Function to calculate NDCG
21
+ def calculate_ndcg(pred_scores, true_scores):
22
+ dcg = calculate_dcg(pred_scores)
23
+ idcg = calculate_dcg(sorted(true_scores, reverse=True))
24
+ return dcg / idcg if idcg > 0 else 0
25
+
26
+
27
+
28
+ def calculate_ndcg_iou(all_gt, all_pred, TS, KS):
29
+ performance = defaultdict(lambda: defaultdict(list))
30
+ performance_avg = defaultdict(lambda: defaultdict(float))
31
+ for k in tqdm(all_pred.keys(), desc="Calculate NDCG"):
32
+ one_pred = all_pred[k]
33
+ one_gt = all_gt[k]
34
+
35
+ one_gt.sort(key=lambda x: x["relevance"], reverse=True)
36
+ for T in TS:
37
+ one_gt_drop = copy.deepcopy(one_gt)
38
+ predictions_with_scores = []
39
+
40
+ for pred in one_pred:
41
+ pred_video_name, pred_time = pred["video_name"], pred["timestamp"]
42
+ matched_rows = [gt for gt in one_gt_drop if gt["video_name"] == pred_video_name]
43
+ if not matched_rows:
44
+ pred["pred_relevance"] = 0
45
+ else:
46
+ ious = [calculate_iou(pred_time[0], pred_time[1], gt["timestamp"][0], gt["timestamp"][1]) for gt in matched_rows]
47
+ max_iou_idx = np.argmax(ious)
48
+ max_iou_row = matched_rows[max_iou_idx]
49
+
50
+ if ious[max_iou_idx] > T:
51
+ pred["pred_relevance"] = max_iou_row["relevance"]
52
+ # Remove the matched ground truth row
53
+ original_idx = one_gt_drop.index(max_iou_row)
54
+ one_gt_drop.pop(original_idx)
55
+ else:
56
+ pred["pred_relevance"] = 0
57
+ predictions_with_scores.append(pred)
58
+ for K in KS:
59
+ true_scores = [gt["relevance"] for gt in one_gt][:K]
60
+ pred_scores = [pred["pred_relevance"] for pred in predictions_with_scores][:K]
61
+ ndcg_score = calculate_ndcg(pred_scores, true_scores)
62
+ performance[K][T].append(ndcg_score)
63
+ for K, vs in performance.items():
64
+ for T, v in vs.items():
65
+ performance_avg[K][T] = np.mean(v)
66
+ return performance_avg
optim/adamw.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ AdamW optimizer (weight decay fix)
3
+ originally from hugginface (https://github.com/huggingface/transformers).
4
+
5
+ Copied from UNITER
6
+ (https://github.com/ChenRocks/UNITER)
7
+ """
8
+ import math
9
+
10
+ import torch
11
+ from torch.optim import Optimizer
12
+
13
+
14
+ class AdamW(Optimizer):
15
+ """ Implements Adam algorithm with weight decay fix.
16
+ Parameters:
17
+ lr (float): learning rate. Default 1e-3.
18
+ betas (tuple of 2 floats): Adams beta parameters (b1, b2).
19
+ Default: (0.9, 0.999)
20
+ eps (float): Adams epsilon. Default: 1e-6
21
+ weight_decay (float): Weight decay. Default: 0.0
22
+ correct_bias (bool): can be set to False to avoid correcting bias
23
+ in Adam (e.g. like in Bert TF repository). Default True.
24
+ """
25
+ def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,
26
+ weight_decay=0.0, correct_bias=True):
27
+ if lr < 0.0:
28
+ raise ValueError(
29
+ "Invalid learning rate: {} - should be >= 0.0".format(lr))
30
+ if not 0.0 <= betas[0] < 1.0:
31
+ raise ValueError("Invalid beta parameter: {} - "
32
+ "should be in [0.0, 1.0[".format(betas[0]))
33
+ if not 0.0 <= betas[1] < 1.0:
34
+ raise ValueError("Invalid beta parameter: {} - "
35
+ "should be in [0.0, 1.0[".format(betas[1]))
36
+ if not 0.0 <= eps:
37
+ raise ValueError("Invalid epsilon value: {} - "
38
+ "should be >= 0.0".format(eps))
39
+ defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
40
+ correct_bias=correct_bias)
41
+ super(AdamW, self).__init__(params, defaults)
42
+
43
+ def step(self, closure=None):
44
+ """Performs a single optimization step.
45
+ Arguments:
46
+ closure (callable, optional): A closure that reevaluates the model
47
+ and returns the loss.
48
+ """
49
+ loss = None
50
+ if closure is not None:
51
+ loss = closure()
52
+
53
+ for group in self.param_groups:
54
+ for p in group['params']:
55
+ if p.grad is None:
56
+ continue
57
+ grad = p.grad.data
58
+ if grad.is_sparse:
59
+ raise RuntimeError(
60
+ 'Adam does not support sparse '
61
+ 'gradients, please consider SparseAdam instead')
62
+
63
+ state = self.state[p]
64
+
65
+ # State initialization
66
+ if len(state) == 0:
67
+ state['step'] = 0
68
+ # Exponential moving average of gradient values
69
+ state['exp_avg'] = torch.zeros_like(p.data)
70
+ # Exponential moving average of squared gradient values
71
+ state['exp_avg_sq'] = torch.zeros_like(p.data)
72
+
73
+ exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
74
+ beta1, beta2 = group['betas']
75
+
76
+ state['step'] += 1
77
+
78
+ # Decay the first and second moment running average coefficient
79
+ # In-place operations to update the averages at the same time
80
+ exp_avg.mul_(beta1).add_(grad , alpha=1.0 - beta1)
81
+ exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
82
+ denom = exp_avg_sq.sqrt().add_(group['eps'])
83
+
84
+ step_size = group['lr']
85
+ if group['correct_bias']: # No bias correction for Bert
86
+ bias_correction1 = 1.0 - beta1 ** state['step']
87
+ bias_correction2 = 1.0 - beta2 ** state['step']
88
+ step_size = (step_size * math.sqrt(bias_correction2)
89
+ / bias_correction1)
90
+
91
+ p.data.addcdiv_(exp_avg, denom, value=-step_size)
92
+
93
+ # Just adding the square of the weights to the loss function is
94
+ # *not* the correct way of using L2 regularization/weight decay
95
+ # with Adam, since that will interact with the m and v
96
+ # parameters in strange ways.
97
+ #
98
+ # Instead we want to decay the weights in a manner that doesn't
99
+ # interact with the m/v parameters. This is equivalent to
100
+ # adding the square of the weights to the loss with plain
101
+ # (non-momentum) SGD.
102
+ # Add weight decay at the end (fixed version)
103
+ if group['weight_decay'] > 0.0:
104
+ p.data.add_(p.data, alpha=-group['lr'] * group['weight_decay'])
105
+
106
+ return loss
results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a4d870ccff8ab61b72571cd7c9f84eb916d84fd7f091b2e300dfb9d4be5ee518
3
+ size 29628
results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01_back.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef85a542568c80fab7d57d69041ebd898e30d4fc912082bd4d571aea3ec6424c
3
+ size 29917
results/tvr-top01-2024_07_08_17_18_30/best_test_predictions.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0becb2747c635a0080149ccb3e92975f7bf4bf3a99d025fd41d29ae9287db438
3
+ size 14263264
results/tvr-top01-2024_07_08_17_18_30/best_val_predictions.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:47ced0079b54bdbc05268645d80c6fa52b1ed44c6e04f6922d535be29aa3fd8c
3
+ size 2560976
results/tvr-top01-2024_07_08_17_18_30/code.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88b0711364459d5340f2e887420295145188a9008d5b50b5ddde46b221645c23
3
+ size 1141392
results/tvr-top01-2024_07_08_17_18_30/model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa2b8044636fe7ce9ab4d36df179ec2358f10a579de4ee5a7e58f338553558d2
3
+ size 190742082
results/tvr-top01-2024_07_08_17_18_30/opt.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c93c28739229f5e35afc1239e1f30e0cad28353909eed88b6d65732943a5ac61
3
+ size 1370
results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea621825b2f1d618daf456f872246d6d50bd3729a36606c7cdcf75dcddbec57a
3
+ size 30298
results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20_back.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03b9976e0b0049f434e91251cfcde27b9a2334e95216d995ada4699f83d889c9
3
+ size 31752
results/tvr-top20-2024_07_08_21_19_47/best_test_predictions.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:12895f4d15d70eff1737745bda045cf6fb1bf6e85aa4e8c4cdd86633cb70274a
3
+ size 14324579
results/tvr-top20-2024_07_08_21_19_47/best_val_predictions.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:103076d328e1b7efdc2773625c38fc73a29492a67bcb27e023af73f8b21c8732
3
+ size 2571786
results/tvr-top20-2024_07_08_21_19_47/code.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88b0711364459d5340f2e887420295145188a9008d5b50b5ddde46b221645c23
3
+ size 1141392
results/tvr-top20-2024_07_08_21_19_47/model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:baff5eaebb7f211640af4e21f2876be344eaa95431ab32398ac7260e9803471f
3
+ size 190742082
results/tvr-top20-2024_07_08_21_19_47/opt.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:90d02a58cbb9a5ea0f23e3fefedd3f8f7b8852332b4877cfe7ba2833ca699071
3
+ size 1368
results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:895455a13565da5f3d44126722152288a3057649fef1daa94d7558d490d97d81
3
+ size 24491
results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40_back.log ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6085e3055b53b0afc63799813027a70b1d1999beeecf22b0accda3b5a60fe8cc
3
+ size 26137
results/tvr-top40-2024_07_11_10_58_46/best_test_predictions.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5deaab54d6eec95172c5877b38dc72712f76b0357f26e255938a55835627ed2c
3
+ size 14329598
results/tvr-top40-2024_07_11_10_58_46/best_val_predictions.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9d7b68cde82958c1a7039210d2ac4bb5cfb5083abee6bbb550083395061a8a8
3
+ size 2572649
results/tvr-top40-2024_07_11_10_58_46/code.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88e51fa09336f4a4545dc2e281cfe8cea943daf17de87c12b6b75d226fdb61dd
3
+ size 1141399
results/tvr-top40-2024_07_11_10_58_46/model.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5eba8e53656fed1ddcbb7d8129bd6c72862797c63684f11121a9a78c86b30c70
3
+ size 190742082
results/tvr-top40-2024_07_11_10_58_46/opt.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e03b5de0524d803c796aaef3fa4aaf1152cfae63644403e236262fe1a4663b3
3
+ size 1368
run_disjoint_top01.sh ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python train.py \
2
+ --model_name conquer \
3
+ --dataset_config config/tvr_ranking_data_config_top01.json \
4
+ --model_config config/model_config.json \
5
+ --eval_tasks_at_training VCMR \
6
+ --use_interal_vr_scores \
7
+ --use_extend_pool 500 \
8
+ --neg_video_num 0 \
9
+ --max_vcmr_video 10 \
10
+ --similarity_measure disjoint \
11
+ --bsz 196 \
12
+ --eval_query_bsz 8 \
13
+ --eval_num_per_epoch 0.05 \
14
+ --n_epoch 4000 \
15
+ --exp_id top01
16
+
17
+ # qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
18
+ # cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top01.sh
19
+
run_disjoint_top20.sh ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python train.py \
2
+ --model_name conquer \
3
+ --dataset_config config/tvr_ranking_data_config_top20.json \
4
+ --model_config config/model_config.json \
5
+ --eval_tasks_at_training VCMR \
6
+ --use_interal_vr_scores \
7
+ --use_extend_pool 500 \
8
+ --neg_video_num 0 \
9
+ --max_vcmr_video 10 \
10
+ --similarity_measure disjoint \
11
+ --bsz 196 \
12
+ --eval_query_bsz 8 \
13
+ --eval_num_per_epoch 1 \
14
+ --n_epoch 200 \
15
+ --exp_id top20
16
+
17
+ # qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
18
+ # cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top20.sh
19
+
run_disjoint_top40.sh ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python train.py \
2
+ --model_name conquer \
3
+ --dataset_config config/tvr_ranking_data_config_top40.json \
4
+ --model_config config/model_config.json \
5
+ --eval_tasks_at_training VCMR \
6
+ --use_interal_vr_scores \
7
+ --use_extend_pool 500 \
8
+ --neg_video_num 0 \
9
+ --max_vcmr_video 10 \
10
+ --similarity_measure disjoint \
11
+ --bsz 196 \
12
+ --eval_query_bsz 8 \
13
+ --eval_num_per_epoch 2 \
14
+ --n_epoch 100 \
15
+ --exp_id top40
16
+
17
+ # qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
18
+ # cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top40.sh
19
+