Liangrj5 commited on Jul 15, 2024

Commit

a638e43

1 Parent(s): f2d2d1a

init

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +2 -0
.gitignore +1 -0
README.md +47 -3
config/config.py +227 -0
config/model_config.json +3 -0
config/tvr_ranking_data_config_top01.json +3 -0
config/tvr_ranking_data_config_top20.json +3 -0
config/tvr_ranking_data_config_top40.json +3 -0
data_loader/second_stage_start_end_dataset.py +349 -0
inference.py +570 -0
model/__init__.py +0 -0
model/backbone/__init__.py +0 -0
model/backbone/encoder.py +235 -0
model/conquer.py +205 -0
model/head/__init__.py +0 -0
model/head/ml_head.py +61 -0
model/head/vs_head.py +42 -0
model/layers.py +196 -0
model/modeling_utils.py +135 -0
model/qal/__init__.py +0 -0
model/qal/query_aware_learning_module.py +92 -0
model/transformer/__init__.py +0 -0
model/transformer/bert.py +275 -0
model/transformer/bert_embed.py +64 -0
ndcg_iou_topk.py +66 -0
optim/adamw.py +106 -0
results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01.log +3 -0
results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01_back.log +3 -0
results/tvr-top01-2024_07_08_17_18_30/best_test_predictions.json +3 -0
results/tvr-top01-2024_07_08_17_18_30/best_val_predictions.json +3 -0
results/tvr-top01-2024_07_08_17_18_30/code.zip +3 -0
results/tvr-top01-2024_07_08_17_18_30/model.ckpt +3 -0
results/tvr-top01-2024_07_08_17_18_30/opt.json +3 -0
results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20.log +3 -0
results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20_back.log +3 -0
results/tvr-top20-2024_07_08_21_19_47/best_test_predictions.json +3 -0
results/tvr-top20-2024_07_08_21_19_47/best_val_predictions.json +3 -0
results/tvr-top20-2024_07_08_21_19_47/code.zip +3 -0
results/tvr-top20-2024_07_08_21_19_47/model.ckpt +3 -0
results/tvr-top20-2024_07_08_21_19_47/opt.json +3 -0
results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40.log +3 -0
results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40_back.log +3 -0
results/tvr-top40-2024_07_11_10_58_46/best_test_predictions.json +3 -0
results/tvr-top40-2024_07_11_10_58_46/best_val_predictions.json +3 -0
results/tvr-top40-2024_07_11_10_58_46/code.zip +3 -0
results/tvr-top40-2024_07_11_10_58_46/model.ckpt +3 -0
results/tvr-top40-2024_07_11_10_58_46/opt.json +3 -0
run_disjoint_top01.sh +19 -0
run_disjoint_top20.sh +19 -0
run_disjoint_top40.sh +19 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+*.log filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ *__pycache__

README.md CHANGED Viewed

@@ -1,3 +1,47 @@
----
-license: mit
----

+---
+license: mit
+datasets:
+- axgroup/Ranking_TVR
+language:
+- en
+---
+# CONQUER_RVMR
+This repository contains the XML model for the baseline of the Ranked Video Moment Retrieval (RVMR) task. The associated paper is titled "Video Moment Retrieval in Practical Setting: A Dataset of Ranked Moments for Imprecise Queries."
+The main repository of the paper is [TVR-Ranking](https://huggingface.co/axgroup/TVR-Ranking), and this model is adapted from [CONQUER](https://github.com/houzhijian/CONQUER.git). The environment setup is the same as for RelocNet_RVMR, as detailed in the [TVR-Ranking](https://huggingface.co/axgroup/TVR-Ranking) repository.
+CONQUER leverages video retrieval results from [HERO](https://github.com/linjieli222/HERO.git). We continue to use these
+results when training on our TVR-Ranking dataset. Note that, because the HERO results are obtained from the TVR dataset, there could be a data leak issue in our task setting. However, this issue is negligible for two reasons: (i) the queries used in our setting is imprecise query with query re-written, and (ii) a query has multiple ground truth moments in our task setting, which was not annotated in the original TVR dataset.
+## Performance
+| **Model**  | **Train Set Top N** | **IoU=0.3** |          | **IoU=0.5** |          | **IoU=0.7** |          |
+|------------|---------------------|-------------|----------|-------------|----------|-------------|----------|
+|            |                     | **Val**     | **Test** | **Val**     | **Test** | **Val**     | **Test** |
+| **NDCG@10**|                     |             |          |             |          |             |          |
+| CONQUER    | 1                   | 0.0999      | 0.0859   | 0.0844      | 0.0709   | 0.0530      | 0.0512   |
+| CONQUER    | 20                  | 0.2406      | 0.2249   | 0.2222      | 0.2104   | 0.1672      | 0.1517   |
+| CONQUER    | 40                  | 0.2450      | 0.2219   | 0.2262      | 0.2085   | 0.1670      | 0.1515   |
+| **NDCG@20**|                     |             |          |             |          |             |          |
+| CONQUER    | 1                   | 0.0952      | 0.0835   | 0.0808      | 0.0687   | 0.0526      | 0.0484   |
+| CONQUER    | 20                  | 0.2130      | 0.1995   | 0.1976      | 0.1867   | 0.1527      | 0.1368   |
+| CONQUER    | 40                  | 0.2183      | 0.1968   | 0.2022      | 0.1851   | 0.1524      | 0.1365   |
+| **NDCG@40**|                     |             |          |             |          |             |          |
+| CONQUER    | 1                   | 0.0974      | 0.0866   | 0.0832      | 0.0718   | 0.0557      | 0.0510   |
+| CONQUER    | 20                  | 0.2029      | 0.1906   | 0.1891      | 0.1788   | 0.1476      | 0.1326   |
+| CONQUER    | 40                  | 0.2080      | 0.1885   | 0.1934      | 0.1775   | 0.1473      | 0.1323   |
+## Quick Start
+Modify the path in `run_disjoint_top20.sh` and then execute the script:
+```sh
+sh run_disjoint_top20.sh
+```
+Feel free to contribute or raise issues for any problems encountered.

config/config.py ADDED Viewed

	@@ -0,0 +1,227 @@

+import os
+import time
+import torch
+import argparse
+import sys
+import pprint
+import json
+from utils.basic_utils import mkdirp, load_json, save_json, make_zipfile
+def parse_with_config(parser):
+    args = parser.parse_args()
+    if args.config is not None:
+        config_args = json.load(open(args.config))
+        override_keys = {arg[2:].split('=')[0] for arg in sys.argv[1:]
+                         if arg.startswith('--')}
+        for k, v in config_args.items():
+            if k not in override_keys:
+                setattr(args, k, v)
+    del args.config
+    return args
+class BaseOptions(object):
+    saved_option_filename = "opt.json"
+    ckpt_filename = "model.ckpt"
+    tensorboard_log_dir = "tensorboard_log"
+    train_log_filename = "train.log.txt"
+    eval_log_filename = "eval.log.txt"
+    def __init__(self):
+        self.parser = argparse.ArgumentParser()
+        self.initialized = False
+        self.opt = None
+    def initialize(self):
+        self.initialized = True
+        self.parser.add_argument("--dset_name", type=str, default="tvr", choices=["tvr", "didemo"])
+        self.parser.add_argument("--eval_split_name", type=str, default="val",
+                                 help="should match keys in video_duration_idx_path, must set for VCMR")
+        self.parser.add_argument("--data_ratio", type=float, default=1.0,
+                                 help="how many training and eval data to use. 1.0: use all, 0.1: use 10%."
+                                      "Use small portion for debug purposes. Note this is different from --debug, "
+                                      "which works by breaking the loops, typically they are not used together.")
+        self.parser.add_argument("--debug", action="store_true",
+                                 help="debug (fast) mode, break all loops, do not load all data into memory.")
+        self.parser.add_argument("--disable_eval", action="store_true",
+                                 help="disable eval")
+        self.parser.add_argument("--results_root", type=str, default="results")
+        self.parser.add_argument("--exp_id", type=str, default=None, help="id of this run, required at training")
+        self.parser.add_argument("--seed", type=int, default=2018, help="random seed")
+        self.parser.add_argument("--device", type=int, default=0, help="0 cuda, -1 cpu")
+        self.parser.add_argument("--device_ids", type=int, nargs="+", default=[0], help="GPU ids to run the job")
+        self.parser.add_argument("--num_workers", type=int, default=8,
+                                 help="num subprocesses used to load the data, 0: use main process")
+        # training config
+        self.parser.add_argument("--lr", type=float, default=1e-4, help="learning rate")
+        self.parser.add_argument("--lr_warmup_proportion", type=float, default=0.01,
+                                 help="Proportion of training to perform linear learning rate warmup for. "
+                                      "E.g., 0.1 = 10% of training.")
+        self.parser.add_argument("--wd", type=float, default=0.01, help="weight decay")
+        self.parser.add_argument("--n_epoch", type=int, default=50, help="number of epochs to run")
+        self.parser.add_argument("--max_es_cnt", type=int, default=3,
+                                 help="number of epochs to early stop, use -1 to disable early stop")
+        self.parser.add_argument("--eval_tasks_at_training", type=str, nargs="+",
+                                 default=["VCMR", "SVMR", "VR"], choices=["VCMR", "SVMR", "VR"],
+                                 help="evaluate and report  numbers for tasks specified here.")
+        self.parser.add_argument("--bsz", type=int, default=128, help="mini-batch size")
+        self.parser.add_argument("--eval_query_bsz", type=int, default=8,
+                                 help="mini-batch size at inference, for query")
+        self.parser.add_argument("--no_eval_untrained", action="store_true", help="Evaluate on un-trained model")
+        self.parser.add_argument("--grad_clip", type=float, default=-1, help="perform gradient clip, -1: disable")
+        self.parser.add_argument("--eval_epoch_num", type=int, default=1, help="eval_epoch_num")
+        # Data config
+        self.parser.add_argument("--max_ctx_len", type=int, default=100,
+                                 help="max number of snippets, 100 for tvr clip_length=1.5, only 109/21825 > 100")
+        self.parser.add_argument("--max_desc_len", type=int, default=30, help="max number of query token")
+        self.parser.add_argument("--clip_length", type=float, default=1.5,
+                                 help="each video will be uniformly segmented into small clips")
+        self.parser.add_argument("--ctx_mode", type=str, default="visual_sub",
+                                 help="adopted modality list for each clip")
+        self.parser.add_argument("--dataset_config", type=str,help="data config")
+        # Model config
+        self.parser.add_argument("--visual_dim", type=int,default=4352,help="visual modality feature dimension")
+        self.parser.add_argument("--text_dim", type=int, default=768, help="textual modality feature dimension")
+        self.parser.add_argument("--query_dim", type=int, default=768, help="query feature dimension")
+        self.parser.add_argument("--hidden_dim", type=int, default=768, help="joint dimension")
+        self.parser.add_argument("--no_output_moe_weight",action="store_true",
+                                 help="whether NOT to use query dependent fusion")
+        self.parser.add_argument("--model_config", type=str, help="model config")
+        ## Train config
+        self.parser.add_argument("--lw_st_ed", type=float, default=0.01, help="weight for moment cross-entropy loss")
+        self.parser.add_argument("--lw_video_ce", type=float, default=0.05, help="weight for video cross-entropy loss")
+        self.parser.add_argument("--lr_mul", type=float, default=1, help="Learning rate multiplier for backbone module")
+        self.parser.add_argument("--use_extend_pool", type=int, default=1000,
+                                 help="use_extend_pool")
+        self.parser.add_argument("--neg_video_num",type=int,default=3,
+                                 help="sample the number of negative video, "
+                                      "if neg_video_num=0, then disable shared normalization training objective")
+        self.parser.add_argument("--encoder_pretrain_ckpt_filepath", type=str,
+                                 default="None",
+                                 help="first_stage_pretrain checkpoint")
+        self.parser.add_argument("--use_interal_vr_scores", action="store_true",
+                                 help="whether to interal_vr_scores, true only for general similarity measure function")
+        ## Eval config
+        self.parser.add_argument("--similarity_measure",
+                                 type=str, choices=["general", "exclusive","disjoint"],
+                                 default="general",help="similarity_measure_function")
+        # post processing
+        self.parser.add_argument("--min_pred_l", type=int, default=0,
+                                 help="constrain the [st, ed] with ed - st >= 1"
+                                      "(1 clips with length 1.5 each, 1.5 secs in total"
+                                      "this is the min length for proposal-based method)")
+        self.parser.add_argument("--max_pred_l", type=int, default=24,
+                                 help="constrain the [st, ed] pairs with ed - st <= 24, 36 secs in total"
+                                      "(24 clips with length 1.5 each, "
+                                      "this is the max length for proposal-based method)")
+        self.parser.add_argument("--max_before_nms", type=int, default=200)
+        self.parser.add_argument("--max_vcmr_video", type=int, default=10,
+                                 help="ranking in top-max_vcmr_video")
+        self.parser.add_argument("--nms_thd", type=float, default=-1,
+                                 help="additionally use non-maximum suppression "
+                                      "(or non-minimum suppression for distance)"
+                                      "to post-processing the predictions. "
+                                      "-1: do not use nms. 0.7 for tvr")
+        self.parser.add_argument("--eval_num_per_epoch", type=float)
+        # can use config files
+        self.parser.add_argument('--config', help='JSON config files')
+        self.parser.add_argument('--model_name', type=str)
+    def display_save(self, opt):
+        args = vars(opt)
+        # Display settings
+        # print("------------ Options -------------\n{}\n-------------------"
+        #       .format({str(k): str(v) for k, v in sorted(args.items())}))
+        print("------------ Options -------------\n{}\n-------------------"
+              .format(pprint.pformat({str(k): str(v) for k, v in sorted(args.items())}, indent=4)))
+        # Save settings
+        if not isinstance(self, TestOptions):
+            option_file_path = os.path.join(opt.results_dir, self.saved_option_filename)  # not yaml file indeed
+            save_json(args, option_file_path, save_pretty=True)
+    def parse(self):
+        if not self.initialized:
+            self.initialize()
+        opt = parse_with_config(self.parser)
+        if opt.debug:
+            opt.results_root = os.path.sep.join(opt.results_root.split(os.path.sep)[:-1] + ["debug_results", ])
+            #opt.disable_eval = True
+        if isinstance(self, TestOptions):
+            # modify model_dir to absolute path
+            opt.model_dir = os.path.join("results", opt.model_dir)
+            saved_options = load_json(os.path.join(opt.model_dir, self.saved_option_filename))
+            for arg in saved_options:  # use saved options to overwrite all BaseOptions args.
+                if arg not in ["results_root", "nms_thd", "debug", "dataset_config", "model_config","device",
+                               "eval_split_name", "bsz", "eval_context_bsz", "device_ids",
+                               "max_vcmr_video","max_pred_l", "min_pred_l", "external_inference_vr_res_path"]:
+                    setattr(opt, arg, saved_options[arg])
+        else:
+            if opt.exp_id is None:
+                raise ValueError("--exp_id is required for at a training option!")
+            opt.results_dir = os.path.join(opt.results_root,
+                                           "-".join([opt.dset_name, opt.exp_id,
+                                                     time.strftime("%Y_%m_%d_%H_%M_%S")]))
+            mkdirp(opt.results_dir)
+            # save a copy of current code
+            code_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
+            code_zip_filename = os.path.join(opt.results_dir, "code.zip")
+            make_zipfile(code_dir, code_zip_filename,
+                         enclosing_dir="code",
+                         exclude_dirs_substring="results",
+                         exclude_dirs=["condor","data","results", "debug_results", "__pycache__"],
+                         exclude_extensions=[".pyc", ".ipynb", ".swap"],)
+        self.display_save(opt)
+        # assert opt.stop_task in opt.eval_tasks_at_training
+        opt.ckpt_filepath = os.path.join(opt.results_dir, self.ckpt_filename)
+        opt.train_log_filepath = os.path.join(opt.results_dir, self.train_log_filename)
+        opt.eval_log_filepath = os.path.join(opt.results_dir, self.eval_log_filename)
+        opt.tensorboard_log_dir = os.path.join(opt.results_dir, self.tensorboard_log_dir)
+        opt.device = torch.device("cuda:%d" % opt.device_ids[0] if opt.device >= 0 else "cpu")
+        self.opt = opt
+        return opt
+class TestOptions(BaseOptions):
+    """add additional options for evaluating"""
+    def initialize(self):
+        BaseOptions.initialize(self)
+        # also need to specify --eval_split_name
+        self.parser.add_argument("--eval_id", type=str, help="evaluation id")
+        self.parser.add_argument("--model_dir", type=str,
+                                 help="dir contains the model file, will be converted to absolute path afterwards")
+        self.parser.add_argument("--tasks", type=str, nargs="+",
+                                 choices=["VCMR", "SVMR", "VR"], default=["VCMR", "SVMR", "VR"],
+                                 help="Which tasks to run."
+                                      "VCMR: Video Corpus Moment Retrieval;"
+                                      "SVMR: Single Video Moment Retrieval;"
+                                      "VR: regular Video Retrieval. (will be performed automatically with VCMR)")
+if __name__ == '__main__':
+    print(__file__)
+    print(os.path.realpath(__file__))
+    code_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
+    print(code_dir)

config/model_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1458b56e285bd34b5db29a8e6babc61f9bf02d377a7ce594579baa833190f582
+size 1637

config/tvr_ranking_data_config_top01.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:03ed22c7ab836800651a9ab882496e71d93266bb6dff35c13d308243d1a5c98e
+size 926

config/tvr_ranking_data_config_top20.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:509c13907d08921dd59c41b040166b4e0fd6e49260fa79adca9d23f46a804f70
+size 926

config/tvr_ranking_data_config_top40.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:75a6540a46a85534dcf79b5049cc47053cd48232f6983268a584565b4a55d48b
+size 926

data_loader/second_stage_start_end_dataset.py ADDED Viewed

	@@ -0,0 +1,349 @@

+import torch
+from torch.utils.data import Dataset
+import math
+import os
+import random
+import numpy as np
+from utils.basic_utils import load_json, l2_normalize_np_array, load_json
+import h5py
+class StartEndDataset(Dataset):
+    """
+    Args:
+        dset_name, str, ["tvr"]
+    Return:
+        a dict: {
+            "model_inputs": {
+                "query"
+                    "feat": torch.tensor, (max_desc_len, D_q)
+                    "feat_mask": torch.tensor, (max_desc_len)
+                    "feat_pos_id": torch.tensor, (max_desc_len)
+                    "feat_token_id": torch.tensor, (max_desc_len)
+                "visual"
+                    "feat": torch.tensor, (max_ctx_len, D_video)
+                    "feat_mask": torch.tensor, (max_ctx_len)
+                    "feat_pos_id": torch.tensor, (max_ctx_len)
+                    "feat_token_id": torch.tensor, (max_ctx_len)
+                "sub" (optional)
+                "st_ed_indices": torch.LongTensor, (2, )
+            }
+        }
+    """
+    def __init__(self, config, data_path, vr_rank_path, max_ctx_len=100, max_desc_len=30, clip_length=1.5,ctx_mode="visual_sub",
+                 is_eval = False, mode = "train",
+                 neg_video_num=3, data_ratio=1,
+                 use_extend_pool=500, inference_top_k=10):
+        self.dset_name = config.dset_name
+        self.root_path = config.root_path
+        self.desc_bert_path = os.path.join(self.root_path,config.desc_bert_path)
+        self.vid_feat_path = os.path.join(self.root_path,config.vid_feat_path)
+        self.ctx_mode = ctx_mode
+        self.use_sub = "sub" in self.ctx_mode
+        if self.use_sub:
+            self.sub_bert_path = os.path.join(self.root_path, config.sub_bert_path)
+        self.max_ctx_len = max_ctx_len
+        self.max_desc_len = max_desc_len
+        self.clip_length = clip_length
+        self.neg_video_num = neg_video_num
+        self.is_eval = is_eval
+        self.mode = mode
+        if mode in ["val", "test"]:
+            #  = load_json(data_path)
+            self.annotations = load_json(data_path)
+            self.ground_truth = self.get_relevant_moment_gt()
+            self.annotations = self.expand_annotations( self.annotations)
+        if mode == "train":
+            self.annotations = self.expand_annotations(load_json(data_path))
+        self.first_VR_ranklist_pool_txn = h5py.File(vr_rank_path, "r")
+        self.query_bert_h5 = h5py.File(self.desc_bert_path, "r")
+        self.vid_feat_txn = h5py.File(self.vid_feat_path, "r")
+        if self.use_sub:
+            self.sub_bert_txn = h5py.File(self.sub_bert_path, "r")
+        self.inference_top_k = inference_top_k
+        video_data = load_json(os.path.join(self.root_path,config.video_duration_idx_path))
+        self.video_data = [{"vid_name": k, "duration": v[0]} for k, v in video_data.items()]
+        self.video2idx = {k: v[1] for k, v in video_data.items()}
+        self.idx2video = {v[1]:k for k, v in video_data.items()}
+        self.use_extend_pool = use_extend_pool
+        self.normalize_vfeat = True
+        self.normalize_tfeat = False
+        self.visual_token_id = 0
+        self.text_token_id = 1
+    def __len__(self):
+        return len(self.annotations)
+    def expand_annotations(self, annotations):
+        new_annotations = []
+        for i in annotations:
+            query = i["query"]
+            query_id = i["query_id"]
+            for moment in  i["relevant_moment"]:
+                moment.update({'query': query, 'query_id': query_id})
+                new_annotations.append(moment)
+        return new_annotations
+    def get_relevant_moment_gt(self):
+        gt_all = {}
+        for data in self.annotations:
+            gt_all[data["query_id"]] = data["relevant_moment"]
+        return gt_all
+    def pad_feature(self, feature, max_ctx_len):
+        """
+            Args:
+                feature: original feature without padding
+                max_ctx_len: the maximum length of video clips (or query token)
+            Returns:
+                 feat_pad : padded feature
+                 feat_mask : feature mask
+        """
+        N_clip, feat_dim = feature.shape
+        feat_pad = torch.zeros((max_ctx_len, feat_dim))
+        feat_mask = torch.zeros(max_ctx_len, dtype=torch.long)
+        feat_pad[:N_clip, :] = torch.from_numpy(feature)
+        feat_mask[:N_clip] = 1
+        return feat_pad , feat_mask
+    def get_query_feat_by_query_id(self, query_id, token_id=1):
+        """
+            Args:
+                query_id: unique query description id
+                token_id: specify modality embedding
+            Returns:
+                a dict for query: {
+                    "feat": torch.tensor, (max_desc_len, D_q)
+                    "feat_mask": torch.tensor, (max_desc_len)
+                    "feat_pos_id": torch.tensor, (max_desc_len)
+                    "feat_token_id": torch.tensor, (max_desc_len)
+                }
+        """
+        query_feat = self.query_bert_h5[str(query_id)][:self.max_desc_len]
+        if self.normalize_tfeat:
+            query_feat = l2_normalize_np_array(query_feat)
+        feat_pad, feat_mask = \
+            self.pad_feature(query_feat, self.max_desc_len)
+        temp_model_inputs = dict()
+        temp_model_inputs["feat"] = feat_pad
+        temp_model_inputs["feat_mask"] = feat_mask
+        temp_model_inputs["feat_pos_id"] = torch.arange(self.max_desc_len, dtype=torch.long)
+        temp_model_inputs["feat_token_id"] = torch.full((self.max_desc_len,), token_id, dtype=torch.long)
+        return temp_model_inputs
+    def get_visual_feat_from_storage(self,vid_name):
+        """
+            Args:
+                vid_name: unique video description id
+            Returns:
+                visual_feat: torch.tensor, (max_ctx_len, D_v)
+                Use ResNet + SlowFast , D_v = 2048 + 2304 = 4352
+        """
+        visual_feat = self.vid_feat_txn[vid_name][:][:self.max_ctx_len]
+        if self.normalize_vfeat:
+            visual_feat = l2_normalize_np_array(visual_feat)
+        return visual_feat
+    def get_sub_feat_from_storage(self,vid_name):
+        """
+            Args:
+                vid_name: unique video description id
+            Returns:
+                visual_feat: torch.tensor, (max_ctx_len, D_s)
+                Use RoBERTa, D_s =768
+        """
+        sub_feat = self.sub_bert_txn[vid_name][:][:self.max_ctx_len]
+        if self.normalize_tfeat:
+            sub_feat = l2_normalize_np_array(sub_feat)
+        return sub_feat
+    def __getitem__(self, index):
+        raw_data = self.annotations[index]
+        # if "video_name" not in raw_data.keys():
+        # initialize with basic data
+        meta = dict(
+            query_id=raw_data["query_id"],
+            desc=raw_data["query"],
+            vid_name=raw_data["video_name"],
+            ts=raw_data["timestamp"],
+        )
+        # If mode is test_public, no ground-truth video_id is provided. So use a fixed dummy ground-truth video_id
+        if self.mode =="test_public":
+            meta["vid_name"] = "placeholder"
+        model_inputs = dict()
+        ## query information
+        model_inputs["query"] = self.get_query_feat_by_query_id(meta["query_id"],
+                                                               token_id=self.text_token_id)
+        query_id = meta["query_id"]
+        if query_id == 7806:
+            query_id += 1
+        _external_inference_vr_res = self.first_VR_ranklist_pool_txn[str(query_id)][:]
+        if not self.is_eval:
+            ##get the rank location of the ground-truth video for the first VR search engine
+            location = 100
+            for idx, item in enumerate(_external_inference_vr_res):
+                if meta["vid_name"] == self.idx2video[item[0]]:
+                    location = idx
+                    break
+            ##check all the location is below 100 when mode is train
+            # if self.mode =="train":
+                # assert  0<=location<100, meta["query_id"]
+            ##get the ranklist without the ground-truth video
+            negative_video_pool_list = [self.idx2video[item[0]] for item in _external_inference_vr_res if meta["vid_name"] != self.idx2video[item[0]] ]
+            ##sample neg_video_num negative videos for shared normalization
+            sampled_negative_video_pool = random.sample(negative_video_pool_list[:location+self.use_extend_pool],
+                                                            k=self.neg_video_num)
+            ##the complete sampled video list , [pos, neg1, neg2, ...]
+            total_vid_name_list = [meta["vid_name"],] + sampled_negative_video_pool
+            self.shared_video_num = 1 + self.neg_video_num
+        else:
+            ##during eval, use top-k videos recommended by the first VR search engine
+            inference_video_list = [ self.idx2video[item[0]] for item in _external_inference_vr_res[:self.inference_top_k]]
+            inference_video_scores = [ item[1] for item in _external_inference_vr_res[:self.inference_top_k]]
+            model_inputs["inference_vr_scores"] = torch.FloatTensor(inference_video_scores)
+            total_vid_name_list = [meta["vid_name"],] + inference_video_list
+            self.shared_video_num = 1 + self.inference_top_k
+        # sampled neg_video_num negative videos or top-k videos
+        meta["sample_vid_name_list"] = total_vid_name_list[1:]
+        """
+            a dict for visual modality: {
+                "feat": torch.tensor, (shared_video_num, max_ctx_len, D_v)
+                "feat_mask": torch.tensor, (shared_video_num, max_ctx_len)
+                "feat_pos_id": torch.tensor, (shared_video_num, max_ctx_len)
+                "feat_token_id": torch.tensor, (shared_video_num, max_ctx_len)
+            }
+        """
+        groundtruth_visual_feat = self.get_visual_feat_from_storage(meta["vid_name"])
+        ctx_l, feat_dim = groundtruth_visual_feat.shape
+        visual_feat_pad = torch.zeros((self.shared_video_num, self.max_ctx_len, feat_dim))
+        visual_feat_mask = torch.zeros((self.shared_video_num, self.max_ctx_len), dtype=torch.long)
+        visual_feat_pos_id = \
+            torch.repeat_interleave(torch.arange(self.max_ctx_len, dtype=torch.long).unsqueeze(0),
+                                    self.shared_video_num, dim=0)
+        visual_feat_token_id = torch.full((self.shared_video_num, self.max_ctx_len), self.visual_token_id,
+                                          dtype=torch.long)
+        for index, video_name in enumerate(total_vid_name_list,start=0):
+            visual_feat = self.get_visual_feat_from_storage(video_name)
+            feat_pad, feat_mask = \
+                self.pad_feature(visual_feat, self.max_ctx_len)
+            visual_feat_pad[index] = feat_pad
+            visual_feat_mask[index] = feat_mask
+        temp_model_inputs = dict()
+        temp_model_inputs["feat"] = visual_feat_pad
+        temp_model_inputs["feat_mask"] = visual_feat_mask
+        temp_model_inputs["feat_pos_id"] = visual_feat_pos_id
+        temp_model_inputs["feat_token_id"] = visual_feat_token_id
+        model_inputs["visual"] = temp_model_inputs
+        """
+              a dict for sub modality: {
+                  "feat": torch.tensor, (shared_video_num, max_ctx_len, D_t)
+                  "feat_mask": torch.tensor, (shared_video_num, max_ctx_len)
+                  "feat_pos_id": torch.tensor, (shared_video_num, max_ctx_len)
+                  "feat_token_id": torch.tensor, (shared_video_num, max_ctx_len)
+              }
+        """
+        if self.use_sub:
+            groundtruth_sub_feat = self.get_sub_feat_from_storage(meta["vid_name"])
+            _ , feat_dim = groundtruth_sub_feat.shape
+            sub_feat_pad = torch.zeros((self.shared_video_num, self.max_ctx_len, feat_dim))
+            sub_feat_mask = torch.zeros((self.shared_video_num, self.max_ctx_len), dtype=torch.long)
+            sub_feat_pos_id = \
+                torch.repeat_interleave(torch.arange(self.max_ctx_len, dtype=torch.long).unsqueeze(0),
+                                        self.shared_video_num, dim=0)
+            sub_feat_token_id = torch.full((self.shared_video_num, self.max_ctx_len), self.text_token_id, dtype=torch.long)
+            for index, video_name in enumerate(total_vid_name_list, start=0):
+                sub_feat = self.get_sub_feat_from_storage(video_name)
+                feat_pad, feat_mask = \
+                    self.pad_feature(sub_feat, self.max_ctx_len)
+                sub_feat_pad[index] = feat_pad
+                sub_feat_mask[index] = feat_mask
+            temp_model_inputs = dict()
+            temp_model_inputs["feat"] = sub_feat_pad
+            temp_model_inputs["feat_mask"] = sub_feat_mask
+            temp_model_inputs["feat_pos_id"] = sub_feat_pos_id
+            temp_model_inputs["feat_token_id"] = sub_feat_token_id
+            model_inputs["sub"] = temp_model_inputs
+        if not self.is_eval:
+            model_inputs["st_ed_indices"] = self.get_st_ed_label(meta["ts"],
+                                                                 max_idx=ctx_l - 1)
+        return dict(meta=meta, model_inputs=model_inputs)
+    def get_st_ed_label(self, ts, max_idx):
+        """
+        Args:
+            ts: [st (float), ed (float)] in seconds, ed > st
+            max_idx: length of the video
+        Returns:
+            [st_idx, ed_idx]: int,
+            ed_idx >= st_idx
+            st_idx, ed_idx both belong to [0, max_idx-1]
+        Given ts = [3.2, 7.6], st_idx = 2, ed_idx = 6,
+        clips should be indexed as [2: 6), the translated back ts should be [3:9].
+        # TODO which one is better, [2: 5] or [2: 6)
+        """
+        st_idx = min(math.floor(ts[0] / self.clip_length), max_idx)
+        ed_idx = min(math.ceil(ts[1] / self.clip_length) - 1, max_idx)  # st_idx could be the same as ed_idx
+        assert 0 <= st_idx <= ed_idx <= max_idx, (ts, st_idx, ed_idx, max_idx)
+        return torch.LongTensor([st_idx, ed_idx])

inference.py ADDED Viewed

	@@ -0,0 +1,570 @@

+import os
+import pprint
+from tqdm import tqdm
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.backends.cudnn as cudnn
+from torch.utils.data import DataLoader
+from config.config import TestOptions
+from model.conquer import CONQUER
+from data_loader.second_stage_start_end_dataset import StartEndDataset as StartEndEvalDataset
+from utils.inference_utils  import \
+    get_submission_top_n, post_processing_vcmr_nms
+from utils.basic_utils import save_json , load_config
+from utils.tensor_utils import find_max_triples_from_upper_triangle_product
+from standalone_eval.eval import eval_retrieval
+from utils.model_utils import move_cuda , start_end_collate
+from utils.model_utils import VERY_NEGATIVE_NUMBER
+import logging
+from time import time
+from ndcg_iou_topk import calculate_ndcg_iou
+logger = logging.getLogger(__name__)
+logging.basicConfig(format="%(asctime)s.%(msecs)03d:%(levelname)s:%(name)s - %(message)s",
+                    datefmt="%Y-%m-%d %H:%M:%S",
+                    level=logging.INFO)
+def generate_min_max_length_mask(array_shape, min_l, max_l):
+    """ The last two dimension denotes matrix of upper-triangle with upper-right corner masked,
+    below is the case for 4x4.
+    [[0, 1, 1, 0],
+     [0, 0, 1, 1],
+     [0, 0, 0, 1],
+     [0, 0, 0, 0]]
+    Args:
+        array_shape: np.shape??? The last two dimensions should be the same
+        min_l: int, minimum length of predicted span
+        max_l: int, maximum length of predicted span
+    Returns:
+    """
+    single_dims = (1, ) * (len(array_shape) - 2)
+    mask_shape = single_dims + array_shape[-2:]
+    extra_length_mask_array = np.ones(mask_shape, dtype=np.float32)  # (1, ..., 1, L, L)
+    mask_triu = np.triu(extra_length_mask_array, k=min_l)
+    mask_triu_reversed = 1 - np.triu(extra_length_mask_array, k=max_l)
+    final_prob_mask = mask_triu * mask_triu_reversed
+    return final_prob_mask  # with valid bit to be 1
+def get_svmr_res_from_st_ed_probs_disjoint(svmr_gt_st_probs, svmr_gt_ed_probs, query_metas, video2idx,
+                                  clip_length, min_pred_l, max_pred_l, max_before_nms):
+    """
+    Args:
+        svmr_gt_st_probs: np.ndarray (N_queries, L, L), value range [0, 1]
+        svmr_gt_ed_probs:
+        query_metas:
+        video2idx:
+        clip_length: float, how long each clip is in seconds
+        min_pred_l: int, minimum number of clips
+        max_pred_l: int, maximum number of clips
+        max_before_nms: get top-max_before_nms predictions for each query
+    Returns:
+    """
+    svmr_res = []
+    query_vid_names = [e["vid_name"] for e in query_metas]
+    # masking very long ones! Since most are relatively short.
+    # disjoint : b_i + e_i
+    _st_ed_scores = np.expand_dims(svmr_gt_st_probs,axis=2) + np.expand_dims(svmr_gt_ed_probs,axis=1)
+    _N_q = _st_ed_scores.shape[0]
+    _valid_prob_mask = np.logical_not(generate_min_max_length_mask(
+        _st_ed_scores.shape, min_l=min_pred_l, max_l=max_pred_l).astype(bool))
+    valid_prob_mask = np.tile(_valid_prob_mask,(_N_q, 1, 1))
+    # invalid location will become VERY_NEGATIVE_NUMBER!
+    _st_ed_scores[valid_prob_mask] = VERY_NEGATIVE_NUMBER
+    batched_sorted_triples = find_max_triples_from_upper_triangle_product(
+        _st_ed_scores, top_n=max_before_nms, prob_thd=None)
+    for i, q_vid_name in tqdm(enumerate(query_vid_names),
+                              desc="[SVMR] Loop over queries to generate predictions",
+                              total=len(query_vid_names)):  # i is query_id
+        q_m = query_metas[i]
+        video_idx = video2idx[q_vid_name]
+        _sorted_triples = batched_sorted_triples[i]
+        _sorted_triples[:, 1] += 1  # as we redefined ed_idx, which is inside the moment.
+        _sorted_triples[:, :2] = _sorted_triples[:, :2] * clip_length
+        # [video_idx(int), st(float), ed(float), score(float)]
+        cur_ranked_predictions = [[video_idx, ] + row for row in _sorted_triples.tolist()]
+        cur_query_pred = dict(
+            query_id=q_m["query_id"],
+            desc=q_m["desc"],
+            predictions=cur_ranked_predictions
+        )
+        svmr_res.append(cur_query_pred)
+    return svmr_res
+def get_svmr_res_from_st_ed_probs(svmr_gt_st_probs, svmr_gt_ed_probs, query_metas, video2idx,
+                                  clip_length, min_pred_l, max_pred_l, max_before_nms):
+    """
+    Args:
+        svmr_gt_st_probs: np.ndarray (N_queries, L, L), value range [0, 1]
+        svmr_gt_ed_probs:
+        query_metas:
+        video2idx:
+        clip_length: float, how long each clip is in seconds
+        min_pred_l: int, minimum number of clips
+        max_pred_l: int, maximum number of clips
+        max_before_nms: get top-max_before_nms predictions for each query
+    Returns:
+    """
+    svmr_res = []
+    query_vid_names = [e["vid_name"] for e in query_metas]
+    # masking very long ones! Since most are relatively short.
+    # general/exclusive :  \hat{b_i} * \hat{e_i}
+    st_ed_prob_product = np.einsum("bm,bn->bmn", svmr_gt_st_probs, svmr_gt_ed_probs)  # (N, L, L)
+    valid_prob_mask = generate_min_max_length_mask(st_ed_prob_product.shape, min_l=min_pred_l, max_l=max_pred_l)
+    st_ed_prob_product *= valid_prob_mask  # invalid location will become zero!
+    batched_sorted_triples = find_max_triples_from_upper_triangle_product(
+        st_ed_prob_product, top_n=max_before_nms, prob_thd=None)
+    for i, q_vid_name in tqdm(enumerate(query_vid_names),
+                              desc="[SVMR] Loop over queries to generate predictions",
+                              total=len(query_vid_names)):  # i is query_id
+        q_m = query_metas[i]
+        video_idx = video2idx[q_vid_name]
+        _sorted_triples = batched_sorted_triples[i]
+        _sorted_triples[:, 1] += 1  # as we redefined ed_idx, which is inside the moment.
+        _sorted_triples[:, :2] = _sorted_triples[:, :2] * clip_length
+        # [video_idx(int), st(float), ed(float), score(float)]
+        cur_ranked_predictions = [[video_idx, ] + row for row in _sorted_triples.tolist()]
+        cur_query_pred = dict(
+            query_id=q_m["query_id"],
+            desc=q_m["desc"],
+            predictions=cur_ranked_predictions
+        )
+        svmr_res.append(cur_query_pred)
+    return svmr_res
+def compute_query2ctx_info(model, eval_dataset, opt,
+                           max_before_nms=200, max_n_videos=100, tasks=("SVMR",)):
+    """
+    Use val set to do evaluation, remember to run with torch.no_grad().
+     model : CONQUER
+     eval_dataset :
+     opt :
+     max_before_nms : max moment number before non-maximum suppression
+     tasks: evaluation tasks
+     general/exclusive function : r * \hat{b_i} + \hat{e_i}
+    """
+    is_vr = "VR" in tasks
+    is_vcmr = "VCMR" in tasks
+    is_svmr = "SVMR" in tasks
+    video2idx = eval_dataset.video2idx
+    model.eval()
+    query_eval_loader = DataLoader(eval_dataset,
+                                   collate_fn= start_end_collate,
+                                   batch_size=opt.eval_query_bsz,
+                                   num_workers=opt.num_workers,
+                                   shuffle=False,
+                                   pin_memory=True)
+    n_total_query = len(eval_dataset)
+    bsz = opt.eval_query_bsz
+    if is_vcmr:
+        flat_st_ed_scores_sorted_indices = np.empty((n_total_query, max_before_nms), dtype=int)
+        flat_st_ed_sorted_scores = np.zeros((n_total_query, max_before_nms), dtype=np.float32)
+    if is_vr :
+        if opt.use_interal_vr_scores:
+            sorted_q2c_indices = np.tile(np.arange(max_n_videos, dtype=int),n_total_query).reshape(n_total_query,max_n_videos)
+            sorted_q2c_scores = np.empty((n_total_query, max_n_videos), dtype=np.float32)
+        else:
+            sorted_q2c_indices = np.empty((n_total_query, max_n_videos), dtype=int)
+            sorted_q2c_scores = np.empty((n_total_query, max_n_videos), dtype=np.float32)
+    if is_svmr:
+        svmr_gt_st_probs = np.zeros((n_total_query, opt.max_ctx_len), dtype=np.float32)
+        svmr_gt_ed_probs = np.zeros((n_total_query, opt.max_ctx_len), dtype=np.float32)
+    query_metas = []
+    for idx, batch in tqdm(
+            enumerate(query_eval_loader), desc="Computing q embedding", total=len(query_eval_loader)):
+        _query_metas = batch["meta"]
+        query_metas.extend(batch["meta"])
+        if opt.device.type == "cuda":
+            model_inputs = move_cuda(batch["model_inputs"], opt.device)
+        else:
+            model_inputs = batch["model_inputs"]
+        video_similarity_score, begin_score_distribution, end_score_distribution = \
+            model.get_pred_from_raw_query(model_inputs)
+        if is_svmr:
+            _svmr_st_probs = begin_score_distribution[:, 0]
+            _svmr_ed_probs = end_score_distribution[:, 0]
+            # normalize to get true probabilities!!!
+            # the probabilities here are already (pad) masked, so only need to do softmax
+            _svmr_st_probs = F.softmax(_svmr_st_probs, dim=-1)  # (_N_q, L)
+            _svmr_ed_probs = F.softmax(_svmr_ed_probs, dim=-1)
+            if opt.debug:
+                print("svmr_st_probs: ", _svmr_st_probs)
+            svmr_gt_st_probs[idx * bsz:(idx + 1) * bsz] = \
+                _svmr_st_probs.cpu().numpy()
+            svmr_gt_ed_probs[idx * bsz:(idx + 1) * bsz] = \
+                _svmr_ed_probs.cpu().numpy()
+        _vcmr_st_prob = begin_score_distribution[:, 1:]
+        _vcmr_ed_prob = end_score_distribution[:, 1:]
+        if not (is_vr or is_vcmr):
+            continue
+        if opt.use_interal_vr_scores:
+            bs = begin_score_distribution.size()[0]
+            _sorted_q2c_indices = torch.arange(max_n_videos).to(begin_score_distribution.device).repeat(bs,1)
+            _sorted_q2c_scores = model_inputs["inference_vr_scores"]
+            if is_vr:
+                sorted_q2c_scores[idx * bsz:(idx + 1) * bsz] = model_inputs["inference_vr_scores"].cpu().numpy()
+        else:
+            video_similarity_score = video_similarity_score[:, 1:]
+            _query_context_scores = torch.softmax(video_similarity_score,dim=1)
+        # Get top-max_n_videos videos for each query
+        _sorted_q2c_scores, _sorted_q2c_indices = \
+            torch.topk(_query_context_scores, max_n_videos, dim=1, largest=True)
+        if is_vr:
+            sorted_q2c_indices[idx * bsz:(idx + 1) * bsz] = _sorted_q2c_indices.cpu().numpy()
+            sorted_q2c_scores[idx * bsz:(idx + 1) * bsz] = _sorted_q2c_scores.cpu().numpy()
+        if not is_vcmr:
+            continue
+        # normalize to get true probabilities!!!
+        # the probabilities here are already (pad) masked, so only need to do softmax
+        _st_probs = F.softmax(_vcmr_st_prob, dim=-1)  # (_N_q, N_videos, L)
+        _ed_probs = F.softmax(_vcmr_ed_prob, dim=-1)
+        # Get VCMR results
+        # compute combined scores
+        row_indices = torch.arange(0, len(_st_probs), device=opt.device).unsqueeze(1)
+        _st_probs = _st_probs[row_indices, _sorted_q2c_indices]  # (_N_q, max_n_videos, L)
+        _ed_probs = _ed_probs[row_indices, _sorted_q2c_indices]
+        # (_N_q, max_n_videos, L, L)
+        # general/exclusive :  r * \hat{b_i} * \hat{e_i}
+        _st_ed_scores = torch.einsum("qvm,qv,qvn->qvmn", _st_probs, _sorted_q2c_scores, _ed_probs)
+        valid_prob_mask = generate_min_max_length_mask(
+            _st_ed_scores.shape, min_l=opt.min_pred_l, max_l=opt.max_pred_l)
+        _st_ed_scores *= torch.from_numpy(
+            valid_prob_mask).to(_st_ed_scores.device)  # invalid location will become zero!
+        _n_q  = _st_ed_scores.shape[0]
+        # sort across the total_n_videos videos (by flatten from the 2nd dim)
+        # the indices here are local indices, not global indices
+        _flat_st_ed_scores = _st_ed_scores.reshape(_n_q, -1)  # (N_q, total_n_videos*L*L)
+        _flat_st_ed_sorted_scores, _flat_st_ed_scores_sorted_indices = \
+            torch.sort(_flat_st_ed_scores, dim=1, descending=True)
+        # collect data
+        flat_st_ed_sorted_scores[idx * bsz:(idx + 1) * bsz] = \
+            _flat_st_ed_sorted_scores[:, :max_before_nms].detach().cpu().numpy()
+        flat_st_ed_scores_sorted_indices[idx * bsz:(idx + 1) * bsz] = \
+            _flat_st_ed_scores_sorted_indices[:, :max_before_nms].detach().cpu().numpy()
+        if opt.debug:
+            break
+    # Numpy starts here!!!
+    vr_res = []
+    if is_vr:
+        for i, (_sorted_q2c_scores_row, _sorted_q2c_indices_row) in tqdm(
+                enumerate(zip(sorted_q2c_scores, sorted_q2c_indices)),
+                desc="[VR] Loop over queries to generate predictions", total=n_total_query):
+            cur_vr_redictions = []
+            query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
+            for j, (v_score, v_meta_idx) in enumerate(zip(_sorted_q2c_scores_row, _sorted_q2c_indices_row)):
+                video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
+                cur_vr_redictions.append([video_idx, 0, 0, float(v_score)])
+            cur_query_pred = dict(
+                query_id=query_metas[i]["query_id"],
+                desc=query_metas[i]["desc"],
+                predictions=cur_vr_redictions
+            )
+            vr_res.append(cur_query_pred)
+    svmr_res = []
+    if is_svmr:
+        svmr_res = get_svmr_res_from_st_ed_probs(svmr_gt_st_probs, svmr_gt_ed_probs,
+                                                 query_metas, video2idx,
+                                                 clip_length=opt.clip_length,
+                                                 min_pred_l=opt.min_pred_l,
+                                                 max_pred_l=opt.max_pred_l,
+                                                 max_before_nms=max_before_nms)
+    vcmr_res = []
+    if is_vcmr:
+        for i, (_flat_st_ed_scores_sorted_indices, _flat_st_ed_sorted_scores) in tqdm(
+                enumerate(zip(flat_st_ed_scores_sorted_indices, flat_st_ed_sorted_scores)),
+                desc="[VCMR] Loop over queries to generate predictions", total=n_total_query):  # i is query_idx
+            # list([video_idx(int), st(float), ed(float), score(float)])
+            video_meta_indices_local, pred_st_indices, pred_ed_indices = \
+                np.unravel_index(_flat_st_ed_scores_sorted_indices,
+                                 shape=(max_n_videos, opt.max_ctx_len, opt.max_ctx_len))
+            # video_meta_indices refers to the indices among the total_n_videos
+            # video_meta_indices_local refers to the indices among the top-max_n_videos
+            # video_meta_indices refers to the indices in all the videos, which is the True indices
+            video_meta_indices = sorted_q2c_indices[i, video_meta_indices_local]
+            pred_st_in_seconds = pred_st_indices.astype(np.float32) * opt.clip_length
+            pred_ed_in_seconds = pred_ed_indices.astype(np.float32) * opt.clip_length + opt.clip_length
+            cur_vcmr_redictions = []
+            query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
+            for j, (v_meta_idx, v_score) in enumerate(zip(video_meta_indices, _flat_st_ed_sorted_scores)):  # videos
+                video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
+                cur_vcmr_redictions.append(
+                    [video_idx, float(pred_st_in_seconds[j]), float(pred_ed_in_seconds[j]), float(v_score)])
+            cur_query_pred = dict(
+                query_id=query_metas[i]["query_id"],
+                desc=query_metas[i]["desc"],
+                predictions=cur_vcmr_redictions)
+            vcmr_res.append(cur_query_pred)
+    res = dict(VCMR=vcmr_res, SVMR=svmr_res, VR=vr_res)
+    return {k: v for k, v in res.items() if len(v) != 0}
+def compute_query2ctx_info_disjoint(model, eval_dataset, opt,
+                           max_before_nms=200, max_n_videos=100, maxtopk = 40):
+    """Use val set to do evaluation, remember to run with torch.no_grad().
+     model : CONQUER
+     eval_dataset :
+     opt :
+     max_before_nms : max moment number before non-maximum suppression
+     tasks: evaluation tasks
+     disjoint function : b_i + e_i
+    """
+    video2idx = eval_dataset.video2idx
+    model.eval()
+    query_eval_loader = DataLoader(eval_dataset, collate_fn= start_end_collate, batch_size=opt.eval_query_bsz,
+                                   num_workers=opt.num_workers, shuffle=False, pin_memory=True)
+    n_total_query = len(eval_dataset)
+    bsz = opt.eval_query_bsz
+    flat_st_ed_scores_sorted_indices = np.empty((n_total_query, max_before_nms), dtype=int)
+    flat_st_ed_sorted_scores = np.zeros((n_total_query, max_before_nms), dtype=np.float32)
+    query_metas = []
+    for idx, batch in tqdm(
+            enumerate(query_eval_loader), desc="Computing q embedding", total=len(query_eval_loader)):
+        query_metas.extend(batch["meta"])
+        if opt.device.type == "cuda":
+            model_inputs = move_cuda(batch["model_inputs"], opt.device)
+        else:
+            model_inputs = batch["model_inputs"]
+        _ , begin_score_distribution, end_score_distribution =  model.get_pred_from_raw_query(model_inputs)
+        begin_score_distribution = begin_score_distribution[:,1:]
+        end_score_distribution= end_score_distribution[:,1:]
+        # Get VCMR results
+        # (_N_q, total_n_videos, L, L)
+        # b_i + e_i
+        _st_ed_scores = torch.unsqueeze(begin_score_distribution, 3) + torch.unsqueeze(end_score_distribution, 2)
+        _n_q, total_n_videos = _st_ed_scores.size()[:2]
+        ## mask the invalid location out of moment length constrain
+        _valid_prob_mask = np.logical_not(generate_min_max_length_mask(
+            _st_ed_scores.shape, min_l=opt.min_pred_l, max_l=opt.max_pred_l).astype(bool))
+        _valid_prob_mask = torch.from_numpy(_valid_prob_mask).to(_st_ed_scores.device)
+        valid_prob_mask = _valid_prob_mask.repeat(_n_q,total_n_videos,1,1)
+        # invalid location will become VERY_NEGATIVE_NUMBER!
+        _st_ed_scores[valid_prob_mask] = VERY_NEGATIVE_NUMBER
+        # sort across the total_n_videos videos (by flatten from the 2nd dim)
+        # the indices here are local indices, not global indices
+        _flat_st_ed_scores = _st_ed_scores.reshape(_n_q, -1)  # (N_q, total_n_videos*L*L)
+        _flat_st_ed_sorted_scores, _flat_st_ed_scores_sorted_indices = \
+            torch.sort(_flat_st_ed_scores, dim=1, descending=True)
+        # collect data
+        flat_st_ed_sorted_scores[idx * bsz:(idx + 1) * bsz] = \
+            _flat_st_ed_sorted_scores[:, :max_before_nms].detach().cpu().numpy()
+        flat_st_ed_scores_sorted_indices[idx * bsz:(idx + 1) * bsz] = \
+            _flat_st_ed_scores_sorted_indices[:, :max_before_nms].detach().cpu().numpy()
+    vcmr_res = {}
+    for i, (_flat_st_ed_scores_sorted_indices, _flat_st_ed_sorted_scores) in tqdm(
+            enumerate(zip(flat_st_ed_scores_sorted_indices, flat_st_ed_sorted_scores)),
+            desc="[VCMR] Loop over queries to generate predictions", total=n_total_query):  # i is query_idx
+        # list([video_idx(int), st(float), ed(float), score(float)])
+        video_meta_indices_local, pred_st_indices, pred_ed_indices = \
+            np.unravel_index(_flat_st_ed_scores_sorted_indices,
+                            shape=(total_n_videos, opt.max_ctx_len, opt.max_ctx_len))
+        pred_st_in_seconds = pred_st_indices.astype(np.float32) * opt.clip_length
+        pred_ed_in_seconds = pred_ed_indices.astype(np.float32) * opt.clip_length + opt.clip_length
+        cur_vcmr_redictions = []
+        query_specific_video_metas = query_metas[i]["sample_vid_name_list"]
+        for j, (v_meta_idx, v_score) in enumerate(zip(video_meta_indices_local, _flat_st_ed_sorted_scores)):  # videos
+            # video_idx = video2idx[query_specific_video_metas[v_meta_idx]]
+            cur_vcmr_redictions.append(
+                {
+                    "video_name": query_specific_video_metas[v_meta_idx],
+                    "timestamp": [float(pred_st_in_seconds[j]), float(pred_ed_in_seconds[j])],
+                    "model_scores": float(v_score)
+                }
+            )
+        query_id=query_metas[i]["query_id"]
+        vcmr_res[query_id] = cur_vcmr_redictions[:maxtopk]
+    return vcmr_res
+def get_eval_res(model, eval_dataset, opt):
+    """compute and save query and video proposal embeddings"""
+    if opt.similarity_measure  == "disjoint": #disjoint b_i+ e_i
+        eval_res = compute_query2ctx_info_disjoint(model, eval_dataset, opt,
+                                          max_before_nms=opt.max_before_nms,
+                                          max_n_videos=opt.max_vcmr_video)
+    elif opt.similarity_measure  in  ["general" , "exclusive" ] : # r * \hat{b_i} * \hat{e_i}
+        eval_res = compute_query2ctx_info(model, eval_dataset, opt,
+                                          max_before_nms=opt.max_before_nms,
+                                          max_n_videos=opt.max_vcmr_video,
+                                          tasks=tasks)
+    return eval_res
+POST_PROCESSING_MMS_FUNC = {
+    "SVMR": post_processing_vcmr_nms,
+    "VCMR": post_processing_vcmr_nms
+}
+def get_prediction_top_n(list_dict_predictions, top_n):
+    top_n_res = []
+    for e in list_dict_predictions:
+        e["predictions"] = e["predictions"][:top_n]
+        top_n_res.append(e)
+    return top_n_res
+def eval_epoch(model, eval_dataset, opt, max_after_nms, iou_thds, topks):
+    pred_data = get_eval_res(model, eval_dataset, opt)
+    # video2idx = eval_dataset.video2idx
+    # pred_data = get_prediction_top_n(eval_res, top_n=max_after_nms)
+    # pred_data = get_prediction_top_n(eval_res, top_n=max_after_nms)
+    gt_data = eval_dataset.ground_truth
+    average_ndcg = calculate_ndcg_iou(gt_data, pred_data, iou_thds, topks)
+    return average_ndcg, pred_data
+def setup_model(opt):
+    """Load model from checkpoint and move to specified device"""
+    checkpoint = torch.load(opt.ckpt_filepath)
+    loaded_model_cfg = checkpoint["model_cfg"]
+    model = CONQUER(loaded_model_cfg,
+                    visual_dim=opt.visual_dim,
+                    text_dim=opt.text_dim,
+                    query_dim=opt.query_dim,
+                    hidden_dim=opt.hidden_dim,
+                    video_len=opt.max_ctx_len,
+                    ctx_mode=opt.ctx_mode,
+                    no_output_moe_weight=opt.no_output_moe_weight,
+                    similarity_measure=opt.similarity_measure,
+                    use_debug = opt.debug)
+    model.load_state_dict(checkpoint["model"])
+    logger.info("Loaded model saved at epoch {} from checkpoint: {}"
+                .format(checkpoint["epoch"], opt.ckpt_filepath))
+    if opt.device.type == "cuda":
+        logger.info("CUDA enabled.")
+        model.to(opt.device)
+        assert len(opt.device_ids) == 1
+        # if len(opt.device_ids) > 1:
+        #     logger.info("Use multi GPU", opt.device_ids)
+        #     model = torch.nn.DataParallel(model, device_ids=opt.device_ids)  # use multi GPU
+    return model
+def start_inference():
+    logger.info("Setup config, data and model...")
+    opt = TestOptions().parse()
+    cudnn.benchmark = False
+    cudnn.deterministic = True
+    data_config = load_config(opt.dataset_config)
+    eval_dataset = StartEndEvalDataset(
+        config = data_config,
+        max_ctx_len=opt.max_ctx_len,
+        max_desc_len= opt.max_desc_len,
+        clip_length = opt.clip_length,
+        ctx_mode = opt.ctx_mode,
+        mode = opt.eval_split_name,
+        data_ratio = opt.data_ratio,
+        is_eval = True,
+        inference_top_k = opt.max_vcmr_video)
+    postfix = "_hero"
+    model = setup_model(opt)
+    save_submission_filename = "inference_{}_{}_{}_predictions_{}{}.json".format(
+        opt.dset_name, opt.eval_split_name, opt.eval_id, "_".join(opt.tasks),postfix)
+    print(save_submission_filename)
+    logger.info("Starting inference...")
+    with torch.no_grad():
+        metrics_no_nms, metrics_nms, latest_file_paths = \
+            eval_epoch(model, eval_dataset, opt, save_submission_filename,
+                       tasks=opt.tasks, max_after_nms=100)
+    logger.info("metrics_no_nms \n{}".format(pprint.pformat(metrics_no_nms, indent=4)))
+    logger.info("metrics_nms \n{}".format(pprint.pformat(metrics_nms, indent=4)))
+if __name__ == '__main__':
+    start_inference()

model/__init__.py ADDED Viewed

File without changes

model/backbone/__init__.py ADDED Viewed

File without changes

model/backbone/encoder.py ADDED Viewed

	@@ -0,0 +1,235 @@

+"""
+Pytorch modules
+some classes are modified from HuggingFace
+(https://github.com/huggingface/transformers)
+"""
+import torch
+import logging
+from torch import nn
+logger = logging.getLogger(__name__)
+try:
+  import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
+except (ImportError, AttributeError) as e:
+  BertLayerNorm = torch.nn.LayerNorm
+from model.transformer.bert import BertEncoder
+from model.layers import (NetVLAD, LinearLayer)
+from model.transformer.bert_embed import (BertEmbeddings)
+from utils.model_utils import mask_logits
+import torch.nn.functional as F
+class TransformerBaseModel(nn.Module):
+    """
+    Base Transformer model
+    """
+    def __init__(self, config):
+        super(TransformerBaseModel, self).__init__()
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+    def forward(self,features,position_ids,token_type_ids,attention_mask):
+        # embedding layer
+        embedding_output = self.embeddings(token_type_ids=token_type_ids,
+                                           inputs_embeds=features,
+                                           position_ids=position_ids)
+        encoder_outputs = self.encoder(embedding_output, attention_mask)
+        sequence_output = encoder_outputs[0]
+        return sequence_output
+class TwoModalEncoder(nn.Module):
+    """
+        Two modality Transformer Encoder model
+    """
+    def __init__(self, config,img_dim,text_dim,hidden_dim,split_num,output_split=True):
+        super(TwoModalEncoder, self).__init__()
+        self.img_linear = LinearLayer(
+            in_hsz=img_dim, out_hsz=hidden_dim)
+        self.text_linear = LinearLayer(
+            in_hsz=text_dim, out_hsz=hidden_dim)
+        self.transformer = TransformerBaseModel(config)
+        self.output_split = output_split
+        if self.output_split:
+            self.split_num = split_num
+    def forward(self, visual_features, visual_position_ids, visual_token_type_ids, visual_attention_mask,
+                text_features,text_position_ids,text_token_type_ids,text_attention_mask):
+        transformed_im = self.img_linear(visual_features)
+        transformed_text = self.text_linear(text_features)
+        transformer_input_feat = torch.cat((transformed_im,transformed_text),dim=1)
+        transformer_input_feat_pos_id = torch.cat((visual_position_ids,text_position_ids),dim=1)
+        transformer_input_feat_token_id = torch.cat((visual_token_type_ids,text_token_type_ids),dim=1)
+        transformer_input_feat_mask = torch.cat((visual_attention_mask,text_attention_mask),dim=1)
+        output = self.transformer(features=transformer_input_feat,
+                                  position_ids=transformer_input_feat_pos_id,
+                                  token_type_ids=transformer_input_feat_token_id,
+                                  attention_mask=transformer_input_feat_mask)
+        if self.output_split:
+            return torch.split(output,self.split_num,dim=1)
+        else:
+            return output
+class OneModalEncoder(nn.Module):
+    """
+        One modality  Transformer Encoder model
+    """
+    def __init__(self, config,input_dim,hidden_dim):
+        super(OneModalEncoder, self).__init__()
+        self.linear = LinearLayer(
+            in_hsz=input_dim, out_hsz=hidden_dim)
+        self.transformer = TransformerBaseModel(config)
+    def forward(self, features, position_ids, token_type_ids, attention_mask):
+        transformed_features = self.linear(features)
+        output = self.transformer(features=transformed_features,
+                                  position_ids=position_ids,
+                                  token_type_ids=token_type_ids,
+                                  attention_mask=attention_mask)
+        return output
+class VideoQueryEncoder(nn.Module):
+    def __init__(self, config, video_modality,
+                 visual_dim=4352, text_dim= 768,
+                 query_dim=768, hidden_dim = 768,split_num=100,):
+        super(VideoQueryEncoder, self).__init__()
+        self.use_sub = len(video_modality) > 1
+        if self.use_sub:
+            self.videoEncoder = TwoModalEncoder(config=config.bert_config,
+                                                img_dim = visual_dim,
+                                                text_dim = text_dim ,
+                                                hidden_dim = hidden_dim,
+                                                split_num = split_num
+                                                )
+        else:
+            self.videoEncoder = OneModalEncoder(config=config.bert_config,
+                                                input_dim = visual_dim,
+                                                hidden_dim = hidden_dim,
+                                                )
+        self.queryEncoder = OneModalEncoder(config=config.query_bert_config,
+                                            input_dim= query_dim,
+                                            hidden_dim=hidden_dim,
+                                            )
+    def forward_repr_query(self, batch):
+        query_output = self.queryEncoder(
+            features=batch["query"]["feat"],
+            position_ids=batch["query"]["feat_pos_id"],
+            token_type_ids=batch["query"]["feat_token_id"],
+            attention_mask=batch["query"]["feat_mask"]
+        )
+        return query_output
+    def forward_repr_video(self,batch):
+        video_output = dict()
+        if len(batch["visual"]["feat"].size()) == 4:
+            bsz, num_video = batch["visual"]["feat"].size()[:2]
+            for key in batch.keys():
+                if key in ["visual", "sub"]:
+                    for key_2 in batch[key]:
+                        if key_2 in ["feat", "feat_mask", "feat_pos_id", "feat_token_id"]:
+                            shape_list = batch[key][key_2].size()[2:]
+                            batch[key][key_2] = batch[key][key_2].view((bsz * num_video,) + shape_list)
+        if self.use_sub:
+            video_output["visual"], video_output["sub"] = self.videoEncoder(
+                visual_features=batch["visual"]["feat"],
+                visual_position_ids=batch["visual"]["feat_pos_id"],
+                visual_token_type_ids=batch["visual"]["feat_token_id"],
+                visual_attention_mask=batch["visual"]["feat_mask"],
+                text_features=batch["sub"]["feat"],
+                text_position_ids=batch["sub"]["feat_pos_id"],
+                text_token_type_ids=batch["sub"]["feat_token_id"],
+                text_attention_mask=batch["sub"]["feat_mask"]
+            )
+        else:
+            video_output["visual"] = self.videoEncoder(
+                features=batch["visual"]["feat"],
+                position_ids=batch["visual"]["feat_pos_id"],
+                token_type_ids=batch["visual"]["feat_token_id"],
+                attention_mask=batch["visual"]["feat_mask"]
+            )
+        return video_output
+    def forward_repr_both(self, batch):
+        video_output = self.forward_repr_video(batch)
+        query_output = self.forward_repr_query(batch)
+        return {"video_feat": video_output,
+                "query_feat": query_output}
+    def forward(self,batch,task="repr_both"):
+        if task == "repr_both":
+            return self.forward_repr_both(batch)
+        elif task == "repr_video":
+            return self.forward_repr_video(batch)
+        elif task == "repr_query":
+            return self.forward_repr_query(batch)
+class QueryWeightEncoder(nn.Module):
+    """
+        Query Weight Encoder
+        Using NetVLAD to aggreate contextual query features
+        Using FC + Softmax to get fusion weights for each modality
+    """
+    def __init__(self, config, video_modality):
+        super(QueryWeightEncoder, self).__init__()
+        ##NetVLAD
+        self.text_pooling = NetVLAD(feature_size=config.hidden_size,cluster_size=config.text_cluster)
+        self.moe_txt_dropout = nn.Dropout(config.moe_dropout_prob)
+        ##FC
+        self.moe_fc_txt = nn.Linear(
+            in_features=self.text_pooling.out_dim,
+            out_features=len(video_modality),
+            bias=False)
+        self.video_modality = video_modality
+    def forward(self, query_feat):
+        ##NetVLAD
+        pooled_text = self.text_pooling(query_feat)
+        pooled_text = self.moe_txt_dropout(pooled_text)
+        ##FC + Softmax
+        moe_weights = self.moe_fc_txt(pooled_text)
+        softmax_moe_weights = F.softmax(moe_weights, dim=1)
+        moe_weights_dict = dict()
+        for modality, moe_weight in zip(self.video_modality, torch.split(softmax_moe_weights, 1, dim=1)):
+            moe_weights_dict[modality] = moe_weight.squeeze(1)
+        return  moe_weights_dict

model/conquer.py ADDED Viewed

	@@ -0,0 +1,205 @@

+import torch
+import torch.nn as nn
+from model.backbone.encoder import VideoQueryEncoder, QueryWeightEncoder
+from model.qal.query_aware_learning_module import BiDirectionalAttention
+from model.layers import FCPlusTransformer#,MomentLocalizationHead
+from model.head.ml_head import MomentLocalizationHead
+from model.head.vs_head import VideoScoringHead
+import logging
+logger = logging.getLogger(__name__)
+class CONQUER(nn.Module):
+    def __init__(self, config,
+                 visual_dim = 4352,
+                 text_dim = 768,
+                 query_dim = 768,
+                 hidden_dim = 768,
+                 video_len = 100,
+                 ctx_mode = "visual_sub",
+                 lw_st_ed = 0.01,
+                 lw_video_ce = 0.05,
+                 similarity_measure="general",
+                 use_debug=False,
+                 no_output_moe_weight=False):
+        super(CONQUER, self).__init__()
+        self.config = config
+        #  related configs
+        self.lw_st_ed = lw_st_ed
+        self.lw_video_ce = lw_video_ce
+        self.similarity_measure = similarity_measure
+        self.video_modality = ctx_mode.split("_")
+        logger.info("video modality : %s" % self.video_modality)
+        self.output_moe_weight = not no_output_moe_weight
+        hidden_dim = hidden_dim
+        base_bert_layer_config = config.bert_config
+        ## Backbone encoder
+        self.encoder = VideoQueryEncoder(config,video_modality=self.video_modality,
+                                         visual_dim=visual_dim,text_dim=text_dim,query_dim=query_dim,
+                                         hidden_dim=hidden_dim,split_num=video_len)
+        if self.output_moe_weight and len(self.video_modality) > 1:
+            self.query_weight = QueryWeightEncoder(config.netvlad_config,video_modality=self.video_modality)
+        ## Query_aware_feature_learning Module
+        self.query_aware_feature_learning_layer = BiDirectionalAttention(hidden_dim)
+        ## Shared transformer for both moment localization and video scoring heads
+        self.contextual_QAL_feature_learning = FCPlusTransformer(base_bert_layer_config,hidden_dim * 4)
+        ## Moment_localization_head
+        self.moment_localization_head = MomentLocalizationHead(config.moment_localization_config,base_bert_layer_config,hidden_dim)
+        self.temporal_criterion = nn.CrossEntropyLoss(reduction="mean")
+        ## Optional video_scoring_head
+        if self.similarity_measure == "exclusive":
+            self.video_scoring_head = VideoScoringHead(config.video_scoring_config,base_bert_layer_config,hidden_dim)
+            self.score_ce = nn.CrossEntropyLoss(reduction="mean")
+        self.debug_model = use_debug
+        if self.debug_model:
+            logger.setLevel(level=logging.DEBUG)
+        self.reset_parameters()
+    def reset_parameters(self):
+        """ Initialize the weights."""
+        def re_init(module):
+            if isinstance(module, (nn.Linear, nn.Embedding)):
+                # Slightly different from the TF version which uses truncated_normal for initialization
+                # cf https://github.com/pytorch/pytorch/pull/5617
+                module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+                #print("nn.Linear, nn.Embedding: ", module)
+            elif isinstance(module, nn.LayerNorm):
+                module.bias.data.zero_()
+                module.weight.data.fill_(1.0)
+            elif isinstance(module, nn.Conv1d):
+                module.reset_parameters()
+            if isinstance(module, nn.Linear) and module.bias is not None:
+                module.bias.data.zero_()
+        self.apply(re_init)
+    def compute_final_score(self,score_dict,moe_weights=None):
+        sample_key = list(score_dict.keys())[0]
+        final_query_context_scores = torch.zeros_like(score_dict[sample_key])
+        shape_size = len(score_dict[sample_key].shape)
+        if moe_weights is not None:
+            for mod in self.video_modality:
+                if shape_size == 2:
+                    final_query_context_scores += torch.einsum("nm,n->nm", score_dict[mod], moe_weights[mod])
+                elif shape_size == 3:
+                    final_query_context_scores += torch.einsum("nlm,n->nlm", score_dict[mod], moe_weights[mod])
+        else:
+            for mod in self.video_modality:
+                final_query_context_scores += torch.div(score_dict[mod], len(self.video_modality))
+        return final_query_context_scores
+    def get_pred_from_raw_query(self, batch):
+        ## Extract query and video feature through MMT backbone
+        _query_feature = self.encoder(batch, task="repr_query") #Widehat_Q
+        _video_feature_dict = self.encoder(batch, task="repr_video") #Widehat_V and #Widehat_S
+        ## Shared normalization technique
+        ## Use the same query feature for shared_video_num times
+        sample_key = list(_video_feature_dict.keys())[0]
+        query_batch = _query_feature.size()[0]
+        video_batch, video_len = _video_feature_dict[sample_key].size()[:2]
+        shared_video_num = int(video_batch / query_batch)
+        query_feature = torch.repeat_interleave(_query_feature, shared_video_num, dim=0)
+        query_mask = torch.repeat_interleave(batch["query"]["feat_mask"], shared_video_num, dim=0)
+        ## Compute Query Dependent Fusion video feature
+        if self.output_moe_weight and len(self.video_modality) > 1:
+            moe_weights_dict = self.query_weight(query_feature)
+            QDF_feature = self.compute_final_score(_video_feature_dict, moe_weights_dict)
+        else:
+            QDF_feature = self.compute_final_score(_video_feature_dict,None)
+        video_mask = batch["visual"]["feat_mask"]
+        ## Compute Query Aware Learning video feature
+        QAL_feature = self.query_aware_feature_learning_layer(QDF_feature, query_feature,
+                                video_mask,query_mask)
+        ## Contextualize QAL features
+        Contextual_QAL  = self.contextual_QAL_feature_learning(
+            features=QAL_feature,
+            feat_mask=video_mask)
+        G = torch.cat([QAL_feature,Contextual_QAL], dim=2)
+        ## Moment localization head
+        begin_score_distribution , end_score_distribution = self.moment_localization_head(G,Contextual_QAL,video_mask)
+        begin_score_distribution = begin_score_distribution.view(query_batch, shared_video_num, video_len)
+        end_score_distribution = end_score_distribution.view(query_batch, shared_video_num, video_len)
+        ## Optional video scoring head
+        video_similarity_score = None
+        if self.similarity_measure == "exclusive":
+            video_similarity_score = self.video_scoring_head(G,video_mask)
+            video_similarity_score = video_similarity_score.view(query_batch, shared_video_num)
+        return video_similarity_score, begin_score_distribution , end_score_distribution
+    def get_moment_loss_share_norm(self, begin_score_distribution, end_score_distribution ,st_ed_indices):
+        bs , shared_video_num , video_len = begin_score_distribution.size()
+        begin_score_distribution = begin_score_distribution.view(bs,-1)
+        end_score_distribution = end_score_distribution.view(bs,-1)
+        loss_st = self.temporal_criterion(begin_score_distribution, st_ed_indices[:, 0])
+        loss_ed = self.temporal_criterion(end_score_distribution, st_ed_indices[:, 1])
+        moment_ce_loss = loss_st + loss_ed
+        return moment_ce_loss
+    def forward(self,batch):
+        video_similarity_score, begin_score_distribution , end_score_distribution = \
+            self.get_pred_from_raw_query(batch)
+        moment_ce_loss, video_ce_loss = 0, 0
+        # moment cross-entropy loss
+        # if neg_video_num = 0, we do not sample negative videos
+        # the softmax operator is performed only for the ground-truth video
+        # which mean to not use shared normalization training objective
+        moment_ce_loss = self.get_moment_loss_share_norm(
+            begin_score_distribution, end_score_distribution, batch["st_ed_indices"])
+        moment_ce_loss = self.lw_st_ed * moment_ce_loss
+        if self.similarity_measure == "exclusive":
+            ce_label = batch["st_ed_indices"].new_zeros(video_similarity_score.size()[0])
+            video_ce_loss = self.score_ce(video_similarity_score, ce_label)
+            video_ce_loss = self.lw_video_ce*video_ce_loss
+        loss = moment_ce_loss + video_ce_loss
+        return loss, {"moment_ce_loss": float(moment_ce_loss),
+                      "video_ce_loss": float(video_ce_loss),
+                      "loss_overall": float(loss)}

model/head/__init__.py ADDED Viewed

File without changes

model/head/ml_head.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import torch
+from torch import nn
+import logging
+logger = logging.getLogger(__name__)
+from model.layers import FCPlusTransformer, ConvSE
+class MomentLocalizationHead(nn.Module):
+    """
+        Moment localization head model
+    """
+    def __init__(self, config,base_bert_layer_config,hidden_dim):
+        super(MomentLocalizationHead, self).__init__()
+        base_bert_layer_config = base_bert_layer_config
+        hidden_dim = hidden_dim
+        self.begin_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
+        self.end_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 2)
+        self.begin_score_modeling = ConvSE(config)
+        self.end_score_modeling = ConvSE(config)
+    def forward(self, G, Contextual_QAL, video_mask):
+        """
+        Inputs:
+            :param contextual_qal_features: (batch, feat_size, L_v)
+            :param video_mask: (batch, L_v)
+        Return:
+             score: (begin or end) score distribution
+        """
+        ## OUTPUT LAYER
+        begin_features = self.begin_feature_modeling(
+            features=G,
+            feat_mask=video_mask)
+        end_features = self.end_feature_modeling(
+            features=torch.cat([Contextual_QAL, begin_features], dim=2),
+            feat_mask=video_mask)
+        ## Un-normalized
+        begin_input_feature = torch.transpose(begin_features, 1, 2)
+        end_input_feature = torch.transpose(end_features, 1, 2)
+        begin_score_distribution = self.begin_score_modeling(
+            contextual_qal_features=begin_input_feature,
+            video_mask=video_mask,
+        )
+        end_score_distribution = self.end_score_modeling(
+            contextual_qal_features=end_input_feature,
+            video_mask=video_mask,
+        )
+        return begin_score_distribution , end_score_distribution

model/head/vs_head.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import torch
+from torch import nn
+import logging
+logger = logging.getLogger(__name__)
+from model.layers import FCPlusTransformer
+class VideoScoringHead(nn.Module):
+    """
+         Video Scoring Head
+    """
+    def __init__(self, config,base_bert_layer_config,hidden_dim):
+        super(VideoScoringHead, self).__init__()
+        base_bert_layer_config = base_bert_layer_config
+        hidden_dim = hidden_dim
+        self.video_feature_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
+        self.video_score_predictor = nn.Sequential(
+            nn.Linear(**config.linear_1_cfg),
+            nn.ReLU(),
+            nn.Linear(**config.linear_2_cfg)
+        )
+    def forward(self, G, video_mask):
+        ## Contexual_QAL_feature for video scoring
+        R = self.video_feature_modeling(
+            features=G,
+            feat_mask=video_mask)
+        holistic_video_feature, _ = torch.max(R, dim=1)
+        video_similarity_score = self.video_score_predictor(holistic_video_feature.squeeze(1)) # r
+        return video_similarity_score

model/layers.py ADDED Viewed

	@@ -0,0 +1,196 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+import logging
+logger = logging.getLogger(__name__)
+try:
+  import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
+except (ImportError, AttributeError) as e:
+  BertLayerNorm = torch.nn.LayerNorm
+from model.transformer.bert import BertEncoder
+from model.modeling_utils import mask_logits
+class LinearLayer(nn.Module):
+    """linear layer configurable with layer normalization, dropout, ReLU."""
+    def __init__(self, in_hsz, out_hsz, layer_norm=True, dropout=0.1, relu=True,tanh=False):
+        super(LinearLayer, self).__init__()
+        self.relu = relu
+        self.tanh = tanh
+        self.layer_norm = layer_norm
+        if layer_norm:
+            self.LayerNorm = BertLayerNorm(in_hsz)
+        layers = [
+            nn.Dropout(dropout),
+            nn.Linear(in_hsz, out_hsz)
+        ]
+        self.net = nn.Sequential(*layers)
+    def forward(self, x):
+        """(N, L, D)"""
+        if self.layer_norm:
+            x = self.LayerNorm(x)
+        x = self.net(x)
+        if self.relu:
+            x = F.relu(x, inplace=True)
+        if self.tanh:
+            x = torch.tanh(x)
+        return x  # (N, L, D)
+class NetVLAD(nn.Module):
+    def __init__(self, cluster_size, feature_size, add_norm=True):
+        super(NetVLAD, self).__init__()
+        self.feature_size = feature_size
+        self.cluster_size = cluster_size
+        self.clusters = nn.Parameter((1 / math.sqrt(feature_size))
+                                     * torch.randn(feature_size, cluster_size))
+        self.clusters2 = nn.Parameter((1 / math.sqrt(feature_size))
+                                      * torch.randn(1, feature_size, cluster_size))
+        self.add_norm = add_norm
+        self.LayerNorm = BertLayerNorm(cluster_size)
+        self.out_dim = cluster_size * feature_size
+    def forward(self, x):
+        max_sample = x.size()[1]
+        x = x.view(-1, self.feature_size)
+        assignment = torch.matmul(x, self.clusters)
+        if self.add_norm:
+            assignment = self.LayerNorm(assignment)
+        assignment = F.softmax(assignment, dim=1)
+        assignment = assignment.view(-1, max_sample, self.cluster_size)
+        a_sum = torch.sum(assignment, -2, keepdim=True)
+        a = a_sum * self.clusters2
+        assignment = assignment.transpose(1, 2)
+        x = x.view(-1, max_sample, self.feature_size)
+        vlad = torch.matmul(assignment, x)
+        vlad = vlad.transpose(1, 2)
+        vlad = vlad - a
+        # L2 intra norm
+        vlad = F.normalize(vlad)
+        # flattening + L2 norm
+        vlad = vlad.reshape(-1, self.cluster_size * self.feature_size)
+        vlad = F.normalize(vlad)
+        return vlad
+class FCPlusTransformer(nn.Module):
+    """
+        FC + Transformer
+        FC layer reduces input feature size into hidden size
+        Transformer contextualizes QAL feature
+    """
+    def __init__(self, config,input_dim):
+        super(FCPlusTransformer, self).__init__()
+        self.trans_linear = LinearLayer(
+            in_hsz=input_dim, out_hsz=config.hidden_size)
+        self.encoder = BertEncoder(config)
+    def forward(self,features, feat_mask):
+        """
+        Inputs:
+            :param contextual_qal_features: (batch, L_v, input_dim)
+            :param feat_mask: (batch, L_v)
+        Return:
+            sequence_output: (batch, L_v, hidden_size)
+        """
+        transformed_features = self.trans_linear(features)
+        encoder_outputs = self.encoder(transformed_features, feat_mask)
+        sequence_output = encoder_outputs[0]
+        return sequence_output
+class ConvSE(nn.Module):
+    """
+        ConvSE module
+    """
+    def __init__(self, config):
+        super(ConvSE, self).__init__()
+        self.clip_score_predictor = nn.Sequential(
+            nn.Conv1d(**config.conv_cfg_1),
+            nn.ReLU(),
+            nn.Conv1d(**config.conv_cfg_2),
+        )
+    def forward(self, contextual_qal_features, video_mask):
+        """
+        Inputs:
+            :param contextual_qal_features: (batch, feat_size, L_v)
+            :param video_mask: (batch, L_v)
+        Return:
+             score: (begin or end) score distribution
+        """
+        score = self.clip_score_predictor(contextual_qal_features).squeeze(1) #(batch, L_v)
+        score = mask_logits(score, video_mask)  #(batch, L_v)
+        return score
+class MomentLocalizationHead(nn.Module):
+    """
+        Moment localization head model
+    """
+    def __init__(self, config,base_bert_layer_config,hidden_dim):
+        super(MomentLocalizationHead, self).__init__()
+        base_bert_layer_config = base_bert_layer_config
+        hidden_dim = hidden_dim
+        self.start_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 5)
+        self.end_modeling = FCPlusTransformer(base_bert_layer_config, hidden_dim * 2)
+        self.start_reader = ConvSE(config)
+        self.end_reader = ConvSE(config)
+    def forward(self, G, Contextual_QAL, video_mask):
+        """
+        Inputs:
+            :param contextual_qal_features: (batch, feat_size, L_v)
+            :param video_mask: (batch, L_v)
+        Return:
+             score: (begin or end) score distribution
+        """
+        ## OUTPUT LAYER
+        start_features = self.start_modeling(
+            features=G,
+            feat_mask=video_mask)
+        end_features = self.end_modeling(
+            features=torch.cat([Contextual_QAL, start_features], dim=2),
+            feat_mask=video_mask)
+        ## Un-normalized
+        start_reader_input_feature = torch.transpose(start_features, 1, 2)
+        end_reader_input_feature = torch.transpose(end_features, 1, 2)
+        reader_st_prob = self.start_reader(
+            contextual_qal_features=start_reader_input_feature,
+            video_mask=video_mask,
+        )
+        reader_ed_prob = self.end_reader(
+            contextual_qal_features=end_reader_input_feature,
+            video_mask=video_mask,
+        )
+        return reader_st_prob,reader_ed_prob

model/modeling_utils.py ADDED Viewed

	@@ -0,0 +1,135 @@

+"""
+Copyright (c) Microsoft Corporation.
+Licensed under the MIT license.
+some functions are modified from HuggingFace
+(https://github.com/huggingface/transformers)
+"""
+import torch
+from torch import nn
+import logging
+logger = logging.getLogger(__name__)
+def prune_linear_layer(layer, index, dim=0):
+    """ Prune a linear layer (a model parameters)
+        to keep only entries in index.
+        Return the pruned layer as a new layer with requires_grad=True.
+        Used to remove heads.
+    """
+    index = index.to(layer.weight.device)
+    W = layer.weight.index_select(dim, index).clone().detach()
+    if layer.bias is not None:
+        if dim == 1:
+            b = layer.bias.clone().detach()
+        else:
+            b = layer.bias[index].clone().detach()
+    new_size = list(layer.weight.size())
+    new_size[dim] = len(index)
+    new_layer = nn.Linear(
+        new_size[1], new_size[0], bias=layer.bias is not None).to(
+            layer.weight.device)
+    new_layer.weight.requires_grad = False
+    new_layer.weight.copy_(W.contiguous())
+    new_layer.weight.requires_grad = True
+    if layer.bias is not None:
+        new_layer.bias.requires_grad = False
+        new_layer.bias.copy_(b.contiguous())
+        new_layer.bias.requires_grad = True
+    return new_layer
+def mask_logits(target, mask, eps=-1e4):
+    return target * mask + (1 - mask) * eps
+def load_partial_checkpoint(checkpoint, n_layers, skip_layers=True):
+    if skip_layers:
+        new_checkpoint = {}
+        gap = int(12/n_layers)
+        prefix = "roberta.encoder.layer."
+        layer_range = {str(l): str(i) for i, l in enumerate(
+            list(range(gap-1, 12, gap)))}
+        for k, v in checkpoint.items():
+            if prefix in k:
+                layer_name = k.split(".")
+                layer_num = layer_name[3]
+                if layer_num in layer_range:
+                    layer_name[3] = layer_range[layer_num]
+                    new_layer_name = ".".join(layer_name)
+                    new_checkpoint[new_layer_name] = v
+            else:
+                new_checkpoint[k] = v
+    else:
+        new_checkpoint = checkpoint
+    return new_checkpoint
+def load_pretrained_weight(model, state_dict):
+    # Load from a PyTorch state_dict
+    old_keys = []
+    new_keys = []
+    for key in state_dict.keys():
+        new_key = None
+        if 'gamma' in key:
+            new_key = key.replace('gamma', 'weight')
+        if 'beta' in key:
+            new_key = key.replace('beta', 'bias')
+        if new_key:
+            old_keys.append(key)
+            new_keys.append(new_key)
+    for old_key, new_key in zip(old_keys, new_keys):
+        state_dict[new_key] = state_dict.pop(old_key)
+    missing_keys = []
+    unexpected_keys = []
+    error_msgs = []
+    # copy state_dict so _load_from_state_dict can modify it
+    metadata = getattr(state_dict, '_metadata', None)
+    state_dict = state_dict.copy()
+    if metadata is not None:
+        state_dict._metadata = metadata
+    def load(module, prefix=''):
+        local_metadata = ({} if metadata is None
+                          else metadata.get(prefix[:-1], {}))
+        module._load_from_state_dict(
+            state_dict, prefix, local_metadata, True, missing_keys,
+            unexpected_keys, error_msgs)
+        for name, child in module._modules.items():
+            if child is not None:
+                load(child, prefix + name + '.')
+    start_prefix = ''
+    if not hasattr(model, 'roberta') and\
+            any(s.startswith('roberta.') for s in state_dict.keys()):
+        start_prefix = 'roberta.'
+    load(model, prefix=start_prefix)
+    if len(missing_keys) > 0:
+        logger.info("Weights of {} not initialized from "
+                    "pretrained model: {}".format(
+                        model.__class__.__name__, missing_keys))
+    if len(unexpected_keys) > 0:
+        logger.info("Weights from pretrained model not used in "
+                    "{}: {}".format(
+                        model.__class__.__name__, unexpected_keys))
+    if len(error_msgs) > 0:
+        raise RuntimeError('Error(s) in loading state_dict for '
+                            '{}:\n\t{}'.format(
+                                model.__class__.__name__,
+                                "\n\t".join(error_msgs)))
+    return model
+def pad_tensor_to_mul(tensor, dim=0, mul=8):
+    """ pad tensor to multiples (8 for tensor cores) """
+    t_size = list(tensor.size())
+    n_pad = mul - t_size[dim] % mul
+    if n_pad == mul:
+        n_pad = 0
+        padded_tensor = tensor
+    else:
+        t_size[dim] = n_pad
+        pad = torch.zeros(*t_size, dtype=tensor.dtype, device=tensor.device)
+        padded_tensor = torch.cat([tensor, pad], dim=dim)
+    return padded_tensor, n_pad

model/qal/__init__.py ADDED Viewed

File without changes

model/qal/query_aware_learning_module.py ADDED Viewed

	@@ -0,0 +1,92 @@

+import torch
+from torch import nn
+import logging
+logger = logging.getLogger(__name__)
+try:
+  import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
+except (ImportError, AttributeError) as e:
+  BertLayerNorm = torch.nn.LayerNorm
+from utils.model_utils import mask_logits
+import torch.nn.functional as F
+class BiDirectionalAttention(nn.Module):
+    """
+         Bi-directional attention flow
+         Perform query-to-video attention (Q2V) and video-to-query attention (V2Q)
+         Append QDF features with a set of query-aware features to form QAL feature
+    """
+    def __init__(self, video_dim):
+        super(BiDirectionalAttention, self).__init__()
+        ## Core Attention for query-aware feature learining
+        self.similarity_weight = nn.Linear(video_dim * 3, 1, bias=False)
+    def forward(self, QDF_emb, query_emb,video_mask, query_mask):
+        """
+        Inputs:
+        :param QDF_emb: (batch, L_v, feat_size)
+        :param query_emb: (batch, L_q, feat_size)
+        :param video_mask: (batch, L_v)
+        :param query_mask: (batch, L_q)
+        Return:
+        QAL: (batch, L_v, feat_size*4)
+        """
+        ## CREATE SIMILARITY MATRIX
+        video_len = QDF_emb.size()[1]
+        query_len = query_emb.size()[1]
+        _QDF_emb = QDF_emb.unsqueeze(2).repeat(1, 1, query_len, 1)
+        # [bs, video_len, 1, feat_size] => [bs, video_len, query_len, feat_size]
+        _query_emb = query_emb.unsqueeze(1).repeat(1, video_len, 1, 1)
+        # [bs, 1, query_len, feat_size] => [bs, video_len, query_len, feat_size]
+        elementwise_prod = torch.mul(_QDF_emb, _query_emb)
+        # [bs, video_len, query_len, feat_size]
+        alpha = torch.cat([_QDF_emb, _query_emb, elementwise_prod], dim=3)
+        # [bs, video_len, query_len, feat_size*3]
+        similarity_matrix = self.similarity_weight(alpha).view(-1, video_len, query_len)
+        similarity_matrix_mask = torch.einsum("bn,bm->bnm", video_mask, query_mask)
+        # [bs, video_len, query_len]
+        ## CALCULATE Video2Query ATTENTION
+        a = F.softmax(mask_logits(similarity_matrix,
+                                  similarity_matrix_mask), dim=-1)
+        # [bs, video_len, query_len]
+        V2Q = torch.bmm(a, query_emb)
+        # [bs] ([video_len, query_len] X [query_len, feat_size]) => [bs, video_len, feat_size]
+        ## CALCULATE Query2Video ATTENTION
+        b = F.softmax(torch.max(mask_logits(similarity_matrix, similarity_matrix_mask), 2)[0], dim=-1)
+        # [bs, video_len]
+        b = b.unsqueeze(1)
+        # [bs, 1, video_len]
+        Q2V = torch.bmm(b, QDF_emb)
+        # [bs] ([bs, 1, video_len] X [bs, video_len, feat_size]) => [bs, 1, feat_size]
+        Q2V = Q2V.repeat(1, video_len, 1)
+        # [bs, video_len, feat_size]
+        ## Append QDF_emb with three query-aware features
+        QAL = torch.cat([QDF_emb, V2Q,
+                         torch.mul(QDF_emb, V2Q),
+                         torch.mul(QDF_emb, Q2V)], dim=2)
+        # [bs, video_len, feat_size*4]
+        return QAL

model/transformer/__init__.py ADDED Viewed

File without changes

model/transformer/bert.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""
+BERT/RoBERTa layers from the huggingface implementation
+(https://github.com/huggingface/transformers)
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from model.modeling_utils import prune_linear_layer
+import math
+import logging
+logger = logging.getLogger(__name__)
+try:
+  import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
+except (ImportError, AttributeError) as e:
+  BertLayerNorm = torch.nn.LayerNorm
+def gelu(x):
+    """ Original Implementation of the gelu activation function
+        in Google Bert repo when initialy created.
+        For information: OpenAI GPT's gelu is slightly different
+        (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi)
+            * (x + 0.044715 * torch.pow(x, 3))))
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+def gelu_new(x):
+    """ Implementation of the gelu activation function currently
+        in Google Bert repo (identical to OpenAI GPT).
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return 0.5 * x * (
+        1 + torch.tanh(
+            math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+def swish(x):
+    return x * torch.sigmoid(x)
+ACT2FN = {
+    "gelu": gelu,
+    "relu": torch.nn.functional.relu,
+    "swish": swish, "gelu_new": gelu_new}
+class BertSelfAttention(nn.Module):
+    def __init__(self, config):
+        super(BertSelfAttention, self).__init__()
+        if config.hidden_size % config.num_attention_heads != 0:
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of "
+                "the number of attention heads (%d)" % (
+                    config.hidden_size, config.num_attention_heads))
+        self.output_attentions = config.output_attentions
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(
+            config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads *\
+            self.attention_head_size
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (
+            self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+    def forward(self, hidden_states, attention_mask=None, head_mask=None):
+        mixed_query_layer = self.query(hidden_states)
+        mixed_key_layer = self.key(hidden_states)
+        mixed_value_layer = self.value(hidden_states)
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        key_layer = self.transpose_for_scores(mixed_key_layer)
+        value_layer = self.transpose_for_scores(mixed_value_layer)
+        # Take the dot product between "query"
+        # and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(
+            query_layer, key_layer.transpose(-1, -2))
+        attention_scores = attention_scores / math.sqrt(
+            self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is
+            # (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        context_layer = torch.matmul(attention_probs, value_layer)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (
+            self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+        outputs = (context_layer, attention_probs)\
+            if self.output_attentions else (context_layer,)
+        return outputs
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super(BertSelfOutput, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class BertAttention(nn.Module):
+    def __init__(self, config):
+        super(BertAttention, self).__init__()
+        self.self = BertSelfAttention(config)
+        self.output = BertSelfOutput(config)
+        self.pruned_heads = set()
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        mask = torch.ones(
+            self.self.num_attention_heads, self.self.attention_head_size)
+        # Convert to set and emove already pruned heads
+        heads = set(heads) - self.pruned_heads
+        for head in heads:
+            # Compute how many pruned heads are
+            # before the head and move the index accordingly
+            head = head - sum(1 if h < head else 0 for h in self.pruned_heads)
+            mask[head] = 0
+        mask = mask.view(-1).contiguous().eq(1)
+        index = torch.arange(len(mask))[mask].long()
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(
+            heads)
+        self.self.all_head_size =\
+            self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+    def forward(self, input_tensor, attention_mask=None, head_mask=None):
+        self_outputs = self.self(input_tensor, attention_mask, head_mask)
+        attention_output = self.output(self_outputs[0], input_tensor)
+        # add attentions if we output them
+        outputs = (attention_output,) + self_outputs[1:]
+        return outputs
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super(BertIntermediate, self).__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super(BertOutput, self).__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = BertLayerNorm(
+            config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class BertLayer(nn.Module):
+    def __init__(self, config):
+        super(BertLayer, self).__init__()
+        self.attention = BertAttention(config)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+    def forward(self, hidden_states, attention_mask=None, head_mask=None):
+        attention_outputs = self.attention(
+            hidden_states, attention_mask, head_mask)
+        attention_output = attention_outputs[0]
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        # add attentions if we output them
+        outputs = (layer_output,) + attention_outputs[1:]
+        return outputs
+class BertEncoder(nn.Module):
+    def __init__(self, config):
+        super(BertEncoder, self).__init__()
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+        self.layer = nn.ModuleList([BertLayer(config) for _ in range(
+            config.num_hidden_layers)])
+    def forward(self, hidden_states, attention_mask=None, head_mask=None):
+        # We create a 3D attention mask from a 2D tensor mask.
+        # Sizes are [batch_size, 1, 1, to_seq_length]
+        # So we can broadcast to
+        # [batch_size, num_heads, from_seq_length, to_seq_length]
+        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(
+            dtype=next(self.parameters()).dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        all_hidden_states = ()
+        all_attentions = ()
+        for i, layer_module in enumerate(self.layer):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer_outputs = layer_module(
+                hidden_states, extended_attention_mask, None)
+            hidden_states = layer_outputs[0]
+            if self.output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+        # Add last layer
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        outputs = (hidden_states,)
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states,)
+        if self.output_attentions:
+            outputs = outputs + (all_attentions,)
+        # last-layer hidden state, (all hidden states), (all attentions)
+        return outputs

model/transformer/bert_embed.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""
+Input Embedding Layers
+"""
+import torch
+import torch.nn as nn
+import logging
+logger = logging.getLogger(__name__)
+try:
+  import apex.normalization.fused_layer_norm.FusedLayerNorm as BertLayerNorm
+except (ImportError, AttributeError) as e:
+  logger.info(
+      "Better speed can be achieved with apex installed from "
+      "https://www.github.com/nvidia/apex ."
+  )
+  BertLayerNorm = torch.nn.LayerNorm
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word, position and token_type embeddings."""
+    def __init__(self, config):
+        super(BertEmbeddings, self).__init__()
+        #self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        # self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        # self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+        seq_length = input_shape[1]
+        if position_ids is None:
+            position_ids = self.position_ids[:, :seq_length]
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        position_embeddings = self.position_embeddings(position_ids)
+        embeddings = inputs_embeds + token_type_embeddings + position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings

ndcg_iou_topk.py ADDED Viewed

	@@ -0,0 +1,66 @@

+from utils.basic_utils import load_jsonl, save_jsonl, load_json
+import pandas as pd
+from tqdm import tqdm
+import numpy as np
+from collections import defaultdict
+import copy
+def calculate_iou(pred_start: float, pred_end: float, gt_start: float, gt_end: float) -> float:
+    intersection_start = max(pred_start, gt_start)
+    intersection_end = min(pred_end, gt_end)
+    intersection = max(0, intersection_end - intersection_start)
+    union = (pred_end - pred_start) + (gt_end - gt_start) - intersection
+    return intersection / union if union > 0 else 0
+# Function to calculate DCG
+def calculate_dcg(scores):
+    return sum((2**score - 1) / np.log2(idx + 2) for idx, score in enumerate(scores))
+# Function to calculate NDCG
+def calculate_ndcg(pred_scores, true_scores):
+    dcg = calculate_dcg(pred_scores)
+    idcg = calculate_dcg(sorted(true_scores, reverse=True))
+    return dcg / idcg if idcg > 0 else 0
+def calculate_ndcg_iou(all_gt, all_pred, TS, KS):
+    performance = defaultdict(lambda: defaultdict(list))
+    performance_avg = defaultdict(lambda: defaultdict(float))
+    for k in tqdm(all_pred.keys(), desc="Calculate NDCG"):
+        one_pred = all_pred[k]
+        one_gt = all_gt[k]
+        one_gt.sort(key=lambda x: x["relevance"], reverse=True)
+        for T in TS:
+            one_gt_drop = copy.deepcopy(one_gt)
+            predictions_with_scores = []
+            for pred in one_pred:
+                pred_video_name, pred_time = pred["video_name"], pred["timestamp"]
+                matched_rows = [gt for gt in one_gt_drop if gt["video_name"] == pred_video_name]
+                if not matched_rows:
+                    pred["pred_relevance"] = 0
+                else:
+                    ious = [calculate_iou(pred_time[0], pred_time[1], gt["timestamp"][0], gt["timestamp"][1]) for gt in matched_rows]
+                    max_iou_idx = np.argmax(ious)
+                    max_iou_row = matched_rows[max_iou_idx]
+                    if ious[max_iou_idx] > T:
+                        pred["pred_relevance"] = max_iou_row["relevance"]
+                        # Remove the matched ground truth row
+                        original_idx = one_gt_drop.index(max_iou_row)
+                        one_gt_drop.pop(original_idx)
+                    else:
+                        pred["pred_relevance"] = 0
+                predictions_with_scores.append(pred)
+            for K in KS:
+                true_scores = [gt["relevance"] for gt in one_gt][:K]
+                pred_scores = [pred["pred_relevance"] for pred in predictions_with_scores][:K]
+                ndcg_score = calculate_ndcg(pred_scores, true_scores)
+                performance[K][T].append(ndcg_score)
+    for K, vs in performance.items():
+        for T, v in vs.items():
+            performance_avg[K][T] = np.mean(v)
+    return performance_avg

optim/adamw.py ADDED Viewed

	@@ -0,0 +1,106 @@

+"""
+AdamW optimizer (weight decay fix)
+originally from hugginface (https://github.com/huggingface/transformers).
+Copied from UNITER
+(https://github.com/ChenRocks/UNITER)
+"""
+import math
+import torch
+from torch.optim import Optimizer
+class AdamW(Optimizer):
+    """ Implements Adam algorithm with weight decay fix.
+    Parameters:
+        lr (float): learning rate. Default 1e-3.
+        betas (tuple of 2 floats): Adams beta parameters (b1, b2).
+            Default: (0.9, 0.999)
+        eps (float): Adams epsilon. Default: 1e-6
+        weight_decay (float): Weight decay. Default: 0.0
+        correct_bias (bool): can be set to False to avoid correcting bias
+            in Adam (e.g. like in Bert TF repository). Default True.
+    """
+    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6,
+                 weight_decay=0.0, correct_bias=True):
+        if lr < 0.0:
+            raise ValueError(
+                "Invalid learning rate: {} - should be >= 0.0".format(lr))
+        if not 0.0 <= betas[0] < 1.0:
+            raise ValueError("Invalid beta parameter: {} - "
+                             "should be in [0.0, 1.0[".format(betas[0]))
+        if not 0.0 <= betas[1] < 1.0:
+            raise ValueError("Invalid beta parameter: {} - "
+                             "should be in [0.0, 1.0[".format(betas[1]))
+        if not 0.0 <= eps:
+            raise ValueError("Invalid epsilon value: {} - "
+                             "should be >= 0.0".format(eps))
+        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay,
+                        correct_bias=correct_bias)
+        super(AdamW, self).__init__(params, defaults)
+    def step(self, closure=None):
+        """Performs a single optimization step.
+        Arguments:
+            closure (callable, optional): A closure that reevaluates the model
+                and returns the loss.
+        """
+        loss = None
+        if closure is not None:
+            loss = closure()
+        for group in self.param_groups:
+            for p in group['params']:
+                if p.grad is None:
+                    continue
+                grad = p.grad.data
+                if grad.is_sparse:
+                    raise RuntimeError(
+                        'Adam does not support sparse '
+                        'gradients, please consider SparseAdam instead')
+                state = self.state[p]
+                # State initialization
+                if len(state) == 0:
+                    state['step'] = 0
+                    # Exponential moving average of gradient values
+                    state['exp_avg'] = torch.zeros_like(p.data)
+                    # Exponential moving average of squared gradient values
+                    state['exp_avg_sq'] = torch.zeros_like(p.data)
+                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
+                beta1, beta2 = group['betas']
+                state['step'] += 1
+                # Decay the first and second moment running average coefficient
+                # In-place operations to update the averages at the same time
+                exp_avg.mul_(beta1).add_(grad , alpha=1.0 - beta1)
+                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1.0 - beta2)
+                denom = exp_avg_sq.sqrt().add_(group['eps'])
+                step_size = group['lr']
+                if group['correct_bias']:  # No bias correction for Bert
+                    bias_correction1 = 1.0 - beta1 ** state['step']
+                    bias_correction2 = 1.0 - beta2 ** state['step']
+                    step_size = (step_size * math.sqrt(bias_correction2)
+                                 / bias_correction1)
+                p.data.addcdiv_(exp_avg, denom, value=-step_size)
+                # Just adding the square of the weights to the loss function is
+                # *not* the correct way of using L2 regularization/weight decay
+                # with Adam, since that will interact with the m and v
+                # parameters in strange ways.
+                #
+                # Instead we want to decay the weights in a manner that doesn't
+                # interact with the m/v parameters. This is equivalent to
+                # adding the square of the weights to the loss with plain
+                # (non-momentum) SGD.
+                # Add weight decay at the end (fixed version)
+                if group['weight_decay'] > 0.0:
+                    p.data.add_(p.data, alpha=-group['lr'] * group['weight_decay'])
+        return loss

results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4d870ccff8ab61b72571cd7c9f84eb916d84fd7f091b2e300dfb9d4be5ee518
+size 29628

results/tvr-top01-2024_07_08_17_18_30/20240708_171830_conquer_top01_back.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef85a542568c80fab7d57d69041ebd898e30d4fc912082bd4d571aea3ec6424c
+size 29917

results/tvr-top01-2024_07_08_17_18_30/best_test_predictions.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0becb2747c635a0080149ccb3e92975f7bf4bf3a99d025fd41d29ae9287db438
+size 14263264

results/tvr-top01-2024_07_08_17_18_30/best_val_predictions.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:47ced0079b54bdbc05268645d80c6fa52b1ed44c6e04f6922d535be29aa3fd8c
+size 2560976

results/tvr-top01-2024_07_08_17_18_30/code.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88b0711364459d5340f2e887420295145188a9008d5b50b5ddde46b221645c23
+size 1141392

results/tvr-top01-2024_07_08_17_18_30/model.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aa2b8044636fe7ce9ab4d36df179ec2358f10a579de4ee5a7e58f338553558d2
+size 190742082

results/tvr-top01-2024_07_08_17_18_30/opt.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c93c28739229f5e35afc1239e1f30e0cad28353909eed88b6d65732943a5ac61
+size 1370

results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea621825b2f1d618daf456f872246d6d50bd3729a36606c7cdcf75dcddbec57a
+size 30298

results/tvr-top20-2024_07_08_21_19_47/20240708_211947_conquer_top20_back.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:03b9976e0b0049f434e91251cfcde27b9a2334e95216d995ada4699f83d889c9
+size 31752

results/tvr-top20-2024_07_08_21_19_47/best_test_predictions.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:12895f4d15d70eff1737745bda045cf6fb1bf6e85aa4e8c4cdd86633cb70274a
+size 14324579

results/tvr-top20-2024_07_08_21_19_47/best_val_predictions.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:103076d328e1b7efdc2773625c38fc73a29492a67bcb27e023af73f8b21c8732
+size 2571786

results/tvr-top20-2024_07_08_21_19_47/code.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88b0711364459d5340f2e887420295145188a9008d5b50b5ddde46b221645c23
+size 1141392

results/tvr-top20-2024_07_08_21_19_47/model.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:baff5eaebb7f211640af4e21f2876be344eaa95431ab32398ac7260e9803471f
+size 190742082

results/tvr-top20-2024_07_08_21_19_47/opt.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:90d02a58cbb9a5ea0f23e3fefedd3f8f7b8852332b4877cfe7ba2833ca699071
+size 1368

results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:895455a13565da5f3d44126722152288a3057649fef1daa94d7558d490d97d81
+size 24491

results/tvr-top40-2024_07_11_10_58_46/20240711_105847_conquer_top40_back.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6085e3055b53b0afc63799813027a70b1d1999beeecf22b0accda3b5a60fe8cc
+size 26137

results/tvr-top40-2024_07_11_10_58_46/best_test_predictions.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5deaab54d6eec95172c5877b38dc72712f76b0357f26e255938a55835627ed2c
+size 14329598

results/tvr-top40-2024_07_11_10_58_46/best_val_predictions.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9d7b68cde82958c1a7039210d2ac4bb5cfb5083abee6bbb550083395061a8a8
+size 2572649

results/tvr-top40-2024_07_11_10_58_46/code.zip ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88e51fa09336f4a4545dc2e281cfe8cea943daf17de87c12b6b75d226fdb61dd
+size 1141399

results/tvr-top40-2024_07_11_10_58_46/model.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5eba8e53656fed1ddcbb7d8129bd6c72862797c63684f11121a9a78c86b30c70
+size 190742082

results/tvr-top40-2024_07_11_10_58_46/opt.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e03b5de0524d803c796aaef3fa4aaf1152cfae63644403e236262fe1a4663b3
+size 1368

run_disjoint_top01.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+python train.py \
+    --model_name conquer \
+    --dataset_config config/tvr_ranking_data_config_top01.json \
+    --model_config config/model_config.json \
+    --eval_tasks_at_training VCMR \
+    --use_interal_vr_scores \
+    --use_extend_pool 500 \
+    --neg_video_num 0 \
+    --max_vcmr_video 10 \
+    --similarity_measure disjoint \
+    --bsz 196 \
+    --eval_query_bsz 8 \
+    --eval_num_per_epoch 0.05 \
+    --n_epoch 4000 \
+    --exp_id top01
+    # qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
+    # cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top01.sh

run_disjoint_top20.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+python train.py \
+    --model_name conquer \
+    --dataset_config config/tvr_ranking_data_config_top20.json \
+    --model_config config/model_config.json \
+    --eval_tasks_at_training VCMR \
+    --use_interal_vr_scores \
+    --use_extend_pool 500 \
+    --neg_video_num 0 \
+    --max_vcmr_video 10 \
+    --similarity_measure disjoint \
+    --bsz 196 \
+    --eval_query_bsz 8 \
+    --eval_num_per_epoch 1 \
+    --n_epoch 200 \
+    --exp_id top20
+    # qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
+    # cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top20.sh

run_disjoint_top40.sh ADDED Viewed

	@@ -0,0 +1,19 @@

+python train.py \
+    --model_name conquer \
+    --dataset_config config/tvr_ranking_data_config_top40.json \
+    --model_config config/model_config.json \
+    --eval_tasks_at_training VCMR \
+    --use_interal_vr_scores \
+    --use_extend_pool 500 \
+    --neg_video_num 0 \
+    --max_vcmr_video 10 \
+    --similarity_measure disjoint \
+    --bsz 196 \
+    --eval_query_bsz 8 \
+    --eval_num_per_epoch 2 \
+    --n_epoch 100 \
+    --exp_id top40
+    # qsub -I -l select=1:ngpus=1 -P gs_slab -q gpu8
+    # cd 11_TVR-Ranking/CONQUER/; conda activate py11; sh run_disjoint_top40.sh