fschlatt commited on
Commit
bb86129
1 Parent(s): 1b207b2

update readme

Browse files
Files changed (4) hide show
  1. README.md +28 -0
  2. configs/fine-tune.yaml +62 -0
  3. configs/index.yaml +25 -0
  4. configs/search.yaml +25 -0
README.md CHANGED
@@ -1,3 +1,31 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Lightning IR ColBERT
6
+
7
+ This model is a ColBERT[^1] model fine-tuned using [Lightning IR](https://github.com/webis-de/lightning-ir).
8
+
9
+ See the [Lightning IR Model Zoo](https://webis-de.github.io/lightning-ir/models.html) for a comparison with other models.
10
+
11
+ ## Reproduction
12
+
13
+ To reproduce the model training, install Lightning IR and run the following command using the [fine-tune.yaml](./configs/fine-tune.yaml) configuration file:
14
+
15
+ ```bash
16
+ lightning-ir fit --config fine-tune.yaml
17
+ ```
18
+
19
+ To index MS~MARCO passages, use the following command and the [index.yaml](./configs/index.yaml) configuration file:
20
+
21
+ ```bash
22
+ lightning-ir index --config index.yaml
23
+ ```
24
+
25
+ After indexing, to evaluate the model on TREC Deep Learning 2019 and 2020, use the following command and the [search.yaml](./configs/search.yaml) configuration file:
26
+
27
+ ```bash
28
+ lightning-ir search --config search.yaml
29
+ ```
30
+
31
+ [^1]: Khattab and Zaharia, [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://dl.acm.org/doi/abs/10.1145/3397271.3401075)
configs/fine-tune.yaml ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # lightning.pytorch==2.3.3
2
+ seed_everything: 0
3
+ trainer:
4
+ precision: bf16-mixed
5
+ max_steps: 50000
6
+ data:
7
+ class_path: lightning_ir.LightningIRDataModule
8
+ init_args:
9
+ num_workers: 1
10
+ train_batch_size: 64
11
+ shuffle_train: true
12
+ train_dataset:
13
+ class_path: lightning_ir.RunDataset
14
+ init_args:
15
+ run_path_or_id: msmarco-passage/train/rank-distillm/set-encoder
16
+ depth: 100
17
+ sample_size: 8
18
+ sampling_strategy: log_random
19
+ targets: score
20
+ normalize_targets: false
21
+ model:
22
+ class_path: lightning_ir.BiEncoderModule
23
+ init_args:
24
+ model_name_or_path: bert-base-uncased
25
+ config:
26
+ class_path: lightning_ir.ColConfig
27
+ init_args:
28
+ similarity_function: dot
29
+ query_expansion: true
30
+ attend_to_query_expanded_tokens: true
31
+ query_mask_scoring_tokens: null
32
+ doc_mask_scoring_tokens: punctuation
33
+ query_aggregation_function: mean
34
+ normalize: false
35
+ add_marker_tokens: false
36
+ embedding_dim: 128
37
+ projection: linear
38
+ query_pooling_strategy: mean
39
+ doc_expansion: false
40
+ attend_to_doc_expanded_tokens: false
41
+ doc_pooling_strategy: mean
42
+ sparsification: null
43
+ query_length: 32
44
+ doc_length: 256
45
+ loss_functions:
46
+ - class_path: lightning_ir.SupervisedMarginMSE
47
+ - class_path: lightning_ir.KLDivergence
48
+ - class_path: lightning_ir.InBatchCrossEntropy
49
+ init_args:
50
+ pos_sampling_technique: first
51
+ neg_sampling_technique: first
52
+ max_num_neg_samples: 8
53
+ optimizer:
54
+ class_path: torch.optim.AdamW
55
+ init_args:
56
+ lr: 2.0e-05
57
+ lr_scheduler:
58
+ class_path: lightning_ir.LinearLRSchedulerWithLinearWarmup
59
+ init_args:
60
+ num_warmup_steps: 5000
61
+ final_value: 0.02
62
+ num_delay_steps: 0
configs/index.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ trainer:
2
+ logger: false
3
+ callbacks:
4
+ - class_path: lightning_ir.IndexCallback
5
+ init_args:
6
+ index_dir: ./index
7
+ index_config:
8
+ class_path: FaissIVFPQIndexConfig
9
+ init_args:
10
+ num_centroids: 262144
11
+ num_subquantizers: 16
12
+ n_bits: 8
13
+ model:
14
+ class_path: lightning_ir.BiEncoderModule
15
+ init_args:
16
+ model_name_or_path: webis/bert-bi-encoder
17
+ data:
18
+ class_path: lightning_ir.LightningIRDataModule
19
+ init_args:
20
+ num_workers: 1
21
+ inference_batch_size: 256
22
+ inference_datasets:
23
+ - class_path: DocDataset
24
+ init_args:
25
+ doc_dataset: msmarco-passage
configs/search.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ trainer:
3
+ logger: false
4
+ callbacks:
5
+ - class_path: SearchCallback
6
+ init_args:
7
+ index_dir: ./index
8
+ use_gpu: false
9
+ search_config:
10
+ class_path: FaissSearchConfig
11
+ init_args:
12
+ k: 10
13
+ model:
14
+ class_path: lightning_ir.BiEncoderModule
15
+ init_args:
16
+ model_name_or_path: webis/bert-bi-encoder
17
+ evaluation_metrics:
18
+ - nDCG@10
19
+ data:
20
+ class_path: lightning_ir.LightningIRDataModule
21
+ init_args:
22
+ inference_datasets:
23
+ - class_path: QueryDataset
24
+ init_args:
25
+ doc_dataset: msmarco-passage/trec-dl-2019/judged