Sampling Negative Examples for training bge-m3
Hello!
First of all, thank you for your incredible and outstanding work!
My question is pretty simple, but might be tricky depending on the dataset: Which approach should I use for sampling negative examples in order to fine-tune bge-m3 model as given in project repo for multilingual semantic search?
Currently I have pairs of (query, document) and need to generate negative example per record.
There are plenty of approaches, such as:
- Explicit Negatives - not really applicable because the data is static.
- Random Negatives - getting random document from other record.
- BM25 Negatives - quite popular, but has some biases when it comes to retrieve documents.
- Gold Negatives - similar to random negatives, but getting specific document.
- In-batch Negatives - fetching negatives in the same batch
- Cross-batch Negatives - quite complicated, but uses GPU for sampling negatives
- Approximate Nearest Neighbors - quite costly, not sure if reasonable to use
- Hybrid - uses combination of BM25 and other approach
What can you suggest from your model's point of view? I can try out all of them, but I would prefer to save some time and follow previously used approach.
Thank you in advance.
Thanks for your interest in our work!
We recommend using In-batch Negatives and BM25 Negatives/Approximate Nearest Neighbors.
We provide a script to mine hard negative(Approximate Nearest Neighbors). And for fine-tuning scrip, you can set use_inbatch_neg
(default value is True) to use in-batch negatives.
Hello!
Thanks for your quick response!
I am trying to reproduce (on jupyter notebook) script you mentioned in your comment and getting the following error:
File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:349, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
346 self._memory_tracker.start()
348 # set the correct log level depending on the node
--> 349 log_level = args.get_process_log_level()
350 logging.set_verbosity(log_level)
352 # force device and distributed setup init explicitly
TypeError: TrainingArguments.get_process_log_level() missing 1 required positional argument: 'self'
transformers version is 4.33.0 (as mentioned in setup file).
If I upgrade transformers to the last version (4.39.2), the following error arises:
File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:4255, in Trainer.create_accelerator_and_postprocess(self)
4249 gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
4251 # create accelerator object
4252 self.accelerator = Accelerator(
4253 deepspeed_plugin=self.args.deepspeed_plugin,
4254 gradient_accumulation_plugin=gradient_accumulation_plugin,
-> 4255 **self.args.accelerator_config.to_dict(),
4256 )
4257 # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
4258 self.gather_function = self.accelerator.gather_for_metrics
AttributeError: 'NoneType' object has no attribute 'to_dict'
I have tried to set up additional argument accelerator_config
in RetrieverTrainingArguments
class, but didn't get any success. Even instance class AcceleratorConfig
from transformers.trainer_pt_utils
doesn't help, despite the fact that it's suggested in transformers' documentation code.
Any ideas or thoughts how I can get rid of any of those errors?
FYI: Opened an issue on transformers GitHub repo page
Closing this as it was fixed (see link to github).