Sampling Negative Examples for training bge-m3

#32
by mehti - opened

Hello!

First of all, thank you for your incredible and outstanding work!

My question is pretty simple, but might be tricky depending on the dataset: Which approach should I use for sampling negative examples in order to fine-tune bge-m3 model as given in project repo for multilingual semantic search?
Currently I have pairs of (query, document) and need to generate negative example per record.

There are plenty of approaches, such as:

  1. Explicit Negatives - not really applicable because the data is static.
  2. Random Negatives - getting random document from other record.
  3. BM25 Negatives - quite popular, but has some biases when it comes to retrieve documents.
  4. Gold Negatives - similar to random negatives, but getting specific document.
  5. In-batch Negatives - fetching negatives in the same batch
  6. Cross-batch Negatives - quite complicated, but uses GPU for sampling negatives
  7. Approximate Nearest Neighbors - quite costly, not sure if reasonable to use
  8. Hybrid - uses combination of BM25 and other approach

What can you suggest from your model's point of view? I can try out all of them, but I would prefer to save some time and follow previously used approach.

Thank you in advance.

Beijing Academy of Artificial Intelligence org
edited Mar 26

Thanks for your interest in our work!
We recommend using In-batch Negatives and BM25 Negatives/Approximate Nearest Neighbors.
We provide a script to mine hard negative(Approximate Nearest Neighbors). And for fine-tuning scrip, you can set use_inbatch_neg (default value is True) to use in-batch negatives.

Hello!

Thanks for your quick response!

I am trying to reproduce (on jupyter notebook) script you mentioned in your comment and getting the following error:

File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:349, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    346 self._memory_tracker.start()
    348 # set the correct log level depending on the node
--> 349 log_level = args.get_process_log_level()
    350 logging.set_verbosity(log_level)
    352 # force device and distributed setup init explicitly

TypeError: TrainingArguments.get_process_log_level() missing 1 required positional argument: 'self'

transformers version is 4.33.0 (as mentioned in setup file).

If I upgrade transformers to the last version (4.39.2), the following error arises:

File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:4255, in Trainer.create_accelerator_and_postprocess(self)
   4249 gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
   4251 # create accelerator object
   4252 self.accelerator = Accelerator(
   4253     deepspeed_plugin=self.args.deepspeed_plugin,
   4254     gradient_accumulation_plugin=gradient_accumulation_plugin,
-> 4255     **self.args.accelerator_config.to_dict(),
   4256 )
   4257 # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
   4258 self.gather_function = self.accelerator.gather_for_metrics

AttributeError: 'NoneType' object has no attribute 'to_dict'

I have tried to set up additional argument accelerator_config in RetrieverTrainingArguments class, but didn't get any success. Even instance class AcceleratorConfig from transformers.trainer_pt_utils doesn't help, despite the fact that it's suggested in transformers' documentation code.

Any ideas or thoughts how I can get rid of any of those errors?

FYI: Opened an issue on transformers GitHub repo page

Closing this as it was fixed (see link to github).

mehti changed discussion status to closed

Sign up or log in to comment