Sampling Negative Examples for training bge-m3

#32

by mehti - opened Mar 25

Mar 25

Hello!

First of all, thank you for your incredible and outstanding work!

My question is pretty simple, but might be tricky depending on the dataset: Which approach should I use for sampling negative examples in order to fine-tune bge-m3 model as given in project repo for multilingual semantic search?
Currently I have pairs of (query, document) and need to generate negative example per record.

There are plenty of approaches, such as:

Explicit Negatives - not really applicable because the data is static.
Random Negatives - getting random document from other record.
BM25 Negatives - quite popular, but has some biases when it comes to retrieve documents.
Gold Negatives - similar to random negatives, but getting specific document.
In-batch Negatives - fetching negatives in the same batch
Cross-batch Negatives - quite complicated, but uses GPU for sampling negatives
Approximate Nearest Neighbors - quite costly, not sure if reasonable to use
Hybrid - uses combination of BM25 and other approach

What can you suggest from your model's point of view? I can try out all of them, but I would prefer to save some time and follow previously used approach.

Thank you in advance.

Shitao

Beijing Academy of Artificial Intelligence org Mar 26

•

edited Mar 26

Thanks for your interest in our work!
We recommend using In-batch Negatives and BM25 Negatives/Approximate Nearest Neighbors.
We provide a script to mine hard negative(Approximate Nearest Neighbors). And for fine-tuning scrip, you can set use_inbatch_neg (default value is True) to use in-batch negatives.

mehti

Mar 29

Hello!

Thanks for your quick response!

I am trying to reproduce (on jupyter notebook) script you mentioned in your comment and getting the following error:

File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:349, in Trainer.__init__(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
    346 self._memory_tracker.start()
    348 # set the correct log level depending on the node
--> 349 log_level = args.get_process_log_level()
    350 logging.set_verbosity(log_level)
    352 # force device and distributed setup init explicitly

TypeError: TrainingArguments.get_process_log_level() missing 1 required positional argument: 'self'

transformers version is 4.33.0 (as mentioned in setup file).

If I upgrade transformers to the last version (4.39.2), the following error arises:

File ~/anaconda3/envs/ai/lib/python3.11/site-packages/transformers/trainer.py:4255, in Trainer.create_accelerator_and_postprocess(self)
   4249 gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
   4251 # create accelerator object
   4252 self.accelerator = Accelerator(
   4253     deepspeed_plugin=self.args.deepspeed_plugin,
   4254     gradient_accumulation_plugin=gradient_accumulation_plugin,
-> 4255     **self.args.accelerator_config.to_dict(),
   4256 )
   4257 # some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
   4258 self.gather_function = self.accelerator.gather_for_metrics

AttributeError: 'NoneType' object has no attribute 'to_dict'

I have tried to set up additional argument accelerator_config in RetrieverTrainingArguments class, but didn't get any success. Even instance class AcceleratorConfig from transformers.trainer_pt_utils doesn't help, despite the fact that it's suggested in transformers' documentation code.

Any ideas or thoughts how I can get rid of any of those errors?

mehti

Apr 2

FYI: Opened an issue on transformers GitHub repo page

mehti

13 days ago

Closing this as it was fixed (see link to github).

mehti changed discussion status to closed 13 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment