GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
Paper • 2402.16829 • Published • 1
How to use smokxy/bge_pairs with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("smokxy/bge_pairs")
sentences = [
"What is packaged drinking water?",
"'26.6 Packaged Drinking Water (other than mineral water) It can be defined as water derived from the surface water or underground water or sea water which is subjected to herein-under specified treatments, namely decantation, filtration, combination of filtration, aerations, filtration with membrane filter depth filter, cartridge filter, activated carbon filtration, de-mineralization, remineralization, reverse osmosis and packed after disinfecting the water to a level that shall not lead any harmful contamination in the drinking water by means of chemical agents or physical methods to reduce the number of micro-organisms to level beyond scientifically accepted level for foods safety or its susceptibility. The standards, packaging and labelling requirements have also been specified under FSSAI rules.'",
"'Some fruit or vegetable powders are produced from juices, concentrates, or pulps by using a spray drying technique. Dry powders can be directly used as important constituents of dry soups, yogurt, etc. The drying is achieved by spraying of the slurry into an airstream at a temperature of 138°C to 150°C and introducing cold dry air either into the outlet end of the dryer or to the dryer walls to cool them to 38°C– 50°C. The most commonly used atomizers are rotary wheel and single-fluid pressure nozzle. A wide range of fruit and vegetable powders can be dried, agglomerated, and instantized in spray drying units, specially equipped with an internal static fluidized bed, integral filter, or external vibrofluidizer. Bananas, peaches, apricots, and to a lesser extent citrus powders are examples of products dried by such techniques.'",
"'LEAF FEEDER 7. Leaf webber , Eucosma critica, Eucosmidae, Lepidoptera Symptom of damage: During vegetative stage of the crop, the caterpillar damages leaves by webbing, while at the floral stages of the crop they enter the buds, flowers and pods and feed on the immature seeds. Nature of damage: Young larva gets itself concealed into the frass produced during the course of scratching. The grown-up larva then draws the two leaves together and spins a thread between them, in which it passes later instar and also pupates. Egg: Oval, creamy white in colour, laid singly in leaves, petioles or stem. Larva: Young larvae are pale-yellow in colour, moderately stout, smooth, except for a few short scattered hairs. It hibernates in larval form. Pupa: Yellowish in colour, gradually turn to light-brown and finally to dark brown. Pupates in thin papery white silken cocoon. Adult: Dusky brown with forewings having four black dots and a silvery transparent mark'"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'What does an Industry Analysis entail?',
"'a. Executive summary b. Business Description c. Industry/Sector analysis d. Marketing plan e. Operations plan f. Financial plan 7.8 What is included in an executive summary? The executive summary is an abstract containing the important points of the business plan. Its purpose is to communicate the plan in a convincing way to important audiences, such as potential investors, so they will read further. It may be the only chapter of the business plan a reader uses to make a quick decision on the proposal. As such, it should fulfill the reader's (financier's) expectations. It is prepared after the total plan has been written. The executive summary should describe the following: a. The industry and market environment in which the opportunity will develop and flourish b. The special and unique business opportunity—the problem the product or service will be solving c. The strategies for success—what differentiates the product or service from the competitors' products d. The financial potential—the anticipated risk and reward of the business e. The management team—the people who will achieve the results f. The resources or capital being requested—a clear statement to your readers about what you hope to gain from them, whether it is capital or other resources 7.9 What is included in a Business Description? The business description explains the business concept by giving a brief yet informative picture of the history, the basic nature, and the purpose of the business, including business objectives and why the business will be successful. The purposes of the business description are to: a. Express clearly understanding of the business concept b. Share enthusiasm for the venture c. Meet the expectations of the reader by providing a realistic picture of the business venture 7.10 What is Industry Analysis?'",
"'-Black cloth, -Khada cloth -Saw dust -0.025 % Sodium hypochlorite -Chick pea / groundnut seedlings -Bleaching powder -Coffee powder -Multivitamin syrup -10 % sucrose -Beaker 500 ml -Measuring cylinder -Egg laying chamber Procedure : 1. Release 10 males and 5 females at 2: 1 ratio in plastic containers and cover with thin black cloth . ( Female require multiple mating to lay fertile eggs ) . 2. To induce the moths to lay more eggs multivitamin syrup 2 drops + 10 % sucrose is given through cotton swabs 3. Daily collect the egg cloth after 3 rd day of copulation . Provide 25- 28 o C , 80- 90 % R.H during egg laying. A female lays 300 –700 eggs 4. Sterilize the egg cloth in 0.025 % sodium hypochlorite for ten seconds and immediately dip the egg cloth in distilled water in 3 different buckets having distilled water one by one and then dry it in shade. 5. Raise chickpea or groundnut seedlings in a week interval and provide for feeding 6. Place newly hatched larvae on chickpea/groundnut seedlings along with egg cloth for one day or place 3-4 eggs in vials containing artificial diet 7. Pick young larvae and rear on bhendi vegetable individually in penicillin vials to avoid cannibalism. 8. Daily change diet till pre pupal stage 9. Collect pre –pupae and allow for pupation in plastic container having saw dust 10. Pupae sterilization is done with the help of coffee filter by dip method 11. Transfer the pupae inside the egg lying chamber by keeping them on a separate petri dish without lid.'",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
val_evaluatorInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.5128 |
| cosine_accuracy@5 | 0.9361 |
| cosine_accuracy@10 | 0.9578 |
| cosine_precision@1 | 0.5128 |
| cosine_precision@5 | 0.1872 |
| cosine_precision@10 | 0.0958 |
| cosine_recall@1 | 0.5128 |
| cosine_recall@5 | 0.9361 |
| cosine_recall@10 | 0.9578 |
| cosine_ndcg@5 | 0.7468 |
| cosine_ndcg@10 | 0.7541 |
| cosine_ndcg@100 | 0.7627 |
| cosine_mrr@5 | 0.6829 |
| cosine_mrr@10 | 0.6861 |
| cosine_mrr@100 | 0.6881 |
| cosine_map@100 | 0.6881 |
| dot_accuracy@1 | 0.5128 |
| dot_accuracy@5 | 0.9361 |
| dot_accuracy@10 | 0.9578 |
| dot_precision@1 | 0.5128 |
| dot_precision@5 | 0.1872 |
| dot_precision@10 | 0.0958 |
| dot_recall@1 | 0.5128 |
| dot_recall@5 | 0.9361 |
| dot_recall@10 | 0.9578 |
| dot_ndcg@5 | 0.7468 |
| dot_ndcg@10 | 0.7541 |
| dot_ndcg@100 | 0.7627 |
| dot_mrr@5 | 0.6829 |
| dot_mrr@10 | 0.6861 |
| dot_mrr@100 | 0.6881 |
| dot_map@100 | 0.6881 |
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
What role do emulsifying and stabilizing agents play in carbonated water? |
'The consumption of carbonated water has increased rapidly. As per FSSAI definitions carbonated water conforming to the standards prescribed for packaged drinking water under Food Safety and Standard act, 2006 impregnated with carbon dioxide under pressure and may contain any of the listed additives singly or in combination. Permitted additives include sweeteners (sugar, liquid glucose, dextrose monohydrate, invert sugar, fructose, Honey) fruits & vegetables extractive, permitted flavouring, colouring matter, preservatives, emulsifying and stabilizing agents, acidulants (citric acid, fumaric acid and sorbitol, tartaric acid, phosphoric acid, lactic acid, ascorbic acid, malic acid), edible gums, salts of sodium, calcium and magnesium, vitamins, caffeine not exceeding 145 ppm, ester gum not exceeding 100 ppm and quinine salts not exceeding 100 ppm. It may contain Sodium saccharin not exceeding 100 ppm or Acesulfame-k 300 ppm or Aspartame not exceeding 700 ppm or sucralose not exceeding 300 ppm.' |
What is the purpose of the Agri Clinic and Agri Business Centres scheme? |
' |
What can be considered as outliers in terms of yield? |
'Identification of Outliers: All these above analyses can be used to check whether there was any reason for yield deviation as presented in the CCE data. Then a yield proxy map may be prepared. The Yield proxy map can be derived from remote sensing vegetation indices (single or combination of indices), crop simulation model output, or an integration of various parameters, which are related to crop yield, such as soil, weather (gridded), satellite based products, etc. Whatever, yield proxies to be used, it is the responsibility of the organization to record documentary evidence (from their or other's published work) that the yield proxy is related to the particular crop's yield. Then the IU level yields need to be overlaid on the yield proxy map. Both yield proxy and CCE yield can be divided into 4-5 categories (e.g. Very good, Good, Medium, Poor, Very poor). Wherever there is large mismatch between yield proxy and the CCE yield (more than 2 levels), the CCE yield for that IU can be considered, as outliers.' |
GISTEmbedLoss with these parameters:{'guide': SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
), 'temperature': 0.01}
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
What diseases do the mentioned pulses have resistance to? |
'..................................................... pulses. 20. Ta 2IPM - 409-4. _ 2020 (Heera) Meha 2005 (I. P.M. - 99-25) Pusa Vishal 200] H. UM-6 2006 (Malaviya Janakalyani). Malaviya Jyothi 999 (H. UM-) TMV-37 2005T. The BM-37 (t. M. - 99.37) Malaviya 2003 Jan Chetna (H. UM-42) IPM-2-3 2009I. P.M. 2 - 4 20 |
What do hypertonic drinks have high levels of? |
'There are three types of sports drinks all of which contain various levels of fluid, electrolytes, and carbohydrate. • Isotonic drinks have fluid, electrolytes and 6-8% carbohydrate. Isotonic drinks quickly replace fluids lost by sweating and supply a boost of carbohydrate. This kind of drink is the choice for most athletes especially middle and long distance running or team sports. • Hypotonic drinks have fluids, electrolytes and a low level of carbohydrates. Hypotonic drinks quickly replace flids lost by sweating. This kind of drink is suitable for athletes who need fluid without the boost of carbohydrates such as gymnasts. • Hypertonic drinks have high levels of carbohydrates. Hypertonic drinks can be used to supplement daily carbohydrate intake normally after exercise to top up muscle glycogen stores. In long distance events high levels of energy are required and hypertonic drinks' |
When should sowing be done? |
'y Sowing should be done in the first fortnight of June and PR 126,PR 114, PR 121, PR 122, PR 127 are suitable varieties. Divide the field into kiyaras (plot) of desirable size after laser land levelling and apply pre-sowing (rauni) irrigation and prepare field when it comes to tar-wattar (good soil moisture) condition and immediately sow the crop with rice seed drill fitted with inclinedplate metering system or Lucky seed drill (for simultaneously sowing and spray of herbicide) by using 20 to 25 kg seed/ha in 20 cm spaced rows. The seed should be placed at 2-3 cm depth. Before sowing, treat rice seed with 3 g Sprint 75 WS (mencozeb + carbendazim) by dissolving in 10-12 ml water per kg seed; make paste of fungicide solution and rub on the seed.' |
GISTEmbedLoss with these parameters:{'guide': SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
), 'temperature': 0.01}
eval_strategy: stepsgradient_accumulation_steps: 4learning_rate: 1e-05weight_decay: 0.01num_train_epochs: 40warmup_ratio: 0.1load_best_model_at_end: Trueoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 4eval_accumulation_steps: Nonelearning_rate: 1e-05weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 40max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | loss | val_evaluator_cosine_map@100 |
|---|---|---|---|---|
| 2.2727 | 500 | 0.2767 | 0.0931 | 0.6449 |
| 4.5455 | 1000 | 0.067 | 0.0777 | 0.6501 |
| 6.8182 | 1500 | 0.0485 | 0.0621 | 0.6678 |
| 9.0909 | 2000 | 0.0361 | 0.0615 | 0.6707 |
| 11.3636 | 2500 | 0.0301 | 0.0687 | 0.6765 |
| 13.6364 | 3000 | 0.0274 | 0.0661 | 0.6733 |
| 15.9091 | 3500 | 0.0223 | 0.0606 | 0.6822 |
| 18.1818 | 4000 | 0.021 | 0.0563 | 0.6834 |
| 20.4545 | 4500 | 0.0203 | 0.0573 | 0.6681 |
| 22.7273 | 5000 | 0.0212 | 0.0637 | 0.6770 |
| 25.0 | 5500 | 0.018 | 0.0580 | 0.6781 |
| 27.2727 | 6000 | 0.0166 | 0.0567 | 0.6781 |
| 29.5455 | 6500 | 0.0194 | 0.0542 | 0.6835 |
| 31.8182 | 7000 | 0.0182 | 0.0547 | 0.6897 |
| 34.0909 | 7500 | 0.0157 | 0.0549 | 0.6899 |
| 36.3636 | 8000 | 0.016 | 0.053 | 0.686 |
| 38.6364 | 8500 | 0.0142 | 0.0541 | 0.6881 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
year={2024},
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
BAAI/bge-small-en-v1.5