GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
Paper • 2402.16829 • Published • 1
How to use smokxy/embedding_finetuned with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("smokxy/embedding_finetuned")
sentences = [
"How can I contact the National Co-operative Development Corporation?",
"'1.1 Chhattisgarh is among the few states in India that have recorded impressive growth in agriculture in recent years. Development of farmers own institutions catering to their various needs, has kept pace with the agricultural growth. As on 30 September 2014, the state had 3,679 farmers clubs (FCs). There were eight federations of farmer clubs in the state, five in Mahasamund, two in Bilaspur and one in Mungeli district. In Bilaspur and Mungeli districts (the study area), 300 FCs were formed, of which 201 were active. Majority of the farmer clubs (129 clubs) were formed by the Regional Rural Bank (Gramin Bank). Other promoting institutions include Chhattisgarh Agricon Samiti (30), CARMDAKSH (12), SBI (12), ARDB (8) and IFFDC (5). While all the clubs were active in the initial three years, many slipped into dormancy through inaction and non-availability of hand-holding support. These clubs did not have any vision or roadmap for the future. 1.2 The Chhattisgarh RO and DDM Bilaspur were keen to make the farmer clubs a sustainable entity and felt the need to federate the clubs to a higher tier so as to make the entire farmer clubs programme sustainable and the organization a viable model. With this in view, the farmer clubs were federated into four farmer club federations and were registered under 'Chhattisgarh Society Registrikaran Adhiniyam, 1973' in the year 2012.'",
"'10.1 Under the scheme, financial support to Farmer Producer Organization (FPO) @ up to maximum of Rs. 18 lakh / FPO or actual, whichever is lesser is to be provided during three years from the year of formation. The financial support is not meant for reimbursing the entire administrative and management cost of FPO but it is to provide the financial support to the FPOs to the extent provided to make them sustainable and economically viable. Hence, the fourth year onwards of formation, the FPO has to manage their financial support from their own business activities. The indicative financial support broadly covers (i) the support for salary of its CEO/Manager (maximum up to Rs.25000/month) and Accountant (maximum up to Rs. 10000/month); (ii) one time registration cost(one time up to maximum Rs. 40000 or actual whichever is lower); (iii) office rent (maximum up to Rs. 48,000/year); (iv) utility charges (electricity and telephone charges of office of FPO maximum up to Rs. 12000/year); (v) one-time cost for minor equipment (including furniture and fixture maximum up to Rs. 20,000); (vi) travel and meeting cost (maximum up to Rs.18,000/year); and (vii) misc. (cleaning, stationery etc. maximum up to Rs. 12,000/year). Any expenditure of operations, management, working capital requirement and infrastructure development etc., over and above this, will be met by the FPOs from their financial resources. 10.2 FPO being organization of farmers, it does not become feasible for FPO itself to professionally administer its activities and day to day business, therefore, FPO requires some professionally equipped Manager/CEO to administer its activities and day to day business with a sole objective to make FPO economically sustainable and farmers' benefiting agri-enterprise. Not only for business development but the value of professional is immense in democratizing the FPOs and strengthening its governing system.'",
"'Risk Analysis For further information, please contact: Chief General Manager, Managing Director Small Farmers' Agri- Business Consortium, National Bank for Agriculture & Rural Head office, NCUI Auditorium Development, **NABARD**, Building C-24, 'G' Block, 5th floor, 3, Siri Institutional Area Bandra-Kurla Complex, August Kranti Marg, Hauz Bandra East, Khas, Mumbai - 400051 New Delhi-110016 Tel: 022- Tel: 011-41060075, 26966017 26539530,26539500 e-mail: sfac@nic.in Website: www.sfacindia.com csr.murthy@nabard.org, fsdd@nabard.org Website: www.nabard.org Agriculture Marketing Adviser Directorate of Marketing & Inspection DAC&FW, New CGO Complex, NH-IV, Faridabad - 121001 Tel: 0129- 2412518 e-mail: mdrc-dac@gov.in Website: www.dmi.gov.in Managing Director National Co-operative Development Corporation, **NCDC**, 4-Siri Institutional Area, Hauz Khas, New Delhi - 110016 Tel: 011- 26960796, 26567140 e-mail: e-mail: mail@ncdc.in Website: www.ncdc.in Agricultural Marketing Division Department of Agriculture, Co-operation & Farmers' Welfare Ministry of Agriculture & Farmers' Welfare Krishi Bhawan, New Delhi-110001 Tel: 011-23386235, 23388579 Website: www.agricoop.nic.in'"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("smokxy/embedding_finetuned")
# Run inference
sentences = [
'What is the requirement of Aadhaar for crop loan or Kisan Credit Card (KCC) under the Interest Subvention Scheme?',
"'6.3.1 Aadhaar has been made mandatory for availing Crop insurance from Kharif 2017 season onwards. Therefore, all banks are advised to mandatorily obtain Aadhaar number of their farmers and the same applies for non-loanee farmers enrolled through banks/Insurance companies/insurance intermediaries. 6.3.2 Farmers not having Aadhaar ID may also enrol under PMFBY subject to their enrolment for Aadhaar and submission of proof of such enrolment as per notification No. 334.dated 8th February, 2017 issued by GOI under Section 7 of Aadhaar Act 2016(Targeted Delivery of Financial and other Subsidies, Benefits and Services). Copy of the notification may be perused on www.pmfby.gov.in. This may be subject to further directions issued by Govt. from time to time. 6.3.3 All banks have to compulsorily take Aadhaar/Aadhaar enrolment number as per notification under Aadhaar Act before sanction of crop loan/KCC under Interest Subvention Scheme. Hence the coverage of loanee farmers without Aadhaar does not arise and such accounts need to be reviewed by the concerned bank branch regularly.'",
"' Date……………………………… ……………………………… Signature of Branch Manager with branch seal Name…………………………………… … Designation …………………………………… ……………………………… ……………………………… Signature of Authorized Person in zonal office Name………………………………… Designation …………………………………… 5. Promoter's request letter List of Enclosures 1. Recommendation 9. List of shareholders addressed to the Bank Manager on original letter head of FPO confirmed by promoter and bank with amount of CGC sought on Bank's Original letterhead with date and dispatch number duly signed by the Branch Manager on each page. 2. Sanction letter of 6. Implementation Schedule 10. Affidavit of promoters that confirmed by the bank. they have not availed CGC from any other institution for sanctioned Credit Facility. sanctioning authority addressed to recommending branch. 3. Bank's approved 7. Up-to-date statement of account of 11. Field inspection report of Term loan and Cash Credit (if Sanctioned). Bank official as on recent date. Appraisal/Process note bearing signature of sanctioning authority. 4. Potential Impact on 8. a).Equity Certificate, C.A/CS * Pin Code at Column No. 1. a), certificate/RCS certificate 2. b), 2. c), 4. a) and 9. a) is Mandatory b). FORM-2, FORM-5 and FORM-23 filed with ROC for Company/RCS. small farmer producers 1. Social Impact, 2. Environmental Impact 3.'",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
val_evaluatorInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.51 |
| cosine_accuracy@5 | 0.89 |
| cosine_accuracy@10 | 0.93 |
| cosine_precision@1 | 0.51 |
| cosine_precision@5 | 0.178 |
| cosine_precision@10 | 0.093 |
| cosine_recall@1 | 0.51 |
| cosine_recall@5 | 0.89 |
| cosine_recall@10 | 0.93 |
| cosine_ndcg@5 | 0.7199 |
| cosine_ndcg@10 | 0.7332 |
| cosine_ndcg@100 | 0.7507 |
| cosine_mrr@5 | 0.6627 |
| cosine_mrr@10 | 0.6684 |
| cosine_mrr@100 | 0.6731 |
| cosine_map@100 | 0.6731 |
| dot_accuracy@1 | 0.51 |
| dot_accuracy@5 | 0.89 |
| dot_accuracy@10 | 0.93 |
| dot_precision@1 | 0.51 |
| dot_precision@5 | 0.178 |
| dot_precision@10 | 0.093 |
| dot_recall@1 | 0.51 |
| dot_recall@5 | 0.89 |
| dot_recall@10 | 0.93 |
| dot_ndcg@5 | 0.7199 |
| dot_ndcg@10 | 0.7332 |
| dot_ndcg@100 | 0.7507 |
| dot_mrr@5 | 0.6627 |
| dot_mrr@10 | 0.6684 |
| dot_mrr@100 | 0.6731 |
| dot_map@100 | 0.6731 |
eval_strategy: stepsgradient_accumulation_steps: 4learning_rate: 1e-05weight_decay: 0.01num_train_epochs: 1.0warmup_ratio: 0.1load_best_model_at_end: Trueoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 4eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 1e-05weight_decay: 0.01adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 1.0max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | loss | val_evaluator_cosine_map@100 |
|---|---|---|---|---|
| 0.531 | 15 | 0.5565 | 0.0661 | 0.6731 |
| 0.9912 | 28 | - | 0.0661 | 0.6731 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{solatorio2024gistembed,
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
author={Aivin V. Solatorio},
year={2024},
eprint={2402.16829},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
BAAI/bge-small-en-v1.5