bge-micro-smiles / README.md
ferris
v0.1.0
82971ba
metadata
base_model: TaylorAI/bge-micro
datasets: []
language:
  - en
library_name: sentence-transformers
license: apache-2.0
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:3210255
  - loss:CachedMultipleNegativesRankingLoss
widget:
  - source_sentence: donepezil hydrochloride monohydrate
    sentences:
      - Cn1nccc1[C@H]1CC[C@H](O[Si](C)(C)C(C)(C)C)C[C@@H]1OC(=O)c1ccccc1
      - COc1cc2c(cc1OC)C(=O)C(CC1CCN(Cc3ccccc3)CC1)C2.Cl.O
      - C(=O)(OC)C1=CC=C(C=C1)CC(C)=O
  - source_sentence: >-
      6-Cyclopropylmethoxy-5-(3,3-difluoro-azetidin-1-yl)-pyridine-2-carboxylic
      acid tert-butyl-(5-methyl-[1,3,4]oxadiazol-2-ylmethyl)-amide
    sentences:
      - Cc1nnc(CN(C(=O)c2ccc(N3CC(F)(F)C3)c(OCC3CC3)n2)C(C)(C)C)o1
      - COc1cccc(CCCC=C(Br)Br)c1
      - CN(C)CCNC(=O)c1ccc2oc(=O)n(Cc3ccc4[nH]c(=O)[nH]c4c3)c2c1
  - source_sentence: >-
      N-(2-chlorophenyl)-6,8-difluoro-N-methyl-4H-thieno[3,2-c]chromene-2-carboxamide
    sentences:
      - CN(C(=O)c1cc2c(s1)-c1cc(F)cc(F)c1OC2)c1ccccc1Cl
      - ClC(C(=O)OCCOCC1=CC=C(C=C1)F)C
      - C(C)OC(\C=C(/C)\OC1=C(C(=CC=C1F)OC(C)C)F)=O
  - source_sentence: >-
      6-[2-[(3-chlorophenyl)methyl]-1,3,3a,4,6,6a-hexahydropyrrolo[3,4-c]pyrrol-5-yl]-3-(trifluoromethyl)-[1,2,4]triazolo[4,3-b]pyridazine
    sentences:
      - CC(=O)OCCOCn1cc(C)c(=O)[nH]c1=O
      - NC1=C(C(=NN1C1=C(C=C(C=C1Cl)C(F)(F)F)Cl)C#N)S(=O)(=O)C
      - ClC=1C=C(C=CC1)CN1CC2CN(CC2C1)C=1C=CC=2N(N1)C(=NN2)C(F)(F)F
  - source_sentence: >-
      (±)-cis-2-(4-methoxyphenyl)-3-acetoxy-5-[2-(dimethylamino)ethyl]-8-chloro-2,3-dihydro-1,5-benzothiazepin-4(5H)-one
      hydrochloride
    sentences:
      - N(=[N+]=[N-])C(C(=O)C1=NC(=C(C(=N1)C(C)(C)C)O)C(C)(C)C)C
      - O[C@@H]1[C@H](O)[C@@H](Oc2nc(N3CCNCC3)nc3ccccc23)C[C@H]1O
      - Cl.COC1=CC=C(C=C1)[C@@H]1SC2=C(N(C([C@@H]1OC(C)=O)=O)CCN(C)C)C=CC(=C2)Cl
model-index:
  - name: MPNet base trained on AllNLI triplets
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: bge micro test
          type: bge-micro-test
        metrics:
          - type: pearson_cosine
            value: .nan
            name: Pearson Cosine
          - type: spearman_cosine
            value: .nan
            name: Spearman Cosine
          - type: pearson_manhattan
            value: .nan
            name: Pearson Manhattan
          - type: spearman_manhattan
            value: .nan
            name: Spearman Manhattan
          - type: pearson_euclidean
            value: .nan
            name: Pearson Euclidean
          - type: spearman_euclidean
            value: .nan
            name: Spearman Euclidean
          - type: pearson_dot
            value: .nan
            name: Pearson Dot
          - type: spearman_dot
            value: .nan
            name: Spearman Dot
          - type: pearson_max
            value: .nan
            name: Pearson Max
          - type: spearman_max
            value: .nan
            name: Spearman Max

MPNet base trained on AllNLI triplets

This is a sentence-transformers model finetuned from TaylorAI/bge-micro. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: TaylorAI/bge-micro
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 384 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("fpc/bge-micro-smiles")
# Run inference
sentences = [
    '(±)-cis-2-(4-methoxyphenyl)-3-acetoxy-5-[2-(dimethylamino)ethyl]-8-chloro-2,3-dihydro-1,5-benzothiazepin-4(5H)-one hydrochloride',
    'Cl.COC1=CC=C(C=C1)[C@@H]1SC2=C(N(C([C@@H]1OC(C)=O)=O)CCN(C)C)C=CC(=C2)Cl',
    'O[C@@H]1[C@H](O)[C@@H](Oc2nc(N3CCNCC3)nc3ccccc23)C[C@H]1O',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 3,210,255 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 5 tokens
    • mean: 42.57 tokens
    • max: 153 tokens
    • min: 4 tokens
    • mean: 40.02 tokens
    • max: 325 tokens
  • Samples:
    anchor positive
    4-t-butylbromobenzene C(C)(C)(C)C1=CC=C(C=C1)Br
    1-methyl-4-(morpholine-4-carbonyl)-N-(2-phenyl-[1,2,4]triazolo[1,5-a]pyridin-7-yl)-1H-pyrazole-5-carboxamide CN1N=CC(=C1C(=O)NC1=CC=2N(C=C1)N=C(N2)C2=CC=CC=C2)C(=O)N2CCOCC2
    Phthalimide C1(C=2C(C(N1)=O)=CC=CC2)=O
  • Loss: CachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 512
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 512
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Click to expand
Epoch Step Training Loss bge-micro-test_spearman_cosine
0.0159 100 6.1861 -
0.0319 200 6.0547 -
0.0478 300 5.6041 -
0.0638 400 4.9367 -
0.0797 500 4.3412 -
0.0957 600 3.8245 -
0.1116 700 3.3188 -
0.1276 800 2.869 -
0.1435 900 2.5149 -
0.1595 1000 2.2282 -
0.1754 1100 2.0046 -
0.1914 1200 1.8032 -
0.2073 1300 1.6289 -
0.2232 1400 1.4567 -
0.2392 1500 1.3326 -
0.2551 1600 1.2127 -
0.2711 1700 1.0909 -
0.2870 1800 1.0021 -
0.3030 1900 0.9135 -
0.3189 2000 0.8378 -
0.3349 2100 0.7758 -
0.3508 2200 0.7031 -
0.3668 2300 0.6418 -
0.3827 2400 0.5965 -
0.3987 2500 0.5461 -
0.4146 2600 0.5039 -
0.4306 2700 0.4674 -
0.4465 2800 0.4339 -
0.4624 2900 0.4045 -
0.4784 3000 0.373 -
0.4943 3100 0.3566 -
0.5103 3200 0.3348 -
0.5262 3300 0.3215 -
0.5422 3400 0.302 -
0.5581 3500 0.2826 -
0.5741 3600 0.2803 -
0.5900 3700 0.2616 -
0.6060 3800 0.2554 -
0.6219 3900 0.234 -
0.6379 4000 0.2306 -
0.6538 4100 0.2224 -
0.6697 4200 0.2141 -
0.6857 4300 0.2117 -
0.7016 4400 0.204 -
0.7176 4500 0.198 -
0.7335 4600 0.1986 -
0.7495 4700 0.1821 -
0.7654 4800 0.1813 -
0.7814 4900 0.1741 -
0.7973 5000 0.1697 -
0.8133 5100 0.1655 -
0.8292 5200 0.1623 -
0.8452 5300 0.1593 -
0.8611 5400 0.1566 -
0.8771 5500 0.151 -
0.8930 5600 0.1526 -
0.9089 5700 0.1453 -
0.9249 5800 0.1448 -
0.9408 5900 0.1369 -
0.9568 6000 0.1409 -
0.9727 6100 0.1373 -
0.9887 6200 0.133 -
1.0046 6300 0.1269 -
1.0206 6400 0.1274 -
1.0365 6500 0.1271 -
1.0525 6600 0.1216 -
1.0684 6700 0.1176 -
1.0844 6800 0.1208 -
1.1003 6900 0.1177 -
1.1162 7000 0.1175 -
1.1322 7100 0.1109 -
1.1481 7200 0.1118 -
1.1641 7300 0.1085 -
1.1800 7400 0.1155 -
1.1960 7500 0.1079 -
1.2119 7600 0.1087 -
1.2279 7700 0.1004 -
1.2438 7800 0.1084 -
1.2598 7900 0.1089 -
1.2757 8000 0.1012 -
1.2917 8100 0.1037 -
1.3076 8200 0.1004 -
1.3236 8300 0.0979 -
1.3395 8400 0.1007 -
1.3554 8500 0.0956 -
1.3714 8600 0.0972 -
1.3873 8700 0.0947 -
1.4033 8800 0.0931 -
1.4192 8900 0.0948 -
1.4352 9000 0.0925 -
1.4511 9100 0.0933 -
1.4671 9200 0.0888 -
1.4830 9300 0.0877 -
1.4990 9400 0.0889 -
1.5149 9500 0.0895 -
1.5309 9600 0.0892 -
1.5468 9700 0.089 -
1.5627 9800 0.0828 -
1.5787 9900 0.0906 -
1.5946 10000 0.0893 -
1.6106 10100 0.0849 -
1.6265 10200 0.0811 -
1.6425 10300 0.0823 -
1.6584 10400 0.0806 -
1.6744 10500 0.0815 -
1.6903 10600 0.0832 -
1.7063 10700 0.0856 -
1.7222 10800 0.081 -
1.7382 10900 0.0831 -
1.7541 11000 0.0767 -
1.7701 11100 0.0779 -
1.7860 11200 0.0792 -
1.8019 11300 0.0771 -
1.8179 11400 0.0783 -
1.8338 11500 0.0749 -
1.8498 11600 0.0755 -
1.8657 11700 0.0778 -
1.8817 11800 0.0753 -
1.8976 11900 0.0767 -
1.9136 12000 0.0725 -
1.9295 12100 0.0744 -
1.9455 12200 0.0743 -
1.9614 12300 0.0722 -
1.9774 12400 0.0712 -
1.9933 12500 0.0709 -
2.0092 12600 0.0694 -
2.0252 12700 0.0705 -
2.0411 12800 0.0715 -
2.0571 12900 0.0705 -
2.0730 13000 0.0653 -
2.0890 13100 0.0698 -
2.1049 13200 0.0676 -
2.1209 13300 0.0684 -
2.1368 13400 0.0644 -
2.1528 13500 0.0652 -
2.1687 13600 0.0673 -
2.1847 13700 0.067 -
2.2006 13800 0.0645 -
2.2166 13900 0.0633 -
2.2325 14000 0.0645 -
2.2484 14100 0.0698 -
2.2644 14200 0.0655 -
2.2803 14300 0.0654 -
2.2963 14400 0.0656 -
2.3122 14500 0.0631 -
2.3282 14600 0.0628 -
2.3441 14700 0.0671 -
2.3601 14800 0.0659 -
2.3760 14900 0.0619 -
2.3920 15000 0.0618 -
2.4079 15100 0.0624 -
2.4239 15200 0.0616 -
2.4398 15300 0.0631 -
2.4557 15400 0.0639 -
2.4717 15500 0.0585 -
2.4876 15600 0.0607 -
2.5036 15700 0.0615 -
2.5195 15800 0.062 -
2.5355 15900 0.0621 -
2.5514 16000 0.0608 -
2.5674 16100 0.0594 -
2.5833 16200 0.0631 -
2.5993 16300 0.0635 -
2.6152 16400 0.06 -
2.6312 16500 0.0581 -
2.6471 16600 0.0607 -
2.6631 16700 0.0577 -
2.6790 16800 0.0592 -
2.6949 16900 0.0625 -
2.7109 17000 0.0622 -
2.7268 17100 0.0573 -
2.7428 17200 0.0613 -
2.7587 17300 0.0587 -
2.7747 17400 0.0587 -
2.7906 17500 0.0588 -
2.8066 17600 0.0568 -
2.8225 17700 0.0573 -
2.8385 17800 0.0575 -
2.8544 17900 0.0575 -
2.8704 18000 0.0582 -
2.8863 18100 0.0577 -
2.9022 18200 0.057 -
2.9182 18300 0.0572 -
2.9341 18400 0.0558 -
2.9501 18500 0.0578 -
2.9660 18600 0.0567 -
2.9820 18700 0.0569 -
2.9979 18800 0.0547 -
3.0139 18900 0.0542 -
3.0298 19000 0.0563 -
3.0458 19100 0.0549 -
3.0617 19200 0.0531 -
3.0777 19300 0.053 -
3.0936 19400 0.0557 -
3.1096 19500 0.0546 -
3.1255 19600 0.0518 -
3.1414 19700 0.0517 -
3.1574 19800 0.0528 -
3.1733 19900 0.0551 -
3.1893 20000 0.0544 -
3.2052 20100 0.0526 -
3.2212 20200 0.0494 -
3.2371 20300 0.0537 -
3.2531 20400 0.0568 -
3.2690 20500 0.0525 -
3.2850 20600 0.0566 -
3.3009 20700 0.0539 -
3.3169 20800 0.0531 -
3.3328 20900 0.0524 -
3.3487 21000 0.0543 -
3.3647 21100 0.0537 -
3.3806 21200 0.0524 -
3.3966 21300 0.0516 -
3.4125 21400 0.0537 -
3.4285 21500 0.0515 -
3.4444 21600 0.0537 -
3.4604 21700 0.0526 -
3.4763 21800 0.0508 -
3.4923 21900 0.0526 -
3.5082 22000 0.0521 -
3.5242 22100 0.054 -
3.5401 22200 0.053 -
3.5561 22300 0.0509 -
3.5720 22400 0.0526 -
3.5879 22500 0.0551 -
3.6039 22600 0.0556 -
3.6198 22700 0.0497 -
3.6358 22800 0.0515 -
3.6517 22900 0.0514 -
3.6677 23000 0.0503 -
3.6836 23100 0.0515 -
3.6996 23200 0.0553 -
3.7155 23300 0.0519 -
3.7315 23400 0.0549 -
3.7474 23500 0.0522 -
3.7634 23600 0.0526 -
3.7793 23700 0.0525 -
3.7952 23800 0.051 -
3.8112 23900 0.0509 -
3.8271 24000 0.0503 -
3.8431 24100 0.0524 -
3.8590 24200 0.0526 -
3.8750 24300 0.0512 -
3.8909 24400 0.0518 -
3.9069 24500 0.0521 -
3.9228 24600 0.0524 -
3.9388 24700 0.051 -
3.9547 24800 0.0535 -
3.9707 24900 0.0508 -
3.9866 25000 0.0514 -
4.0 25084 - nan

Framework Versions

  • Python: 3.10.9
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.4.1+cu124
  • Accelerate: 0.33.0
  • Datasets: 2.18.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup}, 
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}