Edit model card

SentenceTransformer based on flax-sentence-embeddings/all_datasets_v4_MiniLM-L6

This is a sentence-transformers model finetuned from flax-sentence-embeddings/all_datasets_v4_MiniLM-L6 on the json dataset. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("FareedKhan/flax-sentence-embeddings_all_datasets_v4_MiniLM-L6_FareedKhan_prime_synthetic_data_2k_4_16")
# Run inference
sentences = [
    '\nEpstein-Barr virus-associated mesenchymal tumor is a disease designated as a type of leiomyosarcoma within the disease nomenclature system MONDO. This specific condition is uniquely characterized by its association with the Epstein-Barr virus and exhibits symptoms commonly related to an underlying malignancy, such as fatigue, fever, and muscle pain. Identified as a subgroup of leiomyosarcoma, it also encompasses related diseases including Epstein-Barr virus-related tumor, follicular dendritic cell sarcoma, and myopericytoma, all of which share the hallmark of being influenced by the Epstein-Barr virus. This classification emphasizes the role of viral infection in the development and manifestation of these tumor types, offering insights into potential pathways of disease progression and suggesting avenues for targeted therapeutic interventions.',
    'What type of leiomyosarcoma commonly manifests with fatigue, fever, and muscle pain?',
    "Could you provide me with a list of medications that display synergistic effects when combined with Lemborexant, are prescribed for the same indications, and possess an elimination half-life close to 37 hours? I am interested in exploring alternative treatments compatible with Lemborexant's therapeutic indications that may offer extended efficacy through the prolonged half-life of the secondary drug when co-administered.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.3663
cosine_accuracy@3 0.4406
cosine_accuracy@5 0.4653
cosine_accuracy@10 0.5149
cosine_precision@1 0.3663
cosine_precision@3 0.1469
cosine_precision@5 0.0931
cosine_precision@10 0.0515
cosine_recall@1 0.3663
cosine_recall@3 0.4406
cosine_recall@5 0.4653
cosine_recall@10 0.5149
cosine_ndcg@10 0.437
cosine_mrr@10 0.4127
cosine_map@100 0.4187

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 1,814 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 2 tokens
    • mean: 117.23 tokens
    • max: 128 tokens
    • min: 13 tokens
    • mean: 35.72 tokens
    • max: 128 tokens
  • Samples:
    positive anchor

    The list you've provided appears to include a wide range of substances and compounds from various fields such as chemistry, medicine, and pharmaceuticals. Here's a brief categorization and description of a few categories:

    ### Chemical Compounds and Their Uses

    1. Calcin: An anticoagulant used to prevent blood clots by inhibiting the formation of calcium deposits in the blood.
    2. Colistimethate: An antibiotic used to treat serious bacterial infections that are not responsive to other antibiotics.

    ### Medication and Drug Discovery

    3. Benznidazole: An antiparasitic drug used to treat infections like American trypanosomiasis (Chagas' disease).
    4. Amediplase: A promising stem cell derivative molecule developed as a cancer drug. It appears to have the potential to selectively target and kill cancer cells.

    ### Medical Imaging Agents

    5. Gadodiamide, Gadoteridol, Iothalamic acid, Ioversol, and Technetium Tc-99m exametazime: These are contrast agents used in medical imaging to enhance the visibility of organs or tissues in MRI, CT, and other imaging scans.

    ### Drug Delivery and Coagulation Management

    6. Kebuzone: Although not widely recognized, specific uses or pharmaceutical names like this might refer to a less common medication or generic name.
    7. Robenacoxib: A selective COX-2 inhibitor used to manage pain and inflammation.
    8. Melagatran: An anticoagulant, likely used in managing blood clotting.

    ### Immune Modulation and Cancer Therapy

    9. **Idaruciz
    Could you recommend any medications that effectively treat bacterial arthritis and are compatible with Alprostadil? Ideally, the medication should have a short half-life, being metabolized within an hour or so, to accommodate my active lifestyle.

    MYT1, also known by various aliases such as 'C20orf36', 'MTF1', 'MYTI', 'NZF2', 'PLPB1', 'ZC2H2C1', and 'ZC2HC4A', is a gene/protein that codes for a zinc finger-containing DNA-binding protein. This gene is part of a family of neural-specific proteins that play a crucial role in the developing nervous system by binding to the promoter regions of proteolipid proteins in the central nervous system. The protein encoded by MYT1 has been associated with certain diseases, including oculo-auriculo-vertebral spectrum and hemifacial microsomia. It is involved in various cellular components such as the nucleus, chromatin, cytosol, and nucleoplasm, and its biological processes include cell differentiation, regulation of transcription by RNA polymerase II, and nervous system development. MYT1 is expressed in a wide variety of tissues, such as the pituitary gland, pituitary gland
    Which gene or protein is not expressed in the stomach fundus and nasal cavity epithelial tissue?


    ### Key Information on Long QT Syndrome

    #### What is Long QT Syndrome?
    - Definition: A genetic condition that affects the heart's electrical system, causing abnormal heart rhythms.
    - Signs & Symptoms: Syncope (fainting), seizures, abnormal heart rhythms, and sudden cardiac death.

    #### Forms of Long QT Syndrome
    - Congenital (Romano-Ward Syndrome): Occurs when a single gene variant is inherited from one parent, often with normal hearing.
    - Acquired: Resulting from medications, medical conditions, or electrolyte imbalances.

    #### Risk Factors
    - Hereditary: A family history increases risk.
    - Certain Medications: Common antibiotics, antifungals, and some antidepressants can cause it.
    - Medical Conditions: Diarrhea, vomiting, eating disorders, or acute kidney injury.

    #### Complications
    - Torsades de Pointes: Chaotic heart rhythm that can be fatal.
    - Ventricular Fibrillation: Rapid, ineffective heartbeats; can lead to sudden death without immediate treatment.
    - Sudden Death: Occurs in an otherwise healthy individual.

    #### Prevention
    - Medication Review: Regularly check medications for
    Which cardiac arrhythmia contraindicates the use of medications prescribed for bladder infections?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            384
        ],
        "matryoshka_weights": [
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 16
  • learning_rate: 1e-05
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_384_cosine_map@100
0 0 - 0.3971
0.0877 10 1.5497 -
0.1754 20 1.334 -
0.2632 30 1.2332 -
0.3509 40 1.1818 -
0.4386 50 1.087 -
0.5263 60 1.2103 -
0.6140 70 1.1323 -
0.7018 80 1.0869 -
0.7895 90 0.9275 -
0.8772 100 1.0684 -
0.9649 110 0.9702 -
1.0 114 - 0.4142
1.0526 120 1.0792 -
1.1404 130 1.1194 -
1.2281 140 0.9212 -
1.3158 150 1.0393 -
1.4035 160 1.099 -
1.4912 170 0.8902 -
1.5789 180 0.854 -
1.6667 190 0.6828 -
1.7544 200 0.9187 -
1.8421 210 0.8597 -
1.9298 220 1.0286 -
2.0 228 - 0.4179
2.0175 230 0.6874 -
2.1053 240 0.7523 -
2.1930 250 0.7594 -
2.2807 260 0.6929 -
2.3684 270 0.7718 -
2.4561 280 0.7803 -
2.5439 290 0.7324 -
2.6316 300 0.7252 -
2.7193 310 0.7532 -
2.8070 320 0.8368 -
2.8947 330 0.9413 -
2.9825 340 0.7401 -
3.0 342 - 0.4185
3.0702 350 0.6514 -
3.1579 360 0.6765 -
3.2456 370 0.8422 -
3.3333 380 0.6532 -
3.4211 390 0.7121 -
3.5088 400 0.5739 -
3.5965 410 0.7838 -
3.6842 420 0.7554 -
3.7719 430 0.743 -
3.8596 440 0.5219 -
3.9474 450 0.8437 -
4.0 456 - 0.4187
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.10
  • Sentence Transformers: 3.1.1
  • Transformers: 4.45.1
  • PyTorch: 2.2.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.20.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
3
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for FareedKhan/flax-sentence-embeddings_all_datasets_v4_MiniLM-L6_FareedKhan_prime_synthetic_data_2k_4_16

Finetuned
(3)
this model

Evaluation results