MPNet base trained on sxc_med_llm_chemical_gen
This is a sentence-transformers model finetuned from microsoft/mpnet-base on the sxc_med_llm_chemical_gen dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: microsoft/mpnet-base
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
- Training Dataset:
- Language: en
- License: apache-2.0
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Saideepthi55/sentencetransformer_ftmodel_on_chemical_dataset")
# Run inference
sentences = [
'With a molecule represented by the SMILES string CNNNCC(=O)N[C@H](C)C[C@@H](C)NCc1ccc2c(c1)CCC2, propose adjustments that can increase its logP value while keeping the output molecule structurally related to the input molecule.',
'Given a molecule expressed in SMILES string, help me optimize it according to my requirements.',
'In line with your criteria, I\'ve optimized the molecule and present it as "C[C@H](C[C@@H](C)NC(=O)COC(C)(C)C)NCc1ccc2c(c1)CCC2".',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Triplet
- Dataset:
sxc_med_llm_chemical_gen
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.9838 |
dot_accuracy | 0.0162 |
manhattan_accuracy | 0.9827 |
euclidean_accuracy | 0.9836 |
max_accuracy | 0.9838 |
Training Details
Training Dataset
sxc_med_llm_chemical_gen
- Dataset: sxc_med_llm_chemical_gen at ae28874
- Size: 117,502 training samples
- Columns:
input
,instruction
, andoutput
- Approximate statistics based on the first 1000 samples:
input instruction output type string string string details - min: 2 tokens
- mean: 50.03 tokens
- max: 156 tokens
- min: 6 tokens
- mean: 16.19 tokens
- max: 69 tokens
- min: 3 tokens
- mean: 67.94 tokens
- max: 321 tokens
- Samples:
input instruction output List the countries of the Middle East
Egypt, Israel, Iran, Iraq, Jordan, Kuwait, Lebanon, Qatar, Saudi Arabia, Syria, Turkey, The United Arab Emirates, Oman, and Yemen.
Support me in refining the molecule CCC(C)=C(CC)c1ccc(Nc2nc3nonc3nc2N2CCC(Oc3ccc(Cl)cc3)CC2)cc1 to have a more drug-like nature, ensuring it remains close to the input molecule. Please inform me of the QED value of the optimized molecule.
Help me adapt a molecular structure denoted in SMILES string based on my preferences.
I've optimized the molecule according to your requirements, yielding the structure "COC(=O)c1ccc(Nc2nc3nonc3nc2N2CCC(Oc3ccc(Cl)cc3)CC2)cc1" and an estimated QED value of 0.4.
Create a social media strategy to attract followers on Twitter.
The social media strategy for Twitter should involve Tweeting on a regular basis, with content that is engaging and relevant to the target audience. This can include retweetable quotes, links to videos and articles, sharing photos/images, polls, and so on. It is also beneficial for the information to include hashtags relevant to the content and any relevant topics of the industry.
It is also important to leverage Twitter influencers and accounts, as well as engaging with potential followers regularly and DM them when necessary. Additionally, running campaigns, contests and giveaways to gain more followers should also be part of the strategy. Finally, hashtag research should be conducted to increase the reach of the tweets. - Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Evaluation Dataset
sxc_med_llm_chemical_gen
- Dataset: sxc_med_llm_chemical_gen at ae28874
- Size: 4,500 evaluation samples
- Columns:
input
,instruction
, andoutput
- Approximate statistics based on the first 1000 samples:
input instruction output type string string string details - min: 2 tokens
- mean: 68.49 tokens
- max: 165 tokens
- min: 6 tokens
- mean: 16.89 tokens
- max: 36 tokens
- min: 3 tokens
- mean: 69.16 tokens
- max: 205 tokens
- Samples:
input instruction output With a molecule represented by the SMILES string COc1ccc(CN(C(=O)Nc2ccc(C@Hc3ccccc3)c(Cl)c2)C2CC2)cc1, propose adjustments that can decrease its blood-brain barrier permeability (BBBP) value by at least 0.1 compared to the pre-optimized value while keeping the output molecule structurally related to the input molecule. Please inform me of the BBBP value of the optimized molecule.
Help me adapt a molecular structure denoted in SMILES string based on my preferences.
I've optimized the molecule according to your requirements, yielding the structure "CC(C)c1cccc(Oc2ccc(CN(C(=O)Nc3ccc(C@Hc4ccccc4)c(Cl)c3)C3CC3)cc2)c1" and an estimated BBBP value of 0.34.
Which modifications can be applied to the molecule Cc1nn(-c2cc3cc(F)ccc3s2)c2ncc(C(=O)N3CCCC3c3nnc4n3CCCCC4)cc12 to increase its water solubility value while ensuring it remains similar to the original molecule?
Aid me in refining a molecular structure written in SMILES notation based on my criteria.
I've optimized the molecule according to your requirements, resulting in the following structure: "Cc1nn(C)c2ncc(C(=O)N3CCCC3c3nnc4n3CCCCC4)cc12".
Describe the benefits of having a mentor.
Having a mentor can provide tremendous benefits, including guidance, advice, support and connection to a network of contacts. Mentors can offer valuable feedback, help to create goals and strategies, and provide the accountability necessary to carry out the desired goals. They can also provide a fresh perspective which can help to create new ideas and solutions.
- Loss:
MultipleNegativesRankingLoss
with these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim" }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16learning_rate
: 2e-05num_train_epochs
: 1warmup_ratio
: 0.1fp16
: True
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 1max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Truefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falsebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | sxc_med_llm_chemical_gen_max_accuracy |
---|---|---|---|---|
0 | 0 | - | - | 0.7076 |
0.0136 | 100 | 4.1705 | 3.6314 | 0.7469 |
0.0272 | 200 | 3.0088 | 2.4771 | 0.8191 |
0.0408 | 300 | 2.3803 | 2.3765 | 0.8716 |
0.0545 | 400 | 2.2281 | 2.3122 | 0.9316 |
0.0681 | 500 | 2.1647 | 2.2997 | 0.9436 |
0.0817 | 600 | 2.1289 | 2.2663 | 0.9511 |
0.0953 | 700 | 2.0663 | 2.2601 | 0.9629 |
0.1089 | 800 | 2.065 | 2.2500 | 0.9687 |
0.1225 | 900 | 2.0399 | 2.2595 | 0.9693 |
0.1362 | 1000 | 1.9939 | 2.2375 | 0.9707 |
0.1498 | 1100 | 1.9858 | 2.2220 | 0.9684 |
0.1634 | 1200 | 2.0069 | 2.2265 | 0.9758 |
0.1770 | 1300 | 1.9591 | 2.2073 | 0.9702 |
0.1906 | 1400 | 1.9288 | 2.2078 | 0.976 |
0.2042 | 1500 | 1.9466 | 2.2036 | 0.9758 |
0.2179 | 1600 | 1.9175 | 2.2086 | 0.9764 |
0.2315 | 1700 | 1.8835 | 2.2329 | 0.9796 |
0.2451 | 1800 | 1.9134 | 2.2003 | 0.9796 |
0.2587 | 1900 | 1.8809 | 2.2003 | 0.9811 |
0.2723 | 2000 | 1.9263 | 2.2039 | 0.9824 |
0.2859 | 2100 | 1.9101 | 2.2084 | 0.9804 |
0.2996 | 2200 | 1.8846 | 2.2052 | 0.9831 |
0.3132 | 2300 | 1.8842 | 2.1903 | 0.9818 |
0.3268 | 2400 | 1.8945 | 2.1984 | 0.9807 |
0.3404 | 2500 | 1.9217 | 2.1859 | 0.9829 |
0.3540 | 2600 | 1.8704 | 2.1995 | 0.982 |
0.3676 | 2700 | 1.889 | 2.2038 | 0.9822 |
0.3813 | 2800 | 1.875 | 2.2079 | 0.9829 |
0.3949 | 2900 | 1.8792 | 2.1975 | 0.9833 |
0.4085 | 3000 | 1.882 | 2.1895 | 0.9796 |
0.4221 | 3100 | 1.8886 | 2.2115 | 0.9831 |
0.4357 | 3200 | 1.8629 | 2.2040 | 0.9838 |
0.4493 | 3300 | 1.8647 | 2.1973 | 0.9836 |
0.4630 | 3400 | 1.8888 | 2.1961 | 0.9838 |
0.4766 | 3500 | 1.8692 | 2.2027 | 0.9829 |
0.4902 | 3600 | 1.8846 | 2.1954 | 0.9838 |
0.5038 | 3700 | 1.8482 | 2.1888 | 0.9822 |
0.5174 | 3800 | 1.8527 | 2.1873 | 0.9824 |
0.5310 | 3900 | 1.8378 | 2.1940 | 0.9811 |
0.5447 | 4000 | 1.8679 | 2.2008 | 0.9833 |
0.5583 | 4100 | 1.8421 | 2.1845 | 0.9842 |
0.5719 | 4200 | 1.8325 | 2.1948 | 0.9847 |
0.5855 | 4300 | 1.8675 | 2.1750 | 0.9836 |
0.5991 | 4400 | 1.8483 | 2.1828 | 0.9831 |
0.6127 | 4500 | 1.854 | 2.1886 | 0.9831 |
0.6264 | 4600 | 1.827 | 2.1876 | 0.9824 |
0.6400 | 4700 | 1.8863 | 2.1849 | 0.9836 |
0.6536 | 4800 | 1.8919 | 2.1816 | 0.984 |
0.6672 | 4900 | 1.8211 | 2.1830 | 0.9847 |
0.6808 | 5000 | 1.8345 | 2.1847 | 0.9842 |
0.6944 | 5100 | 1.8685 | 2.1855 | 0.9853 |
0.7081 | 5200 | 1.85 | 2.1864 | 0.9844 |
0.7217 | 5300 | 1.8222 | 2.1875 | 0.9842 |
0.7353 | 5400 | 1.8179 | 2.1923 | 0.9844 |
0.7489 | 5500 | 1.7992 | 2.1909 | 0.9851 |
0.7625 | 5600 | 1.8495 | 2.1811 | 0.9847 |
0.7761 | 5700 | 1.808 | 2.1763 | 0.9842 |
0.7898 | 5800 | 1.8293 | 2.1861 | 0.9849 |
0.8034 | 5900 | 1.8184 | 2.1845 | 0.9851 |
0.8170 | 6000 | 1.8256 | 2.1956 | 0.9849 |
0.8306 | 6100 | 1.7904 | 2.1916 | 0.9842 |
0.8442 | 6200 | 1.8028 | 2.1918 | 0.9847 |
0.8578 | 6300 | 1.8316 | 2.1917 | 0.9856 |
0.8715 | 6400 | 1.7951 | 2.1929 | 0.9851 |
0.8851 | 6500 | 1.8175 | 2.1866 | 0.9847 |
0.8987 | 6600 | 1.8071 | 2.1899 | 0.9853 |
0.9123 | 6700 | 1.8632 | 2.1905 | 0.9844 |
0.9259 | 6800 | 1.8441 | 2.1885 | 0.984 |
0.9395 | 6900 | 1.8243 | 2.1865 | 0.9836 |
0.9532 | 7000 | 1.8055 | 2.1852 | 0.9842 |
0.9668 | 7100 | 1.8227 | 2.1843 | 0.984 |
0.9804 | 7200 | 1.8287 | 2.1831 | 0.984 |
0.9940 | 7300 | 1.8379 | 2.1838 | 0.9838 |
Framework Versions
- Python: 3.10.12
- Sentence Transformers: 3.2.0
- Transformers: 4.44.2
- PyTorch: 2.4.1+cu121
- Accelerate: 0.34.2
- Datasets: 3.0.1
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 42
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for Saideepthi55/sentencetransformer_ftmodel_on_chemical_dataset
Base model
microsoft/mpnet-baseDataset used to train Saideepthi55/sentencetransformer_ftmodel_on_chemical_dataset
Evaluation results
- Cosine Accuracy on sxc med llm chemical genself-reported0.984
- Dot Accuracy on sxc med llm chemical genself-reported0.016
- Manhattan Accuracy on sxc med llm chemical genself-reported0.983
- Euclidean Accuracy on sxc med llm chemical genself-reported0.984
- Max Accuracy on sxc med llm chemical genself-reported0.984