Chemlactica-125m is a continually pretrained galactica-125m model for organic molecules. It is pretrained on 40B tokens covering 110M+ molecules from PubChem as well as their chemical properties (molecular weight, synthetic accessibility score, drug-likeness etc.) and similarities (Tanimoto distance between ECFP fingerprints).
Example prompts:
</s>[START_SMILES]CC(=O)OC1=CC=CC=C1C(=O)O[END_SMILES][SAS]
will attempt to predict the synthetic accessibility score of the given molecule.
</s>[SAS]2.25[/SAS][SIMILAR]0.62 CC(=O)OC1=CC=CC=C1C(=O)O[/SIMILAR][START_SMILES]
will attempt to generate a molecule that has 2.25 SAS score and
has a 0.62 similarity score to the given molecule.
The model can be wrapped into an optimization loop to traverse the chemical space with evolving prompts. See the code on GitHub.
A preprint with the details of the model and an optimization algorithm built on top of this model that sets state-of-the-art on Practical Molecular Optimization and other benchmarks is available on arxiv.
Few notes:
- All queries should start with
</s>
symbol. - All numbers are rounded to two decimal points.
- All SMILES are canonicalized using
rdkit
. - Available tags:
[CLOGP]
,[WEIGHT]
,[QED]
,[SAS]
,[TPSA]
,[RINGCOUNT]
,[SIMILAR]
...
The model is part of the 3-model family: Chemlactica-125M, Chemlactica-1.3B and Chemma-2B.
We are looking forward to see the community using the model in new applications and contexts.
- Downloads last month
- 478