File size: 5,550 Bytes
cca1b1c d89aafa cca1b1c 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 d89aafa 0cd0d00 5062c7a 0cd0d00 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
license: agpl-3.0
language:
- en
- zh
tags:
- AI4S
- MoE
---
# SciDFM: Dialogue Foundation Model for Science
SciDFM is the pioneering open-sourced dialogue foundation model tailored for science, which integrates a mixture-of-experts architecture into a transformer-based framework, aiming at enhancing its sophisticated scientific reasoning and understanding capabilities. SciDFM achieves strong performance on general scientific benchmarks such as SciEval and SciQ, and it reachs a SOTA performance on domain-specific benchmark among models of similar size.
## News
* **2024-06-28** The parameter of SciDFM-MoE-A5.6B-v1.0 is open-soursed! Technical report is coming soon.
## Model Details
SciDFM is based on a transformer architecture, and follows modifications of Llama, i.e. RMSNorm, RoPE and SwiGLU. SciDFM use the same hyper-parameters of OpenLLaMa-3B. And in order to better model knowledge of different disciplines, we replace the feed-forward block with Mixture-of-Expert (MoE) layers.
## Training Details
SciDFM is pre-trained on a large corpus containing ~300B science tokens and ~270B general tokens for two epochs, resulting in about 1.1T tokens consuming. And we further fine-tune SciDFM using ~9.3M instruction-following samples for 5 epochs to improve the performances on the downstream benchmarks.
## Usage Details
### Local Inference
To load and run SciDFM locally, here is an example:
```python
import torch
from transformers import LlamaTokenizer, AutoModelForCausalLM
model_name_or_id = "OpenDFM/SciDFM-MoE-A5.6B-v1.0"
tokenizer = LlamaTokenizer.from_pretrained(model_name_or_id, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
chat_template = "<|user|>:{instruction}<|assistant|>:"
input_text = "What is Mixture-of-Experts (MoE) in computer science?"
input_text = chat_template.format(instruction=input_text)
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
do_sample=True,
top_k=20,
top_p=0.9,
temperature=0.9,
max_new_tokens=1024,
eos_token_id=tokenizer.eos_token_id
)
outputs = model.generate(**inputs, generation_config=generation_config)
generated_text = tokenizer.decode(outputs, skip_special_tokens=True)[0][len(input_text):]
print(generated_text.strip())
```
### SMILES preprocess
When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the `rdkit` package to canonicalize the SMILES. Here is an example:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)
```
or directly:
```python
from rdkit import Chem
def canonicalize_smiles(smiles):
return Chem.CanonSmiles(smiles, useChiral=True)
```
### Special Tokens preprocess
If there is SMILES expression in your input, please first process it with the following function:
```python
import sentencepiece as spm
smiles_model = spm.SentencePieceProcessor(model_file="smiles.model")
def convert_smiles(smiles_str):
pieces = smiles_model.encode_as_pieces(smiles_str)[1:]
smiles = "".join([f"[ChemDFM_Start_SMILES_Unit]{piece}[ChemDFM_End_SMILES_Unit]" for piece in pieces])
return smiles
convert_smiles("C(C(=O)O)N")
```
And if there is protein sequece in your input, please first process it with the following function:
```python
def convert_protein(p_str):
res = [f"<<protein>>{s}" for s in p_str]
return "".join(res)
convert_protein("MIRLGAPQTL")
```
## Evaluation
We briefly compare SciDFM-MoE-A5.6B-v1.0 with similar-sized instruction-tuned LLMs on scientific evaluation benchmarks. The results are shown below:
| Model | SciEval | SciQ | ARC\_c | ARC\_e | GSM8K | MATH | MedQA | MMCQA | PMQA | Avg |
|--------------------|---------|-------|--------|--------|-------|-------|-------|-------|-------|-------|
| LLaMa2-7B | 27.06 | 57.00 | 36.43 | 46.59 | 3.94 | 3.96 | 26.32 | 29.84 | 66.80 | 32.95 |
| Galactica-6.7B | 46.28 | 74.20 | 44.28 | 61.83 | 2.80 | 6.32 | 30.48 | 36.46 | 48.80 | 38.91 |
| LLaMa2-13B | 33.88 | 78.10 | 56.66 | 72.35 | 22.82 | 3.90 | 32.68 | 34.28 | 77.80 | 45.45 |
| ChatGLM2-6B | 54.25 | 75.80 | 57.08 | 73.57 | 25.09 | 7.18 | 27.42 | 34.21 | 60.40 | 45.94 |
| Galactica-30B | 54.24 | 83.10 | 57.85 | 75.04 | 13.65 | 8.66 | 37.71 | 48.43 | 58.80 | 48.35 |
| LLaMa3-8B | 59.70 | 90.00 | 71.16 | 84.05 | 5.91 | 7.00 | 48.78 | 52.74 | 26.60 | 49.59 |
| ChatGLM3-6B | 51.13 | 77.60 | 60.84 | 75.97 | 60.27 | 23.52 | 24.59 | 31.39 | 51.80 | 50.53 |
| SciGLM-6B | 61.22 | 88.70 | 77.47 | 86.57 | 42.23 | 16.40 | 42.81 | 44.94 | 73.60 | 59.12 |
| SciDFM | 62.48 | 88.00 | 64.76 | 81.48 | 59.14 | 27.28 | 44.54 | 53.10 | 78.00 | 61.56 |
| ChatGLM3-6B-base | 60.34 | 89.00 | 78.58 | 87.37 | 59.82 | 22.64 | 42.73 | 45.14 | 74.40 | 61.96 |
| Llama3-8B-Instruct | 64.91 | 91.60 | 76.45 | 87.33 | 76.57 | 26.26 | 56.48 | 59.31 | 72.00 | 67.44 |
## Citation
```
@article{sun2024scidfm,
title={SciDFM: A Large Language Model with Mixture-of-Experts for Science},
author={Sun, Liangtai and Luo, Danyu and Ma, Da and Zhao, Zihan and Chen, Baocai and Shen, Zhennan and Zhu, Su and Chen, Lu and Chen, Xin and Yu, Kai},
journal={arXiv preprint arXiv:2409.18412},
year={2024}
}
``` |