Update README.md
Browse files
README.md
CHANGED
@@ -3,23 +3,23 @@ tags:
|
|
3 |
- biology
|
4 |
- medical
|
5 |
---
|
6 |
-
#
|
7 |
|
8 |
-
## Introduction to
|
9 |
|
10 |
-
|
11 |
|
12 |
### Out-of-Distribution Perplexity Evaluation
|
13 |
|
14 |
-
We evaluated the
|
15 |
|
16 |
-
| Model | ESM2(3B)| ESM2 (15B) |
|
17 |
|:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
|
18 |
| < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.8** |
|
19 |
| < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
|
20 |
|
21 |
|
22 |
-
| Model | ProGen2-xlarge (6.4B) |
|
23 |
|:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
|
24 |
| < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.9** |
|
25 |
| < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.5** |
|
@@ -34,27 +34,27 @@ You can choose to manually download the necessary weights.
|
|
34 |
|
35 |
| Model |Download |
|
36 |
|------------------|-----------------------------------------------------------------------------------------------------------|
|
37 |
-
|
|
38 |
-
|
|
39 |
-
|
|
40 |
-
|
|
41 |
-
|
|
42 |
-
|
|
43 |
-
|
|
44 |
|
45 |
## How to use
|
46 |
-
###
|
47 |
-
(
|
48 |
```python
|
49 |
|
50 |
# Obtain residue embeddings
|
51 |
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
|
52 |
import torch
|
53 |
|
54 |
-
tokenizer = AutoTokenizer.from_pretrained("
|
55 |
-
config = AutoConfig.from_pretrained("
|
56 |
config.is_causal=False
|
57 |
-
model = AutoModelForMaskedLM.from_pretrained("
|
58 |
if torch.cuda.is_available():
|
59 |
model = model.cuda()
|
60 |
|
@@ -84,15 +84,15 @@ model = AutoModelForTokenClassification.from_pretrained(config, trust_remote_cod
|
|
84 |
|
85 |
```
|
86 |
|
87 |
-
###
|
88 |
```python
|
89 |
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
|
90 |
import torch
|
91 |
|
92 |
-
tokenizer = AutoTokenizer.from_pretrained("
|
93 |
-
config = AutoConfig.from_pretrained("
|
94 |
config.is_causal=True
|
95 |
-
model = AutoModelForCausalLM.from_pretrained("
|
96 |
if torch.cuda.is_available():
|
97 |
model = model.cuda()
|
98 |
|
@@ -115,11 +115,10 @@ for idx, each in enumerate(prompt):
|
|
115 |
print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
|
116 |
```
|
117 |
|
118 |
-
For more inference or fine-tuning code, datasets, and requirements, please visit our [GitHub page](https://github.com/biomap-research/xTrimoPGLM).
|
119 |
|
120 |
## LICENSE
|
121 |
|
122 |
-
The
|
123 |
|
124 |
## Citations
|
125 |
|
|
|
3 |
- biology
|
4 |
- medical
|
5 |
---
|
6 |
+
# ProteinGLM
|
7 |
|
8 |
+
## Introduction to ProteinGLM Family Models
|
9 |
|
10 |
+
ProteinGLM is the open-source version of the latest protein language models towards protein understanding tasks (Masked Protein Language Models) and protein design (Casual Protein Language Models). The ProteinGLM family models are developed by Tsinghua University. Along with this, we have released the int4 quantization ProteinGLM-100B weights and other small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
|
11 |
|
12 |
### Out-of-Distribution Perplexity Evaluation
|
13 |
|
14 |
+
We evaluated the ProteinGLM (MLM or CLM) and ProteinGLM-INT4 (100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The MLM perplexity results, compared against ESM2-3B and ESM2-15B and the CLM perplexity againest ProGen2-xlarge (6.4B), are as follows (lower is better):
|
15 |
|
16 |
+
| Model | ESM2(3B)| ESM2 (15B) | PGLM (1B) | PGLM (3B) | PGLM (10B) | PGLM-INT4 (100B) |
|
17 |
|:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
|
18 |
| < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.8** |
|
19 |
| < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
|
20 |
|
21 |
|
22 |
+
| Model | ProGen2-xlarge (6.4B) | PGLM (1B) | PGLM (3B) | PGLM (7B) | PGLM-INT4 (100B) |
|
23 |
|:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
|
24 |
| < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.9** |
|
25 |
| < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.5** |
|
|
|
34 |
|
35 |
| Model |Download |
|
36 |
|------------------|-----------------------------------------------------------------------------------------------------------|
|
37 |
+
| ProteinGLM-1B-MLM | [π€ Huggingface](https://huggingface.co/bo1015/proteinglm-1b-mlm) |
|
38 |
+
| ProteinGLM-3B-MLM | [π€ Huggingface](https://huggingface.co/bo1015/proteinglm-3b-mlm) |
|
39 |
+
| ProteinGLM-10B-MLM | [π€ Huggingface](https://huggingface.co/bo1015/proteinglm-10b-mlm) |
|
40 |
+
| ProteinGLM-1B-CLM | [π€ Huggingface](https://huggingface.co/bo1015/protein-1b-clm) |
|
41 |
+
| ProteinGLM-3B-CLM | [π€ Huggingface](https://huggingface.co/bo1015/protein-3b-clm) |
|
42 |
+
| ProteinGLM-7B-CLM | [π€ Huggingface](https://huggingface.co/bo1015/protein-7b-clm) |
|
43 |
+
| ProteinGLM-INT4 (100B) (MLM or CLM) | [π€ Huggingface](https://huggingface.co/Bo1015/proteinglm-100b-int4)| | |
|
44 |
|
45 |
## How to use
|
46 |
+
### ProteinGLM-MLM: Masked Langeuage Models for Protein Understanding Tasks
|
47 |
+
(The ProteinGLM-100B INT4 quantization requires approximately 50 GB of GPU memory. It can be inferred on a single A100/800 GPU with 80 GB of memory or across multiple GPUs totaling 60 GB.)
|
48 |
```python
|
49 |
|
50 |
# Obtain residue embeddings
|
51 |
from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
|
52 |
import torch
|
53 |
|
54 |
+
tokenizer = AutoTokenizer.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, use_fast=True)
|
55 |
+
config = AutoConfig.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
|
56 |
config.is_causal=False
|
57 |
+
model = AutoModelForMaskedLM.from_pretrained("Bo1015/proteinglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
|
58 |
if torch.cuda.is_available():
|
59 |
model = model.cuda()
|
60 |
|
|
|
84 |
|
85 |
```
|
86 |
|
87 |
+
### ProteinGLM-CLM: Casusal Langeuage Models for Protein Design
|
88 |
```python
|
89 |
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
|
90 |
import torch
|
91 |
|
92 |
+
tokenizer = AutoTokenizer.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, use_fast=True)
|
93 |
+
config = AutoConfig.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
|
94 |
config.is_causal=True
|
95 |
+
model = AutoModelForCausalLM.from_pretrained("Bo1015/proteinglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
|
96 |
if torch.cuda.is_available():
|
97 |
model = model.cuda()
|
98 |
|
|
|
115 |
print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
|
116 |
```
|
117 |
|
|
|
118 |
|
119 |
## LICENSE
|
120 |
|
121 |
+
The model in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).
|
122 |
|
123 |
## Citations
|
124 |
|