Bo1015 commited on
Commit
969bdd1
β€’
1 Parent(s): 4ca06d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -24
README.md CHANGED
@@ -3,23 +3,23 @@ tags:
3
  - biology
4
  - medical
5
  ---
6
- # xTrimoPGLM
7
 
8
- ## Introduction to xTrimoPGLM Family Models
9
 
10
- xTrimoPGLM is the open-source version of the latest protein language models towards protein understanding tasks (Masked Protein Language Models) and protein design (Casual Protein Language Models). The xTrimoPGLM family models are developed by BioMap and Tsinghua University. Along with this, we have released the int4 quantization xTrimoPGLM-100B weights and other xTrimo-series small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
11
 
12
  ### Out-of-Distribution Perplexity Evaluation
13
 
14
- We evaluated the xTrimoPGLM (xTMLM or xTCLM) and xTrimoPGLM(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The MLM perplexity results, compared against ESM2-3B and ESM2-15B and the CLM perplexity againest ProGen2-xlarge (6.4B), are as follows (lower is better):
15
 
16
- | Model | ESM2(3B)| ESM2 (15B) | xTMLM (1B) | xTMLM (3B) | xTMLM (10B) | xT (100B)-INT4 |
17
  |:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
18
  | < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.8** |
19
  | < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
20
 
21
 
22
- | Model | ProGen2-xlarge (6.4B) | xTCLM (1B) | xTCLM (3B) | xTCLM (7B) | xT (100B)-INT4 |
23
  |:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
24
  | < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.9** |
25
  | < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.5** |
@@ -34,27 +34,27 @@ You can choose to manually download the necessary weights.
34
 
35
  | Model |Download |
36
  |------------------|-----------------------------------------------------------------------------------------------------------|
37
- | xTrimoPGLM-1B-MLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-1b-mlm) |
38
- | xTrimoPGLM-3B-MLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-3b-mlm) |
39
- | xTrimoPGLM-10B-MLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-10b-mlm) |
40
- | xTrimoPGLM-1B-CLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-1b-clm) |
41
- | xTrimoPGLM-3B-CLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-3b-clm) |
42
- | xTrimoPGLM-7B-CLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-7b-clm) |
43
- | xTrimoPGLM-100B-Int4 (MLM or CLM) | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-100b-int4)| | |
44
 
45
  ## How to use
46
- ### xTrimoPGLM-MLM: Masked Langeuage Models for Protein Understanding tasks
47
- (Note that the xTrimoPGLM-100B INT4 quantization can be infered in a single A100/800 GPU with 80G memory.)
48
  ```python
49
 
50
  # Obtain residue embeddings
51
  from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
52
  import torch
53
 
54
- tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, use_fast=True)
55
- config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
56
  config.is_causal=False
57
- model = AutoModelForMaskedLM.from_pretrained("biomap-research/xtrimopglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
58
  if torch.cuda.is_available():
59
  model = model.cuda()
60
 
@@ -84,15 +84,15 @@ model = AutoModelForTokenClassification.from_pretrained(config, trust_remote_cod
84
 
85
  ```
86
 
87
- ### xTrimoPGLM-CLM: Casusal Langeuage Models for Protein Design
88
  ```python
89
  from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
90
  import torch
91
 
92
- tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, use_fast=True)
93
- config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
94
  config.is_causal=True
95
- model = AutoModelForCausalLM.from_pretrained("biomap-research/xtrimopglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
96
  if torch.cuda.is_available():
97
  model = model.cuda()
98
 
@@ -115,11 +115,10 @@ for idx, each in enumerate(prompt):
115
  print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
116
  ```
117
 
118
- For more inference or fine-tuning code, datasets, and requirements, please visit our [GitHub page](https://github.com/biomap-research/xTrimoPGLM).
119
 
120
  ## LICENSE
121
 
122
- The code in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).
123
 
124
  ## Citations
125
 
 
3
  - biology
4
  - medical
5
  ---
6
+ # ProteinGLM
7
 
8
+ ## Introduction to ProteinGLM Family Models
9
 
10
+ ProteinGLM is the open-source version of the latest protein language models towards protein understanding tasks (Masked Protein Language Models) and protein design (Casual Protein Language Models). The ProteinGLM family models are developed by Tsinghua University. Along with this, we have released the int4 quantization ProteinGLM-100B weights and other small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
11
 
12
  ### Out-of-Distribution Perplexity Evaluation
13
 
14
+ We evaluated the ProteinGLM (MLM or CLM) and ProteinGLM-INT4 (100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The MLM perplexity results, compared against ESM2-3B and ESM2-15B and the CLM perplexity againest ProGen2-xlarge (6.4B), are as follows (lower is better):
15
 
16
+ | Model | ESM2(3B)| ESM2 (15B) | PGLM (1B) | PGLM (3B) | PGLM (10B) | PGLM-INT4 (100B) |
17
  |:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
18
  | < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.8** |
19
  | < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
20
 
21
 
22
+ | Model | ProGen2-xlarge (6.4B) | PGLM (1B) | PGLM (3B) | PGLM (7B) | PGLM-INT4 (100B) |
23
  |:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
24
  | < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.9** |
25
  | < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.5** |
 
34
 
35
  | Model |Download |
36
  |------------------|-----------------------------------------------------------------------------------------------------------|
37
+ | ProteinGLM-1B-MLM | [πŸ€— Huggingface](https://huggingface.co/bo1015/proteinglm-1b-mlm) |
38
+ | ProteinGLM-3B-MLM | [πŸ€— Huggingface](https://huggingface.co/bo1015/proteinglm-3b-mlm) |
39
+ | ProteinGLM-10B-MLM | [πŸ€— Huggingface](https://huggingface.co/bo1015/proteinglm-10b-mlm) |
40
+ | ProteinGLM-1B-CLM | [πŸ€— Huggingface](https://huggingface.co/bo1015/protein-1b-clm) |
41
+ | ProteinGLM-3B-CLM | [πŸ€— Huggingface](https://huggingface.co/bo1015/protein-3b-clm) |
42
+ | ProteinGLM-7B-CLM | [πŸ€— Huggingface](https://huggingface.co/bo1015/protein-7b-clm) |
43
+ | ProteinGLM-INT4 (100B) (MLM or CLM) | [πŸ€— Huggingface](https://huggingface.co/Bo1015/proteinglm-100b-int4)| | |
44
 
45
  ## How to use
46
+ ### ProteinGLM-MLM: Masked Langeuage Models for Protein Understanding Tasks
47
+ (The ProteinGLM-100B INT4 quantization requires approximately 50 GB of GPU memory. It can be inferred on a single A100/800 GPU with 80 GB of memory or across multiple GPUs totaling 60 GB.)
48
  ```python
49
 
50
  # Obtain residue embeddings
51
  from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
52
  import torch
53
 
54
+ tokenizer = AutoTokenizer.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, use_fast=True)
55
+ config = AutoConfig.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
56
  config.is_causal=False
57
+ model = AutoModelForMaskedLM.from_pretrained("Bo1015/proteinglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
58
  if torch.cuda.is_available():
59
  model = model.cuda()
60
 
 
84
 
85
  ```
86
 
87
+ ### ProteinGLM-CLM: Casusal Langeuage Models for Protein Design
88
  ```python
89
  from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
90
  import torch
91
 
92
+ tokenizer = AutoTokenizer.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, use_fast=True)
93
+ config = AutoConfig.from_pretrained("Bo1015/proteinglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
94
  config.is_causal=True
95
+ model = AutoModelForCausalLM.from_pretrained("Bo1015/proteinglm-100b-int4", config = config, torch_dtype=torch.half,trust_remote_code=True)
96
  if torch.cuda.is_available():
97
  model = model.cuda()
98
 
 
115
  print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
116
  ```
117
 
 
118
 
119
  ## LICENSE
120
 
121
+ The model in this repository is open source under the [Creative Commons Attribution-NonCommercial 4.0 International License](./LICENSE).
122
 
123
  ## Citations
124