Bo1015 commited on
Commit
83d5354
β€’
1 Parent(s): 62e9855

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -3
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - biology
5
+ - medical
6
+ ---
7
+ # xTrimoPGLM
8
+
9
+ ## Introduction to xTrimoPGLM Family Models
10
+
11
+ xTrimoPGLM is the open-source version of the latest protein language models towards protein understanding tasks (Masked Protein Language Models) and protein design (Casual Protein Language Models). The xTrimoPGLM family models are developed by BioMap and Tsinghua University. Along with this, we have released the int4 quantization xTrimoPGLM-100B weights and other xTrimo-series small models, which include: 1B, 3B, and 10B models trained with masked language modeling for protein understanding, and 1B, 3B, and 7B causal language models aimed at protein design.
12
+
13
+ ### Out-of-Distribution Perplexity Evaluation
14
+
15
+ We evaluated the xTrimoPGLM (xTMLM or xTCLM) and xTrimoPGLM(100B) models on two OOD test sets, one with sequence identity lower than 0.9 with the training set (<0.9 ID) and the other with sequence identity lower than 0.5 with the training set (<0.5 ID). Each OOD dataset comprises approximately 10,000 protein sequences. The MLM perplexity results, compared against ESM2-3B and ESM2-15B and the CLM perplexity againest ProGen2-xlarge (6.4B), are as follows (lower is better):
16
+
17
+ | Model | ESM2(3B)| ESM2 (15B) | xTMLM (1B) | xTMLM (3B) | xTMLM (10B) | xT (100B) |
18
+ |:--------------------|:----------:|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
19
+ | < 0.9 ID | 7.7 | 7.3 | 9.3 | 7.8 | 7.6 | **6.7** |
20
+ | < 0.5 ID | 11.5 | 11.0 | 13.5 | 11.9 | 11.6 | **10.8** |
21
+
22
+
23
+ | Model | ProGen2-xlarge (6.4B) | xTCLM (1B) | xTCLM (3B) | xTCLM (7B) | xT (100B) |
24
+ |:--------------------|:----------:|:----------:|:----------:|:--------------------:|:--------------------:|
25
+ | < 0.9 ID | 9.7 | 9.8 | 9.3 | 8.9 | **8.7** |
26
+ | < 0.5 ID | 14.3 | 14.0 | 13.7 | 13.5 | **13.3** |
27
+
28
+ ## Downstream Protein Understanding Tasks Evaluation
29
+ (TODO)
30
+
31
+ ## Get Started
32
+ ### Model List
33
+ You can choose to manually download the necessary weights.
34
+
35
+ | Model |Download |
36
+ |------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
37
+ | xTrimoPGLM-1B-MLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-1b-mlm) [πŸ”¨ SwissArmyTransformer]() |
38
+ | xTrimoPGLM-3B-MLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-3b-mlm) [πŸ”¨ SwissArmyTransformer]() |
39
+ | xTrimoPGLM-10B-MLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-10b-mlm) [πŸ”¨ SwissArmyTransformer]() |
40
+ | xTrimoPGLM-1B-CLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-1b-clm) [πŸ”¨ SwissArmyTransformer]() |
41
+ | xTrimoPGLM-3B-CLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-3b-clm) [πŸ”¨ SwissArmyTransformer]() |
42
+ | xTrimoPGLM-7B-CLM | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-7b-clm) [πŸ”¨ SwissArmyTransformer]() |
43
+ | xTrimoPGLM-100B-Int4 (MLM or CLM) | [πŸ€— Huggingface](https://huggingface.co/biomap-research/xtrimopglm-100b-int4) [πŸ”¨ SwissArmyTransformer]() | | |
44
+
45
+ ## How to use
46
+ ### xTrimoPGLM-MLM: Masked Langeuage Models for Protein Understanding tasks
47
+ (Note that the xTrimoPGLM-100B INT4 quantization can be infered in a single A100/800 GPU with 80G memory.)
48
+ ```python
49
+
50
+ # Obtain residue embeddings
51
+ from transformers import AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoTokenizer, AutoConfig
52
+ import torch
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, use_fast=True)
55
+ config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
56
+ config.is_causal=False
57
+ model = AutoModelForMaskedLM.from_config(config, trust_remote_code=True, torch_dtype=torch.half)
58
+ # # if you don't have the single gpu with 80G memory, try the dispatch load.
59
+ # model = load_checkpoint_and_dispatch(
60
+ # model, "biomap-research/xtrimopglm-100b-int4", device_map="auto", no_split_module_classes=["xTrimoPGLMBlock"], strict=True, dtype=dtype
61
+ # )
62
+ if torch.cuda.is_available():
63
+ model = model.cuda()
64
+ model.eval()
65
+
66
+ seq = 'MILMCQHFSGQFSKYFLAVSSDFCHFVFPIILVSHVNFKQMKRKGFALWNDRAVPFTQGIFTTVMILLQYLHGTG'
67
+ output = tokenizer(seq, add_special_tokens=True, return_tensors='pt')
68
+ with torch.inference_mode():
69
+ inputs = {"input_ids": output["input_ids"].cuda(), "attention_mask": output["attention_mask"].cuda()}
70
+ output_embeddings = model(**inputs, output_hidden_states=True, return_last_hidden_state=True).hidden_states[:-1, 0] # get rid of the <eos> token
71
+
72
+
73
+ # model for the sequence-level tasks
74
+ model = AutoModelForSequenceClassification.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
75
+
76
+ # model for the token-level tasks
77
+ model = AutoModelForTokenClassification.from_config(config, trust_remote_code=True, torch_dtype=torch.bfloat16)
78
+
79
+ ```
80
+
81
+
82
+ Refer the *finetune* folder to check more finetuning examples, such as LoRA and Linear Probing.
83
+
84
+ ### xTrimoPGLM-CLM: Casusal Langeuage Models for Protein Design
85
+ ```python
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
87
+ import torch
88
+
89
+ tokenizer = AutoTokenizer.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, use_fast=True)
90
+ config = AutoConfig.from_pretrained("biomap-research/xtrimopglm-100b-int4", trust_remote_code=True, torch_dtype=torch.half)
91
+ config.is_causal=True
92
+ model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.half)
93
+ # # if you don't have the single gpu with 80G memory, try the dispatch load.
94
+ # model = load_checkpoint_and_dispatch(
95
+ # model, "biomap-research/xtrimopglm-100b-int4", device_map="auto", no_split_module_classes=["xTrimoPGLMBlock"], strict=True, dtype=dtype
96
+ # )
97
+ if torch.cuda.is_available():
98
+ model = model.cuda()
99
+ model.eval()
100
+
101
+ gen_kwargs = {'max_length': 256, 'top_p': 0.8, 'temperature':0.9, "num_beams": 1}
102
+ prompt=['', 'MLFVVL', 'LDL', 'VTQA']
103
+
104
+ for idx, each in enumerate(prompt):
105
+ print(f"Begin generating idx: {idx} with prompt {each}")
106
+ output = model.chat(tokenizer, each)
107
+ print(f"\nEnd generation with length: {len(output.split())} - seqs: {output}\n")
108
+ ```
109
+ For more inference scrpts of other models, please visit the model card of the huggingface page.
110
+
111
+
112
+
113
+ ## LICENSE
114
+
115
+ The code in this repository is open source under the [Apache-2.0 license](./LICENSE).
116
+
117
+ ## Citations
118
+
119
+ If you find our work useful, please consider citing the following paper:
120
+ ```
121
+ @article{chen2024xtrimopglm,
122
+ title={xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein},
123
+ author={Chen, Bo and Cheng, Xingyi and Li, Pan and Geng, Yangli-ao and Gong, Jing and Li, Shen and Bei, Zhilei and Tan, Xu and Wang, Boyan and Zeng, Xin and others},
124
+ journal={arXiv preprint arXiv:2401.06199},
125
+ year={2024}
126
+ }
127
+
128
+ @article{cheng2024training,
129
+ title={Training Compute-Optimal Protein Language Models},
130
+ author={Cheng, Xingyi and Chen, Bo and Li, Pan and Gong, Jing and Tang, Jie and Song, Le},
131
+ journal={bioRxiv},
132
+ pages={2024--06},
133
+ year={2024},
134
+ publisher={Cold Spring Harbor Laboratory}
135
+ }
136
+ ```