File size: 3,327 Bytes
95a8c68 fcb396a ecac793 c6293fc f5258be c6293fc 433fe9e c6293fc fcb396a f5258be fcb396a c3c9d0f fcb396a fef55a5 fcb396a 67edcf0 fcb396a 67edcf0 fcb396a 31a9bd2 fcb396a 9edddc7 8105c3b 9edddc7 8105c3b 9edddc7 eb1f8c8 f5258be eb1f8c8 7fad9b9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
license: bsd
tags:
- chemistry
- biology
- protein
- antibodies
- antibody
- heavy chain
- AbLang
- CDR
- OAS
pipeline_tag: sentence-similarity
---
### AbLang model for heavy chains
This is a 🤗 version of AbLang: A language model for antibodies. It was introduced in
[this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in
[this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
### Intended uses & limitations
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA).
### How to use
Here is how to use this model to get the features of a given antibody sequence in PyTorch:
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_heavy')
model = AutoModel.from_pretrained('qilowoq/AbLang_heavy', trust_remote_code=True)
sequence_Example = ' '.join("EVQLQESGPGLVKPSETLSLTCTVSGGPINNAYWTWIRQPPGKGLEYLGYVYHTGVTNYNPSLKSRLTITIDTSRKQLSLSLKFVTAADSAVYYCAREWAEDGDFGNAFHVWGQGTMVAVSSASTKGPSVFPLAPSSKSTSGGTAALGCL")
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
model_output = model(**encoded_input)
```
Sequence embeddings can be produced as follows:
```python
def get_sequence_embeddings(encoded_input, model_output):
mask = encoded_input['attention_mask'].float()
d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
# make sep token invisible
for i in d:
mask[i, d[i]] = 0
mask[:, 0] = 0.0 # make cls token invisible
mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
sum_mask = torch.clamp(mask.sum(1), min=1e-9)
return sum_embeddings / sum_mask
seq_embeds = get_sequence_embeddings(encoded_input, model_output)
```
### Fine-tune
To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685):
```python
pip install git+https://github.com/huggingface/peft.git
pip install loralib
```
LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model.
```python
from peft import LoraConfig, get_peft_model
def apply_lora_bert(model):
config = LoraConfig(
r=8, lora_alpha=32,
lora_dropout=0.3,
target_modules=['query', 'value']
)
for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later
if param.ndim == 1:
# cast the small parameters (e.g. layernorm) to fp32 for stability
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable() # reduce number of stored activations
model.enable_input_require_grads()
model = get_peft_model(model, config)
return model
model = apply_lora_bert(model)
model.print_trainable_parameters()
# trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505
```
### Citation
```
@article{Olsen2022,
title={AbLang: An antibody language model for completing antibody sequences},
author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
journal={bioRxiv},
doi={https://doi.org/10.1101/2022.01.20.477061},
year={2022}
}
``` |