|
--- |
|
license: bsd |
|
tags: |
|
- chemistry |
|
- biology |
|
- protein |
|
- antibodies |
|
- antibody |
|
- heavy chain |
|
- AbLang |
|
- CDR |
|
- OAS |
|
--- |
|
|
|
### AbLang model for heavy chains |
|
|
|
This is a 🤗 version of AbLang: A language model for antibodies. It was introduced in |
|
[this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in |
|
[this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids. |
|
|
|
|
|
### Intended uses & limitations |
|
|
|
The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA). |
|
|
|
### How to use |
|
|
|
Here is how to use this model to get the features of a given antibody sequence in PyTorch: |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_heavy') |
|
model = AutoModel.from_pretrained('qilowoq/AbLang_heavy', trust_remote_code=True) |
|
|
|
sequence_Example = ' '.join("EVQLQESGPGLVKPSETLSLTCTVSGGPINNAYWTWIRQPPGKGLEYLGYVYHTGVTNYNPSLKSRLTITIDTSRKQLSLSLKFVTAADSAVYYCAREWAEDGDFGNAFHVWGQGTMVAVSSASTKGPSVFPLAPSSKSTSGGTAALGCL") |
|
encoded_input = tokenizer(sequence_Example, return_tensors='pt') |
|
model_output = model(**encoded_input) |
|
``` |
|
|
|
Sequence embeddings can be produced as follows: |
|
|
|
```python |
|
def get_sequence_embeddings(encoded_input, model_output): |
|
mask = encoded_input['attention_mask'].float() |
|
d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens |
|
# make sep token invisible |
|
for i in d: |
|
mask[i, d[i]] = 0 |
|
mask[:, 0] = 0.0 # make cls token invisible |
|
mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size()) |
|
sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1) |
|
sum_mask = torch.clamp(mask.sum(1), min=1e-9) |
|
return sum_embeddings / sum_mask |
|
|
|
seq_embeds = get_sequence_embeddings(encoded_input, model_output) |
|
``` |
|
|
|
### Fine-tune |
|
|
|
To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685): |
|
|
|
```python |
|
pip install git+https://github.com/huggingface/peft.git |
|
pip install loralib |
|
``` |
|
|
|
LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model. |
|
|
|
```python |
|
from peft import LoraConfig, get_peft_model |
|
|
|
def apply_lora_bert(model): |
|
config = LoraConfig( |
|
r=8, lora_alpha=32, |
|
lora_dropout=0.3, |
|
target_modules=['query', 'value'] |
|
) |
|
for param in model.parameters(): |
|
param.requires_grad = False # freeze the model - train adapters later |
|
if param.ndim == 1: |
|
# cast the small parameters (e.g. layernorm) to fp32 for stability |
|
param.data = param.data.to(torch.float32) |
|
model.gradient_checkpointing_enable() # reduce number of stored activations |
|
model.enable_input_require_grads() |
|
model = get_peft_model(model, config) |
|
return model |
|
|
|
model = apply_lora_bert(model) |
|
|
|
model.print_trainable_parameters() |
|
# trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505 |
|
``` |
|
|
|
### Citation |
|
``` |
|
@article{Olsen2022, |
|
title={AbLang: An antibody language model for completing antibody sequences}, |
|
author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane}, |
|
journal={bioRxiv}, |
|
doi={https://doi.org/10.1101/2022.01.20.477061}, |
|
year={2022} |
|
} |
|
``` |