File size: 3,327 Bytes
95a8c68
 
 
 
 
 
 
 
 
 
 
fcb396a
ecac793
c6293fc
 
f5258be
c6293fc
433fe9e
c6293fc
fcb396a
 
 
f5258be
fcb396a
 
 
 
 
c3c9d0f
fcb396a
 
fef55a5
fcb396a
 
 
 
67edcf0
fcb396a
67edcf0
fcb396a
 
31a9bd2
fcb396a
9edddc7
8105c3b
9edddc7
 
 
 
 
 
 
 
 
 
 
8105c3b
9edddc7
eb1f8c8
f5258be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb1f8c8
 
 
 
 
 
 
 
 
7fad9b9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: bsd
tags:
- chemistry
- biology
- protein
- antibodies
- antibody
- heavy chain
- AbLang
- CDR
- OAS
pipeline_tag: sentence-similarity
---

### AbLang model for heavy chains

This is a 🤗 version of AbLang: A language model for antibodies. It was introduced in
[this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in
[this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids.


### Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA).

### How to use

Here is how to use this model to get the features of a given antibody sequence in PyTorch:

```python
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_heavy')
model = AutoModel.from_pretrained('qilowoq/AbLang_heavy', trust_remote_code=True)

sequence_Example = ' '.join("EVQLQESGPGLVKPSETLSLTCTVSGGPINNAYWTWIRQPPGKGLEYLGYVYHTGVTNYNPSLKSRLTITIDTSRKQLSLSLKFVTAADSAVYYCAREWAEDGDFGNAFHVWGQGTMVAVSSASTKGPSVFPLAPSSKSTSGGTAALGCL")
encoded_input = tokenizer(sequence_Example, return_tensors='pt')
model_output = model(**encoded_input)
```

Sequence embeddings can be produced as follows:

```python
def get_sequence_embeddings(encoded_input, model_output):
    mask = encoded_input['attention_mask'].float()
    d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
    # make sep token invisible
    for i in d:
        mask[i, d[i]] = 0
    mask[:, 0] = 0.0 # make cls token invisible
    mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
    sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
    sum_mask = torch.clamp(mask.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

seq_embeds = get_sequence_embeddings(encoded_input, model_output)
```

### Fine-tune

To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685):

```python
pip install git+https://github.com/huggingface/peft.git
pip install loralib
```

LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model.

```python
from peft import LoraConfig, get_peft_model

def apply_lora_bert(model):
    config = LoraConfig(
        r=8, lora_alpha=32, 
        lora_dropout=0.3,
        target_modules=['query', 'value']
    )
    for param in model.parameters():
        param.requires_grad = False  # freeze the model - train adapters later
        if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
            param.data = param.data.to(torch.float32)
    model.gradient_checkpointing_enable()  # reduce number of stored activations
    model.enable_input_require_grads()
    model = get_peft_model(model, config)
    return model

model = apply_lora_bert(model)

model.print_trainable_parameters()
# trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505
```

### Citation
```
@article{Olsen2022,
  title={AbLang: An antibody language model for completing antibody sequences},
  author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
  journal={bioRxiv},
  doi={https://doi.org/10.1101/2022.01.20.477061},
  year={2022}
}
```