--- license: bsd tags: - chemistry - biology - protein - antibodies - antibody - heavy chain - AbLang - CDR - OAS --- ### AbLang model for heavy chains This is a 🤗 version of AbLang: A language model for antibodies. It was introduced in [this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in [this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids. ### Intended uses & limitations The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA). ### How to use Here is how to use this model to get the features of a given antibody sequence in PyTorch: ```python from transformers import AutoModel, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_heavy') model = AutoModel.from_pretrained('qilowoq/AbLang_heavy', trust_remote_code=True) sequence_Example = ' '.join("EVQLQESGPGLVKPSETLSLTCTVSGGPINNAYWTWIRQPPGKGLEYLGYVYHTGVTNYNPSLKSRLTITIDTSRKQLSLSLKFVTAADSAVYYCAREWAEDGDFGNAFHVWGQGTMVAVSSASTKGPSVFPLAPSSKSTSGGTAALGCL") encoded_input = tokenizer(sequence_Example, return_tensors='pt') model_output = model(**encoded_input) ``` Sequence embeddings can be produced as follows: ```python def get_sequence_embeddings(encoded_input, model_output): mask = encoded_input['attention_mask'].float() d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens # make sep token invisible for i in d: mask[i, d[i]] = 0 mask[:, 0] = 0.0 # make cls token invisible mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size()) sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1) sum_mask = torch.clamp(mask.sum(1), min=1e-9) return sum_embeddings / sum_mask seq_embeds = get_sequence_embeddings(encoded_input, model_output) ``` ### Fine-tune To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685): ```python pip install git+https://github.com/huggingface/peft.git pip install loralib ``` LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model. ```python from peft import LoraConfig, get_peft_model def apply_lora_bert(model): config = LoraConfig( r=8, lora_alpha=32, lora_dropout=0.3, target_modules=['query', 'value'] ) for param in model.parameters(): param.requires_grad = False # freeze the model - train adapters later if param.ndim == 1: # cast the small parameters (e.g. layernorm) to fp32 for stability param.data = param.data.to(torch.float32) model.gradient_checkpointing_enable() # reduce number of stored activations model.enable_input_require_grads() model = get_peft_model(model, config) return model model = apply_lora_bert(model) model.print_trainable_parameters() # trainable params: 294912 || all params: 85493760 || trainable%: 0.3449514911965505 ``` ### Citation ``` @article{Olsen2022, title={AbLang: An antibody language model for completing antibody sequences}, author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane}, journal={bioRxiv}, doi={https://doi.org/10.1101/2022.01.20.477061}, year={2022} } ```