README.md · qilowoq/AbLang_heavy at 56fca3a1e21a779a39983d18429af6603d3b9cd6

AbLang_heavy / README.md

qilowoq

Update README.md

433fe9e over 1 year ago

preview code

raw

history blame

3.3 kB

	---
	license: bsd
	tags:
	- chemistry
	- biology
	- protein
	- antibodies
	- antibody
	- heavy chain
	- AbLang
	- CDR
	- OAS
	---

	### AbLang model for heavy chains

	This is a 🤗 version of AbLang: A language model for antibodies. It was introduced in
	[this paper](https://doi.org/10.1101/2022.01.20.477061) and first released in
	[this repository](https://github.com/oxpig/AbLang). This model is trained on uppercase amino acids: it only works with capital letter amino acids.


	### Intended uses & limitations

	The model could be used for protein feature extraction or to be fine-tuned on downstream tasks (TBA).

	### How to use

	Here is how to use this model to get the features of a given antibody sequence in PyTorch:

	```python
	from transformers import AutoModel, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('qilowoq/AbLang_heavy')
	model = AutoModel.from_pretrained('qilowoq/AbLang_heavy', trust_remote_code=True)

	sequence_Example = ' '.join("EVQLQESGPGLVKPSETLSLTCTVSGGPINNAYWTWIRQPPGKGLEYLGYVYHTGVTNYNPSLKSRLTITIDTSRKQLSLSLKFVTAADSAVYYCAREWAEDGDFGNAFHVWGQGTMVAVSSASTKGPSVFPLAPSSKSTSGGTAALGCL")
	encoded_input = tokenizer(sequence_Example, return_tensors='pt')
	model_output = model(**encoded_input)
	```

	Sequence embeddings can be produced as follows:

	```python
	def get_sequence_embeddings(encoded_input, model_output):
	mask = encoded_input['attention_mask'].float()
	d = {k: v for k, v in torch.nonzero(mask).cpu().numpy()} # dict of sep tokens
	# make sep token invisible
	for i in d:
	mask[i, d[i]] = 0
	mask[:, 0] = 0.0 # make cls token invisible
	mask = mask.unsqueeze(-1).expand(model_output.last_hidden_state.size())
	sum_embeddings = torch.sum(model_output.last_hidden_state * mask, 1)
	sum_mask = torch.clamp(mask.sum(1), min=1e-9)
	return sum_embeddings / sum_mask

	seq_embeds = get_sequence_embeddings(encoded_input, model_output)
	```

	### Fine-tune

	To save memory we recomend using [LoRA](https://doi.org/10.48550/arXiv.2106.09685):

	```python
	pip install git+https://github.com/huggingface/peft.git
	pip install loralib
	```

	LoRA greatly reduces the number of trainable parameters and performs on-par or better than fine-tuning full model.

	```python
	from peft import LoraConfig, get_peft_model

	def apply_lora_bert(model):
	config = LoraConfig(
	r=8, lora_alpha=32,
	lora_dropout=0.3,
	target_modules=['query', 'value']
	)
	for param in model.parameters():
	param.requires_grad = False # freeze the model - train adapters later
	if param.ndim == 1:
	# cast the small parameters (e.g. layernorm) to fp32 for stability
	param.data = param.data.to(torch.float32)
	model.gradient_checkpointing_enable() # reduce number of stored activations
	model.enable_input_require_grads()
	model = get_peft_model(model, config)
	return model

	model = apply_lora_bert(model)

	model.print_trainable_parameters()
	# trainable params: 294912 \|\| all params: 85493760 \|\| trainable%: 0.3449514911965505
	```

	### Citation
	```
	@article{Olsen2022,
	title={AbLang: An antibody language model for completing antibody sequences},
	author={Tobias H. Olsen, Iain H. Moal and Charlotte M. Deane},
	journal={bioRxiv},
	doi={https://doi.org/10.1101/2022.01.20.477061},
	year={2022}
	}
	```