wabu
/

AmpGPT2

Generated from Trainer

Model card Files Files and versions Community

AmpGPT2 / README.md

wabu's picture

Update README.md

6834eab verified about 1 month ago

|

history blame contribute delete

2.72 kB

	---
	license: apache-2.0
	base_model: nferruz/ProtGPT2
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: AmpGPT2
	results: []
	---

	# AmpGPT2

	AmpGPT2 is a language model capable of generating de novo antimicrobial peptides (AMPs). Over 95% of sequences generated by AmpGPT2 are predicted to have antimicrobial activities.

	## Model description

	AmpGPT2 is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) based on the GPT2 Transformer architecture.
	\| Model \| sequences generated \| AMP percentage (AMP%) \| average length \|
	\|:------------:\|:-----------:\|:-----------:\|:-----------:\|
	\| AmpGPT2\| 1000 \| 95.86\| 64.08 \|
	\| ProtGPT2\| 1000 \| 51.85 \| 222.59 \|

	The results demonstrate that AmpGPT2 outperformes ProtGPT2 in AMP%, suggesting the model learned from the AMP-specific data.
	To validate the results the Antimicrobial Peptide Scanner vr.2 (https://www.dveltri.com/ascan/v2/ascan.html) was used, which is a deep learning tool specifically designed for AMP recognition.

	## Training and evaluation data

	AmpGPT2 was trained using 32014 AMP sequences from the Compass (https://compass.mathematik.uni-marburg.de/) database.

	## How to use AmpGPT2

	The example code below contains the ideal generation settings found while testing.
	The 'num_return_sequences' parameter specifies the amount of sequences generated. When generating more than 100 sequences at the same time, I recommend doing it in batches.
	The results can then be checked with the peptide scanner.
	```
	from transformers import pipeline
	from transformers import GPT2LMHeadModel, GPT2Tokenizer

	ampgpt2 = pipeline('text-generation', model="wabu/AmpGPT2")

	model_amp = GPT2LMHeadModel.from_pretrained('wabu/AmpGPT2')
	tokenizer_amp = GPT2Tokenizer.from_pretrained('wabu/AmpGPT2')

	amp_sequences = ampgpt2( "", do_sample=True, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0 )

	for i, seq in enumerate(amp_sequences):
	sequence_identifier = f"Sequence_{i + 1}"
	sequence = seq['generated_text'].replace('','').strip()

	print(f">{sequence_identifier}\n{sequence}")
	```

	### Training hyperparameters and results

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 50.0

	\| Training Loss \| Epoch \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:---------------:\|:--------:\|
	\| 3.7948 \| 50.0 \| 3.9890 \| 0.4213 \|

	### Framework versions

	- Transformers 4.38.0.dev0
	- Pytorch 2.2.0+cu121
	- Datasets 2.16.1
	- Tokenizers 0.15.0

	The model was trained on four NVIDIA A100 GPUs.