wabu
/

AmpGPT2

Generated from Trainer

Model card Files Files and versions Community

AmpGPT2 / README.md

wabu's picture

Update README.md

f8c84db verified about 1 month ago

|

2.88 kB

	---
	license: apache-2.0
	base_model: nferruz/ProtGPT2
	tags:
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: AmpGPT2
	results: []
	---

	# AmpGPT2

	AmpGPT2 is a language model capable of generating de novo antimicrobial peptides (AMPs). Generated sequences are predicted to be AMPs 95.83% of the time.

	## Model description

	AmpGPT2 is a fine-tuned version of [nferruz/ProtGPT2](https://huggingface.co/nferruz/ProtGPT2) based on the GPT2 Transformer architecture.

	\| Training Loss \| Epoch \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:---------------:\|:--------:\|
	\| 3.7948 \| 50.0 \| 3.9890 \| 0.4213 \|

	To validate the results the Antimicrobial Peptide Scanner vr.2 (https://www.dveltri.com/ascan/v2/ascan.html) was used, which is a deep learning tool specifically designed for AMP recognition.

	## Training and evaluation data

	AmpGPT2 was trained using 32014 AMP sequences from the Compass (https://compass.mathematik.uni-marburg.de/) database.

	## How to use AmpGPT2

	The example code below contains the ideal generation settings found while testing.
	The 'num_return_sequences' parameter specifies the amount of sequences generated. When generating more than 100 sequences at the same time, I recommend doing it in batches.
	The results can then be checked with the peptide scanner.
	```
	from transformers import pipeline
	from transformers import GPT2LMHeadModel, GPT2Tokenizer

	ampgpt2 = pipeline('text-generation', model="wabu/AmpGPT2")

	model_amp = GPT2LMHeadModel.from_pretrained('wabu/AmpGPT2')
	tokenizer_amp = GPT2Tokenizer.from_pretrained('wabu/AmpGPT2')

	amp_sequences = ampgpt2( "", do_sample=True, repetition_penalty=1.2, num_return_sequences=10, eos_token_id=0 )

	for i, seq in enumerate(amp_sequences):
	sequence_identifier = f"Sequence_{i + 1}"
	sequence = seq['generated_text'].replace('','').strip()

	print(f">{sequence_identifier}\n{sequence}")
	```

	### Training hyperparameters and results

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 32
	- eval_batch_size: 32
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 50.0

	\begin{table}[h!]
	\centering
	\caption{AMP Yield Comparison between AmpGPT2 and ProtGPT2}
	\begin{tabular}{lccc}
	\toprule
	Model & Total Sequences & AMP Classified & AMP Percentage (AMP\%) \\
	\midrule
	AmpGPT2 & 10000 & 9541 & 95.41\% \\
	ProtGPT2 & 10000 & 5530 & 55.3\% \\
	\bottomrule
	\end{tabular}
	\label{tab:amp_yield}
	\end{table}

	\| Model \| Amp% \| Length \|
	\|:-------:\|:-----:\|:-------:\|
	\|AmpGPT2\|95.86\|64.08 \|
	\|ProtGPT2\| 51.85 \| 222.59 \|

	### Framework versions

	- Transformers 4.38.0.dev0
	- Pytorch 2.2.0+cu121
	- Datasets 2.16.1
	- Tokenizers 0.15.0

	The model was trained on four NVIDIA A100 GPUs.