Update README.md

8a66607 verified 7 months ago

5.05 kB

	---
	library_name: transformers
	tags:
	- medical
	license: llama3
	language:
	- en
	---
	# BioMed LLaMa-3 8B
	Meta AI released the Llama-3 family of LLMs, composed of two models of the next generation of Llama, Meta Llama 3, available for broad use. This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases.

	Llama-3 is a decoder-only transformer architecture with a 128K-token vocabulary and grouped query attention to improve inference efficiency. It has been trained on sequences of 8192 tokens.

	Llama-3 achieved state-of-the-art performance, enhancing capabilities in reasoning, code generation, and instruction following. It is expected to outperform Claude Sonnet, Mistral Medium, and GPT-3.5 on a number of benchmarks.

	## Model Details
	Powerful LLMs are trained on large amounts of unstructured data and are great at general text generation. BioMed-LLaMa-3-8B based on [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B) addresses some constraints related to using off-the-shelf pre-trained LLMs, especially in the biomedical domain:

	* Efficiently fine-tuned LLaMa-3-8B on medical instruction Alpaca data, encompassing over 54K instruction-focused examples.
	* Fine-tuned using QLoRa to further reduce memory usage while maintaining model performance and enhancing its capabilities in the biomedical domain.

	![finetuning](assets/finetuning.png "LLaMa-3 Fine-Tuning")


	## ⚙️ Config

	\| Parameter \| Value \|
	\|-------------------\|-------------\|
	\| learning rate \| 1e-8 \|
	\| Optimizer \| Adam \|
	\| Betas \| (0.9, 0.99) \|
	\| adam_epsilon \| 1e-8 \|
	\| Lora Alpha \| 16 \|
	\| R \| 8 \|
	\| Lora Dropout \| 0.05 \|
	\| Load in 4 bits \| True \|
	\| Flash Attention 2 \| True \|
	\| Train Batch Size \| 8 \|
	\| Valid Batch Size \| 8 \|
	\| Max Seq Length \| 512 \|


	## 💻 Usage

	```python
	# Installations
	!pip install peft --quiet
	!pip install bitsandbytes --quiet
	!pip install transformers --quiet
	!pip install flash-attn --no-build-isolation --quiet

	# Imports
	import torch
	from peft import LoraConfig, PeftModel
	from transformers import (
	AutoTokenizer,
	BitsAndBytesConfig,
	AutoModelForCausalLM)

	# generate_prompt function
	def generate_prompt(instruction, input=None):
	if input:
	return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. # noqa: E501

	### Instruction:
	{instruction}

	### Input:
	{input}

	### Response:
	"""
	else:
	return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. # noqa: E501

	### Instruction:
	{instruction}

	### Response:
	"""

	# Model Loading Configuration
	based_model_path = "meta-llama/Meta-Llama-3-8B"
	lora_weights = "NouRed/BioMed-Tuned-Llama-3-8b"

	load_in_4bit=True
	bnb_4bit_use_double_quant=True
	bnb_4bit_quant_type="nf4"
	bnb_4bit_compute_dtype=torch.bfloat16
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load Tokenizer
	tokenizer = AutoTokenizer.from_pretrained(
	based_model_path,
	)

	tokenizer.padding_side = 'right'
	tokenizer.pad_token = tokenizer.eos_token
	tokenizer.add_eos_token = True

	# Load Base Model in 4 Bits
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=load_in_4bit,
	bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
	bnb_4bit_quant_type=bnb_4bit_quant_type,
	bnb_4bit_compute_dtype=bnb_4bit_compute_dtype
	)

	base_model = AutoModelForCausalLM.from_pretrained(
	based_model_path,
	device_map="auto",
	attn_implementation="flash_attention_2", # I have an A100 GPU with 40GB of RAM 😎
	quantization_config=quantization_config,
	)

	# Load Peft Model
	model = PeftModel.from_pretrained(
	base_model,
	lora_weights,
	torch_dtype=torch.float16,
	)

	# Prepare Input
	instruction = "I have a sore throat, slight cough, tiredness. should i get tested fro covid 19?"

	prompt = generate_prompt(instruction)
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	# Generate Text
	with torch.no_grad():
	generation_output = model.generate(
	**inputs,
	max_new_tokens=128
	)

	# Decode Output
	output = tokenizer.decode(
	generation_output[0],
	skip_special_tokens=True,
	clean_up_tokenization_spaces=True)

	print(output)
	```

	## 📋 Cite Us

	```
	@misc{biomedllama32024zekaoui,
	author = {Nour Eddine Zekaoui},
	title = {BioMed-LLaMa-3: Efficient Instruction Fine-Tuning in Biomedical Language},
	year = {2024},
	howpublished = {In Hugging Face Model Hub},
	url = {https://huggingface.co/NouRed/BioMed-Tuned-Llama-3-8b}
	}
	```

	```
	@article{llama3modelcard,
	title={Llama 3 Model Card},
	author={AI@Meta},
	year={2024},
	url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
	}
	```

	Created with ❤️ by [@NZekaoui](https://twitter.com/NZekaoui)