--- library_name: transformers tags: - medical license: llama3 language: - en --- # BioMed LLaMa-3 8B Meta AI released the Llama-3 family of LLMs, composed of two models of the next generation of Llama, Meta Llama 3, available for broad use. This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases. Llama-3 is a decoder-only transformer architecture with a 128K-token vocabulary and grouped query attention to improve inference efficiency. It has been trained on sequences of 8192 tokens. Llama-3 achieved state-of-the-art performance, enhancing capabilities in reasoning, code generation, and instruction following. It is expected to outperform Claude Sonnet, Mistral Medium, and GPT-3.5 on a number of benchmarks. ## Model Details Powerful LLMs are trained on large amounts of unstructured data and are great at general text generation. BioMed-LLaMa-3-8B based on [Llama-3-8b](https://huggingface.co/meta-llama/Meta-Llama-3-8B) addresses some constraints related to using off-the-shelf pre-trained LLMs, especially in the biomedical domain: * Efficiently fine-tuned LLaMa-3-8B on medical instruction Alpaca data, encompassing over 54K instruction-focused examples. * Fine-tuned using QLoRa to further reduce memory usage while maintaining model performance and enhancing its capabilities in the biomedical domain. ![finetuning](assets/finetuning.png "LLaMa-3 Fine-Tuning") ## ⚙️ Config | Parameter | Value | |-------------------|-------------| | learning rate | 1e-8 | | Optimizer | Adam | | Betas | (0.9, 0.99) | | adam_epsilon | 1e-8 | | Lora Alpha | 16 | | R | 8 | | Lora Dropout | 0.05 | | Load in 4 bits | True | | Flash Attention 2 | True | | Train Batch Size | 8 | | Valid Batch Size | 8 | | Max Seq Length | 512 | ## 💻 Usage ```python # Installations !pip install peft --quiet !pip install bitsandbytes --quiet !pip install transformers --quiet !pip install flash-attn --no-build-isolation --quiet # Imports import torch from peft import LoraConfig, PeftModel from transformers import ( AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM) # generate_prompt function def generate_prompt(instruction, input=None): if input: return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. # noqa: E501 ### Instruction: {instruction} ### Input: {input} ### Response: """ else: return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. # noqa: E501 ### Instruction: {instruction} ### Response: """ # Model Loading Configuration based_model_path = "meta-llama/Meta-Llama-3-8B" lora_weights = "NouRed/BioMed-Tuned-Llama-3-8b" load_in_4bit=True bnb_4bit_use_double_quant=True bnb_4bit_quant_type="nf4" bnb_4bit_compute_dtype=torch.bfloat16 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load Tokenizer tokenizer = AutoTokenizer.from_pretrained( based_model_path, ) tokenizer.padding_side = 'right' tokenizer.pad_token = tokenizer.eos_token tokenizer.add_eos_token = True # Load Base Model in 4 Bits quantization_config = BitsAndBytesConfig( load_in_4bit=load_in_4bit, bnb_4bit_use_double_quant=bnb_4bit_use_double_quant, bnb_4bit_quant_type=bnb_4bit_quant_type, bnb_4bit_compute_dtype=bnb_4bit_compute_dtype ) base_model = AutoModelForCausalLM.from_pretrained( based_model_path, device_map="auto", attn_implementation="flash_attention_2", # I have an A100 GPU with 40GB of RAM 😎 quantization_config=quantization_config, ) # Load Peft Model model = PeftModel.from_pretrained( base_model, lora_weights, torch_dtype=torch.float16, ) # Prepare Input instruction = "I have a sore throat, slight cough, tiredness. should i get tested fro covid 19?" prompt = generate_prompt(instruction) inputs = tokenizer(prompt, return_tensors="pt").to(device) # Generate Text with torch.no_grad(): generation_output = model.generate( **inputs, max_new_tokens=128 ) # Decode Output output = tokenizer.decode( generation_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=True) print(output) ``` ## 📋 Cite Us ``` @misc{biomedllama32024zekaoui, author = {Nour Eddine Zekaoui}, title = {BioMed-LLaMa-3: Efficient Instruction Fine-Tuning in Biomedical Language}, year = {2024}, howpublished = {In Hugging Face Model Hub}, url = {https://huggingface.co/NouRed/BioMed-Tuned-Llama-3-8b} } ``` ``` @article{llama3modelcard, title={Llama 3 Model Card}, author={AI@Meta}, year={2024}, url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md} } ``` Created with ❤️ by [@NZekaoui](https://twitter.com/NZekaoui)