--- base_model: Writer/Palmyra-Med-70B tags: - fp8 - vllm - medical - med license: other license_name: writer-open-model-license license_link: https://writer.com/legal/open-model-license/ language: - en --- # Palmyra-Medical-70B-FP8 This is a quantized version of [Palmyra-Med-70B](https://huggingface.co/Writer/Palmyra-Med-70B), which was developed by Writer. The original model performance on biomedical benchmarks is 85.87%. **This quantized version acheives an average score of 85.62%.** ## Model Overview: - **Model:** Llama based model finetuned to form Palmyra-X-004 and then again to form Palmyra-Med-70B. - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** FP8 - **Activation quantization:** FP8 - **Intended Use Cases:** Palmyra-Medical-70B-FP8 is intended for non-commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. - **License(s):** [writer-open-model-license](https://writer.com/legal/open-model-license/) ### Writer Resources and Technical Documentation: + [Writer Blog](https://writer.com/blog/palmyra-med-fin-models/) + [Writer Developer Website](https://dev.writer.com/home/models) + [Writer AI Studio](https://writer.com/product/ai-studio/) + [Palmyra Model API](https://dev.writer.com/api-guides/chat-completion) ### Model Optimizations [LLM_Compressor](https://github.com/vllm-project/llm-compressor) library. Using this optimization, the original FP16 weights and linear activations within the transformer blocks are adjusted to FP8, which decreases the model size and VRAM requirements by 50% overall. ## Deployment with vLLM This model can be deployed using the [vLLM](https://docs.vllm.ai/en/latest/) library, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "bprice9/Palmyra-Medical-70B-FP8" number_gpus = 2 sampling_params = SamplingParams(temperature=0.0, top_p=0.9, max_tokens=512, stop_token_ids=[128001, 128009]) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "user", "content": "Give a differential for an intrahepatic lesion with early arterial phase enhancement and rapid washout."}, ] prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` ## Creation This model was created by applying [LLM Compressor with calibration samples from UltraChat](https://github.com/vllm-project/llm-compressor/blob/sa/big_model_support/examples/big_model_offloading/big_model_w8a8_calibrate.py), as presented in the code below. ```python import torch from datasets import load_dataset from transformers import AutoTokenizer from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot from llmcompressor.transformers.compression.helpers import ( calculate_offload_device_map, custom_offload_device_map, ) recipe = """ quant_stage: quant_modifiers: QuantizationModifier: ignore: ["lm_head"] config_groups: group_0: weights: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true input_activations: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true targets: ["Linear"] """ model_stub = "Writer/Palmyra-Med-70B" model_name = model_stub.split("/")[-1] device_map = calculate_offload_device_map( model_stub, reserve_for_hessians=False, num_gpus=2, torch_dtype=torch.float16 ) model = SparseAutoModelForCausalLM.from_pretrained( model_stub, torch_dtype=torch.float16, device_map=device_map ) tokenizer = AutoTokenizer.from_pretrained(model_stub) output_dir = f"./{model_name}-FP8" DATASET_ID = "HuggingFaceH4/ultrachat_200k" DATASET_SPLIT = "train_sft" NUM_CALIBRATION_SAMPLES = 128 MAX_SEQUENCE_LENGTH = 4096 ds = load_dataset(DATASET_ID, split=DATASET_SPLIT) ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES)) def preprocess(example): return { "text": tokenizer.apply_chat_template( example["messages"], tokenize=False, ) } ds = ds.map(preprocess) def tokenize(sample): return tokenizer( sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False, ) ds = ds.map(tokenize, remove_columns=ds.column_names) oneshot( model=model, output_dir=output_dir, dataset=ds, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, save_compressed=True, ) ``` ## Evaluation <table> <tr> <td style="width: 20%;"><strong>Biomedical Benchmark</strong> </td> <td style="width: 20%;"><strong>Med-PaLM-2 (5-shot)</strong> </td> <td style="width: 20%;"><strong>GPT-4</strong> </td> <td style="width: 20%;"><strong>Palmyra-Med-70B (Original FP16)</strong> </td> <td style="width: 20%;"><strong>Palmyra-Medical-70B-FP8 (This Model)</strong> </td> </tr> <tr> <td>MMLU Clincal Knowledge </td> <td>88.3 </td> <td>86.0 </td> <td>90.9 </td> <td>90.2 </td> </tr> <tr> <td>MMLU Medical Genetics </td> <td>90.0 </td> <td>91.0 </td> <td>94.0 </td> <td>93.0 </td> </tr> <tr> <td>MMLU Anatomy </td> <td>77.8 </td> <td>80.0 </td> <td>83.7 </td> <td>83.7 </td> </tr> <tr> <td>MMLU Professional Medicine </td> <td>95.2 </td> <td>93.0 </td> <td>92.7 </td> <td>92.3 </td> </tr> <tr> <td>MMLU College Biology </td> <td>94.4 </td> <td>95.1 </td> <td>94.4 </td> <td>93.8 </td> </tr> <tr> <td>MMLU College Medicine </td> <td>80.9 </td> <td>76.9 </td> <td>84.4 </td> <td>84.4 </td> </tr> <tr> <td>MedQA 4-options </td> <td>79.9 </td> <td>78.9 </td> <td>78.6 </td> <td>79.5 </td> </tr> <tr> <td>PubMed QA </td> <td>79.2 </td> <td>75.2 </td> <td>79.6 </td> <td>78.0 </td> </tr> <tr> <tr> <td>MedMCQA </td> <td>71.3 </td> <td>69.5 </td> <td>74.4 </td> <td>75.7 </td> </tr> <tr> <td><strong>Average</strong> </td> <td><strong>84.1</strong> </td> <td><strong>82.8</strong> </td> <td><strong>85.9</strong> </td> <td><strong>85.6</strong> </td> </tr> </table> ### Citation and Related Information Provided by Writer To cite this model: ``` @misc{Palmyra-Med-70B, author = {Writer Engineering team}, title = {{Palmyra-Med-70b: A powerful LLM designed for healthcare}}, howpublished = {\url{https://dev.writer.com}}, year = 2024, month = June } ```