FlorianJc
/

Mistral-Nemo-Instruct-2407-vllm-fp8

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

FlorianJc commited on Jul 19, 2024

Commit

69e3ae5

·

verified ·

1 Parent(s): 749e4ee

Update README.md

Files changed (1) hide show

README.md +31 -0

README.md CHANGED Viewed

@@ -22,6 +22,37 @@ FP8 (F8_E4M3) quantized version of Mistral-Nemo-Instruct-2407 with 512 epochs.
 Should work with transformers, but you need this patch to use it with vLLM : https://github.com/vllm-project/vllm/pull/6548
 Or simply wait for vLLM 0.5.3...
 # Original model README.md file:

 Should work with transformers, but you need this patch to use it with vLLM : https://github.com/vllm-project/vllm/pull/6548
 Or simply wait for vLLM 0.5.3...
+```diff
+--- vllm/model_executor/models/llama.py	2024-07-19 02:01:59.192831673 +0200
++++ vllm/model_executor/models/llama.py	2024-07-19 02:01:36.752721235 +0200
+@@ -89,6 +89,7 @@
+     def __init__(
+         self,
++        config: LlamaConfig,
+         hidden_size: int,
+         num_heads: int,
+         num_kv_heads: int,
+@@ -115,7 +116,8 @@
+             # the KV heads across multiple tensor parallel GPUs.
+             assert tp_size % self.total_num_kv_heads == 0
+         self.num_kv_heads = max(1, self.total_num_kv_heads // tp_size)
+-        self.head_dim = hidden_size // self.total_num_heads
++        # MistralConfig has an optional head_dim introduced by Mistral-Nemo
++        self.head_dim = getattr(config, "head_dim", self.hidden_size // self.total_num_heads)
+         self.q_size = self.num_heads * self.head_dim
+         self.kv_size = self.num_kv_heads * self.head_dim
+         self.scaling = self.head_dim**-0.5
+@@ -189,6 +191,7 @@
+         attention_bias = getattr(config, "attention_bias", False) or getattr(
+             config, "bias", False)
+         self.self_attn = LlamaAttention(
++            config=config,
+             hidden_size=self.hidden_size,
+             num_heads=config.num_attention_heads,
+             num_kv_heads=getattr(config, "num_key_value_heads",
+```
 # Original model README.md file: