add Hardware section
Browse files
README.md
CHANGED
@@ -125,12 +125,6 @@ output = pipe(messages, **generation_args)
|
|
125 |
print(output[0]['generated_text'])
|
126 |
```
|
127 |
|
128 |
-
Note that by default the model use flash attention which requires certain types of GPU to run. If you want to run the model on:
|
129 |
-
|
130 |
-
+ V100 or earlier generation GPUs: call `AutoModelForCausalLM.from_pretrained()` with `attn_implementation="eager"`
|
131 |
-
+ CPU: use the **GGUF** quantized models [4K](https://aka.ms/Phi3-mini-4k-instruct-gguf)
|
132 |
-
+ Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [4K](https://aka.ms/Phi3-mini-4k-instruct-onnx)
|
133 |
-
|
134 |
## Responsible AI Considerations
|
135 |
|
136 |
Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
@@ -216,6 +210,18 @@ The number of k–shot examples is listed per-benchmark.
|
|
216 |
* [Transformers](https://github.com/huggingface/transformers)
|
217 |
* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
|
218 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
219 |
## Cross Platform Support
|
220 |
|
221 |
ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).
|
|
|
125 |
print(output[0]['generated_text'])
|
126 |
```
|
127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
## Responsible AI Considerations
|
129 |
|
130 |
Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
|
|
|
210 |
* [Transformers](https://github.com/huggingface/transformers)
|
211 |
* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
|
212 |
|
213 |
+
## Hardware
|
214 |
+
Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
|
215 |
+
* NVIDIA A100
|
216 |
+
* NVIDIA A6000
|
217 |
+
* NVIDIA H100
|
218 |
+
|
219 |
+
If you want to run the model on:
|
220 |
+
* NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
|
221 |
+
* CPU: use the **GGUF** quantized models [4K](https://aka.ms/Phi3-mini-4k-instruct-gguf)
|
222 |
+
+ Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [4K](https://aka.ms/Phi3-mini-4k-instruct-onnx)
|
223 |
+
|
224 |
+
|
225 |
## Cross Platform Support
|
226 |
|
227 |
ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).
|