microsoft
/

Phi-3-mini-4k-instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

wwwaj commited on Apr 22, 2024

Commit

2d03770

•

1 Parent(s): 9e55c3c

add Hardware section

Files changed (1) hide show

README.md +12 -6

README.md CHANGED Viewed

@@ -125,12 +125,6 @@ output = pipe(messages, **generation_args)
 print(output[0]['generated_text'])
 ```
-Note that by default the model use flash attention which requires certain types of GPU to run. If you want to run the model on:
-+ V100 or earlier generation GPUs: call `AutoModelForCausalLM.from_pretrained()`  with `attn_implementation="eager"`
-+ CPU: use the **GGUF** quantized models [4K](https://aka.ms/Phi3-mini-4k-instruct-gguf)
-+ Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [4K](https://aka.ms/Phi3-mini-4k-instruct-onnx)
 ## Responsible AI Considerations
 Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
@@ -216,6 +210,18 @@ The number of k–shot examples is listed per-benchmark.
 * [Transformers](https://github.com/huggingface/transformers)
 * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
 ## Cross Platform Support
 ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).

 print(output[0]['generated_text'])
 ```
 ## Responsible AI Considerations
 Like other language models, the Phi series models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
 * [Transformers](https://github.com/huggingface/transformers)
 * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
+## Hardware
+Note that by default, the Phi-3-mini model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
+* NVIDIA A100
+* NVIDIA A6000
+* NVIDIA H100
+If you want to run the model on:
+* NVIDIA V100 or earlier generation GPUs: call AutoModelForCausalLM.from_pretrained() with attn_implementation="eager"
+* CPU: use the **GGUF** quantized models [4K](https://aka.ms/Phi3-mini-4k-instruct-gguf)
++ Optimized inference on GPU, CPU, and Mobile: use the **ONNX** models [4K](https://aka.ms/Phi3-mini-4k-instruct-onnx)
 ## Cross Platform Support
 ONNX runtime ecosystem now supports Phi-3 Mini models across platforms and hardware. You can find the optimized ONNX models [here](https://aka.ms/Phi3-ONNX-HF).