|
---
|
|
language:
|
|
- en
|
|
- de
|
|
- fr
|
|
- it
|
|
- pt
|
|
- hi
|
|
- es
|
|
- th
|
|
license: llama3.1
|
|
pipeline_tag: text-generation
|
|
tags:
|
|
- facebook
|
|
- meta
|
|
- pytorch
|
|
- llama
|
|
- llama-3
|
|
---
|
|
|
|
# Meta-Llama-3.1-70B-Instruct-FP8-128K
|
|
|
|
## Model Overview
|
|
- Model Architecture: Meta-Llama-3.1
|
|
- Input: Text
|
|
- Output: Text
|
|
- Model Optimizations:
|
|
- Weight quantization: FP8
|
|
- Activation quantization: FP8
|
|
- KV Cache quantization:FP8
|
|
- Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat.
|
|
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
|
|
- Release Date: 8/27/2024
|
|
- Version: 1.0
|
|
- License(s): llama3.1
|
|
- Quantized version of Meta-Llama-3.1-8B-Instruct.
|
|
|
|
|
|
## Serve with vLLM engine
|
|
```bash
|
|
python3 -m vllm.entrypoints.openai.api_server \
|
|
--port <port> --model yejingfu/Meta-Llama-3.1-70B-Instruct-FP8-128K \
|
|
--tensor-parallel-size 4 --swap-space 16 --gpu-memory-utilization 0.96 --dtype auto \
|
|
--max-num-seqs 32 --max-model-len 131072 --kv-cache-dtype fp8 --enable-chunked-prefill
|
|
```
|
|
|
|
---
|
|
license: llama3.1
|
|
---
|
|
|