File size: 1,263 Bytes
753cf17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76f1b1e
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
---

language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
license: llama3.1
pipeline_tag: text-generation
tags:
- facebook
- meta
- pytorch
- llama
- llama-3
---


# Meta-Llama-3.1-70B-Instruct-FP8-128K

## Model Overview
- Model Architecture: Meta-Llama-3.1
  - Input: Text
  - Output: Text
- Model Optimizations:
  - Weight quantization: FP8
  - Activation quantization: FP8
  - KV Cache quantization:FP8
- Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat.
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
- Release Date: 8/27/2024
- Version: 1.0
- License(s): llama3.1
- Quantized version of Meta-Llama-3.1-8B-Instruct.


## Serve with vLLM engine
```bash

python3 -m vllm.entrypoints.openai.api_server \

    --port <port> --model yejingfu/Meta-Llama-3.1-70B-Instruct-FP8-128K \

    --tensor-parallel-size 4 --swap-space 16 --gpu-memory-utilization 0.96 --dtype auto \

    --max-num-seqs 32 --max-model-len 131072 --kv-cache-dtype fp8 --enable-chunked-prefill

```

---
license: llama3.1
---