yejingfu
/

Meta-Llama-3.1-70B-Instruct-FP8-128K

Text Generation

Model card Files Files and versions Community

Meta-Llama-3.1-70B-Instruct-FP8-128K / README.md

linke

upload weight files

753cf17 5 months ago

|

history blame contribute delete

1.26 kB

	---
	language:
	- en
	- de
	- fr
	- it
	- pt
	- hi
	- es
	- th
	license: llama3.1
	pipeline_tag: text-generation
	tags:
	- facebook
	- meta
	- pytorch
	- llama
	- llama-3
	---

	# Meta-Llama-3.1-70B-Instruct-FP8-128K

	## Model Overview
	- Model Architecture: Meta-Llama-3.1
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Weight quantization: FP8
	- Activation quantization: FP8
	- KV Cache quantization:FP8
	- Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Meta-Llama-3.1-8B-Instruct, this models is intended for assistant-like chat.
	- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
	- Release Date: 8/27/2024
	- Version: 1.0
	- License(s): llama3.1
	- Quantized version of Meta-Llama-3.1-8B-Instruct.


	## Serve with vLLM engine
	```bash
	python3 -m vllm.entrypoints.openai.api_server \
	--port <port> --model yejingfu/Meta-Llama-3.1-70B-Instruct-FP8-128K \
	--tensor-parallel-size 4 --swap-space 16 --gpu-memory-utilization 0.96 --dtype auto \
	--max-num-seqs 32 --max-model-len 131072 --kv-cache-dtype fp8 --enable-chunked-prefill
	```

	---
	license: llama3.1
	---