README.md · cortecs/Meta-Llama-3-70B-Instruct-GPTQ at 4200ed609dd8c9ee10b3739c37e9bb07652ec97a

metadata

datasets: wikitext
license: apache-2.0
license_link: https://llama.meta.com/llama3/license/

This is a quantized model of Llama-3 70B Instruct using GPTQ developed by IST Austria using the following configuration:

4bit (8bit will follow)
Act order: True
Group size: 128
Seq. length: 4096

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Meta-Llama-3-70B-Instruct-GPTQ

Access the model:

curl http://localhost:8000/v1/completions 
    -H "Content-Type: application/json"
    -d '{
        "model": "cortecs/Meta-Llama-3-70B-Instruct-GPTQ",
        "prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Tell me a joke<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    }'

Evaluations

English	Llama-3 70B Instruct	Llama 3 70B GPTQ	Llama-3 8B Instruct
Avg.	76.19	75.14	66.97
ARC	71.6	70.7	62.5
Hellaswag	77.3	76.4	70.3
MMLU	79.66	78.33	68.11

French	Llama-3 70B Instruct	Llama 3 70B GPTQ	Llama-3 8B Instruct
Avg.	70.97	70.27	57.73
ARC_fr	65.0	64.7	53.3
Hellaswag_fr	72.4	71.4	61.7
MMLU_fr	75.5	74.7	58.2

German	Llama-3 70B Instruct	Llama 3 70B GPTQ	Llama-3 8B Instruct
Avg.	68.43	66.93	53.47
ARC_de	64.2	62.6	49.1
Hellaswag_de	67.8	66.7	55.0
MMLU_de	73.3	71.5	56.3

Italian	Llama-3 70B Instruct	Llama 3 70B GPTQ	Llama-3 8B Instruct
Avg.	70.17	68.63	56.73
ARC_it	64.0	62.1	51.6
Hellaswag_it	72.6	71.0	61.3
MMLU_it	73.9	72.8	57.3

Safety	Llama-3 70B Instruct	Llama 3 70B GPTQ	Llama-3 8B Instruct
Avg.	64.28	63.64	61.42
RealToxicityPrompts	97.9	98.1	97.2
TruthfulQA	61.91	59.91	51.65
CrowS	33.04	32.92	35.42

Spanish	Llama-3 70B Instruct	Llama 3 70B GPTQ	Llama-3 8B Instruct
Avg.	72.5	71.3	59
ARC_es	66.7	65.7	54.1
Hellaswag_es	75.8	74	63.8
MMLU_es	75	74.2	59.1

Take with caution. We did not check for data contamination. Evaluation was done using Eval. Harness using limit=1000 for big datasets.

Performance

Llama-3 70B Instruct	requests/s	tokens/s
NVIDIA L40Sx4	2.38	1135.41

Llama 3 70B GPTQ	requests/s	tokens/s
NVIDIA L40Sx2	2.0	951.28

Llama-3 8B Instruct	requests/s	tokens/s
NVIDIA L40Sx1	11.64	5548.63
NVIDIA L4x1	2.76	1315.25
NVIDIA L4x2	4.79	2283.53
Performance was measured on cortecs.ai.