markoarnauto's picture
Upload README.md with huggingface_hub
4200ed6 verified
|
raw
history blame
5.65 kB
metadata
datasets: wikitext
license: apache-2.0
license_link: https://llama.meta.com/llama3/license/

This is a quantized model of Llama-3 70B Instruct using GPTQ developed by IST Austria using the following configuration:

  • 4bit (8bit will follow)
  • Act order: True
  • Group size: 128
  • Seq. length: 4096

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Meta-Llama-3-70B-Instruct-GPTQ

Access the model:

curl http://localhost:8000/v1/completions 
    -H "Content-Type: application/json"
    -d '{
        "model": "cortecs/Meta-Llama-3-70B-Instruct-GPTQ",
        "prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Tell me a joke<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
    }'

Evaluations

English Llama-3 70B Instruct Llama 3 70B GPTQ Llama-3 8B Instruct
Avg. 76.19 75.14 66.97
ARC 71.6 70.7 62.5
Hellaswag 77.3 76.4 70.3
MMLU 79.66 78.33 68.11
French Llama-3 70B Instruct Llama 3 70B GPTQ Llama-3 8B Instruct
Avg. 70.97 70.27 57.73
ARC_fr 65.0 64.7 53.3
Hellaswag_fr 72.4 71.4 61.7
MMLU_fr 75.5 74.7 58.2
German Llama-3 70B Instruct Llama 3 70B GPTQ Llama-3 8B Instruct
Avg. 68.43 66.93 53.47
ARC_de 64.2 62.6 49.1
Hellaswag_de 67.8 66.7 55.0
MMLU_de 73.3 71.5 56.3
Italian Llama-3 70B Instruct Llama 3 70B GPTQ Llama-3 8B Instruct
Avg. 70.17 68.63 56.73
ARC_it 64.0 62.1 51.6
Hellaswag_it 72.6 71.0 61.3
MMLU_it 73.9 72.8 57.3
Safety Llama-3 70B Instruct Llama 3 70B GPTQ Llama-3 8B Instruct
Avg. 64.28 63.64 61.42
RealToxicityPrompts 97.9 98.1 97.2
TruthfulQA 61.91 59.91 51.65
CrowS 33.04 32.92 35.42
Spanish Llama-3 70B Instruct Llama 3 70B GPTQ Llama-3 8B Instruct
Avg. 72.5 71.3 59
ARC_es 66.7 65.7 54.1
Hellaswag_es 75.8 74 63.8
MMLU_es 75 74.2 59.1

Take with caution. We did not check for data contamination. Evaluation was done using Eval. Harness using limit=1000 for big datasets.

Performance

Llama-3 70B Instruct requests/s tokens/s
NVIDIA L40Sx4 2.38 1135.41
Llama 3 70B GPTQ requests/s tokens/s
NVIDIA L40Sx2 2.0 951.28
Llama-3 8B Instruct requests/s tokens/s
NVIDIA L40Sx1 11.64 5548.63
NVIDIA L4x1 2.76 1315.25
NVIDIA L4x2 4.79 2283.53
Performance was measured on cortecs.ai.