markoarnauto's picture
Upload README.md with huggingface_hub
6280b27 verified
|
raw
history blame
13.2 kB
metadata
datasets: wikitext
license: other
license_link: https://llama.meta.com/llama3/license/

This is a quantized model of Meta-Llama-3-70B-Instruct.yaml using GPTQ developed by IST Austria using the following configuration:

  • 8bit
  • Act order: True
  • Group size: 128

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Meta-Llama-3-70B-Instruct-GPTQ-8b

Access the model:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/Meta-Llama-3-70B-Instruct-GPTQ-8b",
        "prompt": "San Francisco is a"
    } '

Evaluations

English Meta-Llama-3-70B-Instruct Meta-Llama-3-70B-Instruct-GPTQ-8b Meta-Llama-3-70B-Instruct-GPTQ
Avg. 76.19 76.16 75.14
ARC 71.6 71.4 70.7
Hellaswag 77.3 77.1 76.4
MMLU 79.66 79.98 78.33
French Meta-Llama-3-70B-Instruct Meta-Llama-3-70B-Instruct-GPTQ-8b Meta-Llama-3-70B-Instruct-GPTQ
Avg. 70.97 71.03 70.27
ARC_fr 65.0 65.3 64.7
Hellaswag_fr 72.4 72.4 71.4
MMLU_fr 75.5 75.4 74.7
German Meta-Llama-3-70B-Instruct Meta-Llama-3-70B-Instruct-GPTQ-8b Meta-Llama-3-70B-Instruct-GPTQ
Avg. 68.43 68.37 66.93
ARC_de 64.2 64.3 62.6
Hellaswag_de 67.8 67.7 66.7
MMLU_de 73.3 73.1 71.5
Italian Meta-Llama-3-70B-Instruct Meta-Llama-3-70B-Instruct-GPTQ-8b Meta-Llama-3-70B-Instruct-GPTQ
Avg. 70.17 70.43 68.63
ARC_it 64.0 64.3 62.1
Hellaswag_it 72.6 72.4 71.0
MMLU_it 73.9 74.6 72.8
Safety Meta-Llama-3-70B-Instruct Meta-Llama-3-70B-Instruct-GPTQ-8b Meta-Llama-3-70B-Instruct-GPTQ
Avg. 64.28 64.17 63.64
RealToxicityPrompts 97.9 97.8 98.1
TruthfulQA 61.91 61.67 59.91
CrowS 33.04 33.04 32.92
Spanish Meta-Llama-3-70B-Instruct Meta-Llama-3-70B-Instruct-GPTQ-8b Meta-Llama-3-70B-Instruct-GPTQ
Avg. 72.5 72.7 71.3
ARC_es 66.7 66.9 65.7
Hellaswag_es 75.8 75.9 74
MMLU_es 75 75.3 74.2

We did not check for data contamination. Evaluation was done using Eval. Harness using limit=1000.

Performance

requests/s tokens/s
NVIDIA L4x4 0.27 128.81
NVIDIA L4x8 1.31 624.61
Performance measured on cortecs inference.