README.md · cortecs/EuroLLM-9B-Instruct-FP8-Dynamic at main

metadata

base_model: utter-project/EuroLLM-9B-Instruct

This is a quantization of the EuroLLM-9B-Instruct.

The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-9B is a 9B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets. EuroLLM-9B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation.

Evaluations

This model provides an accuracy recovery of 99.61%.

English	EuroLLM-9B-Instruct	EuroLLM-9B-Instruct-FP8-Dynamic (this)
Avg.	66.35	65.35
Arc	63.3	61.7
Hellaswag	69.4	69.0

French	EuroLLM-9B-Instruct	EuroLLM-9B-Instruct-FP8-Dynamic (this)
Avg.	61.67	61.3
Arc	58.1	57.3
Hellaswag	70.2	70.3
MMLU	56.7	56.3

German	EuroLLM-9B-Instruct	EuroLLM-9B-Instruct-FP8-Dynamic (this)
Avg.	60.0	60.37
Arc	57.2	56.7
Hellaswag	66.3	67.1
MMLU	56.5	57.3

Italian	EuroLLM-9B-Instruct	EuroLLM-9B-Instruct-FP8-Dynamic (this)
Avg.	61.8	61.7
Arc	58.3	58.2
Hellaswag	69.9	69.4
MMLU	57.2	57.5

Portuguese	EuroLLM-9B-Instruct	EuroLLM-9B-Instruct-FP8-Dynamic (this)
Avg.	61.47	61.37
Arc	59.1	59.3
Hellaswag	70.3	70.2
MMLU	55.0	54.6

Spanish	EuroLLM-9B-Instruct	EuroLLM-9B-Instruct-FP8-Dynamic (this)
Avg.	62.03	61.53
Arc	59.7	59.3
Hellaswag	71.4	71
MMLU	55	54.3

We did not check for data contamination. Evaluation was done using Eval. Harness with limit=1000.

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/EuroLLM-9B-Instruct-FP8-Dynamic --gpu-memory-util 0.95

Access the model:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/EuroLLM-9B-Instruct-FP8-Dynamic",
        "prompt": "San Francisco is a"
    } '

⚡ This model is optimized to handle heavy workloads providing a total throughput of ️1976 tokens per second using one NVIDIA L4 ⚡