Spaces:
Running
A newer version of the Gradio SDK is available:
5.6.0
LLM Text Generation (Chat)
This benchmark suite benchmarks vLLM and TGI with the chat completion task with various models.
Setup
Docker images
You can pull vLLM and TGI Docker images with:
docker pull mlenergy/vllm:v0.4.2-openai
docker pull mlenergy/tgi:v2.0.2
Installing Benchmark Script Dependencies
pip install -r requirements.txt
Starting the NVML container
Changing the power limit requires the SYS_ADMIN
Linux security capability, which we delegate to a daemon Docker container running a base CUDA image.
bash ../../common/start_nvml_container.sh
With the nvml
container running, you can change power limit with something like docker exec nvml nvidia-smi -i 0 -pl 200
.
HuggingFace cache directory
The scripts assume the HuggingFace cache directory will be under /data/leaderboard/hfcache
on the node that runs this benchmark.
Benchmarking
Obtaining one datapoint
Export your HuggingFace hub token as environment variable $HF_TOKEN
.
The script scripts/benchmark_one_datapoint.py
assumes that it was run from the directory where scripts
is, like this:
python scripts/benchmark_one_datapoint.py --help
Obtaining all datapoints for a single model
Run scripts/benchmark_one_model.py
.
Running the entire suite with Pegasus
You can use pegasus
to run the entire benchmark suite.
Queue and host files are in ./pegasus
.