LLM Text Generation (Chat)

This benchmark suite benchmarks vLLM and TGI with the chat completion task with various models.

Setup

Docker images

You can pull vLLM and TGI Docker images with:

docker pull mlenergy/vllm:v0.4.2-openai
docker pull mlenergy/tgi:v2.0.2

Installing Benchmark Script Dependencies

pip install -r requirements.txt

Starting the NVML container

Changing the power limit requires the SYS_ADMIN Linux security capability, which we delegate to a daemon Docker container running a base CUDA image.

bash ../../common/start_nvml_container.sh

With the nvml container running, you can change power limit with something like docker exec nvml nvidia-smi -i 0 -pl 200.

HuggingFace cache directory

The scripts assume the HuggingFace cache directory will be under /data/leaderboard/hfcache on the node that runs this benchmark.

Benchmarking

Obtaining one datapoint

Export your HuggingFace hub token as environment variable $HF_TOKEN.

The script scripts/benchmark_one_datapoint.py assumes that it was run from the directory where scripts is, like this:

python scripts/benchmark_one_datapoint.py --help

Obtaining all datapoints for a single model

Run scripts/benchmark_one_model.py.

Running the entire suite with Pegasus

You can use pegasus to run the entire benchmark suite. Queue and host files are in ./pegasus.