docs/about.md · ml-energy/leaderboard at a57f7ecf6e9383d9520603a6da327084aac135f1

The goal of the ML.ENERGY Leaderboard is to give people a sense of how much energy LLMs would consume, and the complex tradeoffs between energy, system performance, and user experience.

The code for the leaderboard, backing data, and scripts for benchmarking are all open-source in our repository. We'll see you at the Discussion board, where you can ask questions, suggest improvement ideas, or just discuss leaderboard results!

LLM Text Generation Benchmark

This category includes LLM Chat, LLM Code, and VLM Visual Chat.

Software

CUDA 12.4
vLLM 0.5.4 -- For inference serving
Zeus -- For GPU time and energy measurement

Hardware

NVIDIA A100-SXM4-40GB GPU (AWS p4d.24xlarge)
NVIDIA H100 80GB HBM3 GPU (AWS p5.48xlarge)

Data

Task	Dataset
LLM Chat	500 human prompts from ShareGPT
LLM Code	HumanEval+ from EvalPlus
VLM Visual Chat	500 image and prompt pairs from the LLaVA instruction dataset

Benchmarking process

We are interested in measuring the steady-state of the online serving system, excluding ramp up periods (when the server is gradually loaded with requests) and cooldown (when the server is draining the queue) periods. Therefore, we submit all the requests at the beginning of our benchmark while limiting the serving system's maximum batch size to create a steady-state serving system. The steady-state finishes when the serving system's queue length reaches zero, and we collect the timing and energy consumption of each batch during the steady-state to derive our metrics. The maximum batch size configuration is varied in order to change the system's utilization. Beyond a certain maximum batch size, the actual batch size does not increase due to memory constraints, which means the system will be overloaded, and we stop increasing the maximum batch size.

Diffusion Benchmark

This category includes Diffusion text to image, text to video, and image to video.

Software

CUDA 12.4
HuggingFace Diffusers 0.29.2 -- For inference
Zeus -- For GPU time and energy measurement

Hardware

NVIDIA A100-SXM4-40GB GPU (AWS p4d.24xlarge)
NVIDIA H100 80GB HBM3 GPU (AWS p5.48xlarge)

Data

Task	Dataset
Text to image	Prompts from PartiPrompts
Text to video	Captions from ShareGPT4Video
Image to video	Caption and first frame pairs from ShareGPT4Video

Benchmarking process

Since Diffusion model computations are more or less the same regardless of the input, we sample batches from each dataset and run them back-to-back to obtain stable measurements. The batch size is increased (in powers of two) until the GPU runs out of memory.

The ML.ENERGY Initiative

Are you interested in learning more about our ML energy measurement & optimization works? Meet us at the ML.ENERGY Initiative homepage!

Acknowledgements

Any opinions, findings, and conclusions of our works are those of the author(s) and do not necessarily represent the official policy of any of these funding organizations.

We thank Mozilla for funding us via the 2024 Mozilla Technology Fund.

License

This leaderboard is a research preview intended for non-commercial use only. Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses. Please direct inquiries/reports of potential violation to Jae-Won Chung.

Contact

Please direct general questions and issues related to the leaderboard to our GitHub repository's discussion board. You can find the ML.ENERGY initiative members in our homepage. If you need direct communication, please email admins@ml.energy.