A newer version of the Gradio SDK is available:
5.6.0
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
https://github.com/FMInference/FlexGen
Installation
No additional installation steps are necessary. FlexGen is in the requirements.txt
file for this project.
Converting a model
FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI:
python convert-to-flexgen.py models/opt-1.3b/
The output will be saved to models/opt-1.3b-np/
.
Usage
The basic command is the following:
python server.py --model opt-1.3b --loader flexgen
For large models, the RAM usage may be too high and your computer may freeze. If that happens, you can try this:
python server.py --model opt-1.3b --loader flexgen --compress-weight
With this second command, I was able to run both OPT-6.7b and OPT-13B with 2GB VRAM, and the speed was good in both cases.
You can also manually set the offload strategy with
python server.py --model opt-1.3b --loader flexgen --percent 0 100 100 0 100 0
where the six numbers after --percent
are:
the percentage of weight on GPU
the percentage of weight on CPU
the percentage of attention cache on GPU
the percentage of attention cache on CPU
the percentage of activations on GPU
the percentage of activations on CPU
You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the text-generation-webui/cache
folder.
Performance
In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results:
--loader flexgen --compress-weight --percent 0 100 100 0 100 0
: 0.99 seconds per token.--loader flexgen --compress-weight --percent 100 0 100 0 100 0
: 0.765 seconds per token.
Limitations
- Only works with the OPT models.
- Only two generation parameters are available:
temperature
anddo_sample
.