docs/FlexGen.md · miron258/Elite-text-gen-web-duplicate at main

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).

https://github.com/FMInference/FlexGen

Installation

No additional installation steps are necessary. FlexGen is in the requirements.txt file for this project.

Converting a model

FlexGen only works with the OPT model, and it needs to be converted to numpy format before starting the web UI:

python convert-to-flexgen.py models/opt-1.3b/

The output will be saved to models/opt-1.3b-np/.

Usage

The basic command is the following:

python server.py --model opt-1.3b  --loader flexgen

For large models, the RAM usage may be too high and your computer may freeze. If that happens, you can try this:

python server.py --model opt-1.3b  --loader flexgen --compress-weight

With this second command, I was able to run both OPT-6.7b and OPT-13B with 2GB VRAM, and the speed was good in both cases.

You can also manually set the offload strategy with

python server.py --model opt-1.3b  --loader flexgen --percent 0 100 100 0 100 0

where the six numbers after --percent are:

the percentage of weight on GPU
the percentage of weight on CPU
the percentage of attention cache on GPU
the percentage of attention cache on CPU
the percentage of activations on GPU
the percentage of activations on CPU

You should typically only change the first two numbers. If their sum is less than 100, the remaining layers will be offloaded to the disk, by default into the text-generation-webui/cache folder.

Performance

In my experiments with OPT-30B using a RTX 3090 on Linux, I have obtained these results:

--loader flexgen --compress-weight --percent 0 100 100 0 100 0: 0.99 seconds per token.
--loader flexgen --compress-weight --percent 100 0 100 0 100 0: 0.765 seconds per token.

Limitations

Only works with the OPT models.
Only two generation parameters are available: temperature and do_sample.