This is a custom INT8 version of the original BLOOM weights to make it fast to use with the DeepSpeed-Inference engine which uses Tensor Parallelism. In this repo the tensors are split into 8 shards to target 8 GPUs.
The full BLOOM documentation is here.
To use the weights in repo, you can adapt to your needs the scripts found here (XXX: they are going to migrate soon to HF Transformers code base, so will need to update the link once moved).
- Downloads last month
- 11
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.