How does sharding with 1 GPU work?
Previously I've seen the term 'sharding' used for model parallelism, where your model weights won't fit on a single GPU, so you split (or shard) it over multiple GPUs.
But in this context it looks like the point of sharding is to enable you to run a model whose weights won't fit into GPU memory on the single GPU. How does that work? Are each shard of model weights offloaded and reloaded for every forward/backward pass? I would have thought that would make execution time prohibitively slow.
This docs page only vaguely mentions this use case: https://huggingface.co/transformers/v4.10.1/parallelism.html
In the modern machine learning the various approaches to parallelism are used to:
- fit very large models onto limited hardware - e.g. t5-11b is 45GB in just model params
but doesn't go into any detail on this.
From the docs here: https://huggingface.co/docs/transformers/v4.19.2/en/main_classes/model#transformers.modeling_utils.load_sharded_checkpoint
it looks like I've got confused between sharding of checkpoints (a host memory optimization/feature) and sharding of a model over GPUs.
each checkpoint shard is loaded one by one in RAM and deleted after being loaded in the model.
The point of checkpoint sharding is that you may have less host memory than GPU memory. This doesn't help if the total memory requirements of all shards are greater than that of GPU memory.