abhishek/llama-2-7b-hf-small-shards · How does sharding with 1 GPU work?

Previously I've seen the term 'sharding' used for model parallelism, where your model weights won't fit on a single GPU, so you split (or shard) it over multiple GPUs.

But in this context it looks like the point of sharding is to enable you to run a model whose weights won't fit into GPU memory on the single GPU. How does that work? Are each shard of model weights offloaded and reloaded for every forward/backward pass? I would have thought that would make execution time prohibitively slow.

This docs page only vaguely mentions this use case: https://huggingface.co/transformers/v4.10.1/parallelism.html

In the modern machine learning the various approaches to parallelism are used to:

fit very large models onto limited hardware - e.g. t5-11b is 45GB in just model params

but doesn't go into any detail on this.