google-t5
/

t5-11b

@@ -20,7 +20,8 @@ t5 = transformers.T5ForConditionalGeneration.from_pretrained('t5-11b', use_cdn =
 ```
 Secondly, a single GPU will most likely not have enough memory to even load the model into memory as the weights alone amount to over 40 GB.
-Model parallelism has to be used here to overcome this problem as is explained in this [PR](https://github.com/huggingface/transformers/pull/3578).
 ## [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)

 ```
 Secondly, a single GPU will most likely not have enough memory to even load the model into memory as the weights alone amount to over 40 GB.
+- Model parallelism has to be used here to overcome this problem as is explained in this [PR](https://github.com/huggingface/transformers/pull/3578).
+- DeepSpeed's ZeRO-Offload is another approach as explained in this [post](https://github.com/huggingface/transformers/issues/9996).
 ## [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)