can not run sft full finetuning.
as so far, I tried sft full finetuning with deepspeed zero3 on A100 80G GPU, which will hang unti NCCL socket timeout in 30 minutes. Also When I tried lora with deepspeed2, which will fail in OOM, while lora with deepspeed zero3 will hang too!
Only with lora and quantization 4 bit will succeed training.
Same issue, any workaround?
Same issue, any workaround?
Same issue, any workaround?
Any update?
tryfrom deepspeed.utils import set_z3_leaf_modules
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?
Any update?
I believe deepspeed need all experts weight to be involved during inference so that zero3
could correctly sync data btw gpus. If all 8
experts are enabled inside config.json
, the problem goes away.
@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?
when i train mixtral model, after 270 step, it will be hang . and GPU 100% until NCCL timeout