can not run sft full finetuning.

#74

by hegang126 - opened Jan 3

Jan 3

as so far, I tried sft full finetuning with deepspeed zero3 on A100 80G GPU, which will hang unti NCCL socket timeout in 30 minutes. Also When I tried lora with deepspeed2, which will fail in OOM, while lora with deepspeed zero3 will hang too!

Only with lora and quantization 4 bit will succeed training.

JiaxinTsao

Jan 4

Same issue, any workaround?

guowl

Jan 8

Same issue, any workaround?

guowl

Jan 12

Same issue, any workaround?

zxs1997zju

Jan 22

Any update?

A-Cepheus

Jan 22

try
from deepspeed.utils import set_z3_leaf_modules
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
set_z3_leaf_modules(model, [MixtralSparseMoeBlock])

zxs1997zju

Jan 23

@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?

zxs1997zju

Jan 23

Any update?

JiaxinTsao

Jan 23

I believe deepspeed need all experts weight to be involved during inference so that zero3 could correctly sync data btw gpus. If all 8 experts are enabled inside config.json, the problem goes away.

guowl

Jan 25

@A-Cepheus thanks for reply, the inference now is successful and loss is return, however, the training still hang in backward, any clues?

when i train mixtral model, after 270 step, it will be hang . and GPU 100% until NCCL timeout

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment