Finetuned medium on single GPU but can't train on 2 GPUs. (RTX 3060)

#40
by andromeda01111 - opened

I trained medium on NVIDIA GeForce RTX 3060 GPU on windows 11, pycharm . The training took almost 5 days to complete training. Dataset: 8000 audio files and 8000 transcripts, each audio is only 4 seconds. Now I have 2 RTX 3060, but getting cuda OOM error. So, switched to ubuntu 22.04 so that multi GPU training is smoother. But still CUDA OOM error.

Why is this happening?
How can I fix it?
Originally purchased GPU for training large but now not even medium works.
Please Help.

pytorch==2.0.1
torchvision==0.15.2
torchaudio==2.0.2
pytorch-cuda=11.7

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 11.66 GiB total capacity; 11.16 GiB already allocated; 7.50 MiB free; 11.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

training_args = Seq2SeqTrainingArguments(
output_dir="dir_medium",
per_device_train_batch_size=1,
gradient_accumulation_steps=16, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
# max_steps=4000,
per_device_eval_batch_size=8,
gradient_checkpointing=True,
fp16=True,
eval_strategy="steps",
predict_with_generate=True,
generation_max_length=225,
save_steps=100,
save_total_limit=2,
eval_steps=100,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=False,
optim="adamw_bnb_8bit",
)

Update:

I used PEFT to train the model on large v3 and medium. The training works but the accuracy is actually worst than the medium model I trained on custom data for both.

share your github repo

I have not uploaded the files to git yet. But you can ask me for specific details, and I will try to share whatever I can.

Sign up or log in to comment