File size: 7,062 Bytes

cf05c06

[2023-07-25 19:38:06,582] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:09,856] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-07-25 19:38:09,908] [INFO] [runner.py:555:main] cmd = /usr/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=35109 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets bt --model_name_or_path cerebras/btlm-3b-8k-base --max_length 8092 --trust_remote_code True --epochs 16 --per_device_train_batch_size 8 --per_device_eval_batch_size 2 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate 4.7e-6 --lr_scheduler_type cosine --num_warmup_steps 20 --weight_decay 0.0 --seed 42 --output_dir /home/paperspace/safe-rlhf/output/sft --log_type wandb --log_project BT-Training --zero_stage 2 --bf16 True --tf32 True
[2023-07-25 19:38:11,623] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:14,670] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-07-25 19:38:14,670] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-07-25 19:38:14,670] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-07-25 19:38:14,670] [INFO] [launch.py:163:main] dist_world_size=8
[2023-07-25 19:38:14,670] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-07-25 19:38:16,490] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,534] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,565] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,576] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,717] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,760] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,822] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:16,918] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-25 19:38:20,027] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:20,027] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:20,034] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:20,035] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:20,137] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:20,138] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:21,946] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:21,946] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:21,956] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:21,956] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:21,957] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:21,957] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:21,957] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:21,958] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-07-25 19:38:21,958] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-07-25 19:38:21,958] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-07-25 19:38:21,958] [INFO] [comm.py:616:init_distributed] cdb=None
Set logger level to WARNING.
ninja: no work to do.
Time to load fused_adam op: 0.6772902011871338 seconds
Time to load fused_adam op: 0.6026678085327148 seconds
Time to load fused_adam op: 0.6027846336364746 seconds
Time to load fused_adam op: 0.7029099464416504 seconds
Time to load fused_adam op: 0.6028053760528564 seconds
Time to load fused_adam op: 0.5027179718017578 seconds
Time to load fused_adam op: 0.6026568412780762 seconds
Time to load fused_adam op: 0.4024209976196289 seconds
Rank: 1 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 7 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 2 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 6 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 0 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 3 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 4 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
Rank: 5 partition count [8, 8] and sizes[(330655680, False), (126608, False)] 
***** Running training *****
Saving model to "/home/paperspace/safe-rlhf/output/sft" ...
Saving DeepSpeed Checkpoints...
Converting DeepSpeed Checkpoints to Hugging Face format...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2023-07-25 21:07:50,901] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Processing zero checkpoint './global_step880'
Detected checkpoint of type zero stage 2, world_size: 8
Parsing checkpoint created by deepspeed==0.10.0
Reconstructed Frozen fp32 state dict with 1 params 32 elements
Reconstructed fp32 state dict with 451 params 2646258304 elements
Saving fp32 state dict to pytorch_model.bin
Model saved!
[2023-07-25 21:08:25,310] [INFO] [launch.py:347:main] Process 41397 exits successfully.
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41399 exits successfully.
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41401 exits successfully.
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41398 exits successfully.
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41400 exits successfully.
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41396 exits successfully.
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41402 exits successfully.
[2023-07-25 21:08:26,313] [INFO] [launch.py:347:main] Process 41395 exits successfully.