|
[2023-07-25 19:38:06,582] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:09,856] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. |
|
[2023-07-25 19:38:09,908] [INFO] [runner.py:555:main] cmd = /usr/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=35109 --module --enable_each_rank_log=None safe_rlhf.finetune --train_datasets bt --model_name_or_path cerebras/btlm-3b-8k-base --max_length 8092 --trust_remote_code True --epochs 16 --per_device_train_batch_size 8 --per_device_eval_batch_size 2 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate 4.7e-6 --lr_scheduler_type cosine --num_warmup_steps 20 --weight_decay 0.0 --seed 42 --output_dir /home/paperspace/safe-rlhf/output/sft --log_type wandb --log_project BT-Training --zero_stage 2 --bf16 True --tf32 True |
|
[2023-07-25 19:38:11,623] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:14,670] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} |
|
[2023-07-25 19:38:14,670] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 |
|
[2023-07-25 19:38:14,670] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) |
|
[2023-07-25 19:38:14,670] [INFO] [launch.py:163:main] dist_world_size=8 |
|
[2023-07-25 19:38:14,670] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
|
[2023-07-25 19:38:16,490] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,534] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,565] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,576] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,717] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,760] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,822] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:16,918] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
[2023-07-25 19:38:20,027] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:20,027] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:20,034] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:20,035] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:20,137] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:20,138] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:21,946] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:21,946] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:21,956] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:21,956] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:21,957] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:21,957] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:21,957] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:21,958] [INFO] [comm.py:616:init_distributed] cdb=None |
|
[2023-07-25 19:38:21,958] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl |
|
[2023-07-25 19:38:21,958] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented |
|
[2023-07-25 19:38:21,958] [INFO] [comm.py:616:init_distributed] cdb=None |
|
Set logger level to WARNING. |
|
ninja: no work to do. |
|
Time to load fused_adam op: 0.6772902011871338 seconds |
|
Time to load fused_adam op: 0.6026678085327148 seconds |
|
Time to load fused_adam op: 0.6027846336364746 seconds |
|
Time to load fused_adam op: 0.7029099464416504 seconds |
|
Time to load fused_adam op: 0.6028053760528564 seconds |
|
Time to load fused_adam op: 0.5027179718017578 seconds |
|
Time to load fused_adam op: 0.6026568412780762 seconds |
|
Time to load fused_adam op: 0.4024209976196289 seconds |
|
Rank: 1 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 7 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 2 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 6 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 0 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 3 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 4 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
Rank: 5 partition count [8, 8] and sizes[(330655680, False), (126608, False)] |
|
***** Running training ***** |
|
Saving model to "/home/paperspace/safe-rlhf/output/sft" ... |
|
Saving DeepSpeed Checkpoints... |
|
Converting DeepSpeed Checkpoints to Hugging Face format... |
|
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... |
|
To disable this warning, you can either: |
|
- Avoid using `tokenizers` before the fork if possible |
|
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) |
|
[2023-07-25 21:07:50,901] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
Processing zero checkpoint './global_step880' |
|
Detected checkpoint of type zero stage 2, world_size: 8 |
|
Parsing checkpoint created by deepspeed==0.10.0 |
|
Reconstructed Frozen fp32 state dict with 1 params 32 elements |
|
Reconstructed fp32 state dict with 451 params 2646258304 elements |
|
Saving fp32 state dict to pytorch_model.bin |
|
Model saved! |
|
[2023-07-25 21:08:25,310] [INFO] [launch.py:347:main] Process 41397 exits successfully. |
|
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41399 exits successfully. |
|
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41401 exits successfully. |
|
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41398 exits successfully. |
|
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41400 exits successfully. |
|
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41396 exits successfully. |
|
[2023-07-25 21:08:25,311] [INFO] [launch.py:347:main] Process 41402 exits successfully. |
|
[2023-07-25 21:08:26,313] [INFO] [launch.py:347:main] Process 41395 exits successfully. |
|
|