File size: 10,689 Bytes
0a384a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
nohup: ignoring input
[2022-12-19 10:45:52,072] [WARNING] [runner.py:179:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-12-19 10:45:52,087] [INFO] [runner.py:508:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 tune_gpt.py --deepspeed deepspeed.json --limit=10 --local_rank=-1
[2022-12-19 10:45:53,665] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-12-19 10:45:53,665] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-12-19 10:45:53,665] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-12-19 10:45:53,665] [INFO] [launch.py:162:main] dist_world_size=2
[2022-12-19 10:45:53,665] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
No config specified, defaulting to: apps/all
Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5)
Max length: 2048
PyTorch: setting up devices
[2022-12-19 10:46:03,625] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
No config specified, defaulting to: apps/all
Found cached dataset apps (/home/user/.cache/huggingface/datasets/codeparrot___apps/all/0.0.0/04ac807715d07d6e5cc580f59cdc8213cd7dc4529d0bb819cca72c9f8e8c1aa5)
Max length: 2048
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
GPU memory occupied: 3108 MB.
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
GPU memory occupied: 3108 MB.
Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6374411582946777 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.604843854904175 seconds
Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Emitting ninja build file /home/user/.cache/torch_extensions/py38_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.2902805805206299 seconds
Loading extension module utils...
Time to load utils op: 0.2025001049041748 seconds
Rank: 0 partition count [2] and sizes[(62599296, False)] 
Rank: 1 partition count [2] and sizes[(62599296, False)] 
Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0009868144989013672 seconds
Using /home/user/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003800392150878906 seconds

  0%|          | 0/55 [00:00<?, ?it/s]
  2%|▏         | 1/55 [00:01<01:13,  1.37s/it]
                                              
{'loss': 4.65, 'learning_rate': 0.0, 'epoch': 0.09}

  2%|▏         | 1/55 [00:01<01:13,  1.37s/it]
  4%|β–Ž         | 2/55 [00:02<01:13,  1.39s/it]
  5%|β–Œ         | 3/55 [00:04<01:13,  1.42s/it]
  7%|β–‹         | 4/55 [00:05<01:12,  1.42s/it]
  9%|β–‰         | 5/55 [00:07<01:10,  1.41s/it]
                                              
{'loss': 3.3673, 'learning_rate': 1.294882868674145e-05, 'epoch': 0.45}

  9%|β–‰         | 5/55 [00:07<01:10,  1.41s/it]
 11%|β–ˆ         | 6/55 [00:08<01:09,  1.41s/it]
 13%|β–ˆβ–Ž        | 7/55 [00:09<01:07,  1.41s/it]
 15%|β–ˆβ–        | 8/55 [00:11<01:06,  1.41s/it]
 16%|β–ˆβ–‹        | 9/55 [00:12<01:04,  1.41s/it]
 18%|β–ˆβ–Š        | 10/55 [00:14<01:03,  1.41s/it]
                                               
{'loss': 1.5333, 'learning_rate': 1.852558565662928e-05, 'epoch': 0.91}

 18%|β–ˆβ–Š        | 10/55 [00:14<01:03,  1.41s/it]
 20%|β–ˆβ–ˆ        | 11/55 [00:15<01:02,  1.41s/it]
 22%|β–ˆβ–ˆβ–       | 12/55 [00:16<01:00,  1.41s/it]
 24%|β–ˆβ–ˆβ–Ž       | 13/55 [00:18<00:59,  1.41s/it]
 25%|β–ˆβ–ˆβ–Œ       | 14/55 [00:19<00:57,  1.41s/it]
 27%|β–ˆβ–ˆβ–‹       | 15/55 [00:21<00:56,  1.41s/it]
                                               
{'loss': 0.7749, 'learning_rate': 2.1787779359648994e-05, 'epoch': 1.36}

 27%|β–ˆβ–ˆβ–‹       | 15/55 [00:21<00:56,  1.41s/it]
 29%|β–ˆβ–ˆβ–‰       | 16/55 [00:22<00:55,  1.41s/it]
 31%|β–ˆβ–ˆβ–ˆ       | 17/55 [00:23<00:53,  1.41s/it]
 33%|β–ˆβ–ˆβ–ˆβ–Ž      | 18/55 [00:25<00:52,  1.41s/it]
 35%|β–ˆβ–ˆβ–ˆβ–      | 19/55 [00:26<00:50,  1.41s/it]
 36%|β–ˆβ–ˆβ–ˆβ–‹      | 20/55 [00:28<00:49,  1.41s/it]
                                               
{'loss': 0.7098, 'learning_rate': 2.41023426265171e-05, 'epoch': 1.82}

 36%|β–ˆβ–ˆβ–ˆβ–‹      | 20/55 [00:28<00:49,  1.41s/it]
 38%|β–ˆβ–ˆβ–ˆβ–Š      | 21/55 [00:29<00:48,  1.41s/it]
 40%|β–ˆβ–ˆβ–ˆβ–ˆ      | 22/55 [00:31<00:46,  1.41s/it]
 42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 23/55 [00:32<00:45,  1.41s/it]
 44%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž     | 24/55 [00:33<00:43,  1.41s/it]
 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 25/55 [00:35<00:42,  1.41s/it]
                                               
{'loss': 0.6322, 'learning_rate': 2.58976573734829e-05, 'epoch': 2.27}

 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ     | 25/55 [00:35<00:42,  1.41s/it]
 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹     | 26/55 [00:36<00:40,  1.40s/it]
 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 27/55 [00:38<00:39,  1.40s/it]
 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ     | 28/55 [00:39<00:37,  1.40s/it]
 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž    | 29/55 [00:40<00:36,  1.40s/it]
 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 30/55 [00:42<00:35,  1.41s/it]
                                               
{'loss': 0.591, 'learning_rate': 2.7364536329536817e-05, 'epoch': 2.73}

 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–    | 30/55 [00:42<00:35,  1.41s/it]
 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 31/55 [00:43<00:33,  1.41s/it]
 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š    | 32/55 [00:45<00:32,  1.41s/it]
 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    | 33/55 [00:46<00:30,  1.41s/it]
 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 34/55 [00:47<00:29,  1.41s/it]
 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 35/55 [00:49<00:28,  1.41s/it]
                                               
{'loss': 0.5036, 'learning_rate': 2.8604764815275082e-05, 'epoch': 3.18}

 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž   | 35/55 [00:49<00:28,  1.41s/it]
 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ   | 36/55 [00:50<00:26,  1.41s/it]
 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 37/55 [00:52<00:25,  1.42s/it]
 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 38/55 [00:53<00:24,  1.42s/it]
 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 39/55 [00:55<00:22,  1.42s/it]
 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 40/55 [00:56<00:21,  1.42s/it]
                                               
{'loss': 0.5058, 'learning_rate': 2.9679099596404923e-05, 'epoch': 3.64}

 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 40/55 [00:56<00:21,  1.42s/it]
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 41/55 [00:57<00:19,  1.42s/it]
 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹  | 42/55 [00:59<00:18,  1.42s/it]
 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 43/55 [01:00<00:17,  1.42s/it]
 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 44/55 [01:02<00:15,  1.42s/it]
 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 45/55 [01:03<00:14,  1.42s/it]
                                               
{'loss': 0.404, 'learning_rate': 3.0626730032556536e-05, 'epoch': 4.09}

 82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 45/55 [01:03<00:14,  1.42s/it]
 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 46/55 [01:04<00:12,  1.41s/it]
 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 47/55 [01:06<00:11,  1.41s/it]
 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 48/55 [01:07<00:09,  1.41s/it]
 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 49/55 [01:09<00:08,  1.41s/it]
 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 50/55 [01:10<00:07,  1.41s/it]
                                               
{'loss': 0.3794, 'learning_rate': 3.147441434337073e-05, 'epoch': 4.55}

 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 50/55 [01:10<00:07,  1.41s/it]
 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 51/55 [01:11<00:05,  1.42s/it]
 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 52/55 [01:13<00:04,  1.42s/it]
 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 53/55 [01:14<00:02,  1.42s/it]
 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 54/55 [01:16<00:01,  1.42s/it]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 55/55 [01:17<00:00,  1.42s/it]
                                               
{'loss': 0.3672, 'learning_rate': 3.224123807782732e-05, 'epoch': 5.0}

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 55/55 [01:17<00:00,  1.42s/it]Time: 77.86
Samples/second: 8.86
                                               

{'train_runtime': 77.6739, 'train_samples_per_second': 8.883, 'train_steps_per_second': 0.708, 'train_loss': 0.9113625873218884, 'epoch': 5.0}

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 55/55 [01:17<00:00,  1.42s/it]GPU memory occupied: 44704 MB.

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 55/55 [01:17<00:00,  1.41s/it]
Time: 77.67
Samples/second: 8.88
GPU memory occupied: 44704 MB.
Traceback (most recent call last):
  File "tune_gpt.py", line 223, in <module>
    shutil.move(os.path.join(pwd_path, "output.log"), os.path.join(final_save_dir))
  File "/usr/lib/python3.8/shutil.py", line 789, in move
    raise Error("Destination path '%s' already exists" % real_dst)
shutil.Error: Destination path 'experiments/2022-12-19-ab8f3a39c84fea7f66bf71860384bbce5df5fb3523e7dabd22b35c3ecfefb154/output.log' already exists
[2022-12-19 10:47:33,785] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 144520
[2022-12-19 10:47:33,786] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 144521
[2022-12-19 10:47:33,793] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'tune_gpt.py', '--local_rank=1', '--deepspeed', 'deepspeed.json', '--limit=10', '--local_rank=-1'] exits with return code = 1