runtime error

Exit code: 1. Reason: me/user/app/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/home/user/app/venv/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2012, in _allgather_params [rank0]: dist.all_gather_into_tensor(flat_tensor, [rank0]: File "/home/user/app/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 118, in log_wrapper [rank0]: return func(*args, **kwargs) [rank0]: File "/home/user/app/venv/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 318, in all_gather_into_tensor [rank0]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op) [rank0]: File "/home/user/app/venv/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 236, in all_gather_into_tensor [rank0]: return self.all_gather_function(output_tensor=output_tensor, [rank0]: File "/home/user/app/venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 79, in wrapper [rank0]: return func(*args, **kwargs) [rank0]: File "/home/user/app/venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3200, in all_gather_into_tensor [rank0]: work = group._allgather_base(output_tensor, input_tensor, opts) [rank0]: RuntimeError: [Rank 0]: Ranks 1, 2, 3 failed to pass monitoredBarrier in 7200000 ms [2025-09-24 06:22:21,660] [INFO] [launch.py:351:main] Process 1908 exits successfully. [2025-09-24 06:22:22,661] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1905 [2025-09-24 06:22:22,662] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1906 [2025-09-24 06:22:22,664] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1907 [2025-09-24 06:22:22,666] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1908 [2025-09-24 06:22:22,667] [ERROR] [launch.py:325:sigkill_handler] ['/home/user/app/venv/bin/python3', '-u', 'train.py', '--local_rank=3'] exits with return code = 1

Container logs:

Fetching error logs...