Multigpu training raises error

Hi all,
I’m using TAO(3.22.05) on my ubuntu (18.04), 8 gpu machine (3090). I’m trying to train yolo-v4-tiny with custom dataset. I found that my prerequisite meets the TAO tutorial, when I use gpu=1.

!tao yolo_v4_tiny train -e $SPECS_DIR/yolo_v4.conf \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 1

But when I set --gpus=4, it shous below error. How can I fix this error?

Please tell me if extra information should be provided.

Thanks in advance!

]
87c0b2e277d4:145:526 [3] NCCL INFO init.cc:941 → 2
87c0b2e277d4:140:529 [1] NCCL INFO init.cc:941 → 2
87c0b2e277d4:140:529 [1] NCCL INFO init.cc:977 → 2
87c0b2e277d4:145:526 [3] NCCL INFO init.cc:977 → 2
87c0b2e277d4:145:526 [3] NCCL INFO init.cc:990 → 2
87c0b2e277d4:140:529 [1] NCCL INFO init.cc:990 → 2

87c0b2e277d4:141:523 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
87c0b2e277d4:141:523 [2] NCCL INFO include/shm.h:41 → 2

87c0b2e277d4:141:523 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-32f18a3f2c989b15-0-1-2 (size 9637888)
87c0b2e277d4:141:523 [2] NCCL INFO transport/shm.cc:100 → 2
87c0b2e277d4:141:523 [2] NCCL INFO transport.cc:34 → 2
87c0b2e277d4:141:523 [2] NCCL INFO transport.cc:87 → 2
87c0b2e277d4:141:523 [2] NCCL INFO init.cc:804 → 2
87c0b2e277d4:141:523 [2] NCCL INFO init.cc:941 → 2
87c0b2e277d4:141:523 [2] NCCL INFO init.cc:977 → 2
87c0b2e277d4:141:523 [2] NCCL INFO init.cc:990 → 2
87c0b2e277d4:139:532 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 3(=1e000)
INFO: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
[[loss_1/add_63/_9205]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
0 successful operations.
0 derived errors ignored.
INFO: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
INFO: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
87c0b2e277d4:139:532 [0] NCCL INFO Could not enable P2P between dev 0(=1b000) and dev 3(=1e000)

87c0b2e277d4:139:532 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
87c0b2e277d4:139:532 [0] NCCL INFO include/shm.h:41 → 2

87c0b2e277d4:139:532 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-f15f64ddf4afb234-1-3-0 (size 9637888)
87c0b2e277d4:139:532 [0] NCCL INFO transport/shm.cc:100 → 2
87c0b2e277d4:139:532 [0] NCCL INFO transport.cc:34 → 2
87c0b2e277d4:139:532 [0] NCCL INFO transport.cc:87 → 2
87c0b2e277d4:139:532 [0] NCCL INFO init.cc:804 → 2
87c0b2e277d4:139:532 [0] NCCL INFO init.cc:941 → 2
87c0b2e277d4:139:532 [0] NCCL INFO init.cc:977 → 2
87c0b2e277d4:139:532 [0] NCCL INFO init.cc:990 → 2

87c0b2e277d4:140:529 [1] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
87c0b2e277d4:140:529 [1] NCCL INFO init.cc:1102 → 4

87c0b2e277d4:145:526 [3] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL 87c0b2e277d4:145:526 [3] NCCL INFO init.cc:1102 → 4

87c0b2e277d4:139:532 [0] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
87c0b2e277d4:139:532 [0] NCCL INFO init.cc:1102 → 4

87c0b2e277d4:141:523 [2] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
87c0b2e277d4:141:523 [2] NCCL INFO init.cc:1102 → 4
INFO: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 77, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py”, line 692, in train
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
[[loss_1/add_63/_9205]]
(1) Unknown: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 77, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py”, line 692, in train
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 77, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py”, line 692, in train
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 77, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/models/yolov4_model.py”, line 692, in train File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[{{node training_1/Adam/DistributedAdam_Allreduce/cond_184/HorovodAllreduce_training_1_Adam_gradients_conv_big_object_1_BiasAdd_grad_BiasAddGrad_0}}]]

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[56429,1],3]
Exit code: 1

2022-10-19 07:44:58,131 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Is there any error when run with

--gpus 2

or

--gpus 3

here is the error when --gpus 2

6c9e5f23ad15:138:330 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.19<0>
6c9e5f23ad15:138:330 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
6c9e5f23ad15:138:330 [0] NCCL INFO P2P plugin IBext
6c9e5f23ad15:138:330 [0] NCCL INFO NET/IB : No device found.
6c9e5f23ad15:138:330 [0] NCCL INFO NET/IB : No device found.
6c9e5f23ad15:138:330 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.19<0>
6c9e5f23ad15:138:330 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
6c9e5f23ad15:141:333 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.19<0>
6c9e5f23ad15:141:333 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
6c9e5f23ad15:141:333 [1] NCCL INFO P2P plugin IBext
6c9e5f23ad15:141:333 [1] NCCL INFO NET/IB : No device found.
6c9e5f23ad15:141:333 [1] NCCL INFO NET/IB : No device found.
6c9e5f23ad15:141:333 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.19<0>
6c9e5f23ad15:141:333 [1] NCCL INFO Using network Socket
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:138:330 [0] NCCL INFO Channel 00/02 : 0 1
6c9e5f23ad15:138:330 [0] NCCL INFO Channel 01/02 : 0 1
6c9e5f23ad15:138:330 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
6c9e5f23ad15:138:330 [0] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
6c9e5f23ad15:141:333 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
6c9e5f23ad15:141:333 [1] NCCL INFO Setting affinity for GPU 2 to 3ff003ff
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:138:330 [0] NCCL INFO Channel 00 : 0[1c000] → 1[1d000] via direct shared memory
6c9e5f23ad15:138:330 [0] NCCL INFO Could not enable P2P between dev 0(=1c000) and dev 1(=1d000)
6c9e5f23ad15:138:330 [0] NCCL INFO Channel 01 : 0[1c000] → 1[1d000] via direct shared memory
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:141:333 [1] NCCL INFO Channel 00 : 1[1d000] → 0[1c000] via direct shared memory
6c9e5f23ad15:141:333 [1] NCCL INFO Could not enable P2P between dev 1(=1d000) and dev 0(=1c000)
6c9e5f23ad15:141:333 [1] NCCL INFO Channel 01 : 1[1d000] → 0[1c000] via direct shared memory
6c9e5f23ad15:141:333 [1] NCCL INFO Connected all rings
6c9e5f23ad15:141:333 [1] NCCL INFO Connected all trees
6c9e5f23ad15:141:333 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
6c9e5f23ad15:138:330 [0] NCCL INFO Connected all rings
6c9e5f23ad15:141:333 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
6c9e5f23ad15:138:330 [0] NCCL INFO Connected all trees
6c9e5f23ad15:138:330 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
6c9e5f23ad15:138:330 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
6c9e5f23ad15:138:330 [0] NCCL INFO comm 0x7fcf587ec550 rank 0 nranks 2 cudaDev 0 busId 1c000 - Init COMPLETE
6c9e5f23ad15:141:333 [1] NCCL INFO comm 0x7f21c47eb4e0 rank 1 nranks 2 cudaDev 1 busId 1d000 - Init COMPLETE
6c9e5f23ad15:138:330 [0] NCCL INFO Launch mode Parallel
3/931 […] - ETA: 5:05:56 - loss: nan Batch 2: Invalid loss, terminating training

The 2gpus training is already running. Just meet Nan loss. Please try to set a larger max_lr or smaller bs.

1 Like

Makes sense. I’ll give it a try.

Do you still need support for this topic? Or should we close it? Thanks.

BTW, you can run cuda sample to test P2P for gpu 0 and gpu 3.

I didn’t test it yet, I will leave it after I try it.

meanwhile you can close the issue.

Yes it works. Though I can’t still run 4 gpus at same time, I can use 2 gpus.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.