Tao fail when using more than 2 GPUS - NCCL WARN Call to posix_fallocate failed : No space left on device

I’m trying to train custom dataset by using TAO with multi-GPUs

When I start the training process with up to 2 GPUs works fine, but when start process with 3 or more GPUS the error bellow is raised.

tao info

Configuration of the TAO Toolkit Instance
dockers: ['nvidia/tao/tao-toolkit-tf', 'nvidia/tao/tao-toolkit-pyt', 'nvidia/tao/tao-toolkit-lm']
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

Env Info

ubuntu@aws-xxx:~$ free -g
              total        used        free      shared  buff/cache   available
Mem:            186          40          70           2          76         142
Swap:             0           0           0

ubuntu@aws-xxx:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             94G     0   94G   0% /dev
tmpfs            19G  2.0M   19G   1% /run
/dev/nvme0n1p1  582G  392G  190G  68% /
tmpfs           200G     0  200G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            94G     0   94G   0% /sys/fs/cgroup
/dev/loop2       25M   25M     0 100% /snap/amazon-ssm-agent/4046
/dev/loop4       56M   56M     0 100% /snap/core18/2284
tmpfs            19G   16K   19G   1% /run/user/127
tmpfs            19G   32K   19G   1% /run/user/1000
/dev/loop6       44M   44M     0 100% /snap/snapd/14978
/dev/loop1       27M   27M     0 100% /snap/amazon-ssm-agent/5163
/dev/loop3       44M   44M     0 100% /snap/snapd/15177
/dev/loop5       56M   56M     0 100% /snap/core18/2344


Sun Mar 27 18:03:59 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
|  0%   39C    P0   132W / 300W |  18304MiB / 22731MiB |     51%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
|  0%   40C    P0   115W / 300W |  18063MiB / 22731MiB |     54%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         On   | 00000000:00:1D.0 Off |                    0 |
|  0%   27C    P8    24W / 300W |      2MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   27C    P8    22W / 300W |      2MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Any tips??

To run with multigpu, please change --gpus based on the number of available GPUs in your machine.
2022-03-27 17:52:56,613 [INFO] root: Registry: ['nvcr.io']
2022-03-27 17:52:56,709 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-03-27 17:52:56,758 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
.
.
.

2022-03-27 17:54:00,255 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v3/utils/tensor_utils.py:9: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

Epoch 21/80
dfd5d902fdd1:132:427 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
dfd5d902fdd1:132:427 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dfd5d902fdd1:132:427 [0] NCCL INFO NET/IB : No device found.
dfd5d902fdd1:132:427 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.5<0>
dfd5d902fdd1:132:427 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3
dfd5d902fdd1:133:426 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
dfd5d902fdd1:133:426 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dfd5d902fdd1:133:426 [1] NCCL INFO NET/IB : No device found.
dfd5d902fdd1:133:426 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.5<0>
dfd5d902fdd1:133:426 [1] NCCL INFO Using network Socket
dfd5d902fdd1:134:432 [2] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
dfd5d902fdd1:134:432 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
dfd5d902fdd1:134:432 [2] NCCL INFO NET/IB : No device found.
dfd5d902fdd1:134:432 [2] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.5<0>
dfd5d902fdd1:134:432 [2] NCCL INFO Using network Socket
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
dfd5d902fdd1:133:426 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
dfd5d902fdd1:134:432 [2] NCCL INFO Channel 00 : 2[1d0] -> 0[1b0] via direct shared memory
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Channel 01 : 2[1d0] -> 0[1b0] via direct shared memory
dfd5d902fdd1:132:427 [0] NCCL INFO Connected all rings
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 1(=1c0)
dfd5d902fdd1:134:432 [2] NCCL INFO Connected all rings
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:134:432 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-send-c7b72b31a9e17c47-1-2-1 (size 4104)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:75 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:90 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:753 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-22179506c91c6ec1-0-1-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:753 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO Connected all rings
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 2(=1d0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-f185cfe1f5f1edc0-0-2-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:753 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-38787eb0df8af94b-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-624723612b9b6ac4-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-92d8e885fec5ebc5-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8fb76c5feeba4735-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ea17d6350df539af-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b98611103acab8ae-0-0-1 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-880ceaf8a318e41a-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b1db8fa8ef295593-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e26d54cdc253d694-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-fc4af2d41d7fb652-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-56ab5ca93cbaa8cc-0-2-0 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-26199784699027cb-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-841ba93426ac98ce-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-29bb3f5f0771a654-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-5389e40f538217cd-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-9c0e5762d02a2b45-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-41aded8db0ef38cb-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6b7c923dfcffaa44-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d9906507e3c11201-0-0-1 (size 9637888)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-afc1c05797b0a088-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-a222a2cb6eb9302-0-2-0 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-ce4126d9bae41d76-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-f80fcb8a06f48eef-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-28a190aeda1f0ff0-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-e69bd103a364b6cd-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8c3b672e8429c453-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b60a0bded03a35cc-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-3343aef8bb44440d-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-63d5741d8e6ec50e-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-9750a486f33d294-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-278ffacd8344e9e1-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-515e9f7dcf555b5a-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-81f064a2a27fdc5b-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-282b8b49b1481d42-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-fe5ce6996537abc9-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-58bd506e84729e43-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6528474142558c97-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bf88b11661907f11-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8ef6ebf18e65fe10-0-0-1 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-366685e858d679b0-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-5d4c0c385abf8af-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-dc061c13399b8736-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b1623488996d8bac-0-2-0 (size 9637888)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-80d06f63c6430aab-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-5701cab37a329932-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8e6805127db39b46-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-5dd63fedaa891a45-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-34079b3d5e78a8cc-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-dc29b4c805391f78-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-368a1e9d247411f2-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-5f85978514990f1-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-4e8c2ba316cc59f2-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-f42bc1cdf7916778-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1dfa667e43a1d8f1-0-0-1 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-44d1734f392b038c-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1b02ce9eed1a9213-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-756338740c55848d-0-2-0 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-75ff748eedb276af-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-456daf6a1a87f5ae-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-1b9f0ab9ce778435-0-1-2 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-aa1e9195c6004f1f-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-47efb6ae53b4199-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-d3ed36461210c098-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-25ec15c2ffab58ff-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-4fbaba734bbbca78-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-804c7f981ee64b79-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-6f29817aed8203fb-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-9fbb469fc0ac84fc-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-455adccaa1719282-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-5bcbd7658946accc-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-8c5d9c8a5c712dcd-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-31fd32b53d363b53-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 00/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Channel 01/02 :    0   1   2
dfd5d902fdd1:132:427 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dfd5d902fdd1:134:432 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1
dfd5d902fdd1:132:427 [0] NCCL INFO Could not enable P2P between dev 0(=1b0) and dev 2(=1d0)
dfd5d902fdd1:133:426 [1] NCCL INFO Could not enable P2P between dev 1(=1c0) and dev 0(=1b0)
dfd5d902fdd1:134:432 [2] NCCL INFO Could not enable P2P between dev 2(=1d0) and dev 1(=1c0)

dfd5d902fdd1:133:426 [1] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:133:426 [1] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:133:426 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-db7e10ed85ae9ce4-0-0-1 (size 9637888)
dfd5d902fdd1:133:426 [1] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:132:427 [0] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:132:427 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-c0fd61258d91de5-0-2-0 (size 9637888)
dfd5d902fdd1:132:427 [0] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:867 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
dfd5d902fdd1:134:432 [2] NCCL INFO include/shm.h:41 -> 2

dfd5d902fdd1:134:432 [2] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-b1af6c3d399e2b6b-0-1-2 (size 9637888)
dfd5d902fdd1:134:432 [2] NCCL INFO transport/shm.cc:100 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:34 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO transport.cc:84 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:742 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:867 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:132:427 [0] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:133:426 [1] NCCL INFO init.cc:916 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:903 -> 2
dfd5d902fdd1:134:432 [2] NCCL INFO init.cc:916 -> 2
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 110, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 528, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 516, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 106, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 63, in run_experiment
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23953,1],1]
  Exit code:    1
--------------------------------------------------------------------------

Can you check your disk if there is enough space? BTW, are you using WSL?

Yes. There are enough space as shown with details on OP.
I think this issue is related to flag docker --ipc=host.
The SO is Ubuntu 18.04.6 LTS (kernel Linux 5.4.0-1063-aws), it’s a AWS Instance g5.12xlarge.

According to your description and the logs, you are running yolov4 network in an AWS Instance g5.12xlarge.
Could you try to run with another cluster or host machine well?

No, I dont have another instance to test it.

Analyzing the problem I believe that the AWS instance does not support P2P between dev (nvlink) , so another option is to use shared memory (/dev/shm), but /dev/shm on Docker machine is small when try allocate SHM to 3 or More GPUS, the docker parameter which solve this is --ipc=host, but we dont have control on docker parameters when docker start container, because tao control docker container.

You can control the docker parameters.

Because we can trigger tao docker with below method as well. For example,
$ docker run --runtime=nvidia -it -rm nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 /bin/bash

I found solution.

We can configure docker options by setting the file .tao_mounts.json, just adding config below and now all is working fine.

"DockerOptions": {
        "shm_size": "32G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },

As docs mention

shm_size: Defines the shared memory size of the docker. If this parameter isn’t set, then the TAO Toolkit instance allocates 64MB by default. We recommend setting this as “16G”, thereby allocating 16GB of shared memory.

https://docs.nvidia.com/tao/tao-toolkit/text/tao_launcher.html

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.