NCCL WARN Call to posix_fallocate failed : No space left on device

edit_or · January 9, 2023, 1:32am

Please provide the following information when requesting support.

• Hardware (A30)
• Network Type (Mask_rcnn)
• TAO DOcker Version (http://nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3)
Training config file
maskrcnn_train_resnet18.txt (2.2 KB)

I am training maskrcnn using tao docker container (tao-toolkit-tf:v3.22.05-tf1.15.5-py3).

When training with multiple gpus with the following command,

mask_rcnn train -e /workspace/Nyan/cv_samples_v1.3.0/mask_rcnn/specs/maskrcnn_train_resnet18.txt -d /workspace/Nyan/cv_samples_v1.3.0/mask_rcnn/experiment_dir_unpruned -k nvidia_tlt --gpus 4

I have the following issue and training hang.
The issue is

"include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device"
How can I fix?

3a85e60a5dd3:176:725 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
3a85e60a5dd3:176:725 [0] NCCL INFO include/shm.h:41 -> 2

3a85e60a5dd3:176:725 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-7592b768ba254c27-0-3-2 (size 9637888)
3a85e60a5dd3:176:725 [0] NCCL INFO transport/shm.cc:100 -> 2
3a85e60a5dd3:176:725 [0] NCCL INFO transport.cc:34 -> 2
3a85e60a5dd3:176:725 [0] NCCL INFO transport.cc:87 -> 2
3a85e60a5dd3:176:725 [0] NCCL INFO init.cc:815 -> 2
3a85e60a5dd3:176:725 [0] NCCL INFO init.cc:941 -> 2
3a85e60a5dd3:176:725 [0] NCCL INFO init.cc:977 -> 2
3a85e60a5dd3:176:725 [0] NCCL INFO init.cc:990 -> 2

3a85e60a5dd3:174:726 [0] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device
3a85e60a5dd3:174:726 [0] NCCL INFO include/shm.h:41 -> 2

3a85e60a5dd3:174:726 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-684fb0133cfb0d25-0-3-0 (size 9637888)
3a85e60a5dd3:174:726 [0] NCCL INFO transport/shm.cc:100 -> 2
3a85e60a5dd3:174:726 [0] NCCL INFO transport.cc:34 -> 2
3a85e60a5dd3:174:726 [0] NCCL INFO transport.cc:87 -> 2
3a85e60a5dd3:174:726 [0] NCCL INFO init.cc:804 -> 2
3a85e60a5dd3:174:726 [0] NCCL INFO init.cc:941 -> 2
3a85e60a5dd3:174:726 [0] NCCL INFO init.cc:977 -> 2
3a85e60a5dd3:174:726 [0] NCCL INFO init.cc:990 -> 2

Morganh · January 9, 2023, 3:34am

Please refer to Training doesn't converge for Mapillary Vistas Dataset training with MaskRCNN - #32 by edit_or

edit_or · January 9, 2023, 5:06am

Thanks I used this docker command and it works.

docker run --runtime=nvidia -it --shm-size=16g --ulimit memlock=-1:-1 --ulimit stack=67108864:67108864 --rm --entrypoint "" -v $PWD:/workspace -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

system · January 23, 2023, 5:06am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.