…
6abdae4a2479:147:608 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-59a0636e208113ea-1-3-0 (size 9637888)
6abdae4a2479:147:608 [0] NCCL INFO transport/shm.cc:100 → 2
6abdae4a2479:147:608 [0] NCCL INFO transport.cc:34 → 2
6abdae4a2479:147:608 [0] NCCL INFO transport.cc:84 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:742 → 2
6abdae4a2479:148:603 [1] NCCL INFO init.cc:903 → 2
6abdae4a2479:148:603 [1] NCCL INFO init.cc:916 → 2
6abdae4a2479:149:607 [2] NCCL INFO init.cc:903 → 2
6abdae4a2479:149:607 [2] NCCL INFO init.cc:916 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:867 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:903 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:916 → 2
6abdae4a2479:150:606 [3] NCCL INFO Channel 00 : 3[83000] → 0[2000] via direct shared memory
6abdae4a2479:150:606 [3] NCCL INFO Channel 01 : 3[83000] → 0[2000] via direct shared memory
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
…
6abdae4a2479:148:603 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bc75787ce8703849-0-0-1 (size 9637888)
6abdae4a2479:148:603 [1] NCCL INFO transport/shm.cc:100 → 2
…
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
…
#########################################################
The project path is “/cv_samples_v1.3.0/bpnet”
Using the network is:
# Download the pretrained model from NGC
!ngc registry model download-version nvidia/tao/bodyposenet:trainable_v1.0 \
–dest $LOCAL_EXPERIMENT_DIR/pretrained_model
Training:
!tao bpnet train -e $SPECS_DIR/bpnet_train_m1_coco.yaml \
-r $USER_EXPERIMENT_DIR/models/exp_m1_unpruned \
-k nvidia_tlt \
–gpus 4 \
–gpu_index 0 1 2 3
When I use the following training, there is no problem:
-r $USER_EXPERIMENT_DIR/models/exp_m1_unpruned \
-k nvidia_tlt \
–gpus 1
I know it may be difficult, but please help me.