6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying


6abdae4a2479:147:608 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-59a0636e208113ea-1-3-0 (size 9637888)
6abdae4a2479:147:608 [0] NCCL INFO transport/shm.cc:100 → 2
6abdae4a2479:147:608 [0] NCCL INFO transport.cc:34 → 2
6abdae4a2479:147:608 [0] NCCL INFO transport.cc:84 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:742 → 2
6abdae4a2479:148:603 [1] NCCL INFO init.cc:903 → 2
6abdae4a2479:148:603 [1] NCCL INFO init.cc:916 → 2
6abdae4a2479:149:607 [2] NCCL INFO init.cc:903 → 2
6abdae4a2479:149:607 [2] NCCL INFO init.cc:916 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:867 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:903 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:916 → 2
6abdae4a2479:150:606 [3] NCCL INFO Channel 00 : 3[83000] → 0[2000] via direct shared memory
6abdae4a2479:150:606 [3] NCCL INFO Channel 01 : 3[83000] → 0[2000] via direct shared memory
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying

6abdae4a2479:148:603 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bc75787ce8703849-0-0-1 (size 9637888)
6abdae4a2479:148:603 [1] NCCL INFO transport/shm.cc:100 → 2

tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error

#########################################################
The project path is “/cv_samples_v1.3.0/bpnet”

Using the network is:
# Download the pretrained model from NGC
!ngc registry model download-version nvidia/tao/bodyposenet:trainable_v1.0 \
–dest $LOCAL_EXPERIMENT_DIR/pretrained_model

Training:
!tao bpnet train -e $SPECS_DIR/bpnet_train_m1_coco.yaml \
-r $USER_EXPERIMENT_DIR/models/exp_m1_unpruned \
-k nvidia_tlt \
–gpus 4 \
–gpu_index 0 1 2 3
When I use the following training, there is no problem:
-r $USER_EXPERIMENT_DIR/models/exp_m1_unpruned \
-k nvidia_tlt \
–gpus 1

I know it may be difficult, but please help me.

May I know how did you login docker and trigger notebook?
With below?
$ tao bpnet

I did it according to this tutorial.No extra steps were performed.

“$ tao bpnet” is Command line for training

GPU=1 is no problem,When GPU > 1 an error will be reported

You are running in terminal of an Ubuntu 18 machine, right?
Please share all the log in the terminal. You can upload it as a .txt file.

yes.
error.txt (87.7 KB)

I suggest you to check the space in hard disk.

8b401678ac29:142:612 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

I’ ve check my disk, disk space is > 100G.
Maybe is it something related to share memory? Since tao is running in docker, but we cannot set the ‘shm-size’ when it start.

How did you trigger docker and run training? I cannot find the command when you trigger docker and run training?

I don’t know, but it’s in the tutorial.
# Mapping up the local directories to the TAO docker.

You can look at this
bpnet.ipynb (1.2 MB)

To narrow down, please try to run under terminal instead of notebook.

ok,i try.

+1
have you solve the issue?

NO,I don’t have time to test now。

Why are 18 key points inferred
1

There are 17 key points in the developer document

See BodyPoseNet | NVIDIA NGC

I have two questions.
A:
The coco2017 I prepared didn’t include the neck at all. Why doesn’t it report errors.
B:
I know, but how should I prepare the data?
Like this?

       category['keypoints'] = ["nose",
                                 "neck",
                                 "right_shoulder",
                                 "right_elbow",
                                 "right_wrist",
                                 "left_shoulder",
                                 "left_elbow",
                                 "left_wrist",
                                 "right_hip",
                                 "right_knee",
                                 "right_ankle",
                                 "left_hip",
                                 "left_knee",
                                 "left_ankle",
                                 "right_eye",
                                 "left_eye",
                                 "right_ear",
                                 "left_ear"]

category[‘skeleton’] = [
[16, 14],
[16, 2],
[14, 0],
[2, 1],
[2, 3],
[3, 4],
[1, 8],
[8, 9],
[9, 10],
[17, 15],
[17, 5],
[15, 0],
[5, 1],
[5, 6],
[6, 7],
[1, 11],
[11, 12],
[12, 13],
[0, 1]]

Just prepare the dataset as the guide of https://docs.nvidia.com/tao/tao-toolkit/text/data_annotation_format.html#bodyposenet-coco-format

OH MY GOD.
Why is the model 18 key points, but the data only needs 17 key points