6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying

oomg · January 14, 2022, 9:42am

…
6abdae4a2479:147:608 [0] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-59a0636e208113ea-1-3-0 (size 9637888)
6abdae4a2479:147:608 [0] NCCL INFO transport/shm.cc:100 → 2
6abdae4a2479:147:608 [0] NCCL INFO transport.cc:34 → 2
6abdae4a2479:147:608 [0] NCCL INFO transport.cc:84 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:742 → 2
6abdae4a2479:148:603 [1] NCCL INFO init.cc:903 → 2
6abdae4a2479:148:603 [1] NCCL INFO init.cc:916 → 2
6abdae4a2479:149:607 [2] NCCL INFO init.cc:903 → 2
6abdae4a2479:149:607 [2] NCCL INFO init.cc:916 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:867 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:903 → 2
6abdae4a2479:147:608 [0] NCCL INFO init.cc:916 → 2
6abdae4a2479:150:606 [3] NCCL INFO Channel 00 : 3[83000] → 0[2000] via direct shared memory
6abdae4a2479:150:606 [3] NCCL INFO Channel 01 : 3[83000] → 0[2000] via direct shared memory
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying
…
6abdae4a2479:148:603 [1] include/shm.h:48 NCCL WARN Error while creating shared memory segment nccl-shm-recv-bc75787ce8703849-0-0-1 (size 9637888)
6abdae4a2479:148:603 [1] NCCL INFO transport/shm.cc:100 → 2
…
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
…
#########################################################
The project path is “/cv_samples_v1.3.0/bpnet”

Using the network is:
# Download the pretrained model from NGC
!ngc registry model download-version nvidia/tao/bodyposenet:trainable_v1.0 \
–dest $LOCAL_EXPERIMENT_DIR/pretrained_model

Training:
!tao bpnet train -e $SPECS_DIR/bpnet_train_m1_coco.yaml \
-r $USER_EXPERIMENT_DIR/models/exp_m1_unpruned \
-k nvidia_tlt \
–gpus 4 \
–gpu_index 0 1 2 3
When I use the following training, there is no problem:
-r $USER_EXPERIMENT_DIR/models/exp_m1_unpruned \
-k nvidia_tlt \
–gpus 1

I know it may be difficult, but please help me.

Morganh · January 14, 2022, 9:44am

May I know how did you login docker and trigger notebook?
With below?
$ tao bpnet

oomg · January 14, 2022, 9:50am

I did it according to this tutorial.No extra steps were performed.

oomg · January 14, 2022, 9:52am

“$ tao bpnet” is Command line for training

oomg · January 14, 2022, 9:55am

GPU=1 is no problem,When GPU > 1 an error will be reported

Morganh · January 14, 2022, 9:55am

You are running in terminal of an Ubuntu 18 machine, right?
Please share all the log in the terminal. You can upload it as a .txt file.

oomg · January 14, 2022, 10:02am

yes.
error.txt (87.7 KB)

Morganh · January 14, 2022, 10:05am

I suggest you to check the space in hard disk.

8b401678ac29:142:612 [3] include/shm.h:28 NCCL WARN Call to posix_fallocate failed : No space left on device

oomg · January 14, 2022, 10:10am

I’ ve check my disk, disk space is > 100G.
Maybe is it something related to share memory? Since tao is running in docker, but we cannot set the ‘shm-size’ when it start.

Morganh · January 14, 2022, 1:52pm

How did you trigger docker and run training? I cannot find the command when you trigger docker and run training?

oomg · January 15, 2022, 1:47am

I don’t know, but it’s in the tutorial.
# Mapping up the local directories to the TAO docker.

You can look at this
bpnet.ipynb (1.2 MB)

Morganh · January 16, 2022, 6:46am

To narrow down, please try to run under terminal instead of notebook.

oomg · January 16, 2022, 6:50am

ok,i try.

Bob_DL · January 17, 2022, 9:34am

+1
have you solve the issue?

oomg · January 17, 2022, 9:38am

NO,I don’t have time to test now。

oomg · January 18, 2022, 6:32am

Why are 18 key points inferred

There are 17 key points in the developer document

Morganh · January 18, 2022, 7:08am

See BodyPoseNet | NVIDIA NGC

oomg · January 18, 2022, 7:19am

I have two questions.
A:
The coco2017 I prepared didn’t include the neck at all. Why doesn’t it report errors.
B:
I know, but how should I prepare the data?
Like this?

       category['keypoints'] = ["nose",
                                 "neck",
                                 "right_shoulder",
                                 "right_elbow",
                                 "right_wrist",
                                 "left_shoulder",
                                 "left_elbow",
                                 "left_wrist",
                                 "right_hip",
                                 "right_knee",
                                 "right_ankle",
                                 "left_hip",
                                 "left_knee",
                                 "left_ankle",
                                 "right_eye",
                                 "left_eye",
                                 "right_ear",
                                 "left_ear"]

category[‘skeleton’] = [
[16, 14],
[16, 2],
[14, 0],
[2, 1],
[2, 3],
[3, 4],
[1, 8],
[8, 9],
[9, 10],
[17, 15],
[17, 5],
[15, 0],
[5, 1],
[5, 6],
[6, 7],
[1, 11],
[11, 12],
[12, 13],
[0, 1]]

Morganh · January 18, 2022, 7:49am

Just prepare the dataset as the guide of https://docs.nvidia.com/tao/tao-toolkit/text/data_annotation_format.html#bodyposenet-coco-format

oomg · January 18, 2022, 7:53am

OH MY GOD.
Why is the model 18 key points, but the data only needs 17 key points

Topic		Replies	Views
Bpnet dataset_convert error in tao TAO Toolkit	6	510	October 20, 2022
License Plate Recognition TAO Toolkit	14	1236	July 4, 2022
WSL2 & TAO issues TAO Toolkit wsl , tao	27	3762	January 5, 2022
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found TAO Toolkit	11	2464	February 13, 2022
Fail to create “gazenet_onnx.etlt” by NGC tlt_cv_samples_v1.1.0.zip TAO Toolkit	14	1293	October 12, 2021
CLI update TAO Toolkit	14	1160	June 23, 2022
Invalid Loss TAO Toolkit	31	1293	July 11, 2022
Tao GestureNet train do not work properly TAO Toolkit	2	672	December 9, 2021
BodyPoseNet training not converging TAO Toolkit	11	916	September 27, 2021
Training custom model using Yolo_v4_tiny TAO Toolkit	13	1557	January 19, 2022

6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying

Related topics