BodyPoseNet trained with custom dataset not detecting

Please provide the following information when requesting support.

• Hardware: Tesla V100-SXM3
• Network Type: BodyPoseNet
• TLT Version: tao-toolkit-tf:v3.21.11-tf1.15.5-py3
• Training spec file:
class_name: BpNetTrainer

checkpoint_dir: /workspace/bpnet/train/

log_every_n_secs: 30

checkpoint_n_epoch: 10

num_epoch: 10

summary_every_n_steps: 20

infrequent_summary_every_n_steps: 0

validation_every_n_epoch: 5

max_ckpt_to_keep: 100

random_seed: 42

pretrained_weights: /workspace/bpnet/pretrained_model/model.tlt

load_graph: False

finetuning_config:

is_finetune_exp: False

checkpoint_path: null

ckpt_epoch_num: 0

use_stagewise_lr_multipliers: True

dataloader:

class_name: BpNetDataloader

batch_size: 8

pose_config:

__class_name__: BpNetPoseConfig

target_shape: [68, 120]

pose_config_path: /workspace/models/bpnet/model_pose_config/bpnet_18joints.json

image_config:

image_dims:

  height: 544

  width: 960

  channels: 3

image_encoding: png

dataset_config:

root_data_path: /workspace/dataset_pose/

train_records_folder_path: /workspace/dataset_pose/

train_records_path: [train-fold-000-of-001]

val_records_folder_path: /workspace/dataset_pose/

val_records_path: [test-fold-000-of-001]

dataset_specs:

  coco: /workspace/specs/coco_spec.json

normalization_params:

image_scale: [256.0, 256.0, 256.0]

image_offset: [0.5, 0.5, 0.5]

mask_scale: [255.0]

mask_offset: [0.0]

augmentation_config:

__class_name__: AugmentationConfig

spatial_augmentation_mode: person_centric

spatial_aug_params:

  flip_lr_prob: 0.5

  flip_tb_prob: 0.0

  rotate_deg_max: 40.0

  rotate_deg_min: -40.0

  zoom_prob: 0.0

  zoom_ratio_min: 1.0

  zoom_ratio_max: 1.0

  translate_max_x: 40.0

  translate_min_x: -40.0

  translate_max_y: 40.0

  translate_min_y: -40.0

  use_translate_ratio: False

  translate_ratio_max: 0.2

  translate_ratio_min: -0.2

  target_person_scale: 0.6

identity_spatial_aug_params:

  null

label_processor_config:

paf_gaussian_sigma: 0.03

heatmap_gaussian_sigma: 7.0

paf_ortho_dist_thresh: 1.0

shuffle_buffer_size: 20000

model:

class_name: BpNetLiteModel

backbone_attributes:

architecture: vgg

mtype: default

use_bias: False

stages: 3

heat_channels: 19

paf_channels: 38

use_self_attention: False

data_format: channels_last

use_bias: True

regularization_type: l1

kernel_regularization_factor: 5.0e-4

bias_regularization_factor: 0.0

kernel_initializer: random_normal

optimizer:

class_name: WeightedMomentumOptimizer

learning_rate_schedule:

__class_name__: SoftstartAnnealingLearningRateSchedule

soft_start: 0.05

annealing: 0.5

base_learning_rate: 2.e-5

min_learning_rate: 8.e-08

last_step: null

grad_weights_dict: null

weight_default_value: 1.0

momentum: 0.9

use_nesterov: False

loss:

class_name: BpNetLoss

• How to reproduce the issue ?

Hello, I’m having a problem running the inference for the BodyPoseNet model with the custom dataset that I created, I can run the training just fine without any errors but when I run the inference command to test the model I get no detections from the model. I tried testing with the same images directly using the pre-trained model and it detects just fine. I already checked the annotations format and it is identical to the one provided in the tutorial. The only thing different from the tutorial is that I had to create the masks manually using the bounding box of the keypoints because my dataset does not provide the segmentation.

How about running with the default notebook? Is it fine?

Yes

Hi @Morganh, I just checked training with the coco dataset used in the notebook and it worked fine, so the problem seems to be with my custom dataset, do you have any suggestion for what the problem would be?

Could you share the training log? Is the loss decreasing?

Okay so today I was going to run the training again to save the logs but it gave me the following error:

Traceback (most recent call last):

File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py”, line 235, in call
ret = func(*args)

File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/dataset_config.py”, line 103, in transform_labels

File “<array_function internals>”, line 6, in reshape

File “/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py”, line 299, in reshape
return _wrapfunc(a, ‘reshape’, newshape, order=order)

File “/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py”, line 58, in _wrapfunc
return bound(*args, **kwds)

ValueError: cannot reshape array of size 51 into shape (18,3)

I haven’t changed anything from the last training.

Please set to below in coco_spec.json
“num_joints”: 17,

More, I suggest you to refer to some config files inside the notebook.
For example,
cv_samples_v1.3.0/bpnet/data_pose_config/coco_spec.json
cv_samples_v1.3.0/bpnet/model_pose_config/bpnet_18joints.json

I trained the model for 5 epochs using only 17 joints and still no detections, here’s the log:

INFO 2022-05-18 18:25:53,660| tensorflow: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, loss = 260.48328, step = 0
INFO 2022-05-18 18:26:23,509| tensorflow: epoch = 0.0, loss = 260.48328, step = 0
INFO:tensorflow:global_step/sec: 1.11782
INFO 2022-05-18 18:26:41,403| tensorflow: global_step/sec: 1.11782
INFO:tensorflow:epoch = 0.2215909090909091, loss = 262.648, step = 39 (33.108 sec)
INFO 2022-05-18 18:26:56,617| tensorflow: epoch = 0.2215909090909091, loss = 262.648, step = 39 (33.108 sec)
INFO:tensorflow:global_step/sec: 1.25641
INFO 2022-05-18 18:26:57,321| tensorflow: global_step/sec: 1.25641
INFO:tensorflow:global_step/sec: 1.41945
INFO 2022-05-18 18:27:11,411| tensorflow: global_step/sec: 1.41945
INFO:tensorflow:global_step/sec: 1.41912
INFO 2022-05-18 18:27:25,505| tensorflow: global_step/sec: 1.41912
INFO:tensorflow:epoch = 0.4715909090909091, loss = 207.93134, step = 83 (31.002 sec)
INFO 2022-05-18 18:27:27,619| tensorflow: epoch = 0.4715909090909091, loss = 207.93134, step = 83 (31.002 sec)
INFO:tensorflow:global_step/sec: 1.41822
INFO 2022-05-18 18:27:39,607| tensorflow: global_step/sec: 1.41822
INFO:tensorflow:global_step/sec: 1.42092
INFO 2022-05-18 18:27:53,682| tensorflow: global_step/sec: 1.42092
INFO:tensorflow:epoch = 0.7215909090909091, loss = 306.23758, step = 127 (30.994 sec)
INFO 2022-05-18 18:27:58,613| tensorflow: epoch = 0.7215909090909091, loss = 306.23758, step = 127 (30.994 sec)
INFO:tensorflow:global_step/sec: 1.4166
INFO 2022-05-18 18:28:07,800| tensorflow: global_step/sec: 1.4166
INFO:tensorflow:global_step/sec: 1.41769
INFO 2022-05-18 18:28:21,908| tensorflow: global_step/sec: 1.41769
INFO:tensorflow:epoch = 0.9715909090909092, loss = 185.13324, step = 171 (31.055 sec)
INFO 2022-05-18 18:28:29,668| tensorflow: epoch = 0.9715909090909092, loss = 185.13324, step = 171 (31.055 sec)
INFO:tensorflow:global_step/sec: 1.41679
INFO 2022-05-18 18:28:36,024| tensorflow: global_step/sec: 1.41679
INFO:tensorflow:global_step/sec: 1.4227
INFO 2022-05-18 18:28:50,082| tensorflow: global_step/sec: 1.4227
INFO:tensorflow:epoch = 1.2215909090909092, loss = 244.98184, step = 215 (31.000 sec)
INFO 2022-05-18 18:29:00,668| tensorflow: epoch = 1.2215909090909092, loss = 244.98184, step = 215 (31.000 sec)
INFO:tensorflow:global_step/sec: 1.41667
INFO 2022-05-18 18:29:04,199| tensorflow: global_step/sec: 1.41667
INFO:tensorflow:global_step/sec: 1.4212
INFO 2022-05-18 18:29:18,272| tensorflow: global_step/sec: 1.4212
INFO:tensorflow:epoch = 1.4715909090909092, loss = 201.70868, step = 259 (30.958 sec)
INFO 2022-05-18 18:29:31,626| tensorflow: epoch = 1.4715909090909092, loss = 201.70868, step = 259 (30.958 sec)
INFO:tensorflow:global_step/sec: 1.42245
INFO 2022-05-18 18:29:32,332| tensorflow: global_step/sec: 1.42245
INFO:tensorflow:global_step/sec: 1.41841
INFO 2022-05-18 18:29:46,433| tensorflow: global_step/sec: 1.41841
INFO:tensorflow:global_step/sec: 1.41885
INFO 2022-05-18 18:30:00,528| tensorflow: global_step/sec: 1.41885
INFO:tensorflow:epoch = 1.7215909090909092, loss = 213.14804, step = 303 (31.012 sec)
INFO 2022-05-18 18:30:02,637| tensorflow: epoch = 1.7215909090909092, loss = 213.14804, step = 303 (31.012 sec)
INFO:tensorflow:global_step/sec: 1.41791
INFO 2022-05-18 18:30:14,634| tensorflow: global_step/sec: 1.41791
INFO:tensorflow:global_step/sec: 1.42007
INFO 2022-05-18 18:30:28,717| tensorflow: global_step/sec: 1.42007
INFO:tensorflow:epoch = 1.9715909090909092, loss = 169.17082, step = 347 (31.027 sec)
INFO 2022-05-18 18:30:33,664| tensorflow: epoch = 1.9715909090909092, loss = 169.17082, step = 347 (31.027 sec)
INFO:tensorflow:global_step/sec: 1.42066
INFO 2022-05-18 18:30:42,796| tensorflow: global_step/sec: 1.42066
INFO:tensorflow:global_step/sec: 1.42015
INFO 2022-05-18 18:30:56,879| tensorflow: global_step/sec: 1.42015
INFO:tensorflow:epoch = 2.221590909090909, loss = 265.66968, step = 391 (30.945 sec)
INFO 2022-05-18 18:31:04,609| tensorflow: epoch = 2.221590909090909, loss = 265.66968, step = 391 (30.945 sec)
INFO:tensorflow:global_step/sec: 1.42396
INFO 2022-05-18 18:31:10,924| tensorflow: global_step/sec: 1.42396
INFO:tensorflow:global_step/sec: 1.42039
INFO 2022-05-18 18:31:25,004| tensorflow: global_step/sec: 1.42039
INFO:tensorflow:epoch = 2.471590909090909, loss = 129.05527, step = 435 (30.958 sec)
INFO 2022-05-18 18:31:35,567| tensorflow: epoch = 2.471590909090909, loss = 129.05527, step = 435 (30.958 sec)
INFO:tensorflow:global_step/sec: 1.42039
INFO 2022-05-18 18:31:39,085| tensorflow: global_step/sec: 1.42039
INFO:tensorflow:global_step/sec: 1.42132
INFO 2022-05-18 18:31:53,157| tensorflow: global_step/sec: 1.42132
INFO:tensorflow:epoch = 2.721590909090909, loss = 203.45369, step = 479 (30.944 sec)
INFO 2022-05-18 18:32:06,511| tensorflow: epoch = 2.721590909090909, loss = 203.45369, step = 479 (30.944 sec)
INFO:tensorflow:global_step/sec: 1.42256
INFO 2022-05-18 18:32:07,216| tensorflow: global_step/sec: 1.42256
INFO:tensorflow:global_step/sec: 1.42434
INFO 2022-05-18 18:32:21,257| tensorflow: global_step/sec: 1.42434
INFO:tensorflow:global_step/sec: 1.42348
INFO 2022-05-18 18:32:35,307| tensorflow: global_step/sec: 1.42348
INFO:tensorflow:epoch = 2.971590909090909, loss = 219.72559, step = 523 (30.897 sec)
INFO 2022-05-18 18:32:37,408| tensorflow: epoch = 2.971590909090909, loss = 219.72559, step = 523 (30.897 sec)
INFO:tensorflow:global_step/sec: 1.42447
INFO 2022-05-18 18:32:49,348| tensorflow: global_step/sec: 1.42447
INFO:tensorflow:global_step/sec: 1.42047
INFO 2022-05-18 18:33:03,427| tensorflow: global_step/sec: 1.42047
INFO:tensorflow:epoch = 3.221590909090909, loss = 156.98914, step = 567 (30.969 sec)
INFO 2022-05-18 18:33:08,377| tensorflow: epoch = 3.221590909090909, loss = 156.98914, step = 567 (30.969 sec)
INFO:tensorflow:global_step/sec: 1.41512
INFO 2022-05-18 18:33:17,561| tensorflow: global_step/sec: 1.41512
INFO:tensorflow:global_step/sec: 1.4213
INFO 2022-05-18 18:33:31,632| tensorflow: global_step/sec: 1.4213
INFO:tensorflow:epoch = 3.471590909090909, loss = 152.15149, step = 611 (30.995 sec)
INFO 2022-05-18 18:33:39,372| tensorflow: epoch = 3.471590909090909, loss = 152.15149, step = 611 (30.995 sec)
INFO:tensorflow:global_step/sec: 1.42219
INFO 2022-05-18 18:33:45,695| tensorflow: global_step/sec: 1.42219
INFO:tensorflow:global_step/sec: 1.41918
INFO 2022-05-18 18:33:59,788| tensorflow: global_step/sec: 1.41918
INFO:tensorflow:epoch = 3.721590909090909, loss = 162.58096, step = 655 (30.986 sec)
INFO 2022-05-18 18:34:10,359| tensorflow: epoch = 3.721590909090909, loss = 162.58096, step = 655 (30.986 sec)
INFO:tensorflow:global_step/sec: 1.42047
INFO 2022-05-18 18:34:13,867| tensorflow: global_step/sec: 1.42047
INFO:tensorflow:global_step/sec: 1.422
INFO 2022-05-18 18:34:27,932| tensorflow: global_step/sec: 1.422
INFO:tensorflow:epoch = 3.971590909090909, loss = 140.42368, step = 699 (30.918 sec)
INFO 2022-05-18 18:34:41,276| tensorflow: epoch = 3.971590909090909, loss = 140.42368, step = 699 (30.918 sec)
INFO:tensorflow:global_step/sec: 1.42346
INFO 2022-05-18 18:34:41,982| tensorflow: global_step/sec: 1.42346
INFO:tensorflow:global_step/sec: 1.41753
INFO 2022-05-18 18:34:56,091| tensorflow: global_step/sec: 1.41753
INFO:tensorflow:global_step/sec: 1.42398
INFO 2022-05-18 18:35:10,137| tensorflow: global_step/sec: 1.42398
INFO:tensorflow:epoch = 4.221590909090909, loss = 187.89554, step = 743 (30.966 sec)
INFO 2022-05-18 18:35:12,243| tensorflow: epoch = 4.221590909090909, loss = 187.89554, step = 743 (30.966 sec)
INFO:tensorflow:global_step/sec: 1.42582
INFO 2022-05-18 18:35:24,164| tensorflow: global_step/sec: 1.42582
INFO:tensorflow:global_step/sec: 1.41856
INFO 2022-05-18 18:35:38,262| tensorflow: global_step/sec: 1.41856
INFO:tensorflow:epoch = 4.471590909090909, loss = 210.05615, step = 787 (30.944 sec)
INFO 2022-05-18 18:35:43,187| tensorflow: epoch = 4.471590909090909, loss = 210.05615, step = 787 (30.944 sec)
INFO:tensorflow:global_step/sec: 1.42028
INFO 2022-05-18 18:35:52,344| tensorflow: global_step/sec: 1.42028
INFO:tensorflow:global_step/sec: 1.42496
INFO 2022-05-18 18:36:06,380| tensorflow: global_step/sec: 1.42496
INFO:tensorflow:epoch = 4.721590909090909, loss = 146.06561, step = 831 (30.924 sec)
INFO 2022-05-18 18:36:14,110| tensorflow: epoch = 4.721590909090909, loss = 146.06561, step = 831 (30.924 sec)
INFO:tensorflow:global_step/sec: 1.42059
INFO 2022-05-18 18:36:20,459| tensorflow: global_step/sec: 1.42059
INFO:tensorflow:global_step/sec: 1.41858
INFO 2022-05-18 18:36:34,557| tensorflow: global_step/sec: 1.41858
INFO:tensorflow:epoch = 4.971590909090909, loss = 192.05815, step = 875 (31.007 sec)
INFO 2022-05-18 18:36:45,118| tensorflow: epoch = 4.971590909090909, loss = 192.05815, step = 875 (31.007 sec)

The loss is not decreasing.

Can you attach below files?

  • bpnet_18joints.json
  • coco_spec.json
  • bpnet_train_m1_coco.yaml

More, if possible, please share your dataset with us to reproduce.

unfortunately I can’t share the dataset because the images belong to a client, but if it helps I can share the annotations.

bpnet_18joints.json (1.4 KB)

bpnet_train_m1_coco.yaml (2.7 KB)

coco_spec.json (2.4 KB)

Several comments here.

  1. Are your training images 960x544 ? How many training images?
  2. Can you set larger “num_epoch: xxx” and try again?
  3. Could you try to run inference against the training dataset to check if there are detections?
  4. Yes, please share the annotations as well. Thanks.
  5. Can you attach the inference yaml file?
  1. Are your training images 960x544 ? How many training images?
    Yes. they are. 1659 images.
  2. Can you set larger “num_epoch: xxx” and try again?
    Yes, but I will have to run the training with a lower image dimension because of the memory , when it ends I will tell you the results.
  3. Could you try to run inference against the training dataset to check if there are detections?
    Still no detections.
  4. Yes, please share the annotations as well. Thanks.
  5. Can you attach the inference yaml file?
    infer_spec.yaml (398 Bytes)

keypoints.json (730.5 KB)

I think you did not set correct input_shape.
See Body Pose Estimation - NVIDIA Docs

Please modify

input_shape: [960, 544]

to

input_shape: [544, 960]

Changed the input_shape but still no detections, Im running the training with more epochs now

Okay the training finished, I changed the image size to 256, 256 because of GPU memory usage and trained for 100 epochs. The final loss was:

INFO 2022-05-23 19:02:06,767| tensorflow: epoch = 99.11931818181819, loss = 70.093704, step = 17445 (30.258 sec)

So the loss actually decreased this time but still no detections.

Could you try to deploy the model in deepstream to check if it works?
Refer to deepstream_tao_apps/configs/bodypose2d_tao at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub
and deepstream_tao_apps/configs/bodypose2d_tao at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub

I’ve been trying to run in deepstream but I get the following error: “invalid input pafmap dimension.”

Request sink_0 pad from streammux
joint Edges 1 , 8
joint Edges 8 , 9
joint Edges 9 , 10
joint Edges 1 , 11
joint Edges 11 , 12
joint Edges 12 , 13
joint Edges 1 , 2
joint Edges 2 , 3
joint Edges 3 , 4
joint Edges 2 , 16
joint Edges 1 , 5
joint Edges 5 , 6
joint Edges 6 , 7
joint Edges 5 , 17
joint Edges 1 , 0
joint Edges 0 , 14
joint Edges 0 , 15
joint Edges 14 , 16
joint Edges 15 , 17
connections 0 , 1
connections 1 , 2
connections 1 , 5
connections 2 , 3
connections 3 , 4
connections 5 , 6
connections 6 , 7
connections 2 , 8
connections 8 , 9
connections 9 , 10
connections 5 , 11
connections 11 , 12
connections 12 , 13
connections 0 , 14
connections 14 , 16
connections 8 , 11
connections 15 , 17
connections 0 , 15
Now playing: file:///samples/cam173-20220224000945.mp4
0:00:02.420511734 69 0x5629721fcc90 INFO nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1914> [UID = 1]: Trying to create engine from model files
WARNING: [TRT]: Detected invalid timing cache, setup a local cache instead
0:00:29.127518732 69 0x5629721fcc90 INFO nvinfer gstnvinfer.cpp:638:gst_nvinfer_logger: NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::buildModel() <nvdsinfer_context_impl.cpp:1947> [UID = 1]: serialize cuda engine to file: /app/models/bpnet_model.etlt_b16_gpu0_fp16.engine successfully
INFO: …/nvdsinfer/nvdsinfer_model_builder.cpp:610 [FullDims Engine Info]: layers num: 3
0 INPUT kFLOAT input_1:0 288x384x3 min: 1x288x384x3 opt: 16x288x384x3 Max: 16x288x384x3
1 OUTPUT kFLOAT heatmap_out/BiasAdd:0 36x48x19 min: 0 opt: 0 Max: 0
2 OUTPUT kFLOAT paf_out/BiasAdd:0 36x48x38 min: 0 opt: 0 Max: 0

ERROR: [TRT]: Cannot find binding of given name: conv2d_transpose_1/BiasAdd:0
0:00:29.158933826 69 0x5629721fcc90 WARN nvinfer gstnvinfer.cpp:635:gst_nvinfer_logger: NvDsInferContext[UID 1]: Warning from NvDsInferContextImpl::checkBackendParams() <nvdsinfer_context_impl.cpp:1868> [UID = 1]: Could not find output layer ‘conv2d_transpose_1/BiasAdd:0’ in engine
0:00:29.207837196 69 0x5629721fcc90 INFO nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus: [UID 1]: Load new model:…/…/…/configs/bodypose2d_tao/bodypose2d_pgie_config.txt sucessfully
Decodebin child added: source
Decodebin child added: decodebin0
Running…
Decodebin child added: qtdemux0
Decodebin child added: multiqueue0
Decodebin child added: h265parse0
Decodebin child added: capsfilter0
Decodebin child added: nvv4l2decoder0
In cb_newpad
###Decodebin pick nvidia decoder plugin.
terminate called after throwing an instance of ‘std::runtime_error’
what(): invalid input pafmap dimension.
Aborted (core dumped)

I tested with dimensions 256x256 and trained again with the default one in deepstream 384x288, both gave me the same error.