BodyPoseNet TAO training error

andremayer2000 · April 20, 2022, 3:13pm

Hello, I’m currently trying to train a BodyPoseNet model with my custom dataset using coco format. I’m able to create tfrecords but when I run the train command: bpnet train -e /workspace/specs/bpnet_train_m1_coco_1.yaml -r /workspace/bpnet/ -k key --gpus 1, I get the following error:
`Traceback (most recent call last):

File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 146, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 132, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 158, in deserialize_maglev_object
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 145, in _deserialize_recursively
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 167, in deserialize_maglev_object
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 432, in wrapper
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py”, line 150, in init
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/processors/label_processor.py”, line 57, in init
AssertionError`

I have tried both using my custom dataset and the dataset provided in the tutorial and I get the same error.

• Hardware: Tesla V100
• Network Type: BodyPoseNet
• TLT Version: nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3
• Training spec file
`

class_name: BpNetTrainer
checkpoint_dir: /workspace/bpnet/
log_every_n_secs: 30
checkpoint_n_epoch: 10
num_epoch: 200
summary_every_n_steps: 20
infrequent_summary_every_n_steps: 0
validation_every_n_epoch: 10
max_ckpt_to_keep: 100
random_seed: 42
pretrained_weights: /workspace/bpnet/pretrained_model/model.tlt
load_graph: False
finetuning_config:
is_finetune_exp: False
checkpoint_path: null
ckpt_epoch_num: 0
use_stagewise_lr_multipliers: True
dataloader:
class_name: BpNetDataloader
batch_size: 24
pose_config:
class_name: BpNetPoseConfig
target_shape: [32, 32]
pose_config_path: /workspace/models/bpnet/model_pose_config/bpnet_18joints.json
image_config:
image_dims:
height: 544
width: 960
channels: 3
image_encoding: png
dataset_config:
root_data_path: /workspace/dataset_pose
train_records_folder_path: /workspace/dataset_pose/
train_records_path: [train-fold-000-of-001]
val_records_folder_path: /workspace/dataset_pose/
val_records_path: [test-fold-000-of-001]
dataset_specs:
coco: /workspace/specs/coco_spec.json
normalization_params:
image_scale: [256.0, 256.0, 256.0]
image_offset: [0.5, 0.5, 0.5]
mask_scale: [255.0]
mask_offset: [0.0]
augmentation_config:
class_name: AugmentationConfig
spatial_augmentation_mode: person_centric
spatial_aug_params:
flip_lr_prob: 0.5
flip_tb_prob: 0.0
rotate_deg_max: 40.0
rotate_deg_min: -40.0
zoom_prob: 0.0
zoom_ratio_min: 1.0
zoom_ratio_max: 1.0
translate_max_x: 40.0
translate_min_x: -40.0
translate_max_y: 40.0
translate_min_y: -40.0
use_translate_ratio: False
translate_ratio_max: 0.2
translate_ratio_min: -0.2
target_person_scale: 0.6
identity_spatial_aug_params:
null
label_processor_config:
paf_gaussian_sigma: 0.03
heatmap_gaussian_sigma: 7.0
paf_ortho_dist_thresh: 1.0
shuffle_buffer_size: 20000
model:
class_name: BpNetLiteModel
backbone_attributes:
architecture: vgg
mtype: default
use_bias: False
stages: 3
heat_channels: 19
paf_channels: 38
use_self_attention: False
data_format: channels_last
use_bias: True
regularization_type: l1
kernel_regularization_factor: 5.0e-4
bias_regularization_factor: 0.0
kernel_initializer: random_normal
optimizer:
class_name: WeightedMomentumOptimizer
learning_rate_schedule:
class_name: SoftstartAnnealingLearningRateSchedule
soft_start: 0.05
annealing: 0.5
base_learning_rate: 2.e-5
min_learning_rate: 8.e-08
last_step: null
grad_weights_dict: null
weight_default_value: 1.0
momentum: 0.9
use_nesterov: False
loss:
class_name: BpNetLoss`

Morganh · April 21, 2022, 2:29am

Can you run the default jupyter notebook successfully?

Morganh · April 21, 2022, 2:44am

The target_shape depends on the input shape. This can be computed based on the model stride. In the default setting, the model has a stride of 8.

The assertion error is due to

assert (image_shape[0] // target_shape[0]) == (image_shape[1] // target_shape[1])

andremayer2000 · April 25, 2022, 1:43pm

Thank you very much for your help

Topic		Replies	Views
Assertion Error when training BodyPoseNet with my custom data TAO Toolkit	2	453	February 6, 2023
BodyPoseNet Training error TAO Toolkit	4	640	February 8, 2023
Retraining BodyPoseNet TAO Toolkit	6	660	August 5, 2022
Tao Toolkit BPNet Assertion Error while training TAO Toolkit	4	465	April 20, 2023
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found TAO Toolkit	11	2601	February 13, 2022
BodyPoseNet trained with custom dataset not detecting TAO Toolkit	21	1137	June 6, 2022
BodyPoseNet training not converging TAO Toolkit	11	1063	September 27, 2021
Tfrecord files for retraining BodyPoseNet TAO Toolkit	9	565	July 22, 2022
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying TAO Toolkit	29	3076	February 3, 2022
Bpnet dataset_convert error in tao TAO Toolkit	6	620	October 20, 2022

BodyPoseNet TAO training error

Related topics