BodyPoseNet TAO training error

Hello, I’m currently trying to train a BodyPoseNet model with my custom dataset using coco format. I’m able to create tfrecords but when I run the train command: bpnet train -e /workspace/specs/bpnet_train_m1_coco_1.yaml -r /workspace/bpnet/ -k key --gpus 1, I get the following error:
`Traceback (most recent call last):

File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 146, in
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/scripts/train.py”, line 132, in main
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 158, in deserialize_maglev_object
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 145, in _deserialize_recursively
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 167, in deserialize_maglev_object
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 432, in wrapper
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/bpnet_dataloader.py”, line 150, in init
File “/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/driveix/build_wheel.runfiles/ai_infra/driveix/bpnet/dataloaders/processors/label_processor.py”, line 57, in init
AssertionError`

I have tried both using my custom dataset and the dataset provided in the tutorial and I get the same error.

• Hardware: Tesla V100
• Network Type: BodyPoseNet
• TLT Version: nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3
• Training spec file
`

class_name: BpNetTrainer
checkpoint_dir: /workspace/bpnet/
log_every_n_secs: 30
checkpoint_n_epoch: 10
num_epoch: 200
summary_every_n_steps: 20
infrequent_summary_every_n_steps: 0
validation_every_n_epoch: 10
max_ckpt_to_keep: 100
random_seed: 42
pretrained_weights: /workspace/bpnet/pretrained_model/model.tlt
load_graph: False
finetuning_config:
is_finetune_exp: False
checkpoint_path: null
ckpt_epoch_num: 0
use_stagewise_lr_multipliers: True
dataloader:
class_name: BpNetDataloader
batch_size: 24
pose_config:
class_name: BpNetPoseConfig
target_shape: [32, 32]
pose_config_path: /workspace/models/bpnet/model_pose_config/bpnet_18joints.json
image_config:
image_dims:
height: 544
width: 960
channels: 3
image_encoding: png
dataset_config:
root_data_path: /workspace/dataset_pose
train_records_folder_path: /workspace/dataset_pose/
train_records_path: [train-fold-000-of-001]
val_records_folder_path: /workspace/dataset_pose/
val_records_path: [test-fold-000-of-001]
dataset_specs:
coco: /workspace/specs/coco_spec.json
normalization_params:
image_scale: [256.0, 256.0, 256.0]
image_offset: [0.5, 0.5, 0.5]
mask_scale: [255.0]
mask_offset: [0.0]
augmentation_config:
class_name: AugmentationConfig
spatial_augmentation_mode: person_centric
spatial_aug_params:
flip_lr_prob: 0.5
flip_tb_prob: 0.0
rotate_deg_max: 40.0
rotate_deg_min: -40.0
zoom_prob: 0.0
zoom_ratio_min: 1.0
zoom_ratio_max: 1.0
translate_max_x: 40.0
translate_min_x: -40.0
translate_max_y: 40.0
translate_min_y: -40.0
use_translate_ratio: False
translate_ratio_max: 0.2
translate_ratio_min: -0.2
target_person_scale: 0.6
identity_spatial_aug_params:
null
label_processor_config:
paf_gaussian_sigma: 0.03
heatmap_gaussian_sigma: 7.0
paf_ortho_dist_thresh: 1.0
shuffle_buffer_size: 20000
model:
class_name: BpNetLiteModel
backbone_attributes:
architecture: vgg
mtype: default
use_bias: False
stages: 3
heat_channels: 19
paf_channels: 38
use_self_attention: False
data_format: channels_last
use_bias: True
regularization_type: l1
kernel_regularization_factor: 5.0e-4
bias_regularization_factor: 0.0
kernel_initializer: random_normal
optimizer:
class_name: WeightedMomentumOptimizer
learning_rate_schedule:
class_name: SoftstartAnnealingLearningRateSchedule
soft_start: 0.05
annealing: 0.5
base_learning_rate: 2.e-5
min_learning_rate: 8.e-08
last_step: null
grad_weights_dict: null
weight_default_value: 1.0
momentum: 0.9
use_nesterov: False
loss:
class_name: BpNetLoss`

Can you run the default jupyter notebook successfully?

  • The target_shape depends on the input shape. This can be computed based on the model stride. In the default setting, the model has a stride of 8.

The assertion error is due to

assert (image_shape[0] // target_shape[0]) == (image_shape[1] // target_shape[1])

Thank you very much for your help

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.