Bpnet model - Error while traning

Description

Errors occurs during training, for nvidia tao tool : 4.0v

Environment

TensorRT Version:
GPU Type:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version: Ubuntu LTS 20.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): 1.15.5
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Error message :

2023-01-30 16:37:49.760021: F ./tensorflow/core/kernels/random_op_gpu.h:225] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: the provided PTX was compiled with an unsupported toolchain.
[7d32dadab26b:00152] *** Process received signal ***
[7d32dadab26b:00152] Signal: Aborted (6)
[7d32dadab26b:00152] Signal code: (-6)
[7d32dadab26b:00152] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc8dd8d4090]
[7d32dadab26b:00152] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc8dd8d400b]
[7d32dadab26b:00152] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc8dd8b3859]
[7d32dadab26b:00152] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x20baf4)[0x7fc8d9dbcaf4]
[7d32dadab26b:00152] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor16FillPhiloxRandomIN5Eigen9GpuDeviceENS_6random18NormalDistributionINS4_12PhiloxRandomEfEEEclEPNS_15OpKernelContextERKS3_S6_PfxS7+0x1d5)[0x7fc85ee9f535]
[7d32dadab26b:00152] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(+0x8ed76da)[0x7fc85ee9b6da]
[7d32dadab26b:00152] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3cb)[0x7fc8d8caf3db]
[7d32dadab26b:00152] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113caa7)[0x7fc8d8d0caa7]
[7d32dadab26b:00152] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113d10f)[0x7fc8d8d0d10f]
[7d32dadab26b:00152] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x285)[0x7fc8d8dc1725]
[7d32dadab26b:00152] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fc8d8dbe268]
[7d32dadab26b:00152] [11] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x18d69a0)[0x7fc8d94a69a0]
[7d32dadab26b:00152] [12] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fc8dd876609]
[7d32dadab26b:00152] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fc8dd9b0133]
[7d32dadab26b:00152] *** End of error message ***

Steps To Reproduce

YAML FILE :
class_name: BpNetTrainer
checkpoint_dir: /workspace/tao-experiments/bpnet/models/exp_m1_unpruned
log_every_n_secs: 30
checkpoint_n_epoch: 5
num_epoch: 20
summary_every_n_steps: 20
infrequent_summary_every_n_steps: 0
validation_every_n_epoch: 5
max_ckpt_to_keep: 100
random_seed: 42
pretrained_weights: /workspace/tao-experiments/bpnet/pretrained_model/bodyposenet_vtrainable_v1.0/model.tlt
load_graph: False
finetuning_config:
is_finetune_exp: False
checkpoint_path: null
ckpt_epoch_num: 0
use_stagewise_lr_multipliers: True
dataloader:
class_name: BpNetDataloader
batch_size: 10
pose_config:
class_name: BpNetPoseConfig
target_shape: [32, 32]
pose_config_path: /workspace/examples/bpnet/model_pose_config/bpnet_18joints.json
image_config:
image_dims:
height: 256
width: 256
channels: 3
image_encoding: jpg
dataset_config:
root_data_path: /workspace/tao-experiments/bpnet/data/
train_records_folder_path: /workspace/tao-experiments/bpnet/data
train_records_path: [train-fold-000-of-001]
val_records_folder_path: /workspace/tao-experiments/bpnet/data
val_records_path: [val-fold-000-of-001]
dataset_specs:
coco: /workspace/examples/bpnet/data_pose_config/coco_spec.json
normalization_params:
image_scale: [256.0, 256.0, 256.0]
image_offset: [0.5, 0.5, 0.5]
mask_scale: [255.0]
mask_offset: [0.0]
augmentation_config:
class_name: AugmentationConfig
spatial_augmentation_mode: person_centric
spatial_aug_params:
flip_lr_prob: 0.5
flip_tb_prob: 0.0
rotate_deg_max: 40.0
rotate_deg_min: -40.0
zoom_prob: 0.0
zoom_ratio_min: 1.0
zoom_ratio_max: 1.0
translate_max_x: 40.0
translate_min_x: -40.0
translate_max_y: 40.0
translate_min_y: -40.0
use_translate_ratio: False
translate_ratio_max: 0.2
translate_ratio_min: -0.2
target_person_scale: 0.6
identity_spatial_aug_params:
null
label_processor_config:
paf_gaussian_sigma: 0.03
heatmap_gaussian_sigma: 7.0
paf_ortho_dist_thresh: 1.0
shuffle_buffer_size: 20000
model:
class_name: BpNetLiteModel
backbone_attributes:
architecture: vgg
mtype: default
use_bias: False
stages: 3
heat_channels: 19
paf_channels: 38
use_self_attention: False
data_format: channels_last
use_bias: True
regularization_type: l1
kernel_regularization_factor: 5.0e-4
bias_regularization_factor: 0.0
kernel_initializer: random_normal
optimizer:
class_name: WeightedMomentumOptimizer
learning_rate_schedule:
class_name: SoftstartAnnealingLearningRateSchedule
soft_start: 0.05
annealing: 0.5
base_learning_rate: 2.e-5
min_learning_rate: 8.e-08
last_step: null
grad_weights_dict: null
weight_default_value: 1.0
momentum: 0.9
use_nesterov: False
loss:
class_name: BpNetLoss
inference_spec: /workspace/examples/bpnet/specs/infer_spec.yaml

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered

Hi,

We are moving this post to the TAO Toolkit forum to get better help.

Thank you.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Refer to CLI update - #12 by Morganh

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.