Error while re-training with custom dataset using tlt file- FasterRCNN

Please provide the following information when requesting support.

• Hardware - RTX 3060
• Network Type- Faster_rcnn- Resnet 101(backbone)
• TAo version- format_version: 2.0
toolkit_version: 4.0.1
Training spec file
‘’’
random_seed: 42
enc_key: ‘nvidia-tao’
verbose: True
model_config {
input_image_config {
image_type: RGB
image_channel_order: ‘bgr’
size_height_width {
height: 1080
width: 1920
}
image_channel_mean {
key: ‘b’
value: 103.939
}
image_channel_mean {
key: ‘g’
value: 116.779
}
image_channel_mean {
key: ‘r’
value: 123.68
}
image_scaling_factor: 1.0
max_objects_num_per_image: 100
}
arch: “resnet:101”
anchor_box_config {
scale: 64.0
scale: 128.0
scale: 256.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: False
roi_mini_batch: 256
rpn_stride: 16
use_bias: False
roi_pooling_config {
pool_size: 7
pool_size_2x: False
}
all_projections: False
use_pooling:False
}
dataset_config {
data_sources: {
tfrecords_path: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/version_1/tfrecords/tfrecords*”
image_directory_path: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection”
}
image_extension: ‘jpg’
target_class_mapping {
key: ‘drone’
value: ‘drone’
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 1920
output_image_height: 1080
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
enable_auto_resize: True
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0.0
contrast_center: 0.5
}
}
training_config {
enable_augmentation: True
enable_qat: False
batch_size_per_gpu: 2
num_epochs: 20
pretrained_weights: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v0/frcnn_resnet101.tlt”
output_model: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v1/unpruned/frcnn_resnet101.tlt”
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: ‘x’
value: 10.0
}
classifier_regr_std {
key: ‘y’
value: 10.0
}
classifier_regr_std {
key: ‘w’
value: 5.0
}
classifier_regr_std {
key: ‘h’
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7

regularizer {
type: L2
weight: 1e-4
}

optimizer {
sgd {
lr: 0.02
momentum: 0.9
decay: 0.0
nesterov: False
}
}

learning_rate {
soft_start {
base_lr: 0.01
start_lr: 0.001
soft_start: 0.1
annealing_points: 0.8
annealing_points: 0.9
annealing_divider: 10.0
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0
visualizer {
enabled: true
num_images: 5
}
}
inference_config {
images_dir: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/test’
model: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v1/unpruned/frcnn_resnet101.epoch20.tlt’
batch_size: 4
detection_image_output_dir: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/version_1/inference_results_imgs’
labels_dump_dir: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/version_1/inference_dump_labels’
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.1
object_confidence_thres: 0.0001
bbox_visualize_threshold: 0.6
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.1
}
evaluation_config {
model: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v1/unpruned/frcnn_resnet101.epoch20.tlt’
batch_size: 2
validation_period_during_training: 3
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.3
object_confidence_thres: 0.0001
use_voc07_11point_metric:False
gt_matching_iou_threshold: 0.5
}
‘’’
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Step 1: Trained with a custom dataset with Resnet 101 backbone- 20 epochs
Step 2: Got tlt weight file after finishing training
Step 3: Pruned the model
Step 4: Re-trained the model replacing pretrained weights with pruned weight file
Step 5: Added more data to fine-tune the model, using unpruned weights files I have started training with the above spec file but getting below error

‘’’
2023-06-07 10:42:36.862192: F ./tensorflow/core/kernels/random_op_gpu.h:225] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: the provided PTX was compiled with an unsupported toolchain.
[c1fda20e1e39:00251] *** Process received signal ***
[c1fda20e1e39:00251] Signal: Aborted (6)
[c1fda20e1e39:00251] Signal code: (-6)
[c1fda20e1e39:00251] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f7316dd4090]
[c1fda20e1e39:00251] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f7316dd400b]
[c1fda20e1e39:00251] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7316db3859]
[c1fda20e1e39:00251] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x20baf4)[0x7f7272c06af4]
[c1fda20e1e39:00251] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor16FillPhiloxRandomIN5Eigen9GpuDeviceENS_6random19UniformDistributionINS4_12PhiloxRandomEfEEEclEPNS_15OpKernelContextERKS3_S6_PfxS7+0x1d5)[0x7f72235e17e5]
[c1fda20e1e39:00251] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(+0x8ed75ea)[0x7f72235de5ea]
[c1fda20e1e39:00251] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3cb)[0x7f72198053db]
[c1fda20e1e39:00251] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113caa7)[0x7f7219862aa7]
[c1fda20e1e39:00251] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113d10f)[0x7f721986310f]
[c1fda20e1e39:00251] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x285)[0x7f7219917725]
[c1fda20e1e39:00251] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f7219914268]
[c1fda20e1e39:00251] [11] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x18d69a0)[0x7f7219ffc9a0]
[c1fda20e1e39:00251] [12] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f7316d76609]
[c1fda20e1e39:00251] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f7316eb0133]
[c1fda20e1e39:00251] *** End of error message ***

‘’’

Can you share the result of $nvidia-smi ?

Please update to 525 driver.

Uninstall:  
                sudo apt purge nvidia-driver-*
                sudo apt autoremove
                sudo apt autoclean


Install:    sudo apt install nvidia-driver-525

Yes below is the result of nvidia-smi
image

Please update to 525 driver and retry.

Thank you. Now I am able to re-train using an unpruned model.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.