Error while re-training with custom dataset using tlt file- FasterRCNN

bharath.goolla · June 7, 2023, 11:18am

Please provide the following information when requesting support.

• Hardware - RTX 3060
• Network Type- Faster_rcnn- Resnet 101(backbone)
• TAo version- format_version: 2.0
toolkit_version: 4.0.1
Training spec file
‘’’
random_seed: 42
enc_key: ‘nvidia-tao’
verbose: True
model_config {
input_image_config {
image_type: RGB
image_channel_order: ‘bgr’
size_height_width {
height: 1080
width: 1920
}
image_channel_mean {
key: ‘b’
value: 103.939
}
image_channel_mean {
key: ‘g’
value: 116.779
}
image_channel_mean {
key: ‘r’
value: 123.68
}
image_scaling_factor: 1.0
max_objects_num_per_image: 100
}
arch: “resnet:101”
anchor_box_config {
scale: 64.0
scale: 128.0
scale: 256.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: False
roi_mini_batch: 256
rpn_stride: 16
use_bias: False
roi_pooling_config {
pool_size: 7
pool_size_2x: False
}
all_projections: False
use_pooling:False
}
dataset_config {
data_sources: {
tfrecords_path: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/version_1/tfrecords/tfrecords*”
image_directory_path: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection”
}
image_extension: ‘jpg’
target_class_mapping {
key: ‘drone’
value: ‘drone’
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 1920
output_image_height: 1080
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
enable_auto_resize: True
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0.0
contrast_center: 0.5
}
}
training_config {
enable_augmentation: True
enable_qat: False
batch_size_per_gpu: 2
num_epochs: 20
pretrained_weights: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v0/frcnn_resnet101.tlt”
output_model: “/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v1/unpruned/frcnn_resnet101.tlt”
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: ‘x’
value: 10.0
}
classifier_regr_std {
key: ‘y’
value: 10.0
}
classifier_regr_std {
key: ‘w’
value: 5.0
}
classifier_regr_std {
key: ‘h’
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7

regularizer {
type: L2
weight: 1e-4
}

optimizer {
sgd {
lr: 0.02
momentum: 0.9
decay: 0.0
nesterov: False
}
}

learning_rate {
soft_start {
base_lr: 0.01
start_lr: 0.001
soft_start: 0.1
annealing_points: 0.8
annealing_points: 0.9
annealing_divider: 10.0
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0
visualizer {
enabled: true
num_images: 5
}
}
inference_config {
images_dir: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/test’
model: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v1/unpruned/frcnn_resnet101.epoch20.tlt’
batch_size: 4
detection_image_output_dir: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/version_1/inference_results_imgs’
labels_dump_dir: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/data/version_1/inference_dump_labels’
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.1
object_confidence_thres: 0.0001
bbox_visualize_threshold: 0.6
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.1
}
evaluation_config {
model: ‘/home/robotics/Desktop/AI/tao-getting-started_v4.0.1/workspace/drone_detection/models/model_v1/unpruned/frcnn_resnet101.epoch20.tlt’
batch_size: 2
validation_period_during_training: 3
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.3
object_confidence_thres: 0.0001
use_voc07_11point_metric:False
gt_matching_iou_threshold: 0.5
}
‘’’
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
Step 1: Trained with a custom dataset with Resnet 101 backbone- 20 epochs
Step 2: Got tlt weight file after finishing training
Step 3: Pruned the model
Step 4: Re-trained the model replacing pretrained weights with pruned weight file
Step 5: Added more data to fine-tune the model, using unpruned weights files I have started training with the above spec file but getting below error

‘’’
2023-06-07 10:42:36.862192: F ./tensorflow/core/kernels/random_op_gpu.h:225] Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: the provided PTX was compiled with an unsupported toolchain.
[c1fda20e1e39:00251] *** Process received signal ***
[c1fda20e1e39:00251] Signal: Aborted (6)
[c1fda20e1e39:00251] Signal code: (-6)
[c1fda20e1e39:00251] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f7316dd4090]
[c1fda20e1e39:00251] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f7316dd400b]
[c1fda20e1e39:00251] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7316db3859]
[c1fda20e1e39:00251] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x20baf4)[0x7f7272c06af4]
[c1fda20e1e39:00251] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(ZN10tensorflow7functor16FillPhiloxRandomIN5Eigen9GpuDeviceENS_6random19UniformDistributionINS4_12PhiloxRandomEfEEEclEPNS_15OpKernelContextERKS3_S6_PfxS7+0x1d5)[0x7f72235e17e5]
[c1fda20e1e39:00251] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_cc.so.1(+0x8ed75ea)[0x7f72235de5ea]
[c1fda20e1e39:00251] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3cb)[0x7f72198053db]
[c1fda20e1e39:00251] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113caa7)[0x7f7219862aa7]
[c1fda20e1e39:00251] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x113d10f)[0x7f721986310f]
[c1fda20e1e39:00251] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x285)[0x7f7219917725]
[c1fda20e1e39:00251] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f7219914268]
[c1fda20e1e39:00251] [11] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/…/libtensorflow_framework.so.1(+0x18d69a0)[0x7f7219ffc9a0]
[c1fda20e1e39:00251] [12] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f7316d76609]
[c1fda20e1e39:00251] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f7316eb0133]
[c1fda20e1e39:00251] *** End of error message ***

‘’’

Morganh · June 8, 2023, 7:45am

Can you share the result of $nvidia-smi ?

Please update to 525 driver.

Uninstall:  
                sudo apt purge nvidia-driver-*
                sudo apt autoremove
                sudo apt autoclean


Install:    sudo apt install nvidia-driver-525

bharath.goolla · June 8, 2023, 9:26am

Yes below is the result of nvidia-smi

Morganh · June 12, 2023, 1:38am

Please update to 525 driver and retry.

bharath.goolla · June 12, 2023, 5:22am

Thank you. Now I am able to re-train using an unpruned model.

Topic		Replies	Views
Issue Fine-tuning Faster-RCNN Model Using Unpruned TLT File TAO Toolkit cuda , tensorflow , ubuntu , docker , python	4	456	June 8, 2023
Train with my own tlt model TAO Toolkit	14	860	December 13, 2021
Retraining Error after pruning the Mask RCNN model with TAO Toolkit TAO Toolkit tao	5	603	May 10, 2022
Core dumped while re-training pruned Detectnet model TAO Toolkit cuda , tensorflow , tao	5	726	April 21, 2022
There was a problem with the model retrain after clipping TAO Toolkit	4	770	October 12, 2021
Error retraining the pruned Mask RCNN model with TAO Toolkit TAO Toolkit	2	480	January 20, 2024
Invalid decryption. Unable to open file (file signature not found). The key used to load the model is incorrect TAO Toolkit	3	750	October 12, 2021
Train with my own tlt model #2 TAO Toolkit	42	3298	February 8, 2022
Error when training with TLT toolkit TAO Toolkit	5	518	October 12, 2021
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	701	August 27, 2021

Error while re-training with custom dataset using tlt file- FasterRCNN

Related topics