FasterRCNN TLT V3 error while training

Getting error while training fasterrcnn tltv3 with efficientnet backbone.

train_spec.txt

# Copyright (c) 2017-2020, NVIDIA CORPORATION.  All rights reserved.
random_seed: 42
enc_key: '$KEY'
verbose: True
model_config {
input_image_config {
image_type: RGB
image_channel_order: 'bgr'
size_height_width {
height: 960
width: 1760
}
    image_channel_mean {
        key: 'b'
        value: 103.939
}
    image_channel_mean {
        key: 'g'
        value: 116.779
}
    image_channel_mean {
        key: 'r'
        value: 123.68
}
image_scaling_factor: 1.0
max_objects_num_per_image: 1000
}
arch: "efficientnet:b0"
anchor_box_config {
scale: 6.0
scale: 16.0
scale: 28.0
#scale: 48.0
#scale: 128.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: True
roi_mini_batch: 256
rpn_stride: 16
use_bias: False
roi_pooling_config {
pool_size: 7
pool_size_2x: False
}
activation {
    activation_type: "relu"
}
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/TLT-V3/tfrecords/*"
    image_directory_path: "/workspace/dataset"
  }
image_extension: 'jpg'
target_class_mapping {
key: 'person'
value: 'person'
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 1760
output_image_height: 960
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0.0
contrast_center: 0.5
}
}
training_config {
enable_augmentation: True
enable_qat: False
batch_size_per_gpu: 4
num_epochs: 120
pretrained_weights: "/workspace/TLT-V3/tlt_pretrained_object_detection_vefficientnet_b0_relu/efficientnet_b0_relu.hdf5"
#resume_from_model: "/workspace/tlt-experiments/data/faster_rcnn/efficientnet_b0.epoch2.tlt"
output_model: "/workspace/TLT-V3/results/frcnn_kitti_efficientnet_b0.tlt"
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: 'x'
value: 10.0
}
classifier_regr_std {
key: 'y'
value: 10.0
}
classifier_regr_std {
key: 'w'
value: 5.0
}
classifier_regr_std {
key: 'h'
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7

regularizer {
type: L2
weight: 1e-4
}

optimizer {
sgd {
lr: 0.02
momentum: 0.9
decay: 0.0
nesterov: False
}
}

learning_rate {
soft_start {
base_lr: 0.02
start_lr: 0.002
soft_start: 0.1
annealing_points: 0.8
annealing_points: 0.9
annealing_divider: 10.0
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0
}
inference_config {
images_dir: '/workspace/TLT-V3/ss/images'
model: '/workspace/TLT-V3/ss/frcnn_kitti_efficientnet_b0.epoch12.tlt'
batch_size: 1
detection_image_output_dir: '/workspace/TLT-V3/ss'
labels_dump_dir: '/workspace/TLT-V3/ss'
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
object_confidence_thres: 0.0001
bbox_visualize_threshold: 0.6
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.3
}
evaluation_config {
model: '/workspace/TLT-V3/results'
batch_size: 8
validation_period_during_training: 1
labels_dump_dir: '/workspace/TLT-V3/ss'
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
classifier_nms_max_boxes: 100
classifier_nms_overlap_threshold: 0.3
object_confidence_thres: 0.0001
use_voc07_11point_metric:False
gt_matching_iou_threshold: 0.5
}

Command :

tlt faster_rcnn train -e /workspace/TLT-V3/specs/faster_rcnn/default_spec_efficientnet_b0.txt -k xyz --gpus 4

Error :

Epoch 1/120
[2c41a90d6ca7:00098] *** Process received signal ***
[2c41a90d6ca7:00098] Signal: Segmentation fault (11)
[2c41a90d6ca7:00098] Signal code: Address not mapped (1)
[2c41a90d6ca7:00098] Failing at address: 0x10
[2c41a90d6ca7:00098] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f1ecdb3f040]
[2c41a90d6ca7:00098] [ 1] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8BinaryOpIN5Eigen9GpuDev
iceENS_7functor3mulIfEEE7ComputeEPNS_15OpKernelContextE+0x100)[0x7f1dc47f8b30]
[2c41a90d6ca7:00098] [ 2] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7Compute
EPNS_8OpKernelEPNS_15OpKernelContextE+0x522)[0x7f1dbded6382]
[2c41a90d6ca7:00098] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf978ab)[0x7f1dbdf378ab]
[2c41a90d6ca7:00098] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf97c6f)[0x7f1dbdf37c6f]
[2c41a90d6ca7:00098] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorfl
ow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f1dbdfe7791]
[2c41a90d6ca7:00098] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tenso
rflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f1dbdfe4df8]
[2c41a90d6ca7:00098] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f1ecbac36df]
[2c41a90d6ca7:00098] [ 8] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f1ecd8e86db]
[2c41a90d6ca7:00098] [ 9] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f1ecdc2171f]
[2c41a90d6ca7:00098] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real noticed that process rank 2 with PID 0 on node 2c41a90d6ca7 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/faster_rcnn", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wh
eel.runfiles/ai_infra/iva/faster_rcnn/entrypoint/faster_rcnn.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wh
eel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-04-05 05:32:24,068 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi @samjith888 ,
May I know below ?

  1. Is the training successful while running with 1gpu?
  2. How about other backbones while running with 4gpus?