Error when training with TLT toolkit

When I use the TLT to train, I get the following error,It always prompts “target/truncation is not updated to match the crop area if the dataset contains target/truncation.”, the cpu is full.


and my spec is as follows:

Copyright © 2017-2019, NVIDIA CORPORATION. All rights reserved.

random_seed: 42
enc_key: '******************************************************************
verbose: True
network_config {
input_image_config {
image_type: RGB
image_channel_order: ‘bgr’
size_height_width {
height: 375
width: 1242
}
image_channel_mean {
key: ‘b’
value: 103.939
}
image_channel_mean {
key: ‘g’
value: 116.779
}
image_channel_mean {
key: ‘r’
value: 123.68
}
image_scaling_factor: 1.0
max_objects_num_per_image: 100
}
feature_extractor: “resnet:50”
anchor_box_config {
scale: 64.0
scale: 128.0
scale: 256.0
ratio: 1.0
ratio: 0.5
ratio: 2.0
}
freeze_bn: False
freeze_blocks: 0
freeze_blocks: 1
roi_mini_batch: 256
rpn_stride: 16
conv_bn_share_bias: True
roi_pooling_config {
pool_size: 7
pool_size_2x: False
}
all_projections: True
use_pooling:False
}
training_config {
kitti_data_config {
data_sources: {
tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/kitti_trainval
"
image_directory_path: “/workspace/tlt-experiments/data/training”
}
image_extension: ‘png’
target_class_mapping {
key: ‘Car’
value: ‘Car’
}
validation_fold: 0
}
data_augmentation {
preprocessing {
output_image_width: 1242
output_image_height: 375
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0.0
contrast_center: 0.5
}
}
enable_augmentation: True
batch_size_per_gpu: 32
num_epochs: 200
pretrained_weights: “/workspace/tlt-experiments/data/faster_rcnn/resnet50.hdf5”
output_model: “/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet50.tlt”
rpn_min_overlap: 0.3
rpn_max_overlap: 0.7
classifier_min_overlap: 0.0
classifier_max_overlap: 0.5
gt_as_roi: False
std_scaling: 1.0
classifier_regr_std {
key: ‘x’
value: 10.0
}
classifier_regr_std {
key: ‘y’
value: 10.0
}
classifier_regr_std {
key: ‘w’
value: 5.0
}
classifier_regr_std {
key: ‘h’
value: 5.0
}

rpn_mini_batch: 256
rpn_pre_nms_top_N: 12000
rpn_nms_max_boxes: 2000
rpn_nms_overlap_threshold: 0.7

reg_config {
reg_type: ‘L2’
weight_decay: 1e-4
}

optimizer {
adam {
lr: 0.00001
beta_1: 0.9
beta_2: 0.999
decay: 0.0
}
}

lr_scheduler {
step {
base_lr: 0.00001
gamma: 1.0
step_size: 30
}
}

lambda_rpn_regr: 1.0
lambda_rpn_class: 1.0
lambda_cls_regr: 1.0
lambda_cls_class: 1.0

inference_config {
images_dir: ‘/workspace/tlt-experiments/data/testing/image_2’
model: ‘/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet50.epoch200.tlt’
detection_image_output_dir: ‘/workspace/tlt-experiments/data/faster_rcnn/inference_results_imgs’
labels_dump_dir: ‘/workspace/tlt-experiments/data/faster_rcnn/inference_dump_labels’
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
bbox_visualize_threshold: 0.6
classifier_nms_max_boxes: 300
classifier_nms_overlap_threshold: 0.3
}

evaluation_config {
model: ‘/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet50.epoch200.tlt’
labels_dump_dir: ‘/workspace/tlt-experiments/data/faster_rcnn/test_dump_labels’
rpn_pre_nms_top_N: 6000
rpn_nms_max_boxes: 300
rpn_nms_overlap_threshold: 0.7
classifier_nms_max_boxes: 300
classifier_nms_overlap_threshold: 0.3
object_confidence_thres: 0.0001
use_voc07_11point_metric:False
}

}
any idea?thanks

1 Like

It is not harmful. Please ignore it.

ok,thank you.

hi,Morganh
After continuing to perform training, I got the following error:
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 33, in main
File “./faster_rcnn/scripts/train.py”, line 60, in main
File “./faster_rcnn/models/model_builder.py”, line 518, in train
File “/usr/local/lib/python2.7/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 2671, in _call
session)
File “/usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py”, line 2623, in _make_callable
callable_fn = session._make_callable_from_options(callable_opts)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1471, in _make_callable_from_options
return BaseSession._Callable(self, callable_options)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1425, in init
session._session, options_ptr, status)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py”, line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [32,256,24,78] and type float
[[{{node training/Adam/gradients/zeros_79}}]]
Exception tensorflow.python.framework.errors_impl.InvalidArgumentError: InvalidArgumentError() in <bound method _Callable.del of <tensorflow.python.client.session._Callable object at 0x7efd28ee6310>> ignored

How can this be resolved

OOM issue. Please try lower bs.
Or refer to W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at pack_op.cc:88 : Resource exhausted: OOM when allocating tensor with shape[32,3,2160,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc