Yolo V4 with mobilenetV2

behna.rahimi · December 22, 2020, 8:42pm

Hi, I trained Yolo V4 with resnet and mobilnetV1 succesfully, but when I change to mobilenet V2, it gives me this error:
"
Traceback (most recent call last):
File “/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 209, in
File “/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 205, in main
File “/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 162, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py”, line 91, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1418, in fit_generator
initial_epoch=initial_epoch)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py”, line 217, in fit_generator
class_weight=class_weight)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1217, in train_on_batch
outputs = self.train_function(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2675, in _call
fetched = self._callable_fn(*array_vals)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1472, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[10,32,144,180] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node yolo_conv5_6_bn_1/FusedBatchNormV3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss_1/add_52/_6167]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[10,32,144,180] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node yolo_conv5_6_bn_1/FusedBatchNormV3}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
File “/usr/local/bin/yolo_v4”, line 8, in
sys.exit(main())
File “/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py”, line 12, in main
File “/home/obaba/.cache/dazel/_dazel_obaba/e56ee0dba0ec09ac4333617b53ded644/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 286, in launch_job
AssertionError: Process run failed.
"
How shouil I fix this? Thanks!

behna.rahimi · December 22, 2020, 11:10pm

and, here is config file:

random_seed: 40
yolov4_config {
big_anchor_shape: “[(319.86, 318.67), (247.80, 236.47), (221.25, 183.36)]”
mid_anchor_shape: “[(184.58, 214.97), (158.03, 171.98), (177.00, 142.89)]”
small_anchor_shape: “[(134.01, 130.25), (112.52, 98.63), (63.21, 45.52)]”
box_matching_iou: 0.25
arch: “mobilenet_v2”
#nlayers: 10
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 10
num_epochs: 120
enable_qat: false
checkpoint_interval: 8
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tlt-experiments/yolo_v4/pretrained_mobilenet_v2/tlt_pretrained_object_detection_vmobilenet_v2/mobilenet_v2.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 16
matching_iou_threshold: 0.35
}
nms_config {
confidence_threshold: 0.7
clustering_iou_threshold: 0.35
top_k: 20
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1440
output_height: 1152
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: “/workspace/tlt-experiments/data/training/label_2”
image_directory_path: “/workspace/tlt-experiments/data/training/image_2”
}
include_difficult_in_training: true
target_class_mapping {
key: “bud”
value: “bud”
}
target_class_mapping {
key: “ignore”
value: “ignore”
}
validation_data_sources: {
label_directory_path: “/workspace/tlt-experiments/data/val/label”
image_directory_path: “/workspace/tlt-experiments/data/val/image”
}
}

Morganh · December 23, 2020, 1:18am

Current TLT(2.0_py3 version) does not support yolo_v4 network.
May I know which tlt docker did you run?

behna.rahimi · December 24, 2020, 9:01am

Hello. I am using latest TLT version: yuw-v2. This version was released last week.

Morganh · December 28, 2020, 9:43am

The yuw-v2 version is not an official tlt release. Please ignore it.

behna.rahimi · January 6, 2021, 5:38pm

Thanks! When the next official version will be released?

Morganh · March 22, 2021, 6:50am

TLT 3.0-dp is released in Feb,2021