YOLO v4 failing during evaluation phase after first epoch

For YOLO v4 on TLT 3.0, I have the model performing evaluation checkpoint on every epoch. After the first epoch, the network errors out after performing evaluation but prior to displaying the resulting performance. Here is the error encountered:

5744/5744 [==============================] - 9360s 2s/step - loss: 14.4160

Epoch 00001: saving model to /workspace/TLT/T_3/models/weights/yolov4_resnet34_epoch_001.tlt
Producing predictions: 100%|████████████████| 1436/1436 [14:02<00:00, 1.71it/s]
Killed
Traceback (most recent call last):
File “/usr/local/bin/yolo_v4”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py”, line 12, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py”, line 296, in launch_job
AssertionError: Process run failed.
2021-06-15 00:34:50,257 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Below is the experiment spec for the experiment:

random_seed: 42
yolov4_config {
big_anchor_shape: “[(255.00, 234.73), (225.00, 110.81), (114.00, 181.33)]”
mid_anchor_shape: “[(130.00, 77.57), (67.00, 126.93), (82.00, 56.41)]”
small_anchor_shape: “[(47.00, 72.53), (54.00, 41.30), (34.00, 34.25)]”
box_matching_iou: 0.25
arch: “resnet”
nlayers: 34
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 4
num_epochs: 20
enable_qat: true
checkpoint_interval: 1
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-5
max_learning_rate: 2e-2
soft_start: 0.15
annealing: 0.8
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/DAB/D_2/resnet_34.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 4
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1920
output_height: 1088
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: “/workspace/DAB/D_2/train/labels”
image_directory_path: “/workspace/DAB/D_2/train/images”
}
include_difficult_in_training: true
target_class_mapping {
key: “p_1”
value: “P”
}
target_class_mapping {
key: “p_2”
value: “P”
}
target_class_mapping {
key: “p_3”
value: “P”
}
target_class_mapping {
key: “p_4”
value: “P”
}
target_class_mapping {
key: “p_5”
value: “P”
}
target_class_mapping {
key: “p_6”
value: “P”
}
target_class_mapping {
key: “p_7”
value: “P”
}
target_class_mapping {
key: “p_8”
value: “P”
}
target_class_mapping {
key: “r_1”
value: “R”
}
target_class_mapping {
key: “r_2”
value: “R”
}
target_class_mapping {
key: “r_3”
value: “R”
}
target_class_mapping {
key: “r_4”
value: “R”
}
target_class_mapping {
key: “r_5”
value: “R”
}
target_class_mapping {
key: “r_6”
value: “R”
}
target_class_mapping {
key: “r_7”
value: “R”
}
target_class_mapping {
key: “r_8”
value: “R”
}
validation_data_sources: {
label_directory_path: “/workspace/DAB/D_2/val/labels”
image_directory_path: “/workspace/DAB/D_2/val/images”
}
}

As stated the error does not occur during training or evaluation, but immediately following evaluation before the output of class AP and validation loss

Seems that the evaluation is Killed due to OOM.
Please try to decrease the evaluation batch-size and re-run evaluation.

This was correct. Though it was my DDR4 onboard RAM and not my GPU VRAM that exceeded the limit. There seems to be a significant jump in RAM required at the very end of the validation process (I am assuming to output the model). Increase the dedicated RAM solved the issue.