High ram usage with tlt ResNet

I am trying to train my custom dataset with transfer learning toolkit for resnet. But the ram gets filled completely and the training gets killed. Before i was trying with 19k images but after reducing images to around 3k stll problem persists. And also training is very slow. I am having RTX A5000 with 24gb gpu and 32 gb ram.

Could you please share the training spec file?

random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/tlt-experiments/tfrecords/kitti_trainval/"
image_directory_path: “/workspace/tlt-experiments/dataset”
}
image_extension: “jpg”
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “head”
value: “head”
}
validation_fold: 0
#validation_data_source: {
#tfrecords_path: "/home/data/tfrecords/kitti_val/

#image_directory_path: “/home/data/test”
#}
}

augmentation_config {
preprocessing {
output_image_width: 640
output_image_height: 640
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.0
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 0
translate_max_y: 0
}
color_augmentation {
hue_rotation_max: 0.0
saturation_shift_max: 0.0
contrast_scale_max: 0
contrast_center: 0.5
}
}

postprocessing_config {
target_class_config {
key: “person”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “head”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.15000000596
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 20
}
}
}
}

model_config {
pretrained_model_file: “/workspace/tlt-experiments/resnet_10.hdf5”
num_layers: 10
freeze_blocks: 0
freeze_blocks: 1
all_projections: True
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: “resnet”
}

evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “person”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “head”
value: 0.5
}
evaluation_box_config {
key: “person”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
evaluation_box_config {
key: “head”
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}

cost_function_config {
target_classes {
name: “person”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “head”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}

training_config {
batch_size_per_gpu: 1
num_epochs: 80

learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}

bbox_rasterizer_config {
target_class_config {
key: “person”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
target_class_config {
key: “head”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

Same problem is with yolo_v4_tiny. It is consuming 32gb ram and around 6-10 gb of gpu memory for batch size 1. Also training is very slow as it dont consume complete gpu. It sometime touch 50-60 but most of time remaining at 20% utilisation. Sharing spec file for that as well.
random_seed: 42
yolov4_config {

big_anchor_shape: “[(36.00, 113.01),(57.60, 220.69),(113.80, 378.58)]”
mid_anchor_shape: “[(14.40, 32.91),(27.00, 48.00),(18.22, 74.67)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “cspdarknet_tiny”
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 1
num_epochs: 80
enable_qat: true
checkpoint_interval: 10
use_multiprocessing: true
n_workers: 12
max_queue_size: 4
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 640
output_height: 640
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/train*”
image_directory_path: “/workspace/tao-experiments/data/train”
}
include_difficult_in_training: true
image_extension: “jpg”
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “head”
value: “head”
}

validation_data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/val/tfrecords/val*”
image_directory_path: “/workspace/tao-experiments/data/val”
}
}

It does not make sense when the training get killed while running against 3k 640x640 images. Can you share “$ nvidia-smi” before training and during training?

That ram usage was solved by removing force_cpu: true. but still model training is slow on a5000 for 15k image training with bs 32 it shows 1.40 hr to complete one epoch. Gpu utilisation


is very low.

Which network did you use to train?

Yolov4_tiny with backbone as cspdarknet.

Also last night i left model for training but it got stucked at 7th epoch when i checked it back. Gpu memory and ram was available.

you can check the timing since the jupyter notebook has been running. Please help me quickly.

Can you share the latest spec file?

random_seed: 42
yolov4_config {
big_anchor_shape: “[(36.00, 113.01),(57.60, 220.69),(113.80, 378.58)]”
mid_anchor_shape: “[(14.40, 32.91),(27.00, 48.00),(18.22, 74.67)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “cspdarknet_tiny”
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 16
num_epochs: 80
enable_qat: true
checkpoint_interval: 1
use_multiprocessing: true
n_workers: 20
max_queue_size: 4
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
resume_model_path: “/workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned/weights/yolov4_cspdarknet_tiny_epoch_006.tlt”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
force_on_cpu: true
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: “/workspace/tao-experiments/data/train/labels”
image_directory_path: “/workspace/tao-experiments/data/train/images”
}
include_difficult_in_training: true
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “head”
value: “head”
}

validation_data_sources: {
label_directory_path: “/workspace/tao-experiments/data/val/labels”
image_directory_path: “/workspace/tao-experiments/data/val/images”
}
}

Is above the expected? You want to train a model 1248x384 ,right?

yes can go with that too. but the thing is image size is not effecting training time and stucking of epoch as gpu memory and ram was available.

Did you ever run the default jupyter notebook with public KITTI dataset? Any similar behavior?

No i did not try that. But i think changing just data would not solve problem.

Please try below.

  1. use AMP since your GPU(RTX A5000) supports it. See more in Optimizing the Training Pipeline — TAO Toolkit 3.21.11 documentation

  2. Use tfrecord data loader. In this way, please disable mosaic. See YOLOv4-tiny - NVIDIA Docs

  3. Change

output_width: 1248
output_height: 384

to match the average resolution of the training images.

  1. Change
    checkpoint_interval: 1
    to
    checkpoint_interval: 5

  2. Run a baseline firstly. So, please delete

use_multiprocessing: true
n_workers: 20
max_queue_size: 4

And please run in the terminal firstly instead of jupyter notebook.

Okay will start from baseline first and will tell about the results.

Can you share commands to run in terminal… As in notebook it is mapping docker session how would i do that on terminal.

I think you already setup ~/.tao_mounts.json correctly. Then, you can run similar command in terminal as below.
$ tao yolov4_tiny train xxx

Or you can also login the tao docker directly and then run commands.
$ tao yolov4_tiny run /bin/bash
then
# yolov4_tiny train xxx

Ok so you are saying download kitti dataset and put in desired folder. Create tfrecords(!tao yolo_v4_tiny dataset_convert -d $SPECS_DIR/yolo_v4_tiny_tfrecords_kitti_train.txt
-o $DATA_DOWNLOAD_DIR/training/tfrecords/train) and then run using !tao yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti_seq.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1
–use_amp