TAO yolov4_tiny training sub-task crashes after number of epochs

Please provide the following information when requesting support.

• Hardware: 3090
• Network Type: Yolo_v4_tiny

I’m trying to train a model for inference based on Yolo_v4_tiny. The training period stops after a small number of epochs (please see below the TAO’s output during the training).

One major change I made in the spec files is- because the training is based on HD resolution images, I’ve changed the output height and width in spec ‘augmentation’ section to 1920X1024

What could be the reason it crashes every time?

TAOoutput.txt (70.8 KB)

Please share the training spec file.


yolo_v4_tiny_train.txt (2.4 KB)

Please check if the anchor shapes are correct when you set 1920x1024. If not, need to use kmeans to generate a new one.

More, you can set a lower output_width and output_height , but please make sure they are multiples of 32. Also, need to set correct anchor shapes.

Last, try to set randomize_input_shape_period: 0

Regarding the anchors - I created the anchors at first, as suggested by the script, with the original image size (1920X1080). But let’s say I change the output size in the augmentation property, for example to 640x480. Could this create a problem at the traning phase?

I am afraid it is due to OOM.
So, you can try to train with lower width/height to check if it works.

in the augmentations, can I control the amount of new images it creates?

You can change the setting of augmentation. See YOLOv4-tiny — TAO Toolkit 3.22.05 documentation

For you error log, I find that “batch size per gpu: 20”.

Please set a lower batch-size as well.

im not sure why it happened, because in the spec file I specified batch_size_per_gpu: 8

BTW in the tfrecored file, i didn’t understand what are the preferred values if my data-set contains 1000 images:
num_partitions: 2
val_split: 10
num_shards: 4
is that ok?

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.

It is ok. For the meaning , please refer to DetectNet_v2 — TAO Toolkit 3.22.05 documentation and Data Annotation Format — TAO Toolkit 3.22.05 documentation

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.