TAO yolov4_tiny training sub-task crashes after number of epochs

user14171 · September 1, 2022, 5:39am

Please provide the following information when requesting support.

• Hardware: 3090
• Network Type: Yolo_v4_tiny

Hey,
I’m trying to train a model for inference based on Yolo_v4_tiny. The training period stops after a small number of epochs (please see below the TAO’s output during the training).

One major change I made in the spec files is- because the training is based on HD resolution images, I’ve changed the output height and width in spec ‘augmentation’ section to 1920X1024

What could be the reason it crashes every time?

TAOoutput.txt (70.8 KB)

Morganh · September 1, 2022, 6:00am

Please share the training spec file.

user14171 · September 1, 2022, 6:19am

attached

yolo_v4_tiny_train.txt (2.4 KB)

Morganh · September 1, 2022, 6:29am

Please check if the anchor shapes are correct when you set 1920x1024. If not, need to use kmeans to generate a new one.

More, you can set a lower output_width and output_height , but please make sure they are multiples of 32. Also, need to set correct anchor shapes.

Last, try to set randomize_input_shape_period: 0

user14171 · September 1, 2022, 7:25am

Regarding the anchors - I created the anchors at first, as suggested by the script, with the original image size (1920X1080). But let’s say I change the output size in the augmentation property, for example to 640x480. Could this create a problem at the traning phase?

Morganh · September 1, 2022, 7:32am

I am afraid it is due to OOM.
So, you can try to train with lower width/height to check if it works.

user14171 · September 1, 2022, 7:52am

in the augmentations, can I control the amount of new images it creates?

Morganh · September 1, 2022, 7:56am

You can change the setting of augmentation. See YOLOv4-tiny — TAO Toolkit 3.22.05 documentation

For you error log, I find that “batch size per gpu: 20”.

Please set a lower batch-size as well.

user14171 · September 1, 2022, 8:14am

im not sure why it happened, because in the spec file I specified batch_size_per_gpu: 8

BTW in the tfrecored file, i didn’t understand what are the preferred values if my data-set contains 1000 images:
num_partitions: 2
val_split: 10
num_shards: 4
is that ok?

Morganh · September 1, 2022, 8:18am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

It is ok. For the meaning , please refer to DetectNet_v2 — TAO Toolkit 3.22.05 documentation and Data Annotation Format — TAO Toolkit 3.22.05 documentation

system · September 27, 2022, 1:41am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.