Very high loss while training TAO yolov4

• Hardware dGPU
• Network Type Yolo_v4
• TLT Version 3.22.05
• Training spec file
d17-jul24-yolov4-960-544-config.txt (3.3 KB)

My training dataset contains around ~5200 images.
This is an ongoing training at the time of asking the question. The training loss after the 1st epoch is 1286807.1, and at 2nd epoch this goes down to 693339.94, on the 60th epoch the training loss value is 633260.7 and the mAP is at 0.25.

My question is why is the loss value so high and why does it stay high. I am using the openimages pretrained weight’s file from NGC, this should ensure transfer learning and an increase in the mAP and decrease in the loss in the initial epochs itself. Why is this not the case? for ref, I had previously trained a TAO yolov3 model for 80 epochs using the same dataset and the same pretrained weights and the training loss was under 20.0 after the first epoch, the mAP was above 0.25 after the 10th epoch and got a final training loss value under 1.0 and a mAP ~0.6.

Is this something to do with the architecture difference b/w both the models and does yolov4 require a larger batch size and a larger dataset inorder to achieve these results?

Please use 4.0.1 version along with its spec file.

Can you kindly elaborate on what is the difference between running 4.0.1 vs 3.22.05

In short, for 4.0.0 or 4.0.1 version, to improve the mAP, we fix issues in the yolov4 structure, the loss function, and etc.

You can find some difference in the spec file, for example in 4.0.x version,
loss_loc_weight: 1
loss_neg_obj_weights: 1
loss_class_weights: 1

1 Like

I can see that tao toolkit 5.0 is released, do you recommend this version. Also the latest version in version 4 is 4.0.2, if I have to use version 4 does using the above version make sense or should I go with 4.0.1.

Thanks

Yes, you can use 5.0. For yolov4, there is not much difference between 5.0, 4.0.1 and 4.0.2.

I have setup tao toolkit 4.0.1 to train the yolo_v4 model. Looking at the spec files in the documentation for 3.22.05 and 4.0.1 are both identical.

Are you saying that following are the recommended values for training on 4.0.1

loss_loc_weight: 1
loss_neg_obj_weights: 1
loss_class_weights: 1

Yes, you can refer to the specs in TAO Toolkit Getting Started | NVIDIA NGC
wget --content-disposition 'https://api.ngc.nvidia.com/v2/resources/nvidia/tao/tao-getting-started/versions/4.0.2/files/notebooks/tao_launcher_starter_kit/yolo_v4/specs/yolo_v4_train_resnet18_kitti.txt'

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.