Faster-RCNN Loss increases after annealing/decreasing learning rate

I’m doing object detection training with Faster RCNN and I have an issue with the loss increasing at the end of the training before it shuts down by itself. I’m running the training on a single RTX 3060 Ti.
Here is the log of the training :
logs-resnet18bis.txt (552.0 KB)
and here is my spec file :
default_spec_resnet18_custom.txt (4.4 KB)
I’m confused since this can happen only if the learning rate is too high, but it is already decreasing because we passed the annealing points of 0.8/0.9.

Thanks in advance for your help.

Can you try to finetune the learning rate and trigger more experiments?

Should I increase or decrease first?

Please try increasing first.

It crashes directly if I multiply the base and start learning rate by 10. Now I’m trying by dividing them by 10.

Any update? Still crash?

If I manage the loss correctly (increasing or decreasing depending on the backbone, the training doesn’t crash anymore). I guess that when the loss increases too much, the training shuts down by itself?

By the way, I have another error that comes sometimes during validation, here with Resnet 10 :
default_spec_resnet10_custom.txt (4.2 KB)
output.txt (47.2 KB) (there’s a grep filter that outputs loss every 100 images)
Reducing the batch size doesn’t help.

This random issue during evaluation is caused by illegal memory access for tf.image.combined_non_max_suppression for TensorFlow.
Internal team will work on that.