Low GPU usage on tensorflow (RTX 3090)

Hi!

Spec:
Driver Version: 470.57.02 CUDA Version: 11.4
NVidia Geforce RTX 3090
Tensorflow 2.5.0 CUDNN 8202, using mixed_fp16 training

I’ve been upgrading my 2080TI to an 3090 and noticed the training speed of my model almost didn’t increase.
I noticed the GPU usage is lower than it used to be for the 2080TI. Using nvidia-smi, the GPU usage shown is between 40 and 80%. If I use very large batches, the same code goes down to 11% usage to 100% usage.

This is not due to data loading or cpu limitations. Indeed, just running the network again and again on tensor already uploaded on the GPU will give the same low usage.

Seeing this, I suspected some kernels were using only a small portions of the GPU cores.
Checking with the tensorflow profiling tools seem to confirm that:

As can be seen on that profiling image, the costliest kernel has a grid dimension of only 9,1,5.
Many other kernels of the network seem to have very small grids as well.

As the 3090 seems to have way more cores than my previous 2080TI, this core underutilization should explain the fact I’m hardly seeing any performance gain with the switch.

My network is a modified TTFNet without deformable convolutions, using resnet18, training on COCO with a batch of 16.
Increasing the batch size doesn’t improve the performance (the execution time is just multiplied linearly).
I can give the tensorflow performance log, or other things if needed.

There must have been an issue with my graphic card. It died after a week of use, and with the replacement card my GPU usage is better and my training faster.