0% GPU utilisation as Mesh size increases

I have been working on parallel computations in CFD. I have developed a Cuda code for flow over square cylinder in 3D. My code is stable upto to certain Mesh size. If I increase the mesh size further, my code starts giving zero error at each iteration, although I have not exceeded the threads per block which was my first suspicion. I checked the gpu utilisation and it shows 0%. Previously, when I ran it on the Mesh size where solution was converging, gpu utilisation was going almost 100%. Can someone help in this?

It is hard to provide feedback without a reproducible.

Does the code perform proper CUDA API or CUDA Runtime error checking on the grid launch? As the threads per block is increased so is the required registers per thread block. It is very easy to hit the registers per SM limit before the maximum threads per thread block. If a grid launch requires more resources than the hardware limit then the launch will return CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES/cudaErrorLaunchOutOfResources.