nv nsight report card suggestions


I have written a GPU code for my meshfree CFD solver using CUDA Fortran. I am attaching a report card that I obtained from the nv nsight compute application for kernel performance analysis. I use a Quadro M5000 for my computations. I am showing the performance of one of my kernels which takes up 97 percent of the computations. GPU experts here, kindly help with where exactly I could improve my kernel in maximizing the SM resources. I get most of the data shown in the report, but I really need a direction where I can start improving my kernel.




Hi Srikanth,

Looking at the report card, the biggest thing that jumps out is the high register usage (255 per thread) with is causing a low occupancy (12.5%).

Not knowing your code, I can’t offer any specific advice, but high register usage is often due to the code having many local variables, intermediary computation, and/or many address computation. If you have one very large kernel, try splitting it up into multiple smaller kernels. It may mean using more memory to store intermediary values, but hopefully will reduce the register count (ideally less than 64 registers per thread) and give you better occupancy on the device.


Thanks for the reply Mat. I have been thinking about it from that angle. I am trying optimize the code that way, the kernel is too damn big.

Thanks for the suggestion.