I have written a GPU code for my meshfree CFD solver using CUDA Fortran. I am attaching a report card that I obtained from the nv nsight compute application for kernel performance analysis. I use a Quadro M5000 for my computations. I am showing the performance of one of my kernels which takes up 97 percent of the computations. GPU experts here, kindly help with where exactly I could improve my kernel in maximizing the SM resources. I get most of the data shown in the report, but I really need a direction where I can start improving my kernel.