I originally developed a few kernels under Cuda 8.0 for the Pascal architecture. I made extensive use of NVVP to find memory access inefficiencies, etc. The time has now come to move on to devices with the Volta and Turing architecture and now I am using Nsight Compute to improve the kernels. I have read the NVVP to Compute transition guide. I am a little disappointed that this tool does not point to any specific bottlenecks in my code. When I apply the python tools, they return no results. Could it be that I already have tuned out the bottlenecks, or could it be that I haven’t set things up correctly? I do link to my code location before applying any of the python tools.
Rule-based detection of memory access inefficiencies is not yet as complete in Nsight Compute as it was in NVVP. We are working to close this gap. The next release will have support for detecting uncoalesced accesses in global and shared memory.
Thank you for clearing up my confusion.