How to profile how many times an instruction is executed or how much duration it takes?

I am building a CUDA program where the kernel will busily check an atomic conditional variable in a while loop, since I am optimizing the code to reduce the overheads of busy waiting, I want to know how much cycles are wasted on this loop, so I run that with Nsight Compute, but I cannot find any useful metric that give me insightful information. Any suggestion on how to get that information through the profiler?

The line table information for optimized code can sometimes make exact resolution to source tricky. There’s isn’t a cycle counter metric, but the instructions executed metric will probably give you a good idea of how much time is wasted here. I would recommend clicking on one of the statements in the loop to open the SASS view to see at the assembly level how many instructions are executed in this busy wait loop.