Instruction-Level Profiling of Graphics Shaders

I asked this question about a year ago, and I am wondering if there has been any updates or progress on this.

I’ve tried profiling my executable using Nsight, but the information I can get from it is pretty high-level stuff like counters and which shaders or draw calls cost the most. I’m sure that these basic metrics help people with a lot of simple shaders and lots of draw calls. However, my application uses a single pass of one particularly large shader, and I need to know which parts of it are the slowest, so I know where to optimize.

Is there any way to get low-level instruction timing/profiling for vertex, geometry, and fragment shaders?

NVIDIA has it available for the CUDA compute platform (but not for graphical shaders):
https://devblogs.nvidia.com/cuda-7-5-pinpoint-performance-problems-instruction-level-profiling/

AMD Radeon supports it with the Radeon GPU Profiler:

Why is this still not available for graphical shaders on NVIDIA hardware?

Hello,

Thank you for your feedback on the Nsight Graphics tool. We are sorry you are having issues with instruction-level profiling of graphics shaders on NVIDIA hardware. Please let us know what Graphics API you are using in your situation.

Regards,

Darrell