I asked this question about a year ago, and I am wondering if there has been any updates or progress on this.
I’ve tried profiling my executable using Nsight, but the information I can get from it is pretty high-level stuff like counters and which shaders or draw calls cost the most. I’m sure that these basic metrics help people with a lot of simple shaders and lots of draw calls. However, my application uses a single pass of one particularly large shader, and I need to know which parts of it are the slowest, so I know where to optimize.
Is there any way to get low-level instruction timing/profiling for vertex, geometry, and fragment shaders?
NVIDIA has it available for the CUDA compute platform (but not for graphical shaders):
https://devblogs.nvidia.com/cuda-7-5-pinpoint-performance-problems-instruction-level-profiling/
AMD Radeon supports it with the Radeon GPU Profiler:
Why is this still not available for graphical shaders on NVIDIA hardware?