Instruction-Level Profiling of Graphics Shaders (answered)

I’ve tried profiling my executable using Nsight for Visual Studio, but the information I get is pretty high-level stuff like counters and which shaders or draw calls cost the most. I’m sure that these basic metrics help people with a lot of simple shaders and lots of draw calls. However, my application uses a single pass of one particularly large shader, and I need to know which parts of it are the slowest.

Is there any way to get low-level instruction profiling for vertex, geometry, and fragment shaders? This seems to be available for the CUDA compute platform, but not for graphical shaders:

Sorry. We don’t currently offer an ability to instrument within a graphics shader.