Profiling DXR Shaders with Timer Instrumentation

Originally published at: Profiling DXR Shaders with Timer Instrumentation | NVIDIA Technical Blog

Optimizing real-time graphics applications for maximum performance can be a challenging endeavor, and ray tracing is no exception. Whether you want to make your graphics engine more efficient overall or find a specific performance bottleneck, profiling is the most important tool to achieve your goals.  Despite constantly improving support for ray tracing APIs in profilers…

Hi there,

I’ve been trying to implement this in our codebase. On my RTX 3080 GPU, I find that instead of getting sensible per-pixel execution times, I seem to be getting blocks of the same value. And the value itself looks wrong!

Could anyone advise me on whether (a) this feature is known to work on the latest GPUs and (b) are there any additional steps I can use to debug/validate what I have done beyond this document? Sample code, perhaps? Some way to check the generated asm?


Seeing blocks of pixels with the same timing value is expected because that is how the GPU divides the workload. You won’t see individual pixels.

Answers for your questions:
a) I just ran a test on a RTX 3090 and it functioned properly.
b) There isn’t any asm you can see, but if you followed the article’s sample code and you see a colored heatmap similar to the article then the feature is most likely working and you just need to tweak your heatmap scaling to get a visualization that is appropriate for your workload. Implementing a dynamic scale value that you can change at runtime could help if your workload changes from scene to scene and find a heatmap that looks more useful.

If the values (or colors) look wrong you should try doubling or halving the time scale value since this can depend on the workload of your scene. The article above was written for a RTX 2080 and the RTX 3080 will certainly finish work faster. Try modifying your heatmapScale to 5000, 10000, and so on to see if that reflects in your heat map colors:
Here’s the sample code from the article:

// Scale the time delta value to [0,1]
static float heatmapScale = 65000.0f; // somewhat arbitrary scaling factor, experiment to find a value that works well in your app 
float deltaTimeScaled =  clamp( (float)deltaTime / heatmapScale, 0.0f, 1.0f );

Let us know if that helps.

Hi there,

On closer inspection, what I’m getting does seem to match what you describe.

I have managed to get per-pixel timing detail on another RT hardware platform I’m working with. I’m guessing that’s what I was expecting here. I do see timing data, but more at a threadgroup level. Which is still useful.

I guess when I looked at the images in the post they looked more pixel like and less threadgroup like.

Thanks for the help!