OptiX and Performance Counter reports in Nsight Compute

Hello,

I’m trying to see if I could get some performance counter metrics running an RTcore-accelerated OptiX application, specifically memory operations and L2 metrics.
But I’m unsure if the perf counter reports I get from the Nsight Compute tool reflects the memory accesses done from the RTcore hardware, or just the CUDA kernels run in the SMs.
Could anyone provide a clarification on this, or maybe pointers to any documentations?

For what I’ve done so far, I profiled the optixPathTracer example app in the SDK using Nsight Compute and could see two active kernels being profiled: the ray generation kernel (__raygen__rg_...) and something called NVIDIA Internal. From the Nsight VSE documentation the latter seems to be one of the OptiX kernels invisible to the user.
What I’m looking for specifically is the part where the RTcore hardware does the acceleration structure traversal - namely the optixTrace call. Can I assume that the NVIDIA Internal part is where that happens? The NVIDIA Internal kernel seems to be running for a much shorter time than the ray generation kernel, and that makes me confused about where the actual traversal is happening.

Thanks in advance for any help.

Please have a look into this thread:
https://forums.developer.nvidia.com/t/is-there-a-way-to-measure-rt-core-util/168089

I’m not aware that memory traffic in Nsight Compute reports would be partitioned into SM and RT core usage.

The __raygen__ is just one of the functions of the whole kernel and you should be able to see other OptiX device program domains inside Nsight like the __closesthit__ functions inside your raytracing kernel.

Anything reported as “internal” is either the explicit acceleration structure build, which is a completely different kernel, or internal functions inside the raytracing kernel which are not exposed.

Hello Detlef,

Thanks for your help! I have looked at the thread you linked before and ensured that I enabled all the line info and profile with a RelWithDebInfo configuration in CMake.

Now that I look at the thread count, the __raygen__ kernel closely matches the total number of pixels (1600x900) whereas the internal one only has 40, so your explanation makes more sense that the internal kernel concerns with accel structure builds or some other things. It’s weird that I don’t see other kernels like __closesthit__ and __miss__ as separate reports, but they are visible in the source line view of the __raygen__ report. I guess that’s because __raygen__ is at the top of the call stack.

I’m not aware that memory traffic in Nsight Compute reports would be partitioned into SM and RT core usage.

Does that mean that the memory traffic report at least somehow reflects the RT core usage, albeit mixed with SM’s?

Thanks!

Does that mean that the memory traffic report at least somehow reflects the RT core usage, albeit mixed with SM’s?

That’s right. The memory stats report the memory system usage, regardless of which part of the processor is requesting memory I/O.

It’s weird that I don’t see other kernels like __closesthit__ and __miss__ as separate reports, but they are visible in the source line view of the __raygen__ report. I guess that’s because __raygen__ is at the top of the call stack.

Yes, that’s more or less right. For what it’s worth, __closesthit__ and __miss__ are not kernels per se, they are just functions called as part of a kernel execution. For that matter, __raygen__ is not a kernel either. For an OptiX launch, the raygen program is the entry point for the kernel, and raygen is where you can request traversal and calls to the closesthit and miss programs via the optixTrace() function. Because raygen is always present in an OptiX kernel launch, we decided to use the name of the compiled raygen program as the kernel name for profiling purposes. In some older versions, the name contained megakernel (which is a reference to the fact that all your OptiX programs are compiled into a single kernel).


David.

For what it’s worth, __closesthit__ and __miss__ are not kernels per se, they are just functions called as part of a kernel execution.

I appreciate the clarification, this was an important detail to correct for my understanding.

Because raygen is always present in an OptiX kernel launch, we decided to use the name of the compiled raygen program as the kernel name for profiling purposes.

Ah, that was why the kernel name in the reports had weird prefixes like _0x..._ss_0 and the like! That also explains the name “mega kernel” being mentioned in some of the performance study papers I looked at that targeted older OptiX versions.

Thanks for the helpful explanations, Detlef and David!

1 Like