OptiX and Performance Counter reports in Nsight Compute

hansung_kim · June 13, 2021, 3:32pm

Hello,

I’m trying to see if I could get some performance counter metrics running an RTcore-accelerated OptiX application, specifically memory operations and L2 metrics.
But I’m unsure if the perf counter reports I get from the Nsight Compute tool reflects the memory accesses done from the RTcore hardware, or just the CUDA kernels run in the SMs.
Could anyone provide a clarification on this, or maybe pointers to any documentations?

For what I’ve done so far, I profiled the optixPathTracer example app in the SDK using Nsight Compute and could see two active kernels being profiled: the ray generation kernel (__raygen__rg_...) and something called NVIDIA Internal. From the Nsight VSE documentation the latter seems to be one of the OptiX kernels invisible to the user.
What I’m looking for specifically is the part where the RTcore hardware does the acceleration structure traversal - namely the optixTrace call. Can I assume that the NVIDIA Internal part is where that happens? The NVIDIA Internal kernel seems to be running for a much shorter time than the ray generation kernel, and that makes me confused about where the actual traversal is happening.

Thanks in advance for any help.

droettger · June 14, 2021, 7:19am

Please have a look into this thread:
https://forums.developer.nvidia.com/t/is-there-a-way-to-measure-rt-core-util/168089

I’m not aware that memory traffic in Nsight Compute reports would be partitioned into SM and RT core usage.

The __raygen__ is just one of the functions of the whole kernel and you should be able to see other OptiX device program domains inside Nsight like the __closesthit__ functions inside your raytracing kernel.

Anything reported as “internal” is either the explicit acceleration structure build, which is a completely different kernel, or internal functions inside the raytracing kernel which are not exposed.

hansung_kim · June 14, 2021, 9:27am

Hello Detlef,

Thanks for your help! I have looked at the thread you linked before and ensured that I enabled all the line info and profile with a RelWithDebInfo configuration in CMake.

Now that I look at the thread count, the __raygen__ kernel closely matches the total number of pixels (1600x900) whereas the internal one only has 40, so your explanation makes more sense that the internal kernel concerns with accel structure builds or some other things. It’s weird that I don’t see other kernels like __closesthit__ and __miss__ as separate reports, but they are visible in the source line view of the __raygen__ report. I guess that’s because __raygen__ is at the top of the call stack.

I’m not aware that memory traffic in Nsight Compute reports would be partitioned into SM and RT core usage.

Does that mean that the memory traffic report at least somehow reflects the RT core usage, albeit mixed with SM’s?

Thanks!

dhart · June 14, 2021, 7:18pm

Does that mean that the memory traffic report at least somehow reflects the RT core usage, albeit mixed with SM’s?

That’s right. The memory stats report the memory system usage, regardless of which part of the processor is requesting memory I/O.

It’s weird that I don’t see other kernels like __closesthit__ and __miss__ as separate reports, but they are visible in the source line view of the __raygen__ report. I guess that’s because __raygen__ is at the top of the call stack.

Yes, that’s more or less right. For what it’s worth, __closesthit__ and __miss__ are not kernels per se, they are just functions called as part of a kernel execution. For that matter, __raygen__ is not a kernel either. For an OptiX launch, the raygen program is the entry point for the kernel, and raygen is where you can request traversal and calls to the closesthit and miss programs via the optixTrace() function. Because raygen is always present in an OptiX kernel launch, we decided to use the name of the compiled raygen program as the kernel name for profiling purposes. In some older versions, the name contained megakernel (which is a reference to the fact that all your OptiX programs are compiled into a single kernel).

–
David.

hansung_kim · June 15, 2021, 7:27am

For what it’s worth, __closesthit__ and __miss__ are not kernels per se, they are just functions called as part of a kernel execution.

I appreciate the clarification, this was an important detail to correct for my understanding.

Because raygen is always present in an OptiX kernel launch, we decided to use the name of the compiled raygen program as the kernel name for profiling purposes.

Ah, that was why the kernel name in the reports had weird prefixes like _0x..._ss_0 and the like! That also explains the name “mega kernel” being mentioned in some of the performance study papers I looked at that targeted older OptiX versions.

Thanks for the helpful explanations, Detlef and David!

Topic		Replies	Views
Nsight Compute: optixTrace Metrics OptiX	5	755	July 5, 2023
Is there a way to measure RT Core util? OptiX	4	3630	February 11, 2021
OptiX profiling? Nsight Compute cuda , optix	7	1272	November 27, 2023
Profiling memory coherency of OptiX application with Nsight Systems and Nsight Compute OptiX	5	982	March 30, 2023
Does NSight captures traversal statistics? OptiX	12	1277	July 29, 2021
Using Nsight Compute to Inspect your Kernels Technical Blog	3	1892	January 8, 2026
Nsight Compute Command Line Profiler for Optix kernels OptiX nsight , optix	1	276	June 20, 2024
Compute rays/sec for Optix Program OptiX	1	162	November 26, 2024
Nsight Compute + Optix 8 / Unsupported multi-level instancing detected for traversable handle OptiX	8	223	August 26, 2025
OptiX Shader Kernel Profiling Nsight Compute	1	82	November 28, 2025

OptiX and Performance Counter reports in Nsight Compute

Related topics