Does NSight captures traversal statistics?

I am trying to see control flow divergence statistics during traversal, and was curious if it’s captured by NSight? I was using smsp__sass_average_branch_targets_threads_uniform.pct, but that seems to be capturing statistics from SMs only.

I have a better ray reordering scheme and see much better performance, but wanted to validate that by examining the low-level perf counters. I know the ray to thread/warp mapping is transparent to us and can be dynamically changed at run time. What sort of perf counters would you suggest to look at?

Also I am interested in getting memory coalescing statistics during traversal. My intuition is that in traversal if we order rays so that they are more coherent, they will exercise similar traversal paths and thus access similar tree nodes, which improves not only the cache hit rate but also memory coalescing. I know there are ways to get cache/dram statistics related to SMs, but how about those statistics for traversals?

I do see a whole bunch of “Nvidia internals” calls in nvprof and Nsight. Are they basically traversals?

I took a look at this thread: OptiX and Performance Counter reports in Nsight Compute - #3 by hansung_kim, and had a better understanding now.

Could I understand that all the statistics under those rayGen kernels in Nsight include both RT cores and SMs behaviors, and we can’t tell the difference?

I am using Avg. Active Threads Per Warp in the WarpStateStats section to analyze the warp behavior in the raygen kernel (megakernel). Pretty interesting that even if my entire IS program is empty, no AH/CH/Miss programs, and the only thing the RG program does is to generate rays without divergence, the Avg. Active Threads Per Warp is still as low as 23.33. I’d think in this case it would be 32, since there is absolute no divergence in the kernel?

Hi @boringboringarsenal,

To your first question about divergence, in the latest version of Nsight Compute (2021.2) there are some new stats columns in the “Source” page view titled “Avg Thread Executed” and “Divergent Branches”. The “Avg Thread Executed” column is showing how many threads per warp are active for a given line of code, so if it shows a value of “16”, then it means half of your total possible 32 threads are active, and the remaining threads are somehow diverged - waiting, stalled, or exited. The “Divergent Branches” column shows a count of how many branch (“BRA”) instructions sent threads in different directions, so you can find out where divergence is being triggered. You will notice that you can see divergence happening by looking at changes in the “Avg Thread Executed” column; when that number goes down, of course, it means that something has cause some of the threads to diverge.

In older versions of Nsight Compute that don’t have these two columns, you can still infer the average number of threads that are active per thread by looking at the “Instructions executed” column and comparing it to the “Predicated on-Thread Instructions Executed” column. For example, if “Instructions Executed” is “1,831,711” and “Predicated-On Thread Instructions Executed” is “22,985,542”, then you know the average number of active threads for that line of code is 22985542 (thread instructions) / 1831711 (warp instructions) =~ 12.549 (threads / warp). In this case, the “Avg Thread Executed” column is rounded down to the nearest integer (12), so an advantage of looking at the instruction statistics is you can get slightly more granular numbers.

For you second question about memory, I’m not sure there is any way to get memory coalescing statistics per-se in Nsight Compute, but it might be worth reading through the Nsight Profiling Guide & CUDA docs on memory access patterns and optimization:

https://docs.nvidia.com/nsight-compute/2021.2/ProfilingGuide/index.html

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses

Searching “coalesc” in both of these pages will quickly take you through most of what is offered. My understanding of Nsight’s memory metrics is that there is no difference between RT core memory access and SM memory access, they’re both global memory I/O. As far as I know there isn’t a way to see when coalescing is happening, but Nsight Compute will warn you when coalescing is not happening. Those warnings will appear at the bottom of the “Details” page, and they provide the address of the load instruction that is uncoalesced.


David.

For average active threads per warp, there are lots and lots of things that can cause divergence inside and outside your OptiX program. Any divergence in your code will count against average active threads. Any differences in traversal and/or shader execution will also cause divergence. Are all of your rays hits, and are all hits the same material & calling exactly the same OptiX programs? Are you casting shadow or reflection rays in this profile? Any mix of hits & misses in your launch will count against average active threads. Any use of multiple materials will reduce average active threads. It’s not common for rendering a normal scene to be able to achieve 32 active threads on average, since it means that all rays in the launch must do the same thing, have the same hit or miss status, evaluate the same material, and do the same amount of work.


David.

This makes sense. So the warp statistics do include both the SM work and the RT core work. Since none of my shaders has any divergence (RG just generates rays, IS is empty, no other shaders), all the divergences must be because of the divergences in RT cores. Then what I found interesting is that when comparing two RG programs, one sorts rays in a raster-scan order, and the other randomizes the rays, there is virtually no differences in the active threads per warp count. I guess this probably means the run-time system does some sort of intelligent ray to thread remapping? But the overall performance difference is significant (1.4s vs. 4.6s for 8m rays).

Sorry, I realized I asked some questions you already answered. To clarify, even with empty programs, your divergence can stem from traversal and whether OptiX would have called those programs. Try aiming your camera at a closeup of a single polygon of your scene, so that 100% of the pixels do the same thing, and see how the average threads stats change.

Yes, randomizing rays can have a significant negative impact on performance because it will cause traversal divergence. OptiX has a default thread-to-warp mapping in 2D launches that maps raygen threads into 8x4 pixel blocks (we discussed this already, didn’t we?) As long as you’re using raster-scan order, OptiX is reordering transparently to give you better ray coherence for primary rays. You will notice the same effect for AO rays, for example, because they tend to scatter randomly, so they are usually measurably slower than primary rays purely due to added divergence.


David.

Yes we did. Just to be clear, optix might still remap ray to threads/warps at run-time beyond this initial mapping right?

What throws me off is why I see significant performance difference between raster order and random order but virtually no difference in terms of active threads per warp (23.3 vs. 22.7).

Just to be clear, optix might still remap ray to threads/warps at run-time beyond this initial mapping right?

Correct, the locality of a thread is not guaranteed. This is covered in the Programming Guide’s “Program Input” section here: https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#7014

“The NVIDIA OptiX 7 programming model supports the multiple instruction, multiple data (MIMD) subset of CUDA. Execution must be independent of other threads. For this reason, shared memory usage and warp-wide or block-wide synchronization—such as barriers—are not allowed in the input PTX code. All other GPU instructions are allowed, including math, texture, atomic operations, control flow, and loading data to memory. Special warp-wide instructions like vote and ballot are allowed, but can yield unexpected results as the locality of threads is not guaranteed and neighboring threads can change during execution, unlike in the full CUDA programming model. Still, warp-wide instructions can be used safely when the algorithm in question is independent of locality by, for example, implementing warp-aggregated atomic adds.”

What throws me off is why I see significant performance difference between raster order and random order but virtually no difference in terms of active threads per warp

Your ordering change could be affecting hardware traversal & OptiX divergence without affecting SM divergence of your programs. The active threads per warp statistics are showing you data on the code you wrote that runs on the SM.


David.

Hmm I think I am a little confused by what exactly the active threads per warp statistics are actually capturing. Do they capture only the divergence of the code running on the SMs, i.e., the code in the shader programs, or they actually capture divergence of the traversal as well? From your previous replies I thought it would be the former, but you also said “The active threads per warp statistics are showing you data on the code you wrote that runs on the SM.”

Now if it’s the latter, then I don’t get it why that number isn’t 32 when my entire IS is empty, no AH/CH/Miss, and the only thing the RG does is to cast rays without any divergence.

Sorry for keeping bugging you on this.

Do you have a mix of hits and misses in your launch?


David.

I think the best thing to do next would be to dive deeper and investigate your per-line or per-instruction active threads. So instead of looking at the “Avg. Active Threads Per Warp” stats on the “Details” page, dive into the “Source” page, find your raygen program, and then walk through it to find where the divergence is occurring. Raygen should start with 32 active threads.

You could add IS, AH, CH, MS programs with a line of code and then you’ll be able to see the active thread stats for those programs as well. If the first instruction of your IS program has less than 32 active threads, you can know that you have misses and/or traversal divergence. Likewise if your AH/CH programs have less than 32 threads on entry, it could be for the same reason, or additionally due to mixed materials and/or recursion divergence.

Your question here is a good one, and I don’t know all possible reasons for divergence nor whether Nsight is limiting the average active threads statistics tightly to only the instructions you’re responsible for. It is possible that the average numbers could include some of the OptiX traversal. I guess that will become more clear if you look at the per-line thread stats with stub intersect and hit programs.


David.