Does NSight captures traversal statistics?

boringboringarsenal · July 28, 2021, 8:50pm

I am trying to see control flow divergence statistics during traversal, and was curious if it’s captured by NSight? I was using smsp__sass_average_branch_targets_threads_uniform.pct, but that seems to be capturing statistics from SMs only.

I have a better ray reordering scheme and see much better performance, but wanted to validate that by examining the low-level perf counters. I know the ray to thread/warp mapping is transparent to us and can be dynamically changed at run time. What sort of perf counters would you suggest to look at?

boringboringarsenal · July 29, 2021, 12:16am

Also I am interested in getting memory coalescing statistics during traversal. My intuition is that in traversal if we order rays so that they are more coherent, they will exercise similar traversal paths and thus access similar tree nodes, which improves not only the cache hit rate but also memory coalescing. I know there are ways to get cache/dram statistics related to SMs, but how about those statistics for traversals?

I do see a whole bunch of “Nvidia internals” calls in nvprof and Nsight. Are they basically traversals?

boringboringarsenal · July 29, 2021, 3:34am

I took a look at this thread: OptiX and Performance Counter reports in Nsight Compute - #3 by hansung_kim, and had a better understanding now.

Could I understand that all the statistics under those rayGen kernels in Nsight include both RT cores and SMs behaviors, and we can’t tell the difference?

boringboringarsenal · July 29, 2021, 4:08pm

I am using Avg. Active Threads Per Warp in the WarpStateStats section to analyze the warp behavior in the raygen kernel (megakernel). Pretty interesting that even if my entire IS program is empty, no AH/CH/Miss programs, and the only thing the RG program does is to generate rays without divergence, the Avg. Active Threads Per Warp is still as low as 23.33. I’d think in this case it would be 32, since there is absolute no divergence in the kernel?

dhart · July 29, 2021, 4:11pm

Hi @boringboringarsenal,

To your first question about divergence, in the latest version of Nsight Compute (2021.2) there are some new stats columns in the “Source” page view titled “Avg Thread Executed” and “Divergent Branches”. The “Avg Thread Executed” column is showing how many threads per warp are active for a given line of code, so if it shows a value of “16”, then it means half of your total possible 32 threads are active, and the remaining threads are somehow diverged - waiting, stalled, or exited. The “Divergent Branches” column shows a count of how many branch (“BRA”) instructions sent threads in different directions, so you can find out where divergence is being triggered. You will notice that you can see divergence happening by looking at changes in the “Avg Thread Executed” column; when that number goes down, of course, it means that something has cause some of the threads to diverge.

In older versions of Nsight Compute that don’t have these two columns, you can still infer the average number of threads that are active per thread by looking at the “Instructions executed” column and comparing it to the “Predicated on-Thread Instructions Executed” column. For example, if “Instructions Executed” is “1,831,711” and “Predicated-On Thread Instructions Executed” is “22,985,542”, then you know the average number of active threads for that line of code is 22985542 (thread instructions) / 1831711 (warp instructions) =~ 12.549 (threads / warp). In this case, the “Avg Thread Executed” column is rounded down to the nearest integer (12), so an advantage of looking at the instruction statistics is you can get slightly more granular numbers.

For you second question about memory, I’m not sure there is any way to get memory coalescing statistics per-se in Nsight Compute, but it might be worth reading through the Nsight Profiling Guide & CUDA docs on memory access patterns and optimization:

Searching “coalesc” in both of these pages will quickly take you through most of what is offered. My understanding of Nsight’s memory metrics is that there is no difference between RT core memory access and SM memory access, they’re both global memory I/O. As far as I know there isn’t a way to see when coalescing is happening, but Nsight Compute will warn you when coalescing is not happening. Those warnings will appear at the bottom of the “Details” page, and they provide the address of the load instruction that is uncoalesced.

–
David.

dhart · July 29, 2021, 4:20pm

For average active threads per warp, there are lots and lots of things that can cause divergence inside and outside your OptiX program. Any divergence in your code will count against average active threads. Any differences in traversal and/or shader execution will also cause divergence. Are all of your rays hits, and are all hits the same material & calling exactly the same OptiX programs? Are you casting shadow or reflection rays in this profile? Any mix of hits & misses in your launch will count against average active threads. Any use of multiple materials will reduce average active threads. It’s not common for rendering a normal scene to be able to achieve 32 active threads on average, since it means that all rays in the launch must do the same thing, have the same hit or miss status, evaluate the same material, and do the same amount of work.

–
David.

boringboringarsenal · July 29, 2021, 4:27pm

This makes sense. So the warp statistics do include both the SM work and the RT core work. Since none of my shaders has any divergence (RG just generates rays, IS is empty, no other shaders), all the divergences must be because of the divergences in RT cores. Then what I found interesting is that when comparing two RG programs, one sorts rays in a raster-scan order, and the other randomizes the rays, there is virtually no differences in the active threads per warp count. I guess this probably means the run-time system does some sort of intelligent ray to thread remapping? But the overall performance difference is significant (1.4s vs. 4.6s for 8m rays).

dhart · July 29, 2021, 4:33pm

Sorry, I realized I asked some questions you already answered. To clarify, even with empty programs, your divergence can stem from traversal and whether OptiX would have called those programs. Try aiming your camera at a closeup of a single polygon of your scene, so that 100% of the pixels do the same thing, and see how the average threads stats change.

Yes, randomizing rays can have a significant negative impact on performance because it will cause traversal divergence. OptiX has a default thread-to-warp mapping in 2D launches that maps raygen threads into 8x4 pixel blocks (we discussed this already, didn’t we?) As long as you’re using raster-scan order, OptiX is reordering transparently to give you better ray coherence for primary rays. You will notice the same effect for AO rays, for example, because they tend to scatter randomly, so they are usually measurably slower than primary rays purely due to added divergence.

–
David.

boringboringarsenal · July 29, 2021, 4:38pm

Yes we did. Just to be clear, optix might still remap ray to threads/warps at run-time beyond this initial mapping right?

What throws me off is why I see significant performance difference between raster order and random order but virtually no difference in terms of active threads per warp (23.3 vs. 22.7).

dhart · July 29, 2021, 5:12pm

Just to be clear, optix might still remap ray to threads/warps at run-time beyond this initial mapping right?

Correct, the locality of a thread is not guaranteed. This is covered in the Programming Guide’s “Program Input” section here: https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#7014

“The NVIDIA OptiX 7 programming model supports the multiple instruction, multiple data (MIMD) subset of CUDA. Execution must be independent of other threads. For this reason, shared memory usage and warp-wide or block-wide synchronization—such as barriers—are not allowed in the input PTX code. All other GPU instructions are allowed, including math, texture, atomic operations, control flow, and loading data to memory. Special warp-wide instructions like vote and ballot are allowed, but can yield unexpected results as the locality of threads is not guaranteed and neighboring threads can change during execution, unlike in the full CUDA programming model. Still, warp-wide instructions can be used safely when the algorithm in question is independent of locality by, for example, implementing warp-aggregated atomic adds.”

What throws me off is why I see significant performance difference between raster order and random order but virtually no difference in terms of active threads per warp

Your ordering change could be affecting hardware traversal & OptiX divergence without affecting SM divergence of your programs. The active threads per warp statistics are showing you data on the code you wrote that runs on the SM.

–
David.

boringboringarsenal · July 29, 2021, 5:23pm

Hmm I think I am a little confused by what exactly the active threads per warp statistics are actually capturing. Do they capture only the divergence of the code running on the SMs, i.e., the code in the shader programs, or they actually capture divergence of the traversal as well? From your previous replies I thought it would be the former, but you also said “The active threads per warp statistics are showing you data on the code you wrote that runs on the SM.”

Now if it’s the latter, then I don’t get it why that number isn’t 32 when my entire IS is empty, no AH/CH/Miss, and the only thing the RG does is to cast rays without any divergence.

Sorry for keeping bugging you on this.

dhart · July 29, 2021, 5:58pm

Do you have a mix of hits and misses in your launch?

–
David.

dhart · July 29, 2021, 6:59pm

I think the best thing to do next would be to dive deeper and investigate your per-line or per-instruction active threads. So instead of looking at the “Avg. Active Threads Per Warp” stats on the “Details” page, dive into the “Source” page, find your raygen program, and then walk through it to find where the divergence is occurring. Raygen should start with 32 active threads.

You could add IS, AH, CH, MS programs with a line of code and then you’ll be able to see the active thread stats for those programs as well. If the first instruction of your IS program has less than 32 active threads, you can know that you have misses and/or traversal divergence. Likewise if your AH/CH programs have less than 32 threads on entry, it could be for the same reason, or additionally due to mixed materials and/or recursion divergence.

Your question here is a good one, and I don’t know all possible reasons for divergence nor whether Nsight is limiting the average active threads statistics tightly to only the instructions you’re responsible for. It is possible that the average numbers could include some of the OptiX traversal. I guess that will become more clear if you look at the per-line thread stats with stub intersect and hit programs.

–
David.

Topic		Replies	Views
Wraps in Ray gen and how data is initially stored in the memory hierarchy OptiX	13	1029	June 14, 2022
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1687	August 31, 2020
How Does OptiX Handle Cache Utilization, Branch Divergence, and Bank Conflicts Internally? OptiX	4	55	March 19, 2025
How do I avoiding hitting the same triangle when calling tracing another ray? OptiX	14	632	February 7, 2024
questions about thread execution & volatile CUDA Programming and Performance	19	16896	December 29, 2008
Some questions about ray OptiX	10	1794	May 12, 2023
inter-warp synchronization troubles with persistent threads (__threadfence_block() ?) CUDA Programming and Performance	6	2773	December 8, 2010
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3945	December 21, 2016
Performance penalty due to warp divergence CUDA Programming and Performance	9	1662	May 18, 2023
Nsight Compute: optixTrace Metrics OptiX	5	594	July 5, 2023

Does NSight captures traversal statistics?

Related topics