Dear NVIDIA team,
I’m currently working on optimizing an OptiX program with Nsight Compute.
Is there a way to get performance metrics for the time span that optixTrace spends outside user-defined programs?
For example, I would like to find out how many cycles are required to traverse the BVH (in comparison to cycles required to run my programs), or the number of cache misses / pipeline stalls that occur during that time.
Thanks in advance!
Nsight Compute currently doesn’t automatically break down kernel metrics at the OptiX boundary for you, and traversal is generally excluded from the profiling metrics both to protect proprietary internals, and to help isolate the user program metrics and make them easier to understand and optimize. Can I ask, what kinds of optimizations are you hoping to make with this information?
thanks for the quick response. Obtaining metrics for the BVH traversal would be useful in many ways.
For example, I was hoping to observe the impact of OPTIX_RAY_FLAG_DISABLE_ANYHIT and OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT beyond just measuring end-to-end run time. In my understanding, optixTrace moves parts of the traversal workload from the SM to the on-chip RTX hardware, but then invokes the user programs on the SM. I was wondering about the relative cost of calling a user-defined program, and if there is a way to perform some computation on the SM while the RTX hardware is busy with traversal (to improve GPU utilization).
I respect NVIDIA’s desire to protect their IP, but the OptiX documentation is very vague regarding the points mentioned above, so I was hoping to obtain some more information from profiling. Are you maybe able to comment on these things?
There are some talks in the archive (by yours truly) that discuss the RTX - SM round trips and how it relates to anyhit invocations. See the OptiX Tools & Tricks, for example: OptiX talks from Siggraph 2019 There might be some more recent ones as well. It’s not super detailed, but I talk about the existence of the round-trip overheads, and how to think about and evaluate them, as well as a little about how to optimize (reduce) their use. The relative cost is unique to every application and every GPU model. It depends on choices the application makes, on how much shading compute there is, how large the payload is, what the caching behavior is, scene size & complexity, etc., etc. And so you really have to profile your code to find out.
As far as doing computation on the SM while the RT cores are busy, that does happen already, and it happens automatically as long as there is work to do. The SMs are pretty good about switching to work on other threads as needed anytime a thread has to wait for resources like the RT cores or VRAM or PCI transfers or anything else.
Thank you for pointing me to those talks. I’ll be on the lookout for optimization advice when watching them. In case you are interested, the greater context of my question is this paper on ArXiv, which evaluates OptiX in a non-CG setting. It also includes comparisons to traditional GPU data structures.
One last question: You mentioned that “traversal is generally excluded from the profiling metrics” - does this also apply to memory counters, i.e., cache misses and transfer sizes between caches?
You mentioned that “traversal is generally excluded from the profiling metrics” - does this also apply to memory counters, i.e., cache misses and transfer sizes between caches?
Yes that’s my understanding and what I meant, though to be honest I’ve never tested that super carefully, and there might be some fuzzy boundaries. OptiX device functions probably are included in profiling metrics, and they can include cached loads from parts of your BVH (e.g., for “random vertex access”). If I were to test it I would probably start by trying to create hit/miss shader variants that don’t load anything from memory, try to account for all memory touched in raygen, and then use Nsight Compute to see how much total memory was requested from VRAM, ignoring the caches. You should be able to spot a large discrepancy if the traversal memory access is not being reported. If that works, you could re-enable your complex hit shader features and observe the deltas.
The paper looks very cool! We really like seeing non-CG uses of OptiX. This reminds me of some work I’ve seen on using ray tracing for fast function inversion.