I’m currently looking into occupancy for an OptiX kernel, and looking at some previous posts ( Help reduce the high register count of an Optix raytracer code ), it seems like using Nsight Compute for this task is recommended. However, when I look at Nsight compute, I see that there’s maximum of 43 live registers in the SASS view, but the threads per register number is 107. According to occupancy calculator for sm_86 (the compute architecture i am currently using), this limits the kernel to about <1/3 as a hard upper bound on theoretical occupancy, which the kernel roughly achieves.
It makes sense that the threads per register would be a little larger, since the registers are allocated at a granularity of larger than 1, but I’m still puzzled as to where an extra 54 registers could be coming from. I assume it is from some internal ABI / bookkeeping that OptiX is doing. If so, is there any ways to decrease this? The application I am running is incredibly simple, with only a few words of live state across optixTrace calls and no ray payloads at all.
Also, I was thinking of the implications of this register count on the architecture of a path tracer project I am writing with optix. My thought was that since the user gives up control of scheduling to OptiX, it might be possible for the occupancy to be increased by giving different amounts of registers to the different closest-hit programs, depending on how complex the material / Bsdf evaluation code is, suggesting that the megakernel style i’m currently using is not too suboptimal. however, if it actually is the case that the kernel would need the max over all possible evaluation paths, does this mean wavefront is still a better option?
Try using OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING.
For your question about using different registers for different CH programs, yes your hunch is correct, it is possible to increase occupancy by tuning the registers used for your different shader programs. This is why we added the Payload Semantics API. By carefully specifying the scope of each payload value, it will let you minimize the number of registers used. The compiler can let two or more payload values that don’t overlap in time reuse a single register.
From a performance perspective, with all else being equal (including occupancy), megakernel is almost guaranteed to be better, because we’re comparing megakernel’s register usage to wavefront’s global memory usage, and global memory is much slower than register access. Wavefront can of course be much simpler, and sometimes very advanced shading systems are either difficult or impossible to implement as a megakernel. But for extremely simple pipelines where a megakernel is easy, I would generally expect the megakernel approach to provide the highest performance.
Note that higher occupancy does not necessarily imply higher performance. Always measure and don’t assume that higher occupancy is better. In a given pipeline, yes, OptiX will generally need to use the maximum hit program’s register count for all hit programs. You could create multiple pipelines, each with it’s own register requirements, but the register count is a launch-wide attribute, not a per-thread attribute. So, it’s important to know that you can’t really get away from requiring the maximum register count across your different material shaders by moving to wavefront, unless you evaluate each material in a separate kernel launch, and that’s likely going to be much slower than taking the occupancy hit of combining the materials into a single launch.
Since we’re talking about perf of a shading system, keep in mind that for high performance in simple pipelines, it might be worth considering some kind of uber-shader material. Having a single material and avoiding divergence at the shader level might provide much better performance than having separate materials, even if there are unused terms in the user-shader, as long as there is some overlap.