Greetings! As the title suggests, I am using OptiX 6.0, GeForce RTX 2060, 436.30 driver.
I’ve been tasked to optimize our OptiX kernels. Nsight says that our GPU utilization is fairly low, so I wanted to approach that from a few angles. Particularly, I’ve noticed that our kernels declare a host of local variables, so I figured that low GPU utilization may partially be due to the shortage of registers. Disclaimer: I wasn’t able to verify that with Nsight, because I couldn’t find our OptiX kernel in the list of all kernels. I remember that with Optix 5.0 it was labeled with “MegaKernelN”, but I am not sure about Optix 6.0.
p.16 at http://on-demand.gputechconf.com/gtc/2013/presentations/S3475-Ray-Tracking-With-OptiX.pdf states that “when working set of registers is too large, registers are stored to local memory”. Does that mean that when OptiX kernels are compiled, OptiX automatically performs this optimization? Or do you think that moving the aforementioned plethora of local variables to some local memory could alleviate the issue of having a shortage of registers (I assume not, if OptiX automatically moves them to local memory, if I understood that correctly)?
I would greatly appreciate pointers in the right direction. Thank you for your time!
What the documentation means is that code is compiled to spill registers into memory around function calls, or any time the register usage overflows the number of available registers. This is done at compile time, and the same is true of a CUDA program or a host side CPU program too, registers are frequently stored to and retrieved from memory.
It can be difficult to reduce register usage by moving local variables around, since the compiler is deciding the register usage for you. Here are a few strategies for reducing register usage:
Keep the scope of your variable declarations and references as tight/small as possible
Remove variables from your code, if you can see ways to do so.
Reduce the size of data/variables, if you can. Using floats instead of doubles or half floats instead of full floats will free up a register for every 32 bit value you can save.
Reduce the size of your payload & attributes in your OptiX programs.
Look for places that compute a local variable before optixTrace() or a callable program, and also refer to that variable after the call. This will often cause the variable to need to be saved to memory and then restored after the call. optixTrace & callable programs are not inlined, so each trace call or callable program call will need a big pile of registers for themselves. Sometimes it’s better to re-compute simple expressions after a trace/callable rather than hold onto a variable.
I do recommend trying to get Nsight Compute to work, it will really help understand register usage, and it may also uncover other reasons for the low utilization not related to register usage. I’m not at all sure why it’s not working for you right now, but I can recommend two things to try: upgrade to the latest driver, and use the OptiX 6.5 SDK.
Also, I think it’s probably more common for memory usage to be a bottleneck than register usage, so even without Nsight Compute, I would suggest defaulting to looking for ways to reduce memory bandwidth as the first angle of attack. (And note especially that if this is your problem, then trying to move registers into memory could make your problem worse.) There are other common reasons for low utilization including low scene/BVH coherence and shader divergence, so think about whether those might be the reasons too.