Excellent improvement!
The memory bandwidth looks fine. No host memory accesses, very few device memory accesses, lots of L1 and L2 cache hits.
All the other instruction reductions are awesome. (I assume the IADD reduction comes from a lot less pointer chasing inside the code?)
I can’t say what else could be improved to reduce MIO stalls. Maybe you’re at the peak of what your GPU can handle which I can’t say because you still didn’t provide information about your system configuration.
The overall memory throughput of 2.54 GB/sec doesn’t sound high enough to saturate any current GPU.
Your average active warps per thread of 27.6 (of 32 at maximum) in the original report are also looking OK.
You wouldn’t reach that value with a path tracer, for example, where the ray divergence kills the occupancy. That’s why Shader Execution Reordering is a thing.
Maybe the memory access pattern isn’t good enough, like gathering data from too many different addresses instead of hitting cache lines nicely.
Memory layout changes to Structure Of Array can help there, but only if the kernels walk along these pointers instead of jumping around.
But it also makes sense to keep things interleaved in arrays of small structures if all data of the structure is needed anyways (e.g. vertex attributes). Still, read vectorized data types in order of increasing memory addresses.
Make sure structure fields are ordered by their CUDA alignment requirements from big to small to have no unnecessary padding by the compiler between the fields.
Live variables are simply the variables of which you need the data before and after an OptiX device call like optixTrace or a callable. Sometimes it’s faster to recalculate a value instead of keeping it live.
The number of registers used inside the OptiX kernel can be controlled with the OptixModuleCompileOptions maxRegisterCount value, where the OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT is usually using 128 registers of the 255 available (one is reserved).
The number of registers affects how many blocks are launched. There is a graph inside Nsight Compute which shows that relation. It’s like a stair step. Which values make sense, can be seen inside that graph.
So if you see that you’re using a lot fewer registers than 128, like 64 or 32, it makes sense to try setting maxRegisterCount to a smaller value and benchmark that.
On the other hand, if you see register spilling (lots of STL/LDL calls around your OptiX calls inside the SASS code) which can come from live variables, then it might make sense to raise that maxRegisterCount to 255 and benchmark that. Of course reducing the live variables should still be tried. There is a cliff at 128, so when going over that, there isn’t much difference between 192 or the maximum of 255.
If the performance doesn’t change and the SASS code doesn’t show reduced memory accesses, then just keep maxRegisterCount = OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT.