hi everyone, I would like to get some help on optimizing an OptiX code that we have been working on.
first, some background, my group does research on Monte Carlo (MC) photon transport simulations and we have published open-source codes (portal: http://mcx.space, CUDA-code: GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator, OpenCL code: GitHub - fangq/mcxcl: Monte Carlo eXtreme for OpenCL (MCXCL)) and research papers (Monte Carlo eXtreme - The Photon Player). The primary users of our simulators are biomedical optics researchers for modeling light-tissue interactions.
Although under the hood our MC simulators are just ray-tracers, there are a few major differences compared to typical ray-tracing, which I summarized in this previous post (in short - we deal with lots of short rays due to high scattering in diffusive light, and we focus on volumetric optical properties instead of surfaces)
One of my students recently wrote a prototype OptiX code to implement our algorithm. The good news is that it works and gives correct answer. However, we were a bit disappointed that it was 2-3x slower than our CUDA code - despite the OptiX code only has a barebone implementation.
We are trying to figure out why this OptiX code is slow. We got some profiling results from nsight-compute, the profiling result for the OptiX code and CUDA codes are attached at the end.
Overall, all key performance metrics, including compute efficiency, memory efficiency, occupancy, the OptiX code is about half of those for the CUDA code (same hardware, same driver, same thread/block size, same type of workload, but CUDA code runs more photons). Also, CUDA code has 64 registers but the OptiX code has 117 registers.
We only have 3 optix shaders -
intersection, each is quite short, about 15-20 lines. We used 17 payload registers and 4 attribute registers. The registers used in each shader is definitely much less than what we used in the CUDA code main kernel.
I am wondering: why OptiX uses so many registers? is it common to have such a high overhead?
It is interesting to note that in the CUDA code, when I comment out the global-memory reading and writing part, I can see significant speed increase, but I am not able to see such change in Optix based implementation, showing that something else is the bottleneck that is much higher than global memory cost (which is supposed to be the highest of this type of algorithm).
any suggestion is appreciated!
Optix code profiling result:
in comparison, the report for the CUDA version is listed below: