It’s important to know that Nsight Compute is still geared primarily for CUDA kernels. It is a very good tool for helping identify compute bottlenecks in your shaders, but there are still a few things to be aware of when using it to profile your OptiX kernels.
The main thing to know is that Nsight Compute’s “SM utilization” calculation doesn’t include RTX ops. Time spent on the RTcores can leave the SM waiting for rays to finish, so especially when your shading workload is light, the overall utilization can seem low, even though your GPU is loaded with ray tracing work. In other words, your OptiX utilization might be much higher than what is reported by Night Compute.
You may want to carefully calculate your rays per second traced, that will give you a better idea of whether you’re near achieving the limits of performance on your GPU.
I would also recommend using Nsight Systems to check the overall picture of memory allocation, data transfers, and kernel timings. You’ll be able to see how much transfer times matter, and you may be able to much more quickly spot major bottlenecks that aren’t visible in Nsight Compute.
A few other random tips -
Your workload (650k threads) is decent if you’re tracing multiple rays. It’s a bit small for achieving maximum utilization if your threads are tracing a single ray and exiting. If you need multiple samples or multiple rays per pixel, tracing them together in a single launch is faster than separate launches. When I’m profiling my custom intersectors, for example, I usually time things with 100 samples per pixel.
Yes, you can batch together multiple scenes, and the GPU will be able to fill up more free time. It’s much easier to use multiple CUDA streams with one launch per stream at a time for that than it would be to batch multiple scenes into a single launch. Separate streams will overlap as much as possible on the GPU to fill the gaps.
You mentioned frequent data exchange, so it’s worth checking if your pipeline can move any of the data to the GPU. For example if you’re doing your simulated annealing on the CPU and transferring the results, you could perhaps get higher performance by moving the simulated annealing step to a CUDA kernel, which might allow removing some of your host-to-device transfers, and speed up the simulation time.
If you’re tracing exactly 1 ray per thread, then it appears from the screen capture that in your profile you’re getting ~2 giga-rays per second (665600 rays / 320.93 usec). If you’re using OptiX triangles, that seems a bit low for a mesh with 400 triangles. But, the kernel is very very short, and this was measured in Nsight compute, so we can’t be sure that’s the speed you’re getting when not profiling – try to calculate your rays/sec on a larger batch while running against a release build. If you’re using a custom intersector, or multiple samples per pixel/thread, it’s pretty good.