Ok the thing about my OptiX program is that it is a non-real time path tracer and a significant portion of my pixels hit the background when rendering my scenes.
So I was wondering about the possibility of putting the threads that would normally repeatedly calculate the background color to work on calculating new samples for the pixels that render the geometry.
Of course, that would require some thread synchronization to keep track of how many samples have been calculated and synchronization across devices is currently not possible. But still perhaps some of you can tell me if it is possible to force certain threads to process pixel that do not correspond to that thread’s launch index and do so in a way that doesn’t affect the resulting image.
First, you don’t have that fine grained access to CUDA resources within OptiX. It’s using a single ray programming model and everything about blocks, warps, threads is internal and must not be touched. But there are other ways.
One of my path tracers is doing adaptive tiled rendering with a convergence threshold calculation which will only shoot more rays in regions where there is more work to do. It quickly finds primary rays hitting the miss shader and reflections of it in specular regions and stops wasting rays on that.
If you’re able to determine which regions on your rendering need more work, you could simply track a list of regions or individual pixels for your next launch. The amount of pixels in that list is your new launch dimension and the information inside that list is used to calculate the pixel coordinate in the full sized output image which receives the result.
It’s like scattered writes without the need for synchronization with atomics because every launch index writes to a separate result pixel. (Using atomics in OptiX wouldn’t work on multi-GPU!)
I’m not actually sure if OptiX distributes 1D launches, so I’m using 2D launches which fit around the number of rays and fill up the last row with dummy rays which will be skipped inside the ray generation program if required. Worked nicely.
Automatically distributing rtContextLaunch1D() to multiple GPUs.
OptiX distributes the work with some tiling mechanism and I have never gotten around to test if that works as expected with rtContextLaunch1D(), because I’m normally doing image synthesis and that is at least rtContextLaunch2D(). Someone from the OptiX core team might chime in here.
I just wanted to note that (unless I’m mistaken) there is a way around this. Atomics do work if buffers are set to RT_BUFFER_GPU_LOCAL, where each GPU has it’s own copy. After the rtLaunch you can post-process the multiple buffers (from each GPU) with CUDA to get the result you want.