Progressive photon mapping sample with multiple GPUs

Hi,

Using two GTX 1080 Ti is much slower than one GTX 1080 Ti with the PPM in Optix Advanced Samples. I would like to know how to benefit from multiple GPUs in such a use case, more generally, the use cases that have multiple passes + bandwidth hungry.

One GPU:
Starting photon pass … finished. 0.00374957
Starting kd_tree build … finished. 0.007299
Starting gather pass … finished. 0.00669374

Two GPUs:
Starting photon pass … finished. 0.00798878
Starting kd_tree build … finished. 0.00505487
Starting gather pass … finished. 0.0188994

It seems a major bottle neck is the output buffers that multiple GPUs are writing into. What if we could make photon maps completely stay in each GPU locally? In gather pass, each GPU just read its own photon map. With applying a kd tree construction on each GPU independently, the whole photon map construction could be duplicated on each GPU to avoid writing to PCIE. I am not sure how Optix 5.0 could do this right now, it could be achieved with two features:

  1. Optix launch allows GPUs to write their local buffers instead of just in cooperation mode(same output)
    2, Optix launch allows GPUs read from corresponding local buffers, for example making variable “rtBufferLocal<> photon_map”, when writing to a final output, GPUs could still in cooperation mode, but they read their own photon_map in local memory. Of course rtBufferLocals are not automatically synced between GPUs.

Any suggestions are welcome,

Thanks,

Yashiz

One more question:

Is that safe to change “gather_buffer” to be RT_BUFFER_GPU_LOCAL?

Changing to RT_BUFFER_GPU_LOCAL improves gather pass from 0.018s to 0.008s on my two GPUs setup. It seems working, but I am not sure if it is by luck. “gather_buffer” is used by different passes, so it should be only working if Optix distributes same area of “gather_buffer” for each GPU and for each launch.

In fact, any “accumulation” like buffers could get benefits for such use case. That even if “writes from multiple devices are not coherent, as a separate copy of the buffer resides on each device”, as long as Optix supports non-random access of local buffer across different optix launch, we can avoid copying back to host.

This could be a quite useful feature.

I don’t see a variable named “gather_buffer” in the OptiX Advanced Examples. Please be more specific.

Citing some information from the OptiX API Reference as listed in another thread on this forum before:

[i]3.4.2.2 enum RTbufferflag
RT_BUFFER_GPU_LOCAL An RT_BUFFER_INPUT_OUTPUT has separate copies on each device that are not synchronized.

3.8.3.17 RTresult RTAPI rtBufferCreate
The flag RT_BUFFER_GPU_LOCAL can only be used in combination with RT_BUFFER_INPUT_OUTPUT. RT_BUFFER_INPUT_OUTPUT and RT_BUFFER_GPU_LOCAL used together specify a buffer that allows the host to only write, and the device to read and write data.
The written data will never be visible on the host side and will generally not be visible on other devices.
[/i]

The bold texts together imply that there is no secure accumulation possible over multiple launches on a multi-GPU context on RT_BUFFER_GPU_LOCAL buffers, because you do not control the scheduling of launch indices per GPU.
(EDIT: I was wrong about that. Actually the OptiX multi-GPU load balancer is static! See post further down.)
Final accumulation needs to happen in real input_output buffers.

Dear Detlef,

Thank you for your answers and confirmation.

It is a bit sad to know we have this limitation on local buffers. I understand the pros of dynamic GPU scheduling, but if because of this we have to syncing over PCIE, is a bit pain in the ass. Maybe, maybe an option to hint the scheduling, for example users can define 40% fixed indices for GPU1, 60% fixed indices for GPU2 …

Sorry about the “gather_buffer”, I didn’t remember I dragged radius2, photon_count and flux out from HitRecord to be a accumulation buffer.

Thanks again,

Yashiz

Yes, I know. We’ll keep this in mind.
I had some painful experiences with a supposedly homogeneous multi-GPU system (dual Quadro K6000) in an older system. While the boards were identical, one board was connected to a PCI-E 16x Gen2 slot and the other to a PCI-E 4x Gen1 slot. No chance to get good scaling on that. The slow PCI-E slot choked it, like only 10-15% improvement over using just one of the boards with a standard full image path tracing.

After discussing this internally, using an RT_BUFFER_INPUT_OUTPUT with RT_BUFFER_GPU_LOCAL for accumulation on multiple GPUs is actually working!
While the work distribution of the load balancer is still abstracted internally to be able to implement various schemes, it’s static over multiple GPUs. Means identical launch dimensions access identical launch indices and gathering algorithms will work.

You would just need to output the final accumulated result to another output buffer to make it accessible on the host. That step could also do the tonemapping and conversion from float to unsigned byte formats (recommended RGBA32F or RGBA16F to BGRA8) to reduce the PCI-E load even more.

The OptiX Advanced Example optixVox on https://github.com/nvpro-samples/optix_advanced_samples is doing accumulation this way.

I’m really sorry for the misleading recommendations. I need to work on making all my renderers faster for multi-GPU now…

1 Like

Dear Detlef,

Thank you very much for the update. This is really a good news!
So now, we could safely use local buffers for accumulations :)

Cheers,

Yashiz