Multi-GPU with several float buffers


we have a global illumination renderer (path tracing) running with OptiX on Linux (Ubuntu 12.04) and a set of post-processing operations employing CUDA with Optix’s CUDA-interop (we currently use CUDA 5.0.35). For the latter we rely not only on lighting information, but also on geometry and variance data. Also, we filter indirect illumination separately (so direct and indirect illumination are in separate buffers).

This means that we have several buffers which are filled by the renderer, all consisting of float4 elements. Everything worked like a charm with one GPU (GTX Titan), but when I added a second Titan, it actually became (quite a lot) slower. This means that it dropped from like 50 million rays per second in a typical scene to 10 to 20 million. Checking the code step by step it seems like this is strongly related to our extensive use of OptiX buffers. Successively commenting out buffer writes increases performance vastly.

I am aware that the buffers are stored on the host in a Multi-GPU setup, but is it really prohibitive to use multiple buffers if you still want get a good performance from your average multi-GPU setup?

I also tried just putting a loop for taking multiple samples in the kernel call, which solves the problem more or less, but as our application also relies on interactive navigation, it is not really a solution to just draw N samples per pixel for each “frame”, consequently dividing the “presented” frame rate by N.

Just to be sure you get it right: The problem is not related to the post-processing as it seems. Of course this also takes longer as only one device is employed there and the data has to be copied, but the rays/second measure mentioned above only accounts for the pure rendering time.

My only idea at the moment to get more control over what happens here is to create two processes, each with its own OptiX context and combine the results manually in course of the anyway-applied post-processing. Communication would be done via IPC stuff like shared memory. This should work, at least launching the application on both devices in parallel yielded no significant performance loss in any of them.

Just creating two OptiX contexts in separate threads didn’t seem to work, as it just crashed with an error referring to something like a context bind from different thread (I can’t recall it correctly and I don’t have my machine available to check it, sorry).

Thanks for any ideas and hints on this :)

Hi Nahum,

Unfortunately, at the moment OptiX is not thread safe, it might work or it might not. It seems it is not working for you when testing multiple independent threads…

A hint with the Multi-GPU setup: in order to reduce the buffer copy between host memory and the GPUs, if you know some buffers have not changed between frames, you can try OR’d the RT_BUFFER_COPY_ON_DIRTY flag during the rtBufferCreate. This will cause the data copies to happen only when OptiX has an explicit reason to believe the data are dirty.



actually all buffers I used change every frame, so really I believe there was no other way than doing it with multiple processes and IPC. Luckily I found CUDA supporting IPC also, so I could easily hand over the device pointer I got from OptiX in the child process to the parent process, which then combines everything in the already-existing post-processing chain, only relying on device-device-copies.

Still it would have been nice not having to do this :) …I understood that buffers are stored on the host in Multi-GPU setups, but will there be a way around this in the future? I believe good support for multi-GPU setups is a must-have and in the current state it’s a little annoying when just using multiple buffers which are updated on a per-frame basis crashes your performance completely.

Also, when I just copied the buffers manually to the host in the multi-process setup everything was still quite a lot faster than when optix took care of this. Why is this? Does it have to do with Optix’s scheduling approach?

You cannot compare multi-processing with multi-gpu because those are two totally different approaches each one with its pros and cons. OptiX takes care of many things under the hood (besides enforcing a strong exception safety policy).

There are some multi-gpu issues being worked on, many of them have been fixed and they will be shipped into the next release. Perhaps it may also solve your problem as well.

Thanks mark, I am waiting eagerly :)

I will give my feedback here as soon as I have tested it then!