Multi-GPU with several float buffers

Nahum · May 30, 2013, 1:22pm

Hi,

we have a global illumination renderer (path tracing) running with OptiX on Linux (Ubuntu 12.04) and a set of post-processing operations employing CUDA with Optix’s CUDA-interop (we currently use CUDA 5.0.35). For the latter we rely not only on lighting information, but also on geometry and variance data. Also, we filter indirect illumination separately (so direct and indirect illumination are in separate buffers).

This means that we have several buffers which are filled by the renderer, all consisting of float4 elements. Everything worked like a charm with one GPU (GTX Titan), but when I added a second Titan, it actually became (quite a lot) slower. This means that it dropped from like 50 million rays per second in a typical scene to 10 to 20 million. Checking the code step by step it seems like this is strongly related to our extensive use of OptiX buffers. Successively commenting out buffer writes increases performance vastly.

I am aware that the buffers are stored on the host in a Multi-GPU setup, but is it really prohibitive to use multiple buffers if you still want get a good performance from your average multi-GPU setup?

I also tried just putting a loop for taking multiple samples in the kernel call, which solves the problem more or less, but as our application also relies on interactive navigation, it is not really a solution to just draw N samples per pixel for each “frame”, consequently dividing the “presented” frame rate by N.

Just to be sure you get it right: The problem is not related to the post-processing as it seems. Of course this also takes longer as only one device is employed there and the data has to be copied, but the rays/second measure mentioned above only accounts for the pure rendering time.

My only idea at the moment to get more control over what happens here is to create two processes, each with its own OptiX context and combine the results manually in course of the anyway-applied post-processing. Communication would be done via IPC stuff like shared memory. This should work, at least launching the application on both devices in parallel yielded no significant performance loss in any of them.

Just creating two OptiX contexts in separate threads didn’t seem to work, as it just crashed with an error referring to something like a context bind from different thread (I can’t recall it correctly and I don’t have my machine available to check it, sorry).

Thanks for any ideas and hints on this :)

GuillermoMarcus · May 30, 2013, 1:46pm

Hi Nahum,

Unfortunately, at the moment OptiX is not thread safe, it might work or it might not. It seems it is not working for you when testing multiple independent threads…

A hint with the Multi-GPU setup: in order to reduce the buffer copy between host memory and the GPUs, if you know some buffers have not changed between frames, you can try OR’d the RT_BUFFER_COPY_ON_DIRTY flag during the rtBufferCreate. This will cause the data copies to happen only when OptiX has an explicit reason to believe the data are dirty.

Best,
GM

Nahum · June 10, 2013, 2:10pm

Hi,

actually all buffers I used change every frame, so really I believe there was no other way than doing it with multiple processes and IPC. Luckily I found CUDA supporting IPC also, so I could easily hand over the device pointer I got from OptiX in the child process to the parent process, which then combines everything in the already-existing post-processing chain, only relying on device-device-copies.

Still it would have been nice not having to do this :) …I understood that buffers are stored on the host in Multi-GPU setups, but will there be a way around this in the future? I believe good support for multi-GPU setups is a must-have and in the current state it’s a little annoying when just using multiple buffers which are updated on a per-frame basis crashes your performance completely.

Also, when I just copied the buffers manually to the host in the multi-process setup everything was still quite a lot faster than when optix took care of this. Why is this? Does it have to do with Optix’s scheduling approach?

marknv · June 11, 2013, 10:12pm

You cannot compare multi-processing with multi-gpu because those are two totally different approaches each one with its pros and cons. OptiX takes care of many things under the hood (besides enforcing a strong exception safety policy).

There are some multi-gpu issues being worked on, many of them have been fixed and they will be shipped into the next release. Perhaps it may also solve your problem as well.

Nahum · June 12, 2013, 10:12am

Thanks mark, I am waiting eagerly :)

I will give my feedback here as soon as I have tested it then!

Topic		Replies	Views
Optix 6.5 - Multi-GPU OptiX	1	1365	April 3, 2020
Multi-GPU with OptiX OptiX	9	5614	November 5, 2013
Question about handling buffers when using multiple GPUs? OptiX	12	4159	June 13, 2022
createBufferFromGLBO function crash in multi-GPU environment OptiX	4	1634	June 11, 2018
Using GL buffers from a second render thread OptiX	5	1371	April 17, 2018
Progressive photon mapping sample with multiple GPUs OptiX	6	2078	February 23, 2018
[Optix 6.5] Use of thrust on optix buffer OptiX	2	1118	February 4, 2020
interop with opengl is quite slow OptiX	3	1115	November 9, 2017
Host-device transfer bottleneck OptiX	3	1186	February 5, 2018
DirectX->Optix single geometry buffer or multiple? OptiX	3	2494	September 2, 2015

Multi-GPU with several float buffers

Related topics