Hi,
we have a global illumination renderer (path tracing) running with OptiX on Linux (Ubuntu 12.04) and a set of post-processing operations employing CUDA with Optix’s CUDA-interop (we currently use CUDA 5.0.35). For the latter we rely not only on lighting information, but also on geometry and variance data. Also, we filter indirect illumination separately (so direct and indirect illumination are in separate buffers).
This means that we have several buffers which are filled by the renderer, all consisting of float4 elements. Everything worked like a charm with one GPU (GTX Titan), but when I added a second Titan, it actually became (quite a lot) slower. This means that it dropped from like 50 million rays per second in a typical scene to 10 to 20 million. Checking the code step by step it seems like this is strongly related to our extensive use of OptiX buffers. Successively commenting out buffer writes increases performance vastly.
I am aware that the buffers are stored on the host in a Multi-GPU setup, but is it really prohibitive to use multiple buffers if you still want get a good performance from your average multi-GPU setup?
I also tried just putting a loop for taking multiple samples in the kernel call, which solves the problem more or less, but as our application also relies on interactive navigation, it is not really a solution to just draw N samples per pixel for each “frame”, consequently dividing the “presented” frame rate by N.
Just to be sure you get it right: The problem is not related to the post-processing as it seems. Of course this also takes longer as only one device is employed there and the data has to be copied, but the rays/second measure mentioned above only accounts for the pure rendering time.
My only idea at the moment to get more control over what happens here is to create two processes, each with its own OptiX context and combine the results manually in course of the anyway-applied post-processing. Communication would be done via IPC stuff like shared memory. This should work, at least launching the application on both devices in parallel yielded no significant performance loss in any of them.
Just creating two OptiX contexts in separate threads didn’t seem to work, as it just crashed with an error referring to something like a context bind from different thread (I can’t recall it correctly and I don’t have my machine available to check it, sorry).
Thanks for any ideas and hints on this :)