Multi-GPU with OptiX

xtc · April 10, 2013, 11:44am

Hi guys
I’m implementing Progressive Photon Mapping using the new approach by Knaus and Zwicker (http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Knaus11.pdf) using OptiX. This approach makes each iteration/frame of PPM independent and more suitable for multi-GPU.

I understand that OptiX can support multiple GPU, but I am a bit unclear on how it does it. It seems to just do half the screen on one GPU, and one on the other, or some more sophisticated way of partitioning a single frame on multiple GPUs.

What i do is trace a number of photons and then store them in a buffer. The photons are then sorted into a spatial hash map using CUDA and thrust, never leaving the GPU. If i use one GPU then of course all photons are stored on one device. If i use multiple GPU’s then i imagine that the buffer is split across several GPU’s. On a map to host, these buffers are combined into a single CPU buffer. I want to do the spatial hash map creation on GPU since it is the bottleneck of my renderer. Any more insight into how multi-GPUs in OptiX would be appreciated, I find that the programming guide and API don’t touch upon this issue.

What i would acutally like to do is let one GPU do one frame, while the next GPU does the next frame. I can then combine the results, for instance on the CPU or on one of the GPU’s in a combine pass. Is this somehow possible? For instance, could I create two OptiX contexts mapping to each device on two different host threads. This would allow me to do the CUDA/thrust spatial hash map generation as before, assuming the photons are on one device, and merge the two generated images at the end of the pipeline.

xtc · April 16, 2013, 2:04pm

Hi

Not much happening here so I’ll stir the pot a bit. Hoping to get some feedback.

Can anyone explain to me what happens when you write to a buffer in a OptiX system with multiple GPUs enabled? Will it have to do a synchronize to get the data over to both buffers in this case? The index is a function of launchIndex, however not buffer[launchIndex], so there should be no collisions. However, it seems there is no way to merge these buffers correctly. If i use RT_BUFFER_GPU_LOCAL the slowdown is gone but the results is not correct.

Working with OptiX doing more advanced stuff on multiple GPUs is not optimal for a couple of reasons. Im working on a Master’s thesis with an emphasis on speedup using GPU’s, and ideally several GPU’s’. I would like to be able to launch kernels asynchronously and also to specify which device to use. This would allow me to get a close to 2x speedup using multiple GPUs in my implementation.

xtc · April 16, 2013, 4:34pm

Is there any way to find out which CUDA device a program is running on?
Also, can i find the number of rays that are generated on this GPU (for instance, by looking at BlockDim and GridDim)?

Using this information, I should be able to make sure that each GPU writes to a different area of the buffer. I can then using CUDA merge these buffers from different devices.

nasmaj · April 18, 2013, 7:40pm

So one way you can see which devices OptiX is recognizing is by using the sample3 program in the SDK. If you use this program it will tell you all the GPU’s being recognized. If you want to use only one of the GPU’s you can do something like
CUDA_VISIBLE_DEVICES=0 ./sample3
This will only use the first device recognized and the numbering should stay the same, at least until reboot.

I am not sure if there is a way to set which devices are used from inside the program, but the OptiX team is usually pretty good at responding to these posts within a week or so.

over0219 · April 22, 2013, 6:14pm

From personal experience I’d advise against this. But, our difficulties came from excessive data transfer between host and device in our application. You should check out Chapter 10 in the OptiX Programming Guide:

From the Programming Guide chapter 9:

But with the RT_BUFFER_GPU_LOCAL flag this is not the case, which is why you saw a significant speedup using it.

A simple solution is to just create a variable and set it right before the launch.
I.e. in your kernel:

rtDeclareVariable( uint3, launch_dimension, , );

and before your launch:

context["launch_dimension"]->setUint( dimx, dimy, dimz );
context->launch( "program", dimx, dimy, dimz );

You can do it with rtContextSetDevices. Check out the first section of chapter 3 in the OptiX Programming Guide. So if you only want to use one device, it would look like:

optix::Context context = optixu::Context::create();
rtContextSetDevices( context.get()->get(), 1, 0 );

Although, xtc, I don’t think this was your question? If I’m not mistaken you want to query which device a kernel is running on from inside the kernel, that way you know where to put your results in the buffer, correct?

xtc · April 26, 2013, 10:08am

I had no luck doing this inside a single process, so I am actually doing multi-process distributed rendering using sockets which will give me a cloud like feature. OptiX is not flexible enough to let me do multi-GPU inside a single process with the algorithm I’m working on, but I believe my approach has potential.

xtc · July 18, 2013, 5:03pm

I have made some descriptions of my implementation, and I plan to release source code as soon as possible. Please read more about it [url]http://apartridge.github.io/OppositeRenderer/[/url]

leno · July 18, 2013, 5:33pm

Looks cool! Looking forward to the release.

JBigler · July 18, 2013, 11:40pm

Some additional notes.

GPU_LOCAL does not copy data back to the host when you map it there.
OUTPUT buffers are stored on the host as zero-copy buffers in multiple GPU environments as over1219 said. This also means that any read or write to the buffer will go across the PCIe bus as well atomic operations don’t work properly.
OptiX currently (and this could change in future versions) stripes blocks of ~64K rays per device. If you don’t have enough rays for a given launch you are not going to get good scaling across multiple GPUs.
You could use CUDA interop, and supply your own output buffer for each GPU then copy the data to the host (or use peer-to-peer) to aggregate the data, though you will have to keep track of which GPU ended up writing to which parts of your buffer (as we are free to change how work is distributed between devices).

xtc · November 5, 2013, 7:12pm

leno: Source code is released on the same URL. :)