Question about handling buffers when using multiple GPUs?


As far as I can tell, I don’t really have to change anything to get an Optix context to utilize more than one GPU as it is done automatically. But how do I handle the buffers? Right now on a single GPU the context is used to create one set of buffers and they are accessed on the CPU by mapping and on the GPU we read/write by indexing/atomicAdd. How does this procedure change?

Does each GPU have to have its own set of buffers? Because the programming guide talks about combining the outputs of each device using “CUDA zero-copy memory”. And how do the reading/writing operations change?


Multi-GPU support with OptiX is more or less automatic.

Though if you’re using “Interoperability with CUDA” as described in Chapter 7 of the OptiX Programming Guide, that requires that you know exactly what you’re doing with CUDA device pointers. That’s not automatic and not part of the answers below.

You can select which GPU devices to use with the rtContextSetDevices() function.
The default OptiX behavior is to use all available devices.
In versions OptiX 3 and lower it used the compatible devices of the highest streaming multi-processor versions.
In OptiX 4 that limitation has been lifted and it can be a heterogeneous setup of different GPU architectures (Kepler and newer) as well. Though that can result in longer kernel compilation (per different architecture).

Input buffers are uploaded to all GPUs.
Input_output and output buffers are put into host memory and all GPUs access that via the PCI-E directly (see Chapter 7 and 11), which adds a limit on the possible scaling progression. I would not recommend to use more than four GPUs per context.
The workload balancing does not take the number of PCI-E lanes into account. Best scaling happens with symmetrical setups, e.g. all boards in 16x lanes slots.

Indexing is unchanged. (Always use operator to access buffer elements inside device code. Pointer arithmetic is illegal.)

Atomics do not work across GPUs. Don’t use them to output your final results on multi-GPU contexts.

But input_output buffers can be flagged to be local on each GPU with this flag (see OptiX API Reference): “RT_BUFFER_GPU_LOCAL - An RT_BUFFER_INPUT_OUTPUT has separate copies on each device that are not synchronized”. Good for temporary scratch space.

OpenGL interop will not be active in a multi-GPU setup because output buffers are on the host.

Other than that, read the OptiX Programming Guide Chapter 11 Performance Guidelines which contains some more notes about multi-GPU behavior.

Then there is remote multi-GPU rendering via OptiX’ progressive API available on NVIDIA VCA clusters, which is a separate topic. You shouldn’t care about that unless you have access to a VCA.

Thanks for the reply.

So in a scenario where atomicAdd has to be used since different threads write to the same index, I should create a temporary RT_BUFFER_GPU_LOCAL buffer so that every GPU would get a local copy of it, and I would store intermediate calculations there? But then how would I combine these local multi-GPU buffers, into a single output buffer to be accessed on the host?

I guess another option would be to use a multi-threaded system with one context per GPU per thread. Then I can just sum up the buffers on the CPU when I’m done. Although I’m confused whether this would work, since in Chapter 3.1 it says “… multiple contexts can be active at one time in limited cases” but in Chapter 11 it says “It is possible to obtain better performance by managing multiple GPUs manually. By allocating a context for each device”, and in Chapter 9.7 it says that OptiX is thread safe.

Perhaps the part in Chapter 3.1 is talking about running multiple context in the same thread?

Either way, I’d still like to know the answer to the first question about reading back multi-gpu results on the cpu.

I think you’re mixing the two different APIs inside the OptiX SDK when citing the documentation.

There is the OptiX API and there is the OptiX Prime API.

OptiX is the high-level programmable ray casting API with single ray programming model where you provide PTX programs for up to seven different program domains (ray generation, closest hit, any hit, etc.).
OptiX Prime is the low-level fixed function ray casting API which uses a ray wavefront programming model and where everything except the acceleration structure build and the ray triangle intersection is your responsibility.

Make sure you check which part of the programming guide you cited talks about which of the two APIs.
Chapters 9 and 10 and the second half of 11 are only about OptiX Prime!

The OptiX API is not multi-threading safe! OptiX Prime is.

Using different OptiX contexts per GPU in one process might work in some cases in OptiX 4.1.1. No guarantees for that so far though.

As for the atomics on multi-GPU that’s a hardware thing and I don’t know of a generally working solution for a ray tracing algorithm in which you need them on arbitrary multi-GPU configurations at this time.
With newer GPU architectures supporting Unified Memory (Pascal, Volta), there is a possibility that this works, but I currently can’t say if that applies to how the output buffers are allocated.

I see. Sorry for the mix-up!

But my first question still stands I think. If I use RT_BUFFER_GPU_LOCAL so each GPU gets its own copy, then I can use AtomicAdd. Is there then any way to read those results back on the host? Since obviously I can’t read these local gpu buffers?

Maybe instead I can do something like having an array of output buffers instead of local buffers. Each GPU would write to a specific buffer based on its ID (which I think could be obtained on a device kernel using cudaGetDevice()). I think that should work with atomicAdd?

I kind of feel like I’m missing something. Surely there is a standard way to doing this. That is using multiple gpus to do forward rendering where any thread can write to any buffer location. And then the result has to be read on the CPU.

Or is using OptiX prime my only option?

OptiX abstracts multiple GPUs in one context so that the user facing side of the API is unaware of that.
OptiX does the scheduling and provides the results in an output buffer.
You do not know which GPU wrote what inside the output buffer that way.
This works nicely with gather algorithms where each launch index writes into unique memory locations.
With scatter algorithms which write into the output buffers using atomicAdd this won’t work because only accesses from threads on one GPU are serialized, not across multiple GPUs.

That’s why I said it’s more or less automatic. “Less” was explained with the caveats I listed.

The main problem here is that the scheduling is done by OptiX. When running on multiple-GPUs you have no control about which GPU handles what launch index.
Means you cannot simply gather the GPU local buffers from multiple GPUs into a composited output buffer, because you do not know which GPU is going to work on what data! For example, you might fetch from the scratch buffer of one GPU which wasn’t actually written to on that but another GPU and you missed to read the proper result.

Actually yes, if writes from multiple GPUs are to completely disjunct pinned memory areas that might work.
There is a discussion about atomics on pinnned memory here:

cudaGetDevice() is a CUDA runtime host function. You can’t call that in an OptiX kernel.

Remains the crucial question how you would be able to identify the individual GPUs inside the OptiX device code, and the only solution I can see inside the OptiX documentation (highlighted below) is to use a CUDA interop input buffer where the application must provide device pointers for all of the devices.

Gathering all relevant documentation: enum RTbufferflag
RT_BUFFER_GPU_LOCAL An RT_BUFFER_INPUT_OUTPUT has separate copies on each device that are not synchronized. RTresult RTAPI rtBufferCreate
The flag RT_BUFFER_GPU_LOCAL can only be used in combination with RT_BUFFER_INPUT_OUTPUT. RT_BUFFER_INPUT_OUTPUT and RT_BUFFER_GPU_LOCAL used together specify a buffer that allows the host to only write, and the device to read and write data. The written data will never be visible on the host side and will generally not be visible on other devices.

Means there are input_output buffers possible per GPU, but you can’t read them back to the host by mapping them directly or via a compositing step reliably because the scheduling is not under your control.

7.2.2. Restrictions
An application must retrieve or provide device pointers for either one or all of the devices used by a buffer’s OptiX context. Getting or setting pointers for any other number of devices is an error. Getting pointers for some devices and setting them for others on the same buffer is not allowed. Calling rtBufferMap or rtBufferMarkDirty on a buffer with pointers retrieved/set on all of multiple devices is not allowed. Calling rtBufferSetDevicePointer on output or input/output buffers is not allowed.

Means it’s possible to have different input buffers per GPU. That’s what you need!

7.2.1. Buffer Synchronization
Multi-Pointer Synchronization
If OptiX is using multiple devices it performs no synchronization when an application retrieves/provides buffer pointers for all the devices. OptiX assumes that the application will manage the synchronization of the contents of a buffer’s device pointers.

7.2.3. Zero-copy pointers
With a multi-GPU OptiX context and output or input/output buffers, it is necessary to combine the outputs of each used device. Currently one way OptiX accomplishes this is by using CUDA zero-copy memory. Therefore rtBufferGetDevicePointer may return a pointer to zero-copy memory. Data written to the pointer will automatically be visible to other devices. Zero-copy memory may incur a performance penalty because accesses take place over the PCIe bus.

Means it’s not actually possible to have different output buffers per GPU.

Again, to be able to distinguish the individual GPUs, the only way according to the above documentation is to use CUDA interop and use an input buffer with a different value for rtBufferSetDevicePointer on all GPUs in the context.
That means automatic synchronization with OptiX won’t happen and you could write something GPU specific into these. In your case that buffer would need to hold just a single value (unsigned integer) with a zero based GPU ID.

Now your algorithm would need to write to, for example, a number-of-GPUs times bigger buffer with a consistent addressing based on that GPU ID and accumulate at disjunct memory locations that way.

I’d be interested if that mechanism works.

I just thought about this, and I think for my case it would actually be much simpler to run two separate contexts with MPI, since my application is highly parallelizable. I would create a context on each core and assign to it a specific GPU to work with using rtContextSetDevices, and just gather the results at the end. This should work, right? I mean the fact that OptiX isn’t thread safe wouldn’t matter since I’m running two different processes with MPI, right?

If it can’t work (for some reason), then I will give your solution a shot. Worst case scenario is I would have to convert my code to use OptiX Prime.

Are you sure? According to
cudaGetDevice “Always returns current device ID as would be seen from host”

This “CUDA Dynamic Parallelism” however only works with compute capability 3.5 or higher, and I’m using 2.0.

Yes, running multiple processes with OptiX in parallel on different devices is perfectly fine. I’ve done that a lot on a dual GPU system.

I am reopening this post because I have a similar problem.
First my system info:

CUDA 10.0.130, g++ (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
OptiX Version:[6.1.1] Branch:[r421_00] Build Number:[26287189] CUDA Version:[cuda100] 64-bit
Display driver: 430.14
Devices available:
CUDA device: 0
    GeForce RTX 2080 Ti
    SM count: 68
    SM arch: 75
    SM clock: 1650 KHz
    GPU memory: 11019 MB
    TCC driver: 0
    Compatible devices: 0, 1
CUDA device: 1
    GeForce RTX 2080 Ti
    SM count: 68
    SM arch: 75
    SM clock: 1650 KHz
    GPU memory: 11016 MB
    TCC driver: 0
    Compatible devices: 0, 1

I have a global buffer to store results because I do not know how many elements are going to be used and which threads are going to write. I use a buffer with an atomicAdd to get the index to write on, something like this:

uint hitIndex=atomicAdd(&atomicIndex[0u],1u);
              //Store hit in global buffer

This has worked correctly. Now, I have multiple GPUs. So I have more or less the problem described above. I have tried something similar to the solution proposed by Detlef.
What I do is as follows:
Both the global buffer and the atomic index buffer are created as RT_BUFFER_INPUT_OUTPUT with RT_BUFFER_GPU_LOCAL, so every GPU gets its own buffer and the atomic operation seems to work.

Then, after the launch, I use directly thrust and CUDA to filter and get the results, a summary of the code for 2 GPUs is below:

thrust::host_vector<HitInfo> filterHits(optix::Buffer globalBuffer, optix::Buffer atomicIndex) {
                //Get device pointer to index buffer
                uint* ai_ptr =static_cast<uint*>( atomicIndex->getDevicePointer(0));
                uint* ai_ptr2 =static_cast<uint*>( atomicIndex->getDevicePointer(1));
//Get device pointer to global buffer
                HitInfo* raw_ptr =static_cast<HitInfo*>( globalBuffer->getDevicePointer(0));
                HitInfo* raw_ptr2 =static_cast<HitInfo*>( globalBuffer->getDevicePointer(1));

                thrust::device_ptr<HitInfo> dev_ptr=thrust::device_pointer_cast(raw_ptr);
                thrust::device_ptr<HitInfo> dev_ptr2=thrust::device_pointer_cast(raw_ptr2);

                thrust::device_ptr<uint> dev_ai=thrust::device_pointer_cast(ai_ptr);
                thrust::device_ptr<uint> dev_ai2=thrust::device_pointer_cast(ai_ptr2);
                uint s=dev_ai[0];
                uint s2=dev_ai2[0];
                thrust::device_vector<HitInfo> vt(s+s2);
//Copy contents
                cudaMemcpy(thrust::raw_pointer_cast(, raw_ptr,s*sizeof(HitInfo),cudaMemcpyDeviceToDevice);
                cudaMemcpy(thrust::raw_pointer_cast(, raw_ptr2,s2*sizeof(HitInfo),cudaMemcpyDeviceToDevice);
thrust::sort(vt.begin(), vt.end());

thrust::device_vector<HitInfo>::iterator new_end=thrust::unique(vt.begin(), vt.end());

thrust::host_vector<HitInfo> hv(vt.begin(),new_end);

//Init index for next launch
return hi;

So far it seems to work correctly. But I am not sure if I am messing with something internally or if I can trust this solution. This is an old post, so maybe there is now a better way to do this.
Any advice is appreciated.
Thanks a lot.