Recommendations for splitting work between GPUs

Some history first. The multi-GPU support inside OptiX versions before 7.0.0 only split work across compatible GPUs (think of same SM version). It split the work per launch and could share texture resources via NVLINK when the VRAM usage reached a threshold. It didn’t scale too well when used naively, because due to the legacy OptiX API, output buffers were allocated in pinned memory on the host when OptiX used multiple devices, which resulted in PCIE bus congestion the more GPUs were used. Applications needed to be aware of that and allocate temporary output buffers with the RT_BUFFER_GPU_LOCAL flag to have the temporary buffers on the device.
This post contains links to previous discussions about that: https://forums.developer.nvidia.com/t/optix-6-5-multi-gpu/118375/2

Now with OptiX SDK 7 and 8 versions, all of that changed to the better because the OptiX 7/8 API knows nothing about GPU devices or textures! Instead you control all that via native CUDA host API calls where you have complete control over how and where you allocate what resource (device buffers, texture arrays).

My OptiX Advanced Samples have a number of multi-GPU capable examples. The README.md describes what they do.

rtigo3 shows all different multi-GPU methods. Search the code for m_strategy. Later examples picked the fastest solution only which is the local copy strategy. The system_rtigo3_*.txt configuration files inside the data folder explain the different options. That also shows different CUDA-OpenGL interop option.

They split the work per launch (sub-frame of a progressive path tracer) among multiple GPUs evenly in a checkerboard pattern which can be adjusted in size. 8x8 pixels is my minimum because that covers both 8x4 and 4x8 2D warp tiles and it needs power-of-two sizes because of the integer math calculating which launch index maps to what screen pixel.
It should also be possible to enhance that to a weighted workload distribution, but I haven’t implemented that inside these examples.

In the local copy multi-GPU strategy, the individual 2D launches per GPU are using (1/Nth + padding) width times the full height of the output resolution.
The ray generation program calculates which launch index per GPU goes to which full resolution pixel to shoot primary rays.
The compositing of the individual 2D blocks to the final output buffer resolutions happens with a native CUDA kernel after having copied the partial buffers from the other GPU devices to the device which holds the full resolution output buffer, which is usually the device which also runs the OpenGL implementation, so that the image can be displayed with an OpenGL texture blit.
That CUDA peer-to-peer (P2P) copy will automatically be routed through the fastest path by CUDA, that is via NVLINK when possible, otherwise via PCIE.

Since that is happening in a progressive renderer, I’m not doing that compositing for each sub-frame because that would be a waste of bandwidth. Instead I only show all sub-frames in the first half second to get some reasonable quality during interaction and then only composite and display progressive updates once per second.

Additionally various resources can be shared among devices to save memory which is controlled via a bitfield inside the other renderers (nvlink_shared and newer). Look for “peerToPeer” inside these examples.
The renderers use the NVIDIA Management Library (NVML) to determine the NVLINK topology of the visible devices, resp the active ones selected by the renderer. Later examples can also share resources via PCIE but that isn’t really worth it due to a big performance impact.
Sharing textures via NVLINK came at little cost, sharing acceleration structures via NVLINK access was 2x to 5x slower than holding them on each device depending on ray divergence in a scene.
The shared resources are allocated on the device which received the fewest resources, it assumes boards with homogeneous VRAM configs, but the source code is prepared to use the actual free memory as heuristic as well.

The performance of the individual GPUs is influenced by a lot of factors and then there is also the transfer of the data which adds overhead. Usually the primary GPU running all the OS rasterization things can handle less additional workload.
Just distributing work among GPUs based on their specifications alone wouldn’t be enough, and that wouldn’t work among different GPU generations easily without measuring how they really perform. Depending on the data flow, the system performance also plays a role in that.
Mind that most consumer desktop CPUs aren’t supporting enough PCIE lanes to handle even two GPUs with 16x PCIE lanes each. That requires workstation CPUs and motherboards. So the more data you need to transfer between GPUs, the more PCIE overhead you get.

I would recommend implementing a benchmark which determines the performance of your specific multi-GPU work distribution on each system configuration once and then use the resulting weights for the actual work.

Distributing work dynamically would be quite involved because you would need to track what partial results reside on what GPU while making sure that the correct number of samples is maintained.

When it’s not about the interactive performance where individual sub-frames should be finished as quickly as possible, but about final frame rendering where the overall rendering time should be minimized, there would be other solutions to distribute the work. In a tiled renderer which shoots all samples per pixel at once, that would be rather simple to distribute across heterogeneous GPUs by launching tiles from a queue into the GPU which is done.
You could also launch full sub-frames per GPU in that case and accumulate to the final result depending on the number samples each GPU processed.

There is a slightly different weighted workload distribution mechanism described in the first GPU Raytracing Gems Chapter 10. You would just need to determine the proper weights for a system configuration then.
Since that is using 1D scanlines instead of 2D tiles, this is better be used with 1D launches to not clash with the internal warp sized blocks used with 2D launches.