Recommendations for splitting work between GPUs

I’m looking for any recommendations or algorithms for splitting an OptiX workload between multiple GPUs. My understanding is that it’s best to have multiple instances of the same GPU, in which case splitting the work evenly between them is close enough to optimal for me (I’m not interested in accounting for other workloads on each GPU). But if running with GPUs of different sizes and/or architectures, I’d like some reasonable heuristic for splitting a workload so that they finish about the same time. Is there some documentation or example related to this?
Legacy OptiX could internally split work between GPUs. Can you provide any information on what it was doing?
I do not have a specific GPU combination in mind here. I’m just looking for general rules of thumb.
Thanks

Hi @bdr,

The first thing to clarify is whether you’re using multiple GPUs primarily to increase parallel performance without regard to memory usage, or whether your primary goal is to pool GPU VRAM into a larger working set.

OptiX 6 and earlier could do both of these things with relatively naive and simple strategies that work best on a set of homogeneous devices. If you had a 2d launch, OptiX would divide the launch up into warp-sized tiles and round-robin deal the tiles out to each GPU. For 2 GPUs, for example, this looks like a checkerboard. This naive strategy relies on the assumption that the multiple GPUs are all the same type. OptiX would also replicate all memory across all GPUs until they were somewhat full, and then would start migrating texture memory to 1 GPU at a time and allow cross-device texture queries with nvlink. Only texture memory was ever migrated, and the geometry & BVH memory remained replicated at all times in order to keep traversal performance acceptable.

There are a couple different ways you might plan for a heterogeneous set of GPUs. One would be to estimate the speed of each device, and divvy up the workload in proportion to each device’s speed estimate. Another would be to try to divide your workload into multiple independent chunks with separate launches, tiles perhaps, and build a work queue manager that sends chunks to GPUs on the fly. The chunks need to be large enough to saturate the devices, while also being small enough to have many more chunks than GPUs, so you don’t leave everyone waiting on the last device running. Personally I like this idea since it makes the system adaptive and you don’t need any fancy heuristics, but I realize that depending on what you need, it might not be realistic for you.

If tiling the work and using a queue manager won’t work for you, then estimating the speed of a device could be done by taking into account, at a minimum, the clock speed and number of SMs, or by measuring device speeds on some kind of sentinel kernel before launching a large workload. I don’t expect either to be perfect, especially if the GPUs are very different in size & speed & capabilities, but maybe you can get close enough.

If memory pooling is your main goal, then today I would suggest starting with something like OptiX Toolkit’s Demand Loading rather than trying to do any device-to-device memory migration. This strategy has the potential to keep memory requirements down, and you can manage each device’s texture cache separately, and then focus on how to divvy up the work using one of the above strategies to achieve parallel scaling.


David.

1 Like

Thanks for all the information. Parallel performance is the goal, and duplicating everything to each GPU is acceptable. An adaptive work queue does sound best, but I don’t consistently have enough work to subdivide that much. I was mostly wondering if there’s a known better quick speed estimate than clock times SMs (or cores?) that might have some accounting for different architectures. But perhaps those differences would disappear into the noise in practice anyways.

I’m not aware of a better heuristic or a good example for you, but Detlef has more multi-GPU experience than I do and might have some notes for us on Monday. I’ve tried using cores & clock speeds to normalize profiling metrics across different GPUs with some success, but every time I’ve asked around about how to do it better, I’ve received various XY problem responses along the lines of “why?”, and “oh that’s going to be hard”. ;) Instead of trying to profile GPUs on the fly, you could consider setting up a separate timing profiler for your known GPUs to measure how fast they go on a specific type of workload similar to what you need, in order to derive constants to hard-code into your application. I bet starting with the (clock * SM_count * manual_fudge_factor) normalizing constant won’t be all that bad, unless you’re mixing RTX and non-RTX GPUs. I’d speculate that it will at least be a little better than sending equal workloads to the different GPUs. Playing with it for a while, you might develop a little intuition for how much accuracy or inaccuracy there is.


David.

Some history first. The multi-GPU support inside OptiX versions before 7.0.0 only split work across compatible GPUs (think of same SM version). It split the work per launch and could share texture resources via NVLINK when the VRAM usage reached a threshold. It didn’t scale too well when used naively, because due to the legacy OptiX API, output buffers were allocated in pinned memory on the host when OptiX used multiple devices, which resulted in PCIE bus congestion the more GPUs were used. Applications needed to be aware of that and allocate temporary output buffers with the RT_BUFFER_GPU_LOCAL flag to have the temporary buffers on the device.
This post contains links to previous discussions about that: https://forums.developer.nvidia.com/t/optix-6-5-multi-gpu/118375/2

Now with OptiX SDK 7 and 8 versions, all of that changed to the better because the OptiX 7/8 API knows nothing about GPU devices or textures! Instead you control all that via native CUDA host API calls where you have complete control over how and where you allocate what resource (device buffers, texture arrays).

My OptiX Advanced Samples have a number of multi-GPU capable examples. The README.md describes what they do.

rtigo3 shows all different multi-GPU methods. Search the code for m_strategy. Later examples picked the fastest solution only which is the local copy strategy. The system_rtigo3_*.txt configuration files inside the data folder explain the different options. That also shows different CUDA-OpenGL interop option.

They split the work per launch (sub-frame of a progressive path tracer) among multiple GPUs evenly in a checkerboard pattern which can be adjusted in size. 8x8 pixels is my minimum because that covers both 8x4 and 4x8 2D warp tiles and it needs power-of-two sizes because of the integer math calculating which launch index maps to what screen pixel.
It should also be possible to enhance that to a weighted workload distribution, but I haven’t implemented that inside these examples.

In the local copy multi-GPU strategy, the individual 2D launches per GPU are using (1/Nth + padding) width times the full height of the output resolution.
The ray generation program calculates which launch index per GPU goes to which full resolution pixel to shoot primary rays.
The compositing of the individual 2D blocks to the final output buffer resolutions happens with a native CUDA kernel after having copied the partial buffers from the other GPU devices to the device which holds the full resolution output buffer, which is usually the device which also runs the OpenGL implementation, so that the image can be displayed with an OpenGL texture blit.
That CUDA peer-to-peer (P2P) copy will automatically be routed through the fastest path by CUDA, that is via NVLINK when possible, otherwise via PCIE.

Since that is happening in a progressive renderer, I’m not doing that compositing for each sub-frame because that would be a waste of bandwidth. Instead I only show all sub-frames in the first half second to get some reasonable quality during interaction and then only composite and display progressive updates once per second.

Additionally various resources can be shared among devices to save memory which is controlled via a bitfield inside the other renderers (nvlink_shared and newer). Look for “peerToPeer” inside these examples.
The renderers use the NVIDIA Management Library (NVML) to determine the NVLINK topology of the visible devices, resp the active ones selected by the renderer. Later examples can also share resources via PCIE but that isn’t really worth it due to a big performance impact.
Sharing textures via NVLINK came at little cost, sharing acceleration structures via NVLINK access was 2x to 5x slower than holding them on each device depending on ray divergence in a scene.
The shared resources are allocated on the device which received the fewest resources, it assumes boards with homogeneous VRAM configs, but the source code is prepared to use the actual free memory as heuristic as well.

The performance of the individual GPUs is influenced by a lot of factors and then there is also the transfer of the data which adds overhead. Usually the primary GPU running all the OS rasterization things can handle less additional workload.
Just distributing work among GPUs based on their specifications alone wouldn’t be enough, and that wouldn’t work among different GPU generations easily without measuring how they really perform. Depending on the data flow, the system performance also plays a role in that.
Mind that most consumer desktop CPUs aren’t supporting enough PCIE lanes to handle even two GPUs with 16x PCIE lanes each. That requires workstation CPUs and motherboards. So the more data you need to transfer between GPUs, the more PCIE overhead you get.

I would recommend implementing a benchmark which determines the performance of your specific multi-GPU work distribution on each system configuration once and then use the resulting weights for the actual work.

Distributing work dynamically would be quite involved because you would need to track what partial results reside on what GPU while making sure that the correct number of samples is maintained.

When it’s not about the interactive performance where individual sub-frames should be finished as quickly as possible, but about final frame rendering where the overall rendering time should be minimized, there would be other solutions to distribute the work. In a tiled renderer which shoots all samples per pixel at once, that would be rather simple to distribute across heterogeneous GPUs by launching tiles from a queue into the GPU which is done.
You could also launch full sub-frames per GPU in that case and accumulate to the final result depending on the number samples each GPU processed.

There is a slightly different weighted workload distribution mechanism described in the first GPU Raytracing Gems Chapter 10. You would just need to determine the proper weights for a system configuration then.
Since that is using 1D scanlines instead of 2D tiles, this is better be used with 1D launches to not clash with the internal warp sized blocks used with 2D launches.