Very poor multi-GPU scaling on DGX-1


we have tested Optix on all of the 8 Tesla V100 GPUs in a DGX-1 machine and to our surprise noticed very poor scaling.

On 1 GPU, we saw an occupancy of about 50% and a framerate of 80 fps.
On 8 GPUs, the occupancy was about 20% per GPU and the framerate was only 180 fps.

Are there any settings, flags which need to be enabled to get better scaling?

When using multiple devices in OptiX the output and input_output buffers reside in pinned memory and there is congestion when writing over the PCI-E bus to the same target with many GPUs.

If your renderer is accumulating images, that expensive read-modify-write operation can be done in GPU local buffers and only the final result can be written to an output buffer which then resides in pinned memory. That should increase the multi-GPU scaling drastically.

Find some more information when digging through all links in this and the referenced threads:

Hi, we are currently thinking of purchasing a Nvidia VCA machine with 8 Quadro P6000 GPUs.

How well does Optix scale in the VCA? And are there any benchmarks of Optix running on such a machine?

You should consider using Turing boards for GPU ray tracing nowadays.

The Pascal architecture is two generations older and Turing contains dedicated ray tracing cores and tensor cores, so both pure ray tracing performance and AI denoising is a lot faster. There are also newer rasterization features available.

Then there could be this option:

We’re looking into running Optix on a VCA to drive a very large displaywall (8m x 3m) illuminated by eight 4K projectors.

From our tests on a 4 GPU system, we think Optix’ load balancing feature may hurt the scaling performance across mutliple GPUs. Since we want to render directly to the projectors, without copying the buffer to host memory first, is it possible to manually turn off load balancing in Optix?

Also, can Optix render on a distributed cluster of nodes (each outfitted with multiple GPUs)?