I discovered (and also found here: https://devtalk.nvidia.com/default/topic/539422/?comment=3792994), that OptiX stripes blocks of ~64K rays per device.
Is there an option to disable or change this approach?
The information you cite is from 2013 and said this could change in future versions.
OptiX should divide the workload depending on the actual performance of the devices. I’m actually not sure if there is still some 64k block size involved.
The work distribution is OptiX internal and not programmable at this time.
What was the GPU setup and method you used to determine possible workload distributions in OptiX?
What problem do you need to solve which would require to control that from the OptiX API?
Multi-GPU scaling with OptiX normally works best with a homogeneous GPU setup. In my experience with multi-GPU scaling on a heterogeneous setup, scaling is better for more expensive workloads.
For example, I have a Quadro K6000 and Tesla K20C installed which use the same GPU but the K6000 is faster and has more VRAM. For interactive cases with higher frame rates it was fast enough to just use the K6000, also due to the OpenGL interoperability functionality available then. For expensive final frame rendering it was beneficial to use both devices.
problem is, that OptiX does not allow to manage multi GPU by programmer.
We have path tracer based on OptiX, so splitting to more GPUs is beneficial for final render also for lower resolution that 256x256, however OptiX doesn’t support it. We are using this resolutions for tiling (we have own formula to determine number of pixels in tile by number of cards and their cores - something like sum of all cores * some constant).
More better will be option to have input-output target per device with launch index per device.
For example: We are using tiles for rendering and one launch index correspond with one pixel index. So for example, we have 350x350 pixels, where one card stores results into 250x250 region and another card into remaining pixels (always this same pixels, but I as programmer don’t know which pixels for which card). So after that OptiX must copy this targets to CPU and merge it and copy to both GPUs, but the second GPU never use this pixels.
I think you’re making some incorrect assumptions about the OptiX multi-GPU implementation.
Using multiple devices in one OptiX context will place output or input_output buffers into host memory (see Chapter 11 in the OptiX Programming Guide) which can be written or read and written by each GPU independently. After all GPUs have finished their work, the final result is in that host buffer and there is no copying going on on that, unless for a display routine you implemented.
Acesses to these buffer go over the PCI-E bus, that’s also the reason to avoid float3 output or input_output buffers as rendering targets which are slower due to the worse alignment than float4.
That also means that the performance of multi-GPU can simply be slower than that of a single GPU if the workload is too small to saturate both GPUs because the buffer accesses are not on the GPU.
I can see that in my path tracer as well with launch sizes smaller than 256x256 and both GPUs work on it, so there is no hard 64k limit.
To save on PCI-E bandwidth I’m only displaying the first frame or the first 0.25 seconds of a progressive rendering during user interaction and then display the following sub-frames progression only once a second. That helps multi-GPU scaling simply because of the reduced PCI-E load, esp. for big resolutions.
The granularity of individual rendering blocks is warp sized, means each GPU works on 32 pixels at a time. Which those are is always OptiX’ choice, the single ray programming model doesn’t allow to influence this.
Just for fun, you might see some of that if you color pixels depending on the device clock() results and compare single vs. multi-GPU renderings of identical scenes. Should be best seen with very coarse geometry, otherwise the acceleration structure traversal times will interfere. Search for “TIME_VIEW” in the OptiX SDK example code. Besides that, it’s a nice way to identify expensive regions in a rendering.
Are you rendering into buffers of this small tile size (e.g. 256x256) or are you rendering into the actual final frame resolution (e.g. 1920x1080) and pick the tile destination indices in the ray generation program?
Means launch size 256x256 but buffer size 1920x1080 and the ray generation has offsets and size available to calculate the destination index?
In either case, if you find that 256x256 or smaller sizes do not show multi-GPU scaling in your implementation, why aren’t you adapting your heuristic in the multi-GPU case to get better scaling?
If you’re adapting your heuristic , note that scaling is not quite linear with the number of GPUs installed in a single system because the buffer accesses over the PCI-E bus increase the overhead per GPU added.
you have right, however it doesn’t solve my problem ;)
As I wrote, more better will be option to have input-output target per device (on device, something like GPU_LOCAL, but with cpu read) with launch index per device. This solve all this problems (if you can split job to different input-output buffers).
I did it right after I find out this behavior and also created this topic ;)
I ran my scene on GTX 690 on tiles with resolution something about 220x220 pixels and one card worked 100% and second card 0%, so I tried to change tile resolutions.
However as you wrote, it works for your path tracer for this resolution, so with another cards I probably get different results, so my heuristic depends on computer configuration, because only OptiX has control about it.
Or in some of previous OptiX version (3.0 or 3.5? I don’t remember exactly), some changes of buffer wasn’t copied to both graphics cards, so this was also funny ;)
We are using tiles also for saving some of GPU memory, so we use small buffer.
Our users can choose graphics cards to be used, so it is weird if they choose two cards and one works 100% and another 0% ;(
However thank you for your explanation and have a nice day
I’ve ran some more benchmarks with my path tracer on a multi-GPU system which resulted in the issues you experienced and filed a task to change the OptiX multi-GPU work distribution in a way which gains more scaling at small launch sizes.
Unfortunately this won’t make it into the next imminent OptiX release, but it’s on the list for the next one after that.
Thank you, I am waiting to it ;)