I’m a little confused on how compute device capabilities are used with Optix denoiser. For instance, if the device has Tensor cores it uses them. If not, then fallback to stream processors on the GPU, then go to CPU?
The OptiX Denoiser will only run on the GPU, never on the CPU.
It will use the fastest available method for each underlying GPU architecture automatically.
They are compute kernels which run on the streaming multiprocessors anyway, but if there are Tensor cores available, they will be used to speed up the inferencing.
The implementation ships with the display drivers, means newer drivers can improve both speed and quality.
Thanks. This is very helpful.
How does this work with multiple GPUs of different architectures? For example, let’s say that I have a Pascal Titan and a Turing 2060. Will Optix denoiser only use the display device or can it use both but in this case would use stream processors of the Pascal card and Tensor cores of the Turing card?
OptiX 7 has no knowledge of multiple devices!
That’s finally completely under the developer’s control and happens all inside the CUDA host code of your application.
Means you normally create a CUDA context per device, and an OptiX context per CUDA context and these are completely independent from OptiX’ point of view. Everything which should happen between boards is pure CUDA code.
If you do exactly the same OptiX API calls on both contexts, there will be different kernels for the heterogeneous devices, acceleration structures will be different (incompatilble, cannot be relocated from one to the other device in this case), and probably some more things.
One of my OptiX 7 examples does that, though that has not been tested with a heterogeneous GPU setup.
The preferred setup for multi-GPU should be same board types and best with NVLINK connection.
I would expect that the only rendering distribution strategies which work with that are obviously the single-GPU one and possibly the multi-GPU zero copy (pinned memory) strategy.
All other rendering distribution strategies implemented there so far require copies between the two devices for final display and I do not know if that works with a heterogeneous GPU setup the way I implemented it.
Link here: https://forums.developer.nvidia.com/t/optix-advanced-samples-on-github/48410/4
That example also contains two methods (one for Windows and one for Windows and Linux) to figure out which device is the primary OpenGL device to make CUDA-OpenGL interop working.
Anyway, when handling these two devices separately you can distribute the work as you like, which esp. in a heterogeneous setup would require to do some load balancing to make sure the slower board doesn’t bottleneck the rendering speed.
Also the denoiser will run differently, as said the RTX will use the Tensor cores, the Pascal obviously not.
Here I would try to run the denoiser only on the faster board on the full image.
That would be simpler than denoising two tiles of the image, one on either board, which requires an overlap area (query with optixDenoiserComputeMemoryResources) between tiles, and then you still need to get the results to the final full image anyway.
If you’re not actually rendering with OptiX but only want to apply the denoiser, then I would recommend to not use two devices at all, just pick the faster one.
(OptiX 6 would not allow multi-GPU on your configuration. It will pick all GPUs with the highest compatible SM versions, which would be the RTX board in your setup. It’s also either all boards with RT cores or none.)
Thank you for this detailed reply. I didn’t realize this was made possible in Optix 7. That’s great!
I’ll try your Optix 7 example to see how that works in a heterogeneous GPU setup. The ideal use case would be that the boards would be the same architecture and model, and could communicate over NVlink, but I would like the flexibility to not require that, even if it does require more complexity in the code and some potential loss of performance.
Choosing the right rendering distribution approach will require some exploration and experimentation particularly with respect to memory and copies. For load balancing I was thinking to keep it simple with subdividing tiles for the slower GPUs. The subdivided tiles may be smaller than optimal, but it would avoid idle while waiting for the slower GPUs to finish.
For the denoiser, I’m not too concerned. It’s fast enough that I would probably just go with the fastest card doing the entire image. An alternative would be to have the fastest GPU denoise each tile as it is completed before composition.
My multi-GPU OptiX 7 example is currently implemented to work with all GPUs on a single frame. It evenly distributes the work in a checkerboard pattern to the devices.
You could also distribute full alternate frames. Internally I have a strategy which renders all samples per pixel per tile (final frame bucket rendering). Both these would better be implemented with a multi-threaded work queue to automatically load-balance work distribution on a heterogeneous GPU setup.
So far my examples are all single threaded, also because my development system has a dual Quadro RTX6000 NVLINK setup.
Have a look into the Raytracing Gems: https://www.realtimerendering.com/raytracinggems/
Chapter 2.10 contains a load balancing scheme with weighting of GPU performance. I haven’t used that so far.
The OptiX 7 denoiser will only take a few milliseconds on a FullHD image an a board with Tensor cores. Tiling will always be more complicated.
It got much faster since the initial versions. If you’re under Windows use at least 442.50 drivers for better quality.
You cannot denoise tiles individually without having the adjacent border pixels around it available or you’ll get seams. That border size needs to be queried with optixDenoiserComputeMemoryResources() and currently says you need 64 pixels overlap.
So let’s say you have 64x64 tiles then you need a maximum of eight tiles around the center tile you want to denoise. Trust me, use it on full frames.