in my application I need to perform fast denoising of large number of low-res images (i.e. 32x32 or 64x64 pixels) These images are all independent of each other (therefore denoising each image individually gives theoretically perfect result).
I am aware that my use case is not what the optix denoiser was primarily designed for.
Unsurprisingly performing 1024 optixDenoiserInvoke() calls on 32x32 images takes an order or two orders of magnitude longer than performing single optixDenoiserInvoke() on one 1024 x 1024 image (i.e. the same total number of pixels) Note that after performing all the optixDenoiserInvoke() calls I measure the total time first waiting for CUDA_SYNC_CHECK() and I don’t measure any host vs. GPU memory transfers (only the execution time of optixDenoiserInvoke calls)
This is caused by the fact that optixDenoiserInvoke() takes about the same time for all resolutions up until about 256x256 (so there is some overhead of starting the denoiser which is irrespective of the resolution and this overhead takes most of the time for denoising low res images like 32x32)
I was trying to mitigate this by running multiple optixDenoiserInvoke() on mulitple prealocated denoisers each using its own stream and this helps to speed up the denoising but doesn’t scale much above ~10 denoisers & streams. (Even then the speed up is not 10x but more like 5x)
Curiously while denoising single image this can take approximately 0.5ms, while running 100 parallel optixDenoiserInvoke() calls on 100 images is closer to 6ms instead of 0.5ms (that means calling async operation optixDenoiserInvoke 100x with 100 different streams doesn’t scale anywhere close to 100x speedup)
I also tried the multiple denoisers approach calling each optixDenoiserInvoke in its separate thread (testing if it could be related to CPU bottleneck) but this didn’t help.
When I tried combining all the small 32x32 images into large image (consisting of multiple tiles each having configurable “overlap” of black pixels in between the actual content) here the speed of single large optixDenoiserInvoke is again much faster than multiple small paralel optixDenoiserInvoke calls, however as I was worried this also resulted in the black background leaking into the final denoised result and I was unable to resolve this problem within the precision margin for my applicaiton (while applying the denoisier separately on 32x32 works OK)
This is all tested on RTX 3070 + Ryzen 7900 + Ubuntu 22.04.2, Driver 535.86.05, CUDA 12.2, Optix 7.6
I was also using modelKind OPTIX_DENOISER_MODEL_KIND_HDR and pixel format OPTIX_PIXEL_FORMAT_FLOAT3
So here are my questions:
1.) Is there anything I could try that could improve the performance when using Optix 7.6?
2.) Could upgrade to Optix 8.0 improve the performance?
3.) If not is there any chance that in some future version we could get denosing speed for this scenario closer to theoretical limit (instead of currently i.e. 10x slower than the theoretical limit even when using multiple denoisers + streams)?
I was also using modelKind OPTIX_DENOISER_MODEL_KIND_HDR and pixel format OPTIX_PIXEL_FORMAT_FLOAT3
That “also” is in addition to the other system configuration options or do you mean you tried other denoiser input formats as well?
Asking because the OptiX denoisers are using half internally, so using float formats as input might be slower than it needs to be.
There can also be a difference between 3- and 4-component inputs due to hardware vectorized loads for 4-components, so maybe try OPTIX_PIXEL_FORMAT_HALF4 instead.
The AOV denoiser models got continuous quality and performance improvements over driver versions and should definitely be tested instead of the LDR and HDR models which didn’t.
The denoiser implementation is part of the display driver, like the OptiX 7/8 implementation itself.
Means improvements in denoiser performance can be expected by changing display drivers and denoiser models rather than OptiX SDK versions.
If possible I would always recommend using the newest possible OptiX SDK versions though.
The OptiX Denoiser API changed among SDK versions and some small application code adjustments might be necessary. Always read the OptiX Release Notes when switching OptiX SDK versions.
The optixDenoiser example inside the OptiX SDK releases shows the usage of the different denoiser models on loaded images data.
In principle the optixDenoiserInvoke calls should run fully asynchronously to the CPU since they are taking a CUDA stream argument.
Your observation that sizes below 256x256 aren’t scaling well is mainly due to not saturating the GPU with such small workloads, which is depending on the respective underlying installed hardware resources.
For such cases running multiple denoiser invocations in separate CUDA streams can actually scale, but don’t overdo it. Using 100 CUDA streams is unlikely to help. Switching between them isn’t free either. I would have used a maximum of 8 or 16 maybe. Benchmark that.
Also I would recommend to not actually use the CUDA default stream for that, which might have different synchronization behavior. (When using the CUDA Driver API you have full control about that.)
I wouldn’t be surprised if there is still some fixed overhead in the denoising invocation which would become visible with many small inputs. That would need to be investigated if that is still the case with the AOV denoiser on workloads which saturate the GPU.
I meant this was the only configuration I tested with. Now I followed your advice and tested with OPTIX_PIXEL_FORMAT_HALF4 as well (didn’t seem to make any significant effect on speed).
I also tried AOV denoiser model. Unfortunately I did not see any speed up with this model. It was actually a bit slower (tested on Linux & WIndows) compared to HDR (both when using single stream and when using multiple streams), however it appeared that it required slightly less samples for comparable quality. But since I am in a situation when the denoising is slower than the actual path-tracing in my use case “dumber” but faster denoiser is preferable.
AOV denoiser also doesn’t seem to scale by using multiple threads (only multiple streams)
I also tried passing all the layers as part of one long vector and set number of input layers accordingly - this was for the AOV model and as expected - it executed significantly faster than
using multiple streams but with single denoiser invoke per layer, however as also expected the denoised outputs were all mixed together as in multilevel AOV layers are considered to be correlated and treated as such (that is my understanding at least).
Would be nice to have something like multiple layers “batch” mode for HDR denoiser model (interface already supports passing it that way - however ignores all the additional layers)
Otherwise I didn’t find a way how to overcome the significant overhead of running denoising on multiple low res images.
I also tried putting all the images into single big texture and denoise that with single denoiser invoke (which as already mentioned is much faster) but even when I set albedo into guide layer (and enabled albedo in denoiser options) so that albedo for “black” overlap areas between tiles was 0 and albedo on areas with actual content was 1 this still didn’t seem to help the denoiser to realize it shouldn’t mix color pixels from the black “overlap” areas with areas that had albedo 1. Is this supposed to work like that? Or maybe I had some bug in the code?
I suspect that it looks almost as some sort of CPU bottleneck (but one which doesn’t scale with multiple threads). That’s because my Linux machine has RTX 3070 and denoising there is actually considerably faster than on my Windows machine that has RTX 3080 (and comparable CPU) and this is what tends to happen when on other more complicated code which are compiled with gcc vs. MSVC.
Yes I create all the streams before using them / not using the default one. And it also appears the performance doesn’t scale beyond 8 parallel streams.
AOV denoiser also doesn’t seem to scale by using multiple threads (only multiple streams)
Yes this is expected; CPU threads don’t affect GPU serialization, only CUDA streams do. However, your note on Linux vs Windows, and 3070 vs 3080 is pretty interesting. I would recommend using Nsight Systems to profile on both machines and see if you can spot where the speed difference is. I would expect the actual kernel execution times on the 3080 to be faster than the 3070, all else being equal. If there is a difference in CPU speed that is bottlenecking your program, you might be able to see it in the gaps between kernels, or in the host-side profile directly. Nsight Systems will include CPU-side functions as well, if you let it, so you get a nice picture of both CPU and GPU activity, and how they intermingle.
I also tried putting all the images into single big texture
I think this is probably the only hope for batching many small images with the current API. Maybe consider playing with a scheme where you extend the boundary pixels of your image tiles into the gutter between images, so it’s not just black. That way, if it does a little filtering that brings in pixels from outside the image boundary you won’t notice. There are multiple ways you might do that, copying pixels along colums & rows, or by diffusing the color. I would assume wrapping (where you start using pixels from the left side of the image once you run off the right side) might have the same problem as using a black background, but I’m not sure. There might be ways to put a different color that’s not black against the boundary and get the denoiser to believe it’s an edge or high frequency detail.
It’s a good idea, but I don’t think the albedo or alpha buffers can be used to mask what pixels get used for denoising. But maybe playing with the background more will yield a viable approach.