I was also using modelKind OPTIX_DENOISER_MODEL_KIND_HDR and pixel format OPTIX_PIXEL_FORMAT_FLOAT3
That “also” is in addition to the other system configuration options or do you mean you tried other denoiser input formats as well?
Asking because the OptiX denoisers are using half internally, so using float formats as input might be slower than it needs to be.
There can also be a difference between 3- and 4-component inputs due to hardware vectorized loads for 4-components, so maybe try OPTIX_PIXEL_FORMAT_HALF4 instead.
Please read this thread about which denoisers models are recommended today:
https://forums.developer.nvidia.com/t/optix-8-0-denoiser-camera-space-vs-world-space/262875/4
The AOV denoiser models got continuous quality and performance improvements over driver versions and should definitely be tested instead of the LDR and HDR models which didn’t.
The denoiser implementation is part of the display driver, like the OptiX 7/8 implementation itself.
Means improvements in denoiser performance can be expected by changing display drivers and denoiser models rather than OptiX SDK versions.
If possible I would always recommend using the newest possible OptiX SDK versions though.
The OptiX Denoiser API changed among SDK versions and some small application code adjustments might be necessary. Always read the OptiX Release Notes when switching OptiX SDK versions.
The optixDenoiser example inside the OptiX SDK releases shows the usage of the different denoiser models on loaded images data.
In principle the optixDenoiserInvoke calls should run fully asynchronously to the CPU since they are taking a CUDA stream argument.
Your observation that sizes below 256x256 aren’t scaling well is mainly due to not saturating the GPU with such small workloads, which is depending on the respective underlying installed hardware resources.
For such cases running multiple denoiser invocations in separate CUDA streams can actually scale, but don’t overdo it. Using 100 CUDA streams is unlikely to help. Switching between them isn’t free either. I would have used a maximum of 8 or 16 maybe. Benchmark that.
Also I would recommend to not actually use the CUDA default stream for that, which might have different synchronization behavior. (When using the CUDA Driver API you have full control about that.)
I wouldn’t be surprised if there is still some fixed overhead in the denoising invocation which would become visible with many small inputs. That would need to be investigated if that is still the case with the AOV denoiser on workloads which saturate the GPU.