First, I would recommend to use OptiX 7.1.0 and R450 drivers.
The denoiser interface has been changed slightly and the actual implementation has been improved.
I adjusted my OptiX 7 advanced examples last week to compile with either OptiX 7.0.0 or 7.1.0. Default is still 7.0.0 and switching to 7.1.0 only requires to change the resp. find_package(OptiX7 REQURED) to find_package(OptiX71 REQURED) inside the CMakeLists.txt files.
Look for the added OPTIX_VERSION compile-time checks in the intro_denoiser example to see where the API has been changed between the two OptiX 7 versions.
- Are the results from tiled and non-tiled denoiser match?
I cannot say if the result is pixel identical. Never tried myself. It’s running the same denoiser and the overlap size of pixels is meant to cover the internal kernel size in a way that there are no seams between inner tile borders.
- Is it correct to pass tile width/height (without overlapping) to
optixDenoiserComputeMemoryResources
?
Yes, as documented: https://raytracing-docs.nvidia.com/optix7/api/html/group__optix__host__api__denoiser.html#ga2020ac0b7346bc7f1b256f8fea1cf140
Note that OptiX 7.1.0 changed the OptixDenoiserSizes structure to be able to get the scratch spaces and the overlap in pixels in a single call where OptiX 7.0. would have required two calls to get the overlap size first and then the required scratch spaces.
- Is the value
overlapWindowSizeInPixels
in OptixDenoiserSizes the sum of left and right (top and bottom) overlap width? So if I pass 32x32 to optixDenoiserComputeMemoryResources
as tile size and get overlapWindowSizeInPixels == 64
, will the width/height of input layers be 96x96 (actual output size is 32x32 for most region)?
It’s the single overlap to adjacent data, means a 64 pixel wide border around the center tile.
Means in you case, for the inner tiles it would be 64 + 32 + 64 and for the tiles at the edges it’s 32 + 64.
That isn’t going to work, because you wouldn’t have enough data for the second and second to last row or column of your full image.
I would not recommend to do this at all, see below.
- If Q.2 is true, I’m confused from the description:
'outputWidth' and 'outputHeight' must be greater than or equal to the dimensions passed to optixDenoiserSetup.
at the comments for optixDenoiserComputeMemoryResources
.
Is this description correct? Because the description at optixDenoiserSetup
says:
'inputWidth' and 'inputHeight' must include overlap on both sides of the image if tiling is being used.
and I receive the value overlapWindowSizeInPixels == 64
with tile size is 32x32, this is impossible.
The overlap size in pixels is the overlap between two adjacent tiles at each edge.
Means if you have 32x32 tiles, you need at least two rows of these tiles around your center tile to fulfill the required 64 pixels overlap around it.
Tiles at the borders of the full image do not have overlap outside the full image. The overlap only applies to the inner edges.
I don’t see these quotes inside the OptiX 7.1.0 online documentation.
EDIT: Found it inside the API reference. That’s one of the things changed between OptiX 7.0.0 and 7.1.0.
That actually sounds confusing, but the optixDenoiserComputeMemoryResources is used to calculate the required memory, which needs to be the same size or bigger than the input image in optixDenoiserSetup. That’s why the docs say output sizes in optixDenoiserComputeMemoryResources need to be the same or bigger than input sizes in optixDenoiserSetup. Otherwise you allocate not enough memory.
If you normally setup your denoising on full images you simply give the full input image size to the setup, you don’t have more data after all. You can then do tiled denoising on that full input image by setting up the proper OptixImage2D for the tiles with overlap by calculating the start address and size as required. The row stride will take care to access the proper input pixels of the full image. In that case the denoiser invocation is on a smaller size than that input size.
Please read this OptiX 7.1.0 chapter and look carefully at the diagrams and code listing there:
https://raytracing-docs.nvidia.com/optix7/guide/index.html#ai_denoiser#using-image-tiles-with-the-denoiser
Now if you are planning to only allocate as much data as a single tile requires, the documentation says that should include the overlap because in that case your individual denoiser invocation will happen on at least that input size for an inner tile. The ones at the edge would require less memory. The issue with that is that you need to be able to fill the data of this single tile with all its overlap pixels and still need to setup the OptixImage2D differently for tiles at the edge of the image and inner tiles with full border pixels.
This would be an even bigger hassle than doing tiled denoising on a full image with the proper offsets and strides to pick out individual tiles. You would also need to copy the denoised tile to its final location.
In addition to the questions, I noticed that OptiX 7.1.0 SDK actually has (non-tiled) denoiser sample but it forgot to add the project to the root “OptiX SDK 7.1.0\SDK\CMakeLists.txt”
Right, that’s known. https://forums.developer.nvidia.com/t/optixdenoiserinvoke-pixel-format/139854/4
Generally it doesn’t make sense to use tiled denoising if it isn’t for memory consumption issues.
There was a time when the denoiser didn’t work on huge images in the past, e.g. 8192 * 8192 was over the limit of 2 GiG elements inside cuDNN. That doesn’t apply to the denoiser in R450 drivers anymore.
It’s faster to denoise a whole image, so even with a tiled renderer I would render the full image first and denoise once at the end.
For that it’s recommended to use the minimum amount of tiles in the denoiser, e.g. when in the past 8192x8192 didn’t work, then two tiles, e.g. top and bottom of the image with 8192*(4096+64) and 8192*(64+ 4096) denoiser tiles would have been sufficient to overcome the previous limitation.
If it’s for memory limitations, still use the biggest denoiser tile size you can afford.
I also wouldn’t use 32x32 tiles for a GPU ray tracer unless each tile renders all samples per pixel at once. The number of threads should be well above 64k, otherwise the number of threads would be much too small to saturate a recent GPU.