Optix Denoiser high CPU usage

xilefmai · November 24, 2019, 1:52pm

I’m running the optix denoiser along a vulkan project. I’m using color, albedo, normal and the intensity buffer as inputs.

Without denoising, my project has about 1% CPU usage. When invoking the denoiser with optixDenoiserInvoke for each frame, I get about 15% CPU usage.

Why is that? Is there a way to reduce CPU usage?

Thanks

dhart · November 25, 2019, 6:52pm

Hi @xilefmai,

I didn’t follow the details of the Vulkan-OptiX integration very carefully – so stupid question: what are the high level steps involved in connecting Vulkan to OptiX? Without knowing anything, my wild speculation about the CPU usage would be maybe it’s caused by buffer copies between CPU and GPU. Is that possible given what you had to do?

I’d recommend profiling the app to see what the CPU is doing. If it’s cudaMemcpy traffic, for example, then you might be able to get some visibility by using Nsight Systems, otherwise if there’s something unexpected on the app side you might be able to use a CPU profiler.

Side note, I believe the normal buffer is still being ignored by the denoiser. You might try not including the normal buffer in your pipeline and see if the CPU usage is affected.

–
David.

xilefmai · November 26, 2019, 2:07am

Hi thanks for your reply,

I’m using the VK_NV_ray_tracing extension with Vulkan and share the final ray-traced image buffer with VkExportMemoryWin32HandleInfoKHR and import it with cudaImportExternalMemory. There are similar methods for sharing semaphores.

My synchronization is:

Cuda waits for a Vulkan semaphore, indicating when the ray-traced image is ready
Denoiser gets invoked
Vulkan waits for Cuda until the image is denoised (and ready to be copied to the swapchain images)

I use VkBuffers for all denoiser inputs, directly shared between Vulkan and Optix, device-local and without any copying. I found previous posts mentioning that sharing VkImages with Optix is tricky and difficult (actually involves copying), so I tried sharing just VkBuffers directly and it works great so far (except CPU usage).

Regarding the normal buffer, if I enable the normal buffer and put gibberish into it, then the denoised image is affected by that.

dhart · November 26, 2019, 6:21pm

What is your image size, and frame rate, and ray tracing workload? Are you able to CPU profile the application?

I’ve asked a Vulkan expert who just tried a similar setup, and he can see high CPU usage when the ray tracing workload is small and the image size is small and the frame rate is high. That would be expected since the application is looping quickly. When he traces more rays with a larger image size, the CPU usage goes down due to the application having more time between launches. Would this explain your situation at all?

–
David.

xilefmai · November 29, 2019, 4:29pm

I’ve tried different combinations now, I increased the SPP from 4 to 16, 24 and 32, resized the window to small (256x256), medium (512x512) and large (1280x1280), nothing had really an impact on the CPU usage. My framerate is about 35-40 FPS with denoiser, and 60+ FPS without at 1280x720.

I use an external EVGA RTX 2070 with a Razer Core X connected over Thunderbolt 3 if that helps.

I profiled the project with nvprof, you can find the results here: https://gist.github.com/maierfelix/dbee9abe2fa77520228ad2a7b596904c. It seems, that a call to cudaStreamSynchronize has very large impact, I don’t know what’s the reason for this.

Edit: I’ve removed all cudaStream related stuff from my denoiser setup, didn’t had any impact. The source of the denoiser can be found here: https://github.com/maierfelix/nvk-optix-denoiser/blob/master/src/index.cpp#L244
Also I’ve made sure that Cuda and Vulkan use the same GPU device.

dhart · December 2, 2019, 5:58pm

Have you tried a CPU profile of your application? Nsight systems might work, or you can use any normal non-Nvidia CPU profiler. The nvprof output won’t really help us understand a large CPU usage. I don’t know a lot about issues relating to eGPUs or Vulkan-CUDA interop, it certainly could be related to one of those things, but I think first step is to start profiling the CPU side of the app. If it’s hard to get a profiler to work, you can always add manual timing code in your render loop to time the launches and also time everything else except the launches. The first thing we need to know is whether the CPU usage is happening during launches (and which launches specifically), or happening not during launches.

Just so you know, cudaStreamSynchronize is just waiting for your launch to complete, and it shows up in the profile instead of the (hidden) OptiX launch kernel, so you can safely assume that the long cudaStreamSynchronize calls in your profile roughly represent the GPU workload of your launches. The synchronize calls have almost no impact by themselves, which is why it won’t change if you try to avoid streams.

–
David.

Topic		Replies	Views
Optix Denoiser for Real-time Denoising OptiX	2	959	June 14, 2022
Importing Vulkan image into OptixImage2D for Optix Denoiser OptiX	11	1665	June 14, 2022
Nvidia optiX OptiX	2	604	June 14, 2022
Optix 7 denoiser OptiX	3	492	June 14, 2022
How CUDA works within Optix OptiX cuda	20	678	March 25, 2025
How to check if Optix denoiser is supported by GPU? OptiX cuda , optix	4	187	June 16, 2025
optixDenoiserInvoke without cudaDeviceSynchronize OptiX	3	838	June 14, 2022
Fast denoising of high number of low-res images OptiX	3	817	September 22, 2023
Optix denoiser compute device options OptiX	6	1605	October 12, 2021
Optix 7 denoiser with vulkan RTX image format OptiX	4	983	June 14, 2022

Optix Denoiser high CPU usage

Related topics