Optix Denoiser high CPU usage

I’m running the optix denoiser along a vulkan project. I’m using color, albedo, normal and the intensity buffer as inputs.

Without denoising, my project has about 1% CPU usage. When invoking the denoiser with optixDenoiserInvoke for each frame, I get about 15% CPU usage.

Why is that? Is there a way to reduce CPU usage?

Thanks

Hi @xilefmai,

I didn’t follow the details of the Vulkan-OptiX integration very carefully – so stupid question: what are the high level steps involved in connecting Vulkan to OptiX? Without knowing anything, my wild speculation about the CPU usage would be maybe it’s caused by buffer copies between CPU and GPU. Is that possible given what you had to do?

I’d recommend profiling the app to see what the CPU is doing. If it’s cudaMemcpy traffic, for example, then you might be able to get some visibility by using Nsight Systems, otherwise if there’s something unexpected on the app side you might be able to use a CPU profiler.

Side note, I believe the normal buffer is still being ignored by the denoiser. You might try not including the normal buffer in your pipeline and see if the CPU usage is affected.


David.

Hi thanks for your reply,

I’m using the VK_NV_ray_tracing extension with Vulkan and share the final ray-traced image buffer with VkExportMemoryWin32HandleInfoKHR and import it with cudaImportExternalMemory. There are similar methods for sharing semaphores.

My synchronization is:

  • Cuda waits for a Vulkan semaphore, indicating when the ray-traced image is ready
  • Denoiser gets invoked
  • Vulkan waits for Cuda until the image is denoised (and ready to be copied to the swapchain images)

I use VkBuffers for all denoiser inputs, directly shared between Vulkan and Optix, device-local and without any copying. I found previous posts mentioning that sharing VkImages with Optix is tricky and difficult (actually involves copying), so I tried sharing just VkBuffers directly and it works great so far (except CPU usage).

Regarding the normal buffer, if I enable the normal buffer and put gibberish into it, then the denoised image is affected by that.

What is your image size, and frame rate, and ray tracing workload? Are you able to CPU profile the application?

I’ve asked a Vulkan expert who just tried a similar setup, and he can see high CPU usage when the ray tracing workload is small and the image size is small and the frame rate is high. That would be expected since the application is looping quickly. When he traces more rays with a larger image size, the CPU usage goes down due to the application having more time between launches. Would this explain your situation at all?


David.

I’ve tried different combinations now, I increased the SPP from 4 to 16, 24 and 32, resized the window to small (256x256), medium (512x512) and large (1280x1280), nothing had really an impact on the CPU usage. My framerate is about 35-40 FPS with denoiser, and 60+ FPS without at 1280x720.

I use an external EVGA RTX 2070 with a Razer Core X connected over Thunderbolt 3 if that helps.

I profiled the project with nvprof, you can find the results here: https://gist.github.com/maierfelix/dbee9abe2fa77520228ad2a7b596904c. It seems, that a call to cudaStreamSynchronize has very large impact, I don’t know what’s the reason for this.

Edit: I’ve removed all cudaStream related stuff from my denoiser setup, didn’t had any impact. The source of the denoiser can be found here: https://github.com/maierfelix/nvk-optix-denoiser/blob/master/src/index.cpp#L244
Also I’ve made sure that Cuda and Vulkan use the same GPU device.

Have you tried a CPU profile of your application? Nsight systems might work, or you can use any normal non-Nvidia CPU profiler. The nvprof output won’t really help us understand a large CPU usage. I don’t know a lot about issues relating to eGPUs or Vulkan-CUDA interop, it certainly could be related to one of those things, but I think first step is to start profiling the CPU side of the app. If it’s hard to get a profiler to work, you can always add manual timing code in your render loop to time the launches and also time everything else except the launches. The first thing we need to know is whether the CPU usage is happening during launches (and which launches specifically), or happening not during launches.

Just so you know, cudaStreamSynchronize is just waiting for your launch to complete, and it shows up in the profile instead of the (hidden) OptiX launch kernel, so you can safely assume that the long cudaStreamSynchronize calls in your profile roughly represent the GPU workload of your launches. The synchronize calls have almost no impact by themselves, which is why it won’t change if you try to avoid streams.


David.