nvEncDestroyEncodercall hangs when nvencoder and Optix denoiser use the same CUDA context

Hi there,

I’m using Optix denoiser to denoise rendered images and then nvencoder to encode them. I create Optix denoiser and nvencoder so they use the same CUDA primary context. First, I create the Optix denoiser, denoise the very first image, then I create nvencoder and encode the image. I keep denoising and encoding images with created denoiser and encoder as long as I need.

It works fine till I’m done with denoising and encoding images and start deleting Optix denoiser and nvencoder.

Optix denoiser deletes fine, but nvEncDestroyEncodercall “hangs” utilizing 100% of CPU.

I got it fixed by calling cuCtxSynchronize() after the very first image is denoised by Optix denoiser and before creating the nvencoder.

I haven’t seen in the Optix denoiser documentation that cuCtxSynchronize() call is required at some point after Optix denoiser is created or is called.

I would like to know if calling cuCtxSynchronize() after the optixDenoiserInvoke() and before creating an instance of the nvencoder makes sense.

Is this the right thing (and place) to do or am I missing something?
Does anyone have an idea what is happening in this scenario?

Couple more observations:

  • when optixDenoiserInvoke is not called but the rest flow is the same, I just encode noisy images, the nvEncDestroyEncodercall does not hang
  • If Optix denoiser and nvencoder use different contexts the nvEncDestroyEncodercall does not hang

Let me know if more info or source code is needed.

Hi @petr.mpp, welcome!

I have no idea what might be causing this interaction. I’m really glad you found a workaround already, and it makes sense that some kind of synchronizing would fix it. The call to optixDenoiserInvoke(), like many OptiX API calls, is asynchronous, and so creating the nvencoder at the same time that a denoiser kernel is running might indeed cause problems that could later trip things up.

The denoiser team is on holiday this week, but we will investigate after the New Year. It might be interesting to check a few alternatives - does synchronizing the stream or the device, instead of the context, also fix the issue? Does it make a difference if you create your nvencoder before you call optixDenoiserInvoke()?


Hi @dhart!
Thank you for the replay!

The call to optixDenoiserInvoke() , like many OptiX API calls, is asynchronous

That’s interesting. I was not aware that optixDenoiserInvoke is asynchronous.
I’m looking at
And do not see that description of the optixDenoiserInvoke API says that it’s asynchronous.
My understanding was that when optixDenoiserInvoke returns the denoising process is complete and image is denoised.
Am I missing something?
Programming guide for nvencoder explicitly says:
“Upon completion of the encoding process for an input picture, the client is required to call NvEncLockBitstream” to get a CPU pointer to the encoded bit stream."

Is there similar Optix denoiser API I need to call before using denoised image to make sure that denoising is complete?
In other words, how do I know when denoising is complete?

Yes. Problem does not reproduce when nvencoder is created before the optixDenoiserInvoke() call

All OptiX API calls which take a CUstream argument are asynchronous, like most of the optixAccel* calls and optixLaunch.

If you want to wait on it to be complete, you need to add a CUDA stream synchronization call (cudaStreamSynchronize or cuStreamSynchronize) afterwards.

If you look into the OptiX SDK example optixDenoiser code, the denoiser.exec() function inside OptiXDenoiser.h does that with the CUDA_SYNC_CHECK() macro at the end.

Thank you for the response!

Couple things I would like to clarify.

  1. I access denoised image data after optixDenoiserInvoke either from my cuda kernel or after copying denoised images from a device to the host memory with cuMemcpyDtoH. As far as I understand, since I denoise and access a memory with denoised image within the same (default) cuda stream then all operations: optixDenoiserInvoke, my kernel and cuMemcpyDtoH are implicitly synchronized and are executed in sequence. If so, then explicit call of cuStreamSynchronize is not required. Is that right? My motivation is: to get the best possible performance I want to eliminate all unnecessary sync calls.
  2. NVENCODER API does not take the cuda stream and as far as I remember the documentation does not say anything about cuda streams. Does NVENCODER run in the default stream or I should make no assumptions regarding what stream NVENCODER is running on?
  3. Assuming NVENCODER may run on any cuda stream.
    Then, does synchronizing Optix denoiser and NVENCODER creation with cuCtxSynchronize make sense? Calling cuCtxSynchronize only once before NVENCODER creation seems more preferable to me than calling cuStreamSynchronize after each optixDenoiserInvoke call. Is there a reason why for my case I should go with cuStreamSynchronize after each optixDenoiserInvoke call instead of single cuCtxSynchronize just before NVENCODER creation?

You’re right about point 1; tasks scheduled on the same stream are implicitly synchronized. Note that you should count both implicit syncs and explicit syncs when trying to minimize sync calls - the implicit syncs can have an impact on whether you should be putting work on another stream and/or reordering kernels to prevent overlap. Also note that avoiding sync calls themselves isn’t necessarily the goal, the goal is to avoid waiting when syncronizing. Ideally your sync calls find that the work is already complete and return immediately.

I have to admit I don’t know anything about the NVENCODER API, and I dont’ want to speculate. I would recommend asking in their forum channel for clarification on stream usage. Glancing at the docs very briefly it does look to me like the NVENCODER Programming Guide is suggesting that you should create a separate context for NVENCODER usage, so I wonder if they would consider having the OptiX denoiser share the same context as not the best practice? Either way, until you know how NVENCODER uses streams, I would think the only safe assumption when synchronizing the creation of the encoder and the OptiX denoiser is to use either the context-sync (if both are in the same context) or potentially even the device-sync call (if using multiple contexts). While you might have separate reasons to do so, synchronizing the steam after each optixDenoiserInvoke() call doesn’t seem appropriate if the goal is to fence your NVENCODER initialization. I would recommend wrapping the NVENCODER setup with context or device syncs for now, or otherwise simply ensuring that you’re not denoising during encoder setup, and vice-verse, not encoding during denoiser setup.


Asked here: Sharing the same cuda context between renderer, Optix denoiser and NVENCODER