cuvidCreateDecoder return error CUDA_ERROR_OUT_OF_MEMORY

int TestDecoder() {
  CUVIDDECODECREATEINFO decparam = {};
  decparam.CodecType 			= cudaVideoCodec_HEVC;
  decparam.ulWidth				= 1920;
  decparam.ulHeight				= 1080;
  decparam.ulNumDecodeSurfaces	= 16;
  decparam.ulTargetWidth		= 1920;
  decparam.ulTargetHeight		= 1080;
  decparam.ulNumOutputSurfaces	= 1;
  decparam.ChromaFormat			= cudaVideoChromaFormat_420;    // cudaVideoChromaFormat_XXX (only 4:2:0 is currently supported)
  decparam.ulCreationFlags		= cudaVideoCreate_PreferCUVID;  // Decoder creation flags (cudaVideoCreateFlags_XXX)
  decparam.display_area.left 	= 0;
  decparam.display_area.top 	= 0;
  decparam.display_area.right 	= 1920;
  decparam.display_area.bottom	= 1080;
  decparam.target_rect.left 	= 0;
  decparam.target_rect.top 		= 0;
  decparam.target_rect.right 	= 1920;
  decparam.target_rect.bottom	= 1080;
  bool deinterlace = false;
  decparam.bitDepthMinus8       = 0;
  decparam.OutputFormat			= cudaVideoSurfaceFormat_NV12;
  decparam.DeinterlaceMode		= cudaVideoDeinterlaceMode_Weave;
  decparam.vidLock				= nullptr;

  CUresult cures;
  CUvideodecoder hDecoder;

  if(CUDA_SUCCESS != (cures = cuvidCreateDecoder(&hDecoder, &decparam)))
    return fprintf(stderr, "cuvidCreateDecoder error %d\n", cures), cures;

  if(hDecoder && CUDA_SUCCESS != (cures = cuvidDestroyDecoder(hDecoder)))
    return fprintf(stderr, "cuvidDestroyDecoder error %d\n", cures), cures;

  return 0;
}

int main()
{
	if(auto cures = cuInit(0))
      return fprintf(stderr, "cuInit error %d", cures), cures;

	CUcontext ctx;
    if(auto cures = cuCtxCreate(&ctx, CU_CTX_SCHED_BLOCKING_SYNC, 0))
      return fprintf(stderr, "cuCtxCreate error %d", cures), cures;

	for(int i = 0;i>=0; i++)
	{
	  if(auto res = TestDecoder())
	    return res;

	  fprintf(stderr, "\r%c", "\\-/|"[i&3]);
	}

	return 0;
}

(1) call cuvidCreateDecoder
(2) call cuvidDestroyDecoder
If you do a cycle of 1 and 2 points, we can see how the size of “Virtual size” (see Process Explorer) increases. And after some time cuvidCreateDecoder returns the error CUDA_ERROR_OUT_OF_MEMORY.

OS: Windows 10 x64 (22H2 Build 19045.2486)
GPU: NVIDIA RTX 3070 (driver version: 528.24)

This is bug in NVIDIA driver?

1 Like

Hi there @Vitaly_Shemet and welcome to the NVIDIA developer forums!

This is not really a realistic and typical use-case, so I wouldn’t exclude the possibility that it is a simple case of de-allocation being done asynchronously and thus causing unintended memory leaks.

Also statically allocating the descriptor structs inside the functions can cause trouble in how cuVid handles the decoder reference.

Of course it could be an oversight in CUDA memory handling, but i would rather check first if this happens in a normal use-case.

This is an example as a demonstration of work. In a real application, video files(H264/H265/others) are opened, decoded and closed, and after some time an CUDA_ERROR_OUT_OF_MEMORY error occurs. I tried to keep the code as simple as possible to show the problem.

We are hitting a similar issue.

We are able to run and decode multiple sequential clips for hours.
Then cuvidCreateDecoder starts returning CUDA_ERROR_OUT_OF_MEMORY.

We’ve tried both a single context for the scope of the app, as well as individual context’s per clip.

Have verified that cuvidDestroyDecoder is returning a valid result.

Once the initial CUDA_ERROR_OUT_OF_MEMORY is returned: even destroying and recreating the context: as well as a cudaDeviceReset() between them is no help. Every subsequent call to cuvidCreateDecoder will fail.

OS: Windows 11 Home - 22000.2176
GPU: GeForce RTX 3070 Laptop - 536.67

1 Like

Welcome back @matthew.collins2 and thanks for the additional verification.

We are already tracking this internally as a potential bug. If and when there is a resolution, I will update it here.

Thanks!

2 Likes

Thanks for the update. Hopefully there will be a fix, for the moment we’ve found a work around.

The repeated creating and destroying of the decoders is leaking something; however, there is another ability in the API to reconfigure an existing decoder.

Keeping a pool of previously used decoders that are compatible with a reconfigure, then reusing them lets us run for an unlimited amount of time.

The reconfigure has some rules about the codec, bit_depth_luma_minus8, chroma_format, and max width and height that must be respected – as long as those are minded it seems to work well.