Cuda-memory leak since Video Codec SKD 9.1 Windows drivers

After installing 436.15 / 436.48 (both tested) Windows drivers on Maxwell workstations (GTX 970, GTX 950), “cuvidDestroyDecoder” started leaking. With previous driver relaeses, no such leak occurs. The same code does not leak on either Pascal or Turing based workstations.

The leak amounts approximately to the size of one surface, and leaks from shared memory pool. The leak appears to be deterministic, occuring 100% of the time. The leaked memory is released on application termination.

If triggered repeatedly, bluescreens occur with a reproducible TDR watchdig violation during memory allocation. (Even though that appears to be perfectly normal behavior for this driver when exhausting the shared memory pool…)

NVDEC is instantiated with the following parameters:

CUVIDDECODECREATEINFO = {
	ulWidth = 3840
	ulHeight = 2160
	ulNumDecodeSurfaces = 8
	CodecType = cudaVideoCodec_H264 (4)
	ChromaFormat = cudaVideoChromaFormat_420 (1)
	ulCreationFlags = 4
	bitDepthMinus8 = 0
	ulIntraDecodeOnly = 0
	ulMaxWidth = 0
	ulMaxHeight = 0
	Reserved1 = 0
	display_area = {left=0 top=0 right=0 bottom=0}
	OutputFormat = cudaVideoSurfaceFormat_NV12 (0)
	DeinterlaceMode = cudaVideoDeinterlaceMode_Weave (0)
	ulTargetWidth = 3840
	ulTargetHeight = 2160
	ulNumOutputSurfaces = 1
	vidLock = 0x0000019663072ba0
	target_rect = {left=0 top=0 right=0 bottom=0}
	Reserved2 = {0, 0, 0, 0, 0}
}

Hi.
Can you provide exact details about how to reproduce this issue? Were you able to reproduce this issue using sample application from SDK?

Thanks.

Unclear what exactly is going on. We are using NVDEC with the old multithreaded pattern, where frames are throttled on parser input, rather than by backpressure in pfnDisplayPicture, with deferred transfer to host outside of pfnDisplayPicture. Also heavy multi threading, multi-GPU, and multiple concurrent 4k streams per GPU.

Anyway, it doesn’t look as if the memory is leaking in any part belonging to NVDEC itself, but rather allocations made with “cuMemHostAlloc(xxx, xxx, CU_MEMHOSTALLOC_DEVICEMAP)” started leaking despite being free’d with “cuMemFreeHost(xxx)”, including proper synchronization of in-flight streams.

I’m at loss why his started happening now, just started happening from that specific driver release onwards.
I already have double checked for possible leaks on our end, but none to be found. I have yet to completely verify that all of the post processing cuda kernels are properly safeguarded against overflows for unexpected “CUVIDPROCPARAMS” values.

Potential suspect, “cuMemHostAlloc(xxx, xxx, CU_MEMHOSTALLOC_DEVICEMAP)” / “cuMemFreeHost(xxx)” are being called concurrently from multiple threads on the same CUDA context, and also for multiple device context on different GPUs in parallel, while decoding and device to host copies are happening in parallel. We did encounter at least one Bluescreen due to a TDR watchdog violation in low level GDI allocation, so it’s quite possible that there is a regression in memory management.

PS: Issue doesn’t appears to be limited to Maxwell, but also occurs on Pascal and Turing GPUs, including Quadro cards. It’s not reliably reproducible on these systems though.

So, cuda-memcheck claims there is no leak, and neither a buffer overflow.
Yet there is an obvious leak, and the shared memory is actually released on releasing the primary device context.

Switching “cuMemHostAlloc” / “cuMemFreeHost” for “malloc + cuMemHostRegister” / “cuMemHostUnregister + free” did not make any difference.

“cuMemHostRegister + cuMemHostUnregister” in isolation, on an IDLE GPU are working reliably. It appears as if “cuMemHostUnregister” starts failing (silently) only when the GPUs are not idling.

How did you understand that there is a memory leak if cuda-memcheck did not show it?
Please give us a minimal working and leaking copy of you code… Really!

Minimal example: https://gist.github.com/Ext3h/b037506884826f5a50e96e6f82647576

Doesn’t even involve nvcodec. Just calling cuMemHostRegister / cuMemHostUnregister concurrently is enough to trigger the leak with that driver.

The leak can be tracked via “\GPU Adapter Memory(hwid)\Shared Usage” winperf performance counter.
Respectively “\GPU Process Memory(pid_hwid)\Shared Usage” also correctly attributes the leaked shared memory to the test runner.
At 8GB combined usage over all NVidia GPUs in the system, driver API starts bailing out with CUDA_ERROR_OUT_OF_MEMORY. Or bluescreen right away, when unlucky.

On multi GPU system with Maxwell GPUs with these driver versions, it’s reaching 8GB limit within a second.
Pascal GPU for reference stays within the expected 512MB (plus overhead) peak shared memory usage. (Even though there is a rare heap corruption happening somewhere.)

Is that still leaking? In that case we need to escalate this.

Yes, it’s still leaking as far as I know, at least unless I missed some hotfix last week.

Issue is “in progress” in ticket #2762823, and turned out to be deterministic. A simple “shared memory is registered with all present GPUs, regardless of flags, but only unregistered for exactly one GPU, also regardless of flags”. Guaranteed to be missed by QA when they are only testing single GPU systems.