Multiple decoder instances causes CUDA API Error.

Masa-Tam · July 30, 2018, 9:37am

Board : GTX 1060 6GiB
Driver: 398.36
OS : Windows 10 x64 Pro 2018 April.

Repeatedly creating and releasing multiple HEVC decoders instances may cause the subsequent CUDA API to terminate abnormally.
There was no problem with the driver before 391.35.
Please tell me how to handle multiple decoder instances properly with the latest driver.

rypark · July 30, 2018, 8:13pm

Hi Masa-Tam,

Could you provide us a sample code reproducing this issue?

You can send the sample code to this email address:

video-sdk-feedback@nvidia.com

Thanks,
Ryan Park

rypark · July 31, 2018, 5:32pm

Hi Masa-Tam,

We received the sample code. Thanks for sending it to us.

It seems due to some reason or other you are running out of video memory.

Could you provide us some additional info:

Would it be possible to explain your use-case? We would be interested to know as why you are facing the need of having to create and destroy multiple decoder instances.
Do you mean the exactly same application which created and destroyed multiple decoder instances did not fail on 391.35 but fails 398.36. Please confirm this as this may be a valuable data point to aid our investigation.
Please ensure that the decoder instances being created are being released when they are not needed anymore?
Please refer to section 4.8 “..\Video_Codec_SDK_8.2.15\doc\NVDEC_VideoDecoder_API_ProgGuide” in the SDK package. Would like to draw your attention to following tips listed over there which can be used for reducing memory foot print when dealing with multiple instance decoding.

While decoding multiple streams it is recommended to create a single CUDA context and share it across multiple decode sessions. This saves the memory overhead associated with the CUDA context creation.

In the use cases where there is frequent change of decode resolution and/or post processing parameters, it is recommended to use cuvidReconfigureDecoder() instead of destroying the existing decoder instance and recreating a new one.

Thanks,
Ryan Park

Masa-Tam · August 1, 2018, 10:23am

Hi Ryan san,

Since my application is a video editing software, it is necessary to obtain video from multiple decoders from multiple HEVC streams.
My application will generate errors only by changing the driver. 391.35 is normal, fails at 398.36.
The decoder instance is managed by the implementation of the IUnknown inheritance interface, and it uses lifetime management by using CComPtr. I think that Release () is called at the end of decoding and that the destructor was running and surely released. I confirmed it on the debugger.
When inputting to the decoder, analysis is performed on the elementary stream to input in AU units, and the decoded video data is transferred to the physical memory as soon as possible. Is anything more necessary than that?

In my implementation I am creating a thread dedicated to running for CUDA API / NVDEC. A CUDA context is bound to this thread, and other threads are requesting execution to the dedicated thread through a lambda expression.

Sincerely yours,
Masa-Tam.

gerhans · August 8, 2018, 4:42pm

Just wanted to say I am experiencing problems very similar to this, unique to HEVC.

In stress-testing, the same application can stably decode 100+ (low resolution) H.264 videos simultaneously, and can dynamically create and destroy videos for many days at a time (and presumably longer). However when I point it at HEVC videos, the same app will generally crash after destroying 1-2 videos, though it can happily play many HEVCs simultaneously (and create them dynamically) also for days at a time.

I’ve spent a good bit of time attempting architectural changes to work around this problem, presuming it was my bug, but I’m increasingly suspicious that it’s not. My architecture is similar to how Masa-Tam describes theirs, minus the COM. Specifically, a thread per-video for FFmpeg demuxing and mapping/unmapping frames. All other interactions with CUDA and CUVID are funneled through a single thread and CUDA context, and all of those interactions are protected with a CUvideoctxlock.

Incidentally, I get the same behavior even if I disable all frame mapping and CUDA interactions - just allocating and deallocating the HEVC videos triggers the crash. Most commonly the crash occurs in a call to cuEventQuery(), resulting in a CUDA_ERROR_UNKNOWN. I’d be grateful for any insights as I’m basically out of things to try now.

Thanks,
@gerhans

Masa-Tam · August 30, 2018, 7:21am

This problem was fixed in driver 399.07.

Many thanks,
Masa-Tam.

gerhans · September 19, 2018, 3:58pm

I can confirm the same - 399.07 or later appears to fix this problem for me too.

basilzeno · September 27, 2018, 7:05pm

Is it only for Windows?
For Ubuntu, the latest driver is 396.54 not 399.07!

Topic		Replies	Views
[Problem] About multiple CUDA decoder limitation CUDA Programming and Performance	0	2345	July 23, 2010
[Problem] About multiple CUDA decoder limitation CUDA Programming and Performance	0	672	July 23, 2010
NVML multiple instance of NVDEC General Topics and Other SDKs	1	655	April 8, 2017
CUDA Decoder API multi-stream limitation? CUDA Programming and Performance	3	3439	May 20, 2010
cuvidCreateDecoder return error CUDA_ERROR_OUT_OF_MEMORY Video Processing & Optical Flow cuda , nvenc	5	1505	September 14, 2023
the cuda context problem on Multiple decoder instances Video Processing & Optical Flow	1	712	September 14, 2018
NVML multiple instance of NVDEC CUDA Programming and Performance	7	1242	February 27, 2019
Use NVidia to process GPU decode, Repeatedly placing memory will cause program crash Video Processing & Optical Flow	10	1367	June 25, 2018
How to decode multiple videos concurrently with NVENC? General Topics and Other SDKs	1	681	August 14, 2019
GT 720 video decoder count limitation Video Processing & Optical Flow	0	598	April 27, 2021

Multiple decoder instances causes CUDA API Error.

Related topics