cuvidDecodePicture returning CUDA_ERROR_LAUNCH_FAILED

I have a 1080x1920 MPEG2 video that is less than 20 seconds in length (573 frames at 30fps).

I am getting the error CUDA_ERROR_LAUNCH_FAILED on some frames but not all of them. The description for the error indicates that it is fatal and there is no recovery from this within the “context”. This sounds like the compute context.

Anyone with insight into why this error would be coming out?


Some additional information about this.

The error code seems to indicate that there is an exception occurring somewhere in the cuvidDecodePicture processing on the GPU.

I have NVCUVID decoding and properly presenting frames from an H.264 video. Working fine. I give the same program an MPEG2 video and I get mostly black frames and these exceptions. I am pretty sure it is something in my code, but I am rather clueless as to what to look for.

I am not using the cuvidVideoSource for presenting blocks of data to the parser, so that may be a significant difference.

Nobody is replying… interesting

OK, I believe I have the answer to the CUDA_ERROR_LAUNCH_FAILED error. There would appear to be two pretty critical numbers that are fed to the decoder in the CUVIDDECODECREATEINFO structure:

ulNumDecodeSurfaces : this relates directly to the size of the PicIndex circular buffer. Not sure what happens if this is too small, but I suspect it isn’t nice. There appears to be a limit on this of 32.

ulNumOutputSurfaces : No idea what this controls, but when I changed it from 2 to 10 the errors stopped.

I am running what is considered to be the decoder post-processing (mapping the frame, color conversion, conversion to RGB, and unmapping the frame) in a separate thread. I suspect that if my queue for this thread exceeds the number of decode surfaces bad things will happen with the PicIndex circular buffer and things will fail. Evidently, there is something going on with the number of output surfaces as well.

What impact on memory does the number of output surfaces have? I have a 2GB GTX 670 card, but I would like my code to be able to be run on significantly smaller amounts of RAM on smaller cards.

There is a statement in the cudaDecodeD3D9 sample that says they are limiting the decode memory to 24MB by limitimg 1.5 * frame pixels to 16,777,216. OK, sounds reasonable - except with most cards having at least 256MB available isn’t 24MB a little aggressive a limit?

And why is there no comment about the number of output surfaces and what effect that has on memory?

I would suggest filing requests for documentation enhancements via the bug reporting form linked from the registered developer website.