Unspecified launch failure on GM107/GTX 750

I’m seeing a couple of ‘unspecified launch failures’, but only on GM107 based cards (a 750 and a 750 Ti). I’ve not had any reports of this error occurring on any other GPUs.

One of the failures occurs on a cuvidMapVideoFrame call, and the other is on a cuArray3DCreate call. I’m guessing this is some sort of memory allocation issue, but I’m not sure how to get to the bottom of it.

The odd thing about the cuArray3DCreate failure, is that the previous two cuArray3DCreate calls were able to succeed. The only difference with these calls is using CU_EGL_COLOR_FORMAT_YVU420_SEMIPLANAR instead of CU_EGL_COLOR_FORMAT_YUV420_SEMIPLANAR to work around a driver bug.

These issues were produced on recent driver versions, 510 and 515.48.07, as well as the older 470 version, on Linux.

Does anyone have any idea on what the issue is, or how to go about debugging it?

Does the code check the return status of all memory allocation functions, and takes appropriate action if the status indicates failure? Obviously proceeding with program execution after a failed allocation could lead to “unspecified launch failure” later on, when an invalid pointer is passed to a kernel, which then blows up once it tries to dereference that bad pointer.

The fact that “unspecified launch failure” is reported on unrelated API calls would seem to indicate that not all kernel invocations are properly checked for error status. You would want to fix that, so ULF is reported at the time / location of failure and not further downstream.

Without further knowledge of the code, my recommendation would be to spend some quality time with the CUDA debugger to find out which kernel gives rise to the failure and where in the kernel the failure occurs.

I have checks on most CUDA functions, but I’ve just done a search and noticed I’ve missed a couple that might be the actual cause. I’ll add them in and see if the issue moves.

I don’t have any kernels of my own, I’m just using basic CUDA functions and NVDECODE, so I doubt it’s a bad kernel that’s the problem.

Thanks & Regards
elFarto

In that case you may simply want to log all arguments passed to these API functions, and then look at the arguments of the first call that triggers an ULF. The problem can not just originate with bad pointers being passed in, but also with wrong sizes, strides, etc that cause an out-of-bounds memory access inside the API function. A typical scenario is an integer overflow in intermediate computation causing one of the arguments to be wide off the mark.

One of the users having the issue was able to run it under cuda-memcheck and got this:

========= CUDA-MEMCHECK
========= This tool is deprecated and will be removed in a future release of the CUDA toolkit
========= Please use the compute-sanitizer tool as a drop-in replacement
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuStreamSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so.1 [0x2e8aeb]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x30567]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2d6b8]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 (cuvidGetDecoderCaps + 0xf4) [0x1bf54]
=========     Host Frame:/usr/lib/dri/nvidia_drv_video.so [0xe57f]
=========     Host Frame:/usr/lib/libva.so.2 (vaQuerySurfaceAttributes + 0x69) [0xa529]
=========     Host Frame:/usr/lib/libavutil.so.57 [0x35826]
=========     Host Frame:/usr/lib/libavutil.so.57 (av_hwdevice_get_hwframe_constraints + 0x68) [0x24298]
=========     Host Frame:mpv [0x1225f5]
=========     Host Frame:mpv [0x11d5ce]
=========     Host Frame:mpv [0x134be4]
=========     Host Frame:mpv [0x12e904]
=========     Host Frame:mpv [0x90ac8]
=========     Host Frame:mpv [0x1339ed]
=========     Host Frame:/usr/lib/libc.so.6 [0x8c54d]
=========     Host Frame:/usr/lib/libc.so.6 (clone + 0x44) [0x111874]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuMemFree_v2.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so.1 [0x29265f]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x1b65a]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2ee89]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x17604]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2e5c1]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x17553]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2d6d9]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 (cuvidGetDecoderCaps + 0xf4) [0x1bf54]
=========     Host Frame:/usr/lib/dri/nvidia_drv_video.so [0xe57f]
=========     Host Frame:/usr/lib/libva.so.2 (vaQuerySurfaceAttributes + 0x69) [0xa529]
=========     Host Frame:/usr/lib/libavutil.so.57 [0x35826]
=========     Host Frame:/usr/lib/libavutil.so.57 (av_hwdevice_get_hwframe_constraints + 0x68) [0x24298]
=========     Host Frame:mpv [0x1225f5]
=========     Host Frame:mpv [0x11d5ce]
=========     Host Frame:mpv [0x134be4]
=========     Host Frame:mpv [0x12e904]
=========     Host Frame:mpv [0x90ac8]
=========     Host Frame:mpv [0x1339ed]
=========     Host Frame:/usr/lib/libc.so.6 [0x8c54d]
=========     Host Frame:/usr/lib/libc.so.6 (clone + 0x44) [0x111874]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuStreamDestroy_v2.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so.1 [0x2aafd2]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2ca4b]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2ca69]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x1755d]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 [0x2d6d9]
=========     Host Frame:/usr/lib/libnvcuvid.so.1 (cuvidGetDecoderCaps + 0xf4) [0x1bf54]
=========     Host Frame:/usr/lib/dri/nvidia_drv_video.so [0xe57f]
=========     Host Frame:/usr/lib/libva.so.2 (vaQuerySurfaceAttributes + 0x69) [0xa529]
=========     Host Frame:/usr/lib/libavutil.so.57 [0x35826]
=========     Host Frame:/usr/lib/libavutil.so.57 (av_hwdevice_get_hwframe_constraints + 0x68) [0x24298]
=========     Host Frame:mpv [0x1225f5]
=========     Host Frame:mpv [0x11d5ce]
=========     Host Frame:mpv [0x134be4]
=========     Host Frame:mpv [0x12e904]
=========     Host Frame:mpv [0x90ac8]
=========     Host Frame:mpv [0x1339ed]
=========     Host Frame:/usr/lib/libc.so.6 [0x8c54d]
=========     Host Frame:/usr/lib/libc.so.6 (clone + 0x44) [0x111874]
=========
========= ERROR SUMMARY: 3 errors