Hi in my application that uses CUDA graph, the call to cudaGraphExecKernelNodeSetParams to update some node parameter returned the error code 719
Check failed: check_cuda14 == cudaSuccess (719 vs. 0) ;-Cuda error enum val: 719; Cuda error name: cudaErrorLaunthFailure; Cuda error string: unspecified launch failure;
when I have CUDA_LAUNCH_BLOCKING=1. It successes without it. I understand that we should not have cudaDeviceSynchronize() during graph capture. Does this applies to CUDA_LAUNCH_BLOCKING too?
Setting CUDA_LAUNCH_BLOCKING is basically the same as inserting a cudaDeviceSynchronize() after every kernel launch. Outside of very specific and unlikely debugging scenarios, one should never set CUDA_LAUNCH_BLOCKING=1.
When CUDA first came into existence, there was no debugger available, nor device-side printf(), so all debugging had to be done by the most primitive means imaginable, which included the use of CUDA_LAUNCH_BLOCKING. I seem to recall a debate within the CUDA team to remove it after debugger support became available (around 2008 or so), but I do not recall why that did not happen.
At this point the availability of CUDA_LAUNCH_BLOCKING probably causes more harm than good, and NVIDIA might want to think about deprecating it.
I am hunting down a bug in which there is occasional cudaErrorLaunchFailure, probably coming from CUDA graph in my application. The application is multi-threaded, each thread has its own TensorRT engine that is captured and relaunched using CUDA graph. This error code is too generic to know for sure that it even has to do with CUDA graph, but it seems to go away when we disable CUDA graph.
I am wondering whether it can be caused of one thread doing some blocking operation, while another thread is capturing/launching CUDA graph. There is no explicit cudaDeviceSynchronize() anywhere in the code. Would other API such as cudaMalloc/cudaFree consider blocking ?
I have not used CUDA Graph and I am not familiar with your application, so I am afraid I won’t be able to assist.
An error that occurs only occasionally and unpredictably could certainly be caused by some sort of race condition. It could also have other root causes, such as the use of uninitialized data or out-of-bounds accesses. Have you checked for these potential sources of error?
Generally speaking, when trying to track down a hard to catch error, it is essential to reduce one’s code to a bare minimum still able to reproduce the failure without unduly long waiting time (say, more than a few minutes).
Also generally speaking, application testing should start with carefully designed unit tests and integration testing, to catch errors at the earliest possible stage. Are you doing that? Root causing seemingly random errors in a largish application can be a nightmare taking weeks to resolve; been there, done that, got the T-shirt.