"unspecified launch failure" but "No CUDA-MEMCHECK"


I have a distributed CUDA program. The program has multiple threads, each thread doing kernel launches and cudaMemcpyAsync() in separate streams. The program usually runs fine, but sometimes I see “unspecified launch failure” errors. Since the error does not occur every time, I guess it’s caused by some concurrency issues.

I googled my problem, and almost all such “unspecified launch failure” are because of out of bound array accesses, and people suggest using cuda-memcheck. I tried running my program with cuda-memcheck, and the problem happens less often (probably because the program runs much slower with cuda-memcheck). I finally got a run with this problem occurred. However, cuda-memcheck says “No CUDA-MEMCHECK results found”. I’m completely confused. Does “No CUDA-MEMCHECK results found” mean no memory problems in my program? So why it has “unspecified launch failure”?

Also, I saw someone saying that “unspecified launch failure” could only be caused by kernel launches, and that cudaMemcpyAsync() will never cause “unspecified launch failure”. Is it true?

Thank you so much!

My output looks something like that:

h5: ========= CUDA-MEMCHECK
h5: ========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaStreamSynchronize.
h5: ========= Saved host backtrace up to driver entry point at error
h5: ========= Host Frame:/usr/lib/libcuda.so.1 [0x2ef613]
h5: ========= Host Frame:/usr/local/cuda/lib64/libcudart.so.6.5 (cudaStreamSynchronize + 0x15e) [0x3773e]
h5: ========= Host Frame:libmyproj.so (_ZN19MyClassl25send_updatesEjij + 0x225) [0x3f355]
h5: ========= Host Frame:/usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0 [0xba4a]
h5: ========= Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x8182]
h5: ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0xfb38d]
h5: =========
h5: ========= Error: process didn’t terminate successfully
h5: ========= Internal error (20)
h5: ========= No CUDA-MEMCHECK results found

  1. I would suggest updating to CUDA 7.5 Newer CUDA versions have fixed bugs and incorporated various enhancements.

  2. Your cuda-memcheck output seems to indicate an internal error. This is possibly something that would be fixed in a newer release.

  3. You should also make sure that your application process terminates normally, or cuda-memcheck will not be able to produce any results. This means that your process/app should reach a normal conclusion with a 0 return code.

  4. The unspecified launch failure normally means that one of your kernels hit some sort of critical error while it was executing. You might be able to get more focus on the problem by rigorously checking each kernel invocation for errors. Rigorously here includes the idea of synchronization immediately after the kernel launch. That may affect the behavior of your application, or the reproducibility of the error.

  5. You may also want to read the cuda-memcheck documentation:


including section D. Known Issues

Hi @txtbob,

Thank you for your timely reply.

That’s interesting, because I usually assert the return value of CUDA calls to be cudaSuccess, so my code will exit with errors when seeing CUDA exceptions. That’s a very weird design of cuda-memcheck. So shall I do something like “if (cudaStreamSynchronize(cudaStream) != cudaSuccess) {exit(0);}” in order to let the messages show up?

Does cudaMemcpyAsync() considered as a kernel launch? I was trying to track down the problem and has removed all kernel launches except cudaMemcpyAsync(), but I still see “unspecified launch failure”. Does it mean it’s caused by cudaMemcpyAsync() or there must be some other kernel launches that I didn’t find?


Regarding the return code of your application, I believe I am in error there. Please disregard that. Your application does need to terminate normally, however (it cannot terminate with a seg fault or other OS-triggered event).

If the cudaMemcpyAsync has a type of cudaMemcpyDeviceToDevice then it’s possible that an underlying kernel gets launched by the CUDA driver in order to complete the operation.

I updated my code to exit(0) on CUDA failures and did another run, and I got something little bit more:

h7: =========
h7: ========= Error: process didn’t terminate successfully
h7: ========= The application may have hit an error when dereferencing Unified Memory from the host. Please rerun the application under cuda-gdb or Nsight Eclipse Edition to catch host side errors.
h7: ========= Internal error (20)
h7: ========= No CUDA-MEMCHECK results found

Still “No CUDA-MEMCHECK results found”. Shall I run it with cuda-gdb?


I just noticed that my program still doesn’t terminate “normally”, because I find:

h7: ========= Program hit cudaErrorCudartUnloading (error 29) due to “driver shutting down” on CUDA API call to cudaEventDestroy.

I guess the other threads are still trying to do CUDA calls after one of the threads “exit(0)”. So just doing “exit(0)” won’t work. Can someone tell me what’s the correct way of using cuda-memcheck in my situation?


Don’t call exit until all threads have stopped doing CUDA calls. You might want to investigate the best method to do a graceful shutdown that is appropriate for whatever threading model you are using.

Note that presumably your application already has a shutdown model (and if this shutdown model allows threads to make CUDA calls while the application is exiting, that is a problem.)

If your shutdown model is working, it may not be necessary to use exit() at the point of detecting a cuda runtime error. This was primarily a suggestion for localization but it’s admittedly more complex in a multi-threaded scenario. Instead, you could just let your application shut down normally. After a CUDA runtime error is detected (referring to sticky errors, such as unspecified launch failure), then all subsequent CUDA runtime API calls should also return that error.

This shouldn’t interfere with proper cuda-memcheck usage, although it may tend to generate a lot of output from cuda-memcheck. Save the cuda-memcheck output to a file, and begin inspecting the output at the top of the file. When I use cuda-memcheck, I find it helpful to compile my programs with -lineinfo

I think my problem was solved. It turns out that it was caused by my incorrect use of cudaMemset(). cudaMemset() is actually asynchronous with respect to the host, meaning that it returns control to the host code before the memset kernel finishes. Since I use this same piece of memory in some other streams, there could be write conflicts.

The problem was fixed by changing cudaMemset() to cudaMemsetAsync(…,stream) with a cudaStreamSynchronize(stream).

Please checking my other thread for more details:

Thank you!