CUDA Bug: "CUDA error: unspecified launch failure"

I have a CUDA kernel that works fine the first time it’s called, but will fail to launch if called enough. I’ve attached a testcase. I’m compiling with

/opt/cuda32/cuda/bin/nvcc -arch sm_20 -o test32

I’m running on a GTX580,

LD_LIBRARY_PATH=/opt/cuda32/cuda/lib/ ./test32 -blockSize 512 -nThreads 8192

which will output something like:

512 x 16 = 8192
CUDA error: unspecified launch failure

It will often die before 10 and almost always dies before 100. The testcase will run fine if:

  1. I compile with v3.1.9 of the compiler. v3.2.9 and 3.2.16 fail. Compiling with 3.1.9 and using the 3.2.16 runtime libraries also runs fine.
  2. I run with 7168 threads instead of 8192.
  3. I compile with -arch sm_13 (and run blockSize=64 and nThreads=7680)
  4. I change almost anything in the code. Most of the code is useless but apparently is necessary to trigger the bug.

If I run under cuda-gdb, I get

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
[Switching to CUDA Kernel 11 (<<<(9,0),(352,0,0)>>>)]
0x095e95c8 in doCalc<<<(16,1),(512,1,1)>>> ()

if I do “set cuda memcheck on”, I get

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching to CUDA Kernel 0 (<<<(10,0),(91,0,0)>>>)]
0x0a06a938 in doCalc ()
(cuda-gdb) bt
#0 0x0a06a938 in doCalc ()
#1 0x0a06a938 in doCalc<<<(16,1),(512,1,1)>>> ()

Compiling with -G makes the problem go away. (3.57 KB)


It seems to be working well on my GTX 480 - no crashes or exceptions of any kind.

I also run it through Parallel Nsight memory bounds checker, and it passed.

maybe some kind of driver issue? have you installed the latest gtx580 driver from the NVidia website?


Hmm, you ran with 8192 threads (or 7680, which is what would fully utilize your GTX480)? If you try to over-utilize your GPU (8740 or even 9216 threads), does it still work?

I was running the latest (260.19.36), but tried the latest beta driver (270.26) and I also hit the exception with that.


it seems to work well even with nthreads = 18432.

the only change I made to the code (and I don’t believe it’s relevant, so i didn’t mention it earlier) is that i made nthreads and block_size defines instead of command line arguments (i.e. #define nthreads 18432).

the only other advice I have is to try and print out every index you use to access arrays inside your device code. maybe you will notice an overflow.


one more suggestion - copy your code and paste into an SDK exaple with just one cu file (e.g. transpose SDK example). this may help to eliminate any errors in your project/build rule config.

I have a very similar problem here. I have a CUDA kernel which seems to work perfectly but yields an unexpected launch failure when compiled with -arch=sm_20. I am running on Ubuntu x64, GTX 470. Other observations :

  1. Compiled without options or with -arch=sm_13, the code runs as expected (and gives the same result as a reference C function).

  2. Compiled with -arch=sm_20, it fails at runtime.

  3. With nvcc 3.1 instead of nvcc 3.2 (from Cuda Tooolkit 3.1 instead of 3.2), with -arch=sm_20, the code runs as expected.

  4. If I add some debug code, like saving a local variable to global memory, it is very easy to make the problem go away.

I have compare the PTX generated with -arch=sm_13 and -arch=sm_20; there is almost no differences between the two codes :

  1. A constant array is located into global memory (for sm_20) instead of constant memory (for sm_13)

  2. The builtin variables tid.x, ctaid.x, etc. are 32 bits (for sm_20) instead of 16 bits (for sm_13)

  3. Some function parameters are reloaded in the same registers they were first loaded even if the registers have not been spilled (for sm_13).

I have not reduced the function to a test case that I can upload yet.


I have exactly the same problems on 480 GTX using 260.19.36 or 270.30 beta driver with 3.2 toolkit. Debug works fine. Using printf’s in release code it looks like some syncthreads() are ignored, but this might only be the manifestation of some driver error.

I had a problem not too long ago where my code worked just fine compiled with 1.3 but crashed with a unspecified launch failure when I compiled with 2.0 ( I have a gtx480 card ). I didn’t really find the problem there but when I updated the drivers and cuda-sdk version the problem disapperared so it might be related to that.

There could possible be some problem with out of bounds shared memory errors, and that is why it works with -G. I quote “avidday” from this forum who has helped me many times from another thread.

“Compiling for debugging spills everything to local memory, which can hide out of bounds shared memory errors. In compute 2.x devices, shared memory and the l1 cache share the same physical memory, which is why out of bounds shared memory access causes aborts not seen on older architectures. If it didn’t the result could be global memory corruption.”