CUDA Bug: "CUDA error: unspecified launch failure"

Andrew_Schultz · February 18, 2011, 5:32pm

I have a CUDA kernel that works fine the first time it’s called, but will fail to launch if called enough. I’ve attached a testcase. I’m compiling with

/opt/cuda32/cuda/bin/nvcc -arch sm_20 -o test32 test.cu

I’m running on a GTX580,

LD_LIBRARY_PATH=/opt/cuda32/cuda/lib/ ./test32 -blockSize 512 -nThreads 8192

which will output something like:

512 x 16 = 8192
0
1
2
3
4
CUDA error: unspecified launch failure

It will often die before 10 and almost always dies before 100. The testcase will run fine if:

I compile with v3.1.9 of the compiler. v3.2.9 and 3.2.16 fail. Compiling with 3.1.9 and using the 3.2.16 runtime libraries also runs fine.
I run with 7168 threads instead of 8192.
I compile with -arch sm_13 (and run blockSize=64 and nThreads=7680)
I change almost anything in the code. Most of the code is useless but apparently is necessary to trigger the bug.

If I run under cuda-gdb, I get

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
[Switching to CUDA Kernel 11 (<<<(9,0),(352,0,0)>>>)]
0x095e95c8 in doCalc<<<(16,1),(512,1,1)>>> ()

if I do “set cuda memcheck on”, I get

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.
[Switching to CUDA Kernel 0 (<<<(10,0),(91,0,0)>>>)]
0x0a06a938 in doCalc ()
(cuda-gdb) bt
#0 0x0a06a938 in doCalc ()
#1 0x0a06a938 in doCalc<<<(16,1),(512,1,1)>>> ()

Compiling with -G makes the problem go away.
test.cu (3.57 KB)

eldadk · February 21, 2011, 3:05pm

Hey,

It seems to be working well on my GTX 480 - no crashes or exceptions of any kind.

I also run it through Parallel Nsight memory bounds checker, and it passed.

maybe some kind of driver issue? have you installed the latest gtx580 driver from the NVidia website?

eldad

Andrew_Schultz · February 21, 2011, 3:22pm

Hmm, you ran with 8192 threads (or 7680, which is what would fully utilize your GTX480)? If you try to over-utilize your GPU (8740 or even 9216 threads), does it still work?

I was running the latest (260.19.36), but tried the latest beta driver (270.26) and I also hit the exception with that.

eldadk · February 21, 2011, 3:47pm

hi,

it seems to work well even with nthreads = 18432.

the only change I made to the code (and I don’t believe it’s relevant, so i didn’t mention it earlier) is that i made nthreads and block_size defines instead of command line arguments (i.e. define nthreads 18432).

the only other advice I have is to try and print out every index you use to access arrays inside your device code. maybe you will notice an overflow.

eldad.

eldadk · February 21, 2011, 3:55pm

one more suggestion - copy your code and paste into an SDK exaple with just one cu file (e.g. transpose SDK example). this may help to eliminate any errors in your project/build rule config.

Svart_Riddare · February 22, 2011, 7:28am

I have a very similar problem here. I have a CUDA kernel which seems to work perfectly but yields an unexpected launch failure when compiled with -arch=sm_20. I am running on Ubuntu x64, GTX 470. Other observations :

Compiled without options or with -arch=sm_13, the code runs as expected (and gives the same result as a reference C function).
Compiled with -arch=sm_20, it fails at runtime.
With nvcc 3.1 instead of nvcc 3.2 (from Cuda Tooolkit 3.1 instead of 3.2), with -arch=sm_20, the code runs as expected.
If I add some debug code, like saving a local variable to global memory, it is very easy to make the problem go away.

I have compare the PTX generated with -arch=sm_13 and -arch=sm_20; there is almost no differences between the two codes :

A constant array is located into global memory (for sm_20) instead of constant memory (for sm_13)
The builtin variables tid.x, ctaid.x, etc. are 32 bits (for sm_20) instead of 16 bits (for sm_13)
Some function parameters are reloaded in the same registers they were first loaded even if the registers have not been spilled (for sm_13).

I have not reduced the function to a test case that I can upload yet.

smokyboy · March 9, 2011, 5:58pm

I have a CUDA kernel that works fine the first time it’s called, but will fail to launch if called enough. I’ve attached a testcase. I’m compiling with

/opt/cuda32/cuda/bin/nvcc -arch sm_20 -o test32 test.cu

I’m running on a GTX580,

LD_LIBRARY_PATH=/opt/cuda32/cuda/lib/ ./test32 -blockSize 512 -nThreads 8192

which will output something like:

512 x 16 = 8192

0

1

2

3

4

CUDA error: unspecified launch failure

It will often die before 10 and almost always dies before 100. The testcase will run fine if:

I compile with v3.1.9 of the compiler. v3.2.9 and 3.2.16 fail. Compiling with 3.1.9 and using the 3.2.16 runtime libraries also runs fine.

I run with 7168 threads instead of 8192.

I compile with -arch sm_13 (and run blockSize=64 and nThreads=7680)

I change almost anything in the code. Most of the code is useless but apparently is necessary to trigger the bug.

If I run under cuda-gdb, I get

Program received signal CUDA_EXCEPTION_10, Device Illegal Address.

[Switching to CUDA Kernel 11 (<<<(9,0),(352,0,0)>>>)]

0x095e95c8 in doCalc<<<(16,1),(512,1,1)>>> ()

if I do “set cuda memcheck on”, I get

Program received signal CUDA_EXCEPTION_1, Lane Illegal Address.

[Switching to CUDA Kernel 0 (<<<(10,0),(91,0,0)>>>)]

0x0a06a938 in doCalc ()

(cuda-gdb) bt

#0 0x0a06a938 in doCalc ()

#1 0x0a06a938 in doCalc<<<(16,1),(512,1,1)>>> ()

Compiling with -G makes the problem go away.

Hi,

I have exactly the same problems on 480 GTX using 260.19.36 or 270.30 beta driver with 3.2 toolkit. Debug works fine. Using printf’s in release code it looks like some syncthreads() are ignored, but this might only be the manifestation of some driver error.

Mange · March 11, 2011, 11:29am

I had a problem not too long ago where my code worked just fine compiled with 1.3 but crashed with a unspecified launch failure when I compiled with 2.0 ( I have a gtx480 card ). I didn’t really find the problem there but when I updated the drivers and cuda-sdk version the problem disapperared so it might be related to that.

There could possible be some problem with out of bounds shared memory errors, and that is why it works with -G. I quote “avidday” from this forum who has helped me many times from another thread.

“Compiling for debugging spills everything to local memory, which can hide out of bounds shared memory errors. In compute 2.x devices, shared memory and the l1 cache share the same physical memory, which is why out of bounds shared memory access causes aborts not seen on older architectures. If it didn’t the result could be global memory corruption.”

Topic		Replies	Views
unspecified launch failure kernel fails if a loop is too long CUDA Programming and Performance	8	42847	April 25, 2007
Cuda Error #4 that requires PC Reboot, Help!!! CUDA Programming and Performance	17	9604	September 17, 2013
Launch failures after CUDA upgrade? 2.0 -> 2.3 = unspecified launch failures CUDA Programming and Performance	6	4409	August 20, 2009
CUDA 3.2 on GTX 480 is "busy or unavailable" CUDA Programming and Performance	19	73461	March 21, 2011
Unspecifiec launch failure on CUDA_SAFE_CALL(cudaThreadSynchronize()) CUDA Programming and Performance	5	2119	January 27, 2011
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	2918	September 23, 2010
GPU CUDA problem: CUDA grid launch failed error on windows CUDA Programming and Performance	2	1749	November 10, 2017
code that crashes unpredictably CUDA Programming and Performance	15	12643	April 28, 2010
strange cuda build problem [SOLVED] CUDA Programming and Performance	16	3025	March 6, 2017
Strange behavior CUDA Programming and Performance	5	6548	November 7, 2010

CUDA Bug: "CUDA error: unspecified launch failure"

which will output something like:

512 x 16 = 8192 0 1 2 3 4 CUDA error: unspecified launch failure

Related topics

512 x 16 = 8192
0
1
2
3
4
CUDA error: unspecified launch failure