Unspecified launch failure 4 kernel calls

I have a function that that calls my kernel 4 times. For all my kernel calls I get the error “Unspecified Launch Failure”. Maybe someone can help me on this one. I took some looks at the forum and did a search on the error but couldn’t found anything useful about this error.

I hope someone can help me with this error…

Also the when I run the program in deviceemulation mode I get a segmentation fault and if I write my output to file in deviceemulation I will not see anything come up in the file but when I don’t write teh output to the file I see the output on the screen. Can someone please explain what is happening here?

segmentation fault = you are writing to memory that you did not allocate. So you are writing past the end of an array. That is also consistent with an unspecified launch failure (I checked this morning :D )

You can run your emulation code under valgrind I believe to find out where it goes wrong

If you are running in linux, valgrind is an amazing tool that can help you find where the out of bounds memory write is by running your device emulation binary through valgrind.

Edit: I must be tired this morning, I missed that DenisR had already mentioned valgrind.

Although I have never needed to use it myself, what I have heard from Valgrind warrants it being mentioned twice :D

Both of you thank you so much… I have Valgrind installed on my computer but completely forgot about it… But the first thing Denis mentioned got me thinking. I had indeed an memory out of bounds problem… No i fixed it and the program is working again…

One step closer to completion :D:D:D

I’m glad your fixed your problem.

For the benefit of anyone else who reads this thread: accessing past the end of allocated memory doesn’t always cause an “unspecified launch failure” right away. There are times I’ve made thousands of kernel calls (all writing outside their memory) and then the next call causes an unspecified launch failure.

Oh, and I’ve also seen out of bounds memory accesses trigger the 5s launch timeout with the error message “the kernel launch has timed out and been terminated”.

Same thing for me, I also went out of bounds without error, and just scaling up my problemsize made the error appear (as 5sec & unspecified errors)

I find it cost almost nothing checking the bound inside CUDA kernel, so i check boundary condition whenever i can

Maybe this is a very stupid question but how do you check this?

Thanks again, Jordy

I would guess something like this:

__global__ int func_with_check ( int *input, int inputsize, float * output, int outputsize)


int index = ......;

int in;

if (index < inputsize)

  in = input[index];


  return index; // Or something else

if (index_out < outputsize)

  output[index_out] = fsafdsfds;


  return -index_out; // Or something else

return 0;


So when your returnvalue is non-zero you can find out (through the return value) where you went out of bounds

How can you return a value from a global function? ;)

Anyways, I find that I often need construtcs like this in my code:

if (index < inputsize)

 in = input[index];


 in = 0;


if (index < inputsize)

  out[index] = computed_value

Or, if there are no syncthreads() in the kernel, one can just do if (index < inputsize) return; at the top

Usually, index = blockDim.x * blockIdx.x + threadIdx.x. To keep things simple, lets say my inputSize is 65 and my block size is 64. The calculated index will go up to 127 and I can’t be accessing past the end of my array hence the need for the check (just without the return value).

I have a LOT of kernels like this, and sometimes forget the check: hence my extensive experience with kernels writing past the end of arrays :) Although, now I have my build system setup to compile the whole thing in emulation and then run all the unit tests through valgrind to catch these kinds of errors before they drive me insane.

However, while I add checks where they are needed because of the “tail” block, I do not add a check to every single memory write. That would be a little tedious, and sometimes bad writes are due to a bug in the index location. I’d rather have the system crash because of a bad write (that can be found with valgrind) than get absolutely no output written to an array and be left wondering why.

Aargghh, that is only possible from device functions offcourse… Just started to use device functions myself, had no need earlier.