cudaMalloc() return "cudaErrorLaunchFailure"

my kernal code as below:

__global__ static void GRAY2RGB(char *data, char *pSurfacePtr)


	const int tid = threadIdx.x;

	char *pcSrc = data + (320 * tid);

	char *pcDst = pSurfacePtr + (320 * tid * 4);	

	int value;

	for(int w=0;w<320;++w)


		value = *pcSrc++;		//   <---- the problem line *************

		*pcDst++ = value;		// B

		*pcDst++ = value;		// G

		*pcDst++ = value;		// R

		*pcDst++ = 0;



and call function as below:

__host__ void Process(char *data, char *pSurfacePtr)


	cudaError_t err = cudaSuccess;	 

	char *dataD = NULL, *resultD = NULL;	

	err = cudaMalloc((void**)&dataD,  320*240);

	if(err)	throw err;

	err = cudaMalloc((void**)&resultD,  320*240*4);

	if(err)	throw err;


	err = cudaMemcpy(dataD, data, 320*240,  cudaMemcpyHostToDevice);		// <---- return "cudaErrorLaunchFailure" here

	if(err)	throw err;

	GRAY2RGB<<<1, 240, 0>>>(dataD, resultD);


	err = cudaMemcpy(pSurfacePtr, resultD, 320*240*4, cudaMemcpyDeviceToHost);

	if(err)	throw err;




it’s just a sample program.

my program will call Process() to convert every picture.

and cudaMemcpy() will return “cudaErrorLaunchFailure” at 2nd or 3rd time i called the function.

but, if i change the line “value = *pcSrc++;” into “value = {any constant integer};” in kernal,

then no errors will be returned. it’s very weird … :wacko:

any suggestions?

[Environment] WinXP 32bit + CUDA 2.0 + VS 8.0


Most likely that pointer inputs to process() are NOT being sufficiently allocated…

thank you for your reply.

i remove both the cudaMemcpy() lines and run,

and it causes pc reboot automatically …

btw, i put this codes in DLL.

more and more confused. :blink:

Your kernel is segfaulting, hence cudaMemcpy is returning the error code from the kernel launch.

how could it be happened?

after i moved two cudaMemcpy() lines, all operations are on device memory.

and i did allocate them enough memory spaces.

are there any special coding rules in kernal function?

I am not sure what that “const” is doing there… Try removing that “const” and see if that works good.

Also, try changing the pointer variables to be “volatile” and see if that helps. (if thats the one then one needs to look into PTX to see whats happening)

Also, I dont see the point behind declaring your kernel function as “Static”.

That is so the function is not visible outside the current file so you have no collissions when you have a second function with the same name.

And using static never caused any issues for me.

Obviously there should be an error check right after the ThreadSynchronize to catch any errors in the kernel right then and not later.

The only other things I notice are no bugs but only severe performance issues, namely 240 threads/block is a lot and also not divisible by 32, all your memory accesses are uncoalesced, and your overall number of threads (one block of 240 threads = 240 threads overall) is at least a factor 10 too small.

i re-checked my code,

and found some problems at the input pointer “char *pSurfacePtr” to process().

just like Sarnath’s reply #2.

sorry for the bother, and thank all of you guys’ help. :rolleyes:

But a kernel function can be linked outside?? I doubt so. Kernel functions can be made visible only via “includes”. Isn’t it?,

Good to know you found the bug and thnks for acknowledging my replies.


There are performance issues as Reimar pointed out – most important one is “lack of coalescing”… you probably are running 20x slower…

No, includes do not make anything visible they just provide the function prototypes so the compiler knows what and how many its arguments are.

A true C compiler does not require any includes to use a function, nvcc does because it is actually a (very incomplete) C++ compiler.

A function is always visible to the outside unless you use static (or one of gcc’s visibility attributes).

Same applies to global variables, too.

And in both cases, if you have two with the same name your program will fail to link because there are two different things with the same name (uninitialized

global variables are an exception of that).

In many ways, “static” is to C what “private” is to C++.

GPU code is compiled and stored as a “data object” in the OBJECT file. Thats my understanding. So GPU functions will NOT be and canNOT be visible to a linker.

How can linker possibly understand a GPU function? Also, note that there are NO function calls in CUDA. Its all inlined…

Of course they are in some way visible to the linker, how else could they be linked together into the final binary? How else would the CUDA runtime find them again?

The actual GPU code of course is not in the .text section but in the the .rodata section as a string. I am not sure if you can get a name collission on those, they do not

seem to have a name in the symbol table.

But this case was about a global function, and the host part of that is of corse an normal function, which is added to the symbol table and otherwise behaves just like any other function

(though it has a device_stub prefix).

Why should that matter, a linker does not understand x86 or any other functions either, they don’t even know what a function is.

They have sections, symbols and relocations and that’s basically it.

You have asked the question and you have answered it as well!

The fact that GPU code and data are in the form of strings and NOT finding a place in symbol table is sufficient proof that they are NOT exported outside. So, whats the fun in marking them static?

Relocations can be architecture specific. Do you expect your x86 linker to work for every type of object file in the world??

Yes, I said that it does not matter for the GPU code itself, which is interesting but irrelevant.

You seem to not have read the part about the host stub though.

I think we both are on same page. In my original post, I was talking about “Static” for the GPU code. I can understand “static” for host code makes sense. Dats y, I was asking to remove that for the GPU code.

But a global function is both (and there was no device function in the example code)! It has a host and a device part, the device part ends up as a string and the host part is a normal x86 function with a device_stub prefix (which means it will not collide with other pure host functions, but with other global functions).

global functions are executed only in the device. And callable only by the host!

But you mean to say that kernel-calling stubs are separately created and find a space in the symbol table. This is news to me. WHen I work on Linux, I can check that out. Thanks for the info.

Ah, that is where the misunderstanding was. It never occured to me that you might think it could be done any different.

They probably could inline it, but at least in case of Linux they do create real x86 functions, as running nm on the object files/binary will show.

I actually thought it weird that the strings for the actual GPU code were not named, since that means you can’t (easily) access them when using the low-level CUDA device API.

Thanks for this info… This may open up new possibilities in my project.

However, Lack of explicit documentation on this and the fact that NVIDIA SDK “#includes” the kernels make me think twice on this topic.

Anyway, Thanks for the input!