cudaMalloc() return "cudaErrorLaunchFailure"

anders.an · December 3, 2008, 12:51pm

my kernal code as below:

__global__ static void GRAY2RGB(char *data, char *pSurfacePtr)

{

	const int tid = threadIdx.x;

	char *pcSrc = data + (320 * tid);

	char *pcDst = pSurfacePtr + (320 * tid * 4);	

	int value;

	for(int w=0;w<320;++w)

	{						

		value = *pcSrc++;		//   <---- the problem line *************

		*pcDst++ = value;		// B

		*pcDst++ = value;		// G

		*pcDst++ = value;		// R

		*pcDst++ = 0;

	}

}

and call function as below:

__host__ void Process(char *data, char *pSurfacePtr)

{	

	cudaError_t err = cudaSuccess;	 

	char *dataD = NULL, *resultD = NULL;	

	err = cudaMalloc((void**)&dataD,  320*240);

	if(err)	throw err;

	err = cudaMalloc((void**)&resultD,  320*240*4);

	if(err)	throw err;

			

	err = cudaMemcpy(dataD, data, 320*240,  cudaMemcpyHostToDevice);		// <---- return "cudaErrorLaunchFailure" here

	if(err)	throw err;

	GRAY2RGB<<<1, 240, 0>>>(dataD, resultD);

	cudaThreadSynchronize();

	err = cudaMemcpy(pSurfacePtr, resultD, 320*240*4, cudaMemcpyDeviceToHost);

	if(err)	throw err;

	cudaFree(dataD);

	cudaFree(resultD);

}

it’s just a sample program.

my program will call Process() to convert every picture.

and cudaMemcpy() will return “cudaErrorLaunchFailure” at 2nd or 3rd time i called the function.

but, if i change the line “value = *pcSrc++;” into “value = {any constant integer};” in kernal,

then no errors will be returned. it’s very weird … :wacko:

any suggestions?

[Environment] WinXP 32bit + CUDA 2.0 + VS 8.0

thanks.

Sarnath · December 3, 2008, 2:01pm

Most likely that pointer inputs to process() are NOT being sufficiently allocated…

anders.an · December 4, 2008, 12:40am

thank you for your reply.

i remove both the cudaMemcpy() lines and run,

and it causes pc reboot automatically …

btw, i put this codes in DLL.

more and more confused. :blink:

tmurray · December 4, 2008, 12:51am

Your kernel is segfaulting, hence cudaMemcpy is returning the error code from the kernel launch.

anders.an · December 4, 2008, 1:05am

how could it be happened?

after i moved two cudaMemcpy() lines, all operations are on device memory.

and i did allocate them enough memory spaces.

are there any special coding rules in kernal function?

Sarnath · December 4, 2008, 7:26am

I am not sure what that “const” is doing there… Try removing that “const” and see if that works good.

Also, try changing the pointer variables to be “volatile” and see if that helps. (if thats the one then one needs to look into PTX to see whats happening)

Sarnath · December 4, 2008, 7:29am

Also, I dont see the point behind declaring your kernel function as “Static”.

Reimar · December 4, 2008, 10:42am

That is so the function is not visible outside the current file so you have no collissions when you have a second function with the same name.

And using static never caused any issues for me.

Obviously there should be an error check right after the ThreadSynchronize to catch any errors in the kernel right then and not later.

The only other things I notice are no bugs but only severe performance issues, namely 240 threads/block is a lot and also not divisible by 32, all your memory accesses are uncoalesced, and your overall number of threads (one block of 240 threads = 240 threads overall) is at least a factor 10 too small.

anders.an · December 5, 2008, 8:37am

i re-checked my code,

and found some problems at the input pointer “char *pSurfacePtr” to process().

just like Sarnath’s reply #2.

sorry for the bother, and thank all of you guys’ help. :rolleyes:

Sarnath · December 5, 2008, 9:08am

But a kernel function can be linked outside?? I doubt so. Kernel functions can be made visible only via “includes”. Isn’t it?

Anders.an,

Good to know you found the bug and thnks for acknowledging my replies.

btw,

There are performance issues as Reimar pointed out – most important one is “lack of coalescing”… you probably are running 20x slower…

Reimar · December 5, 2008, 10:20am

No, includes do not make anything visible they just provide the function prototypes so the compiler knows what and how many its arguments are.

A true C compiler does not require any includes to use a function, nvcc does because it is actually a (very incomplete) C++ compiler.

A function is always visible to the outside unless you use static (or one of gcc’s visibility attributes).

Same applies to global variables, too.

And in both cases, if you have two with the same name your program will fail to link because there are two different things with the same name (uninitialized

global variables are an exception of that).

In many ways, “static” is to C what “private” is to C++.

Sarnath · December 5, 2008, 10:46am

GPU code is compiled and stored as a “data object” in the OBJECT file. Thats my understanding. So GPU functions will NOT be and canNOT be visible to a linker.

How can linker possibly understand a GPU function? Also, note that there are NO function calls in CUDA. Its all inlined…

Reimar · December 5, 2008, 12:23pm

Of course they are in some way visible to the linker, how else could they be linked together into the final binary? How else would the CUDA runtime find them again?

The actual GPU code of course is not in the .text section but in the the .rodata section as a string. I am not sure if you can get a name collission on those, they do not

seem to have a name in the symbol table.

But this case was about a global function, and the host part of that is of corse an normal function, which is added to the symbol table and otherwise behaves just like any other function

(though it has a device_stub prefix).

Why should that matter, a linker does not understand x86 or any other functions either, they don’t even know what a function is.

They have sections, symbols and relocations and that’s basically it.

Sarnath · December 5, 2008, 12:28pm

You have asked the question and you have answered it as well!

The fact that GPU code and data are in the form of strings and NOT finding a place in symbol table is sufficient proof that they are NOT exported outside. So, whats the fun in marking them static?

Relocations can be architecture specific. Do you expect your x86 linker to work for every type of object file in the world??

Reimar · December 5, 2008, 1:08pm

Yes, I said that it does not matter for the GPU code itself, which is interesting but irrelevant.

You seem to not have read the part about the host stub though.

Sarnath · December 8, 2008, 3:31am

I think we both are on same page. In my original post, I was talking about “Static” for the GPU code. I can understand “static” for host code makes sense. Dats y, I was asking to remove that for the GPU code.

Reimar · December 8, 2008, 2:33pm

But a global function is both (and there was no device function in the example code)! It has a host and a device part, the device part ends up as a string and the host part is a normal x86 function with a device_stub prefix (which means it will not collide with other pure host functions, but with other global functions).

Sarnath · December 10, 2008, 4:58am

global functions are executed only in the device. And callable only by the host!

But you mean to say that kernel-calling stubs are separately created and find a space in the symbol table. This is news to me. WHen I work on Linux, I can check that out. Thanks for the info.

Reimar · December 10, 2008, 4:53pm

Ah, that is where the misunderstanding was. It never occured to me that you might think it could be done any different.

They probably could inline it, but at least in case of Linux they do create real x86 functions, as running nm on the object files/binary will show.

I actually thought it weird that the strings for the actual GPU code were not named, since that means you can’t (easily) access them when using the low-level CUDA device API.

Sarnath · December 11, 2008, 4:02am

Thanks for this info… This may open up new possibilities in my project.

However, Lack of explicit documentation on this and the fact that NVIDIA SDK “#includes” the kernels make me think twice on this topic.

Anyway, Thanks for the input!

Topic		Replies	Views
Embarassingly beginner question CUDA Programming and Performance	8	3370	May 22, 2009
Device Memeroy allocation and data transfer Data transfer between host and device CUDA Programming and Performance	5	2641	June 16, 2011
unspecified launch failure: This error is in cudaMemcpy CUDA Programming and Performance	9	23336	February 16, 2010
Strange unspecified launch failure error on cudaMemcpy CUDA Programming and Performance	5	3075	March 6, 2018
Unspecified launch failure when launch my kernel CUDA Programming and Performance	2	3476	June 15, 2010
Simple code won't work: unspecified launch failure CUDA Programming and Performance	0	805	October 20, 2011
Silent kernel failure CUDA Programming and Performance	25	8663	May 18, 2020
Error in cudaMalloc before kernel launch with multiple threads in CPU CUDA Programming and Performance	10	4687	September 6, 2015
Unspecified launch failure strange error, please help CUDA Programming and Performance	13	16403	December 31, 2007
A naive question about dereferenced pointers unspecified launch failure CUDA Programming and Performance	10	3051	September 24, 2008

cudaMalloc() return "cudaErrorLaunchFailure"

Related topics