intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures

Cudesnick,

Please do post your updates. It will be very helpful!

Thanks!

Best Regards,

Sarnath

It was suggested to me to use cudaThreadExit to “clean up” from the ULF. Sadly this does not
seem to work. The cuda enviroment is still trashed and all subsequent kernel calls fail.
If anybody has any alternative methods to recover from an ULF caused
by “intermittent killer kernel” type failures that doesn’t require the calling program to be exited,
I would love to hear about it. It would be great to be able to at least recover in an orderly manner.
Best regards

cellophane man

Hi CellOphone man,

Are you too using “constant” memory ?? And, is ur ULF related to constant memory?

No, I am not using constant memory.

Please see the code fragments at the beginning of this thread.

cellophane man

So, here’s my update.

I still don’t know what the primary cause of the ULF is. As I mentioned earlier, sometimes the ULF changes to just some kind of an incorrect execution of the kernel. In that latter case the kernel code returns prematurely due to extensive error checking, that I have scattered all over the kernel. I know how to switch between the two (by adding/removing very innocent code, such as dumping some registers to device RAM), but I cannot interpret this change of behavior in any reasonable way. Let me reiterate, that I observe both the ULF and the premature return very rarely, they occur once in tens of thousands of kernel launches, each launch involving several thousand blocks.

So, when the crash is ULF, then the use of cudaThreadExit() didn’t help me to resolve my issue. This is in line with what cellophane man reported in this thread a couple of posts ago.

However, when I observe the premature return from the kernel, invocation of cudaThreadExit(), followed by full restart the same CUDA task (including startup initialization), with the exact same parameters, as those, which have caused the premature return, resolves my issue: the restarted task completes successfully. If I remember correctly (NOT sure), relaunching the kernel, without invocation of cudaThreadExit() and full reinitialization of the device memory didn’t help in this case.

Hi,

One of my kernels takes more than 5 secs to complete and I get the cuda error “the launch timed out and was terminated”. Then the subsequent calls to cudaMalloc() fail even for small amount of memory request.

I tried to use cudaThreadExit() to clean up the device before proceeding further. But it doesn’t work and cudaMalloc fails.

The following is the sample kernel I reproduced the “launch timed out” problem and main function where I use cudaThreadExit():

__global__ void myKernel()

{

	float result = 2.0;

	float A = 565.55;

	while(1)

	{

	   result = result * A; 

	}

}

int main()

{

	dim3 grid, blk;

	cudaError_t err;

	grid.x = 1;

	blk.x = 32;

	

	myKernel<<<grid, blk>>>();

	err = cudaThreadSynchronize();

	if(err != cudaSuccess)

	{

		printf("cudaThreadSynchronize() failed\n");

	}

	//checkCUDAError("In Kernel:");

	

	cudaThreadExit();

	

	err = cudaMalloc((void **)&gpuInput,10 * sizeof(float));

	if(err != cudaSuccess)

	{

		printf("main 1:cudaMalloc failed\n");

	}

	if(gpuInput)

	{

		cudaFree(gpuInput);

	}

	cudaThreadExit();

	err = cudaMalloc((void **)&gpuInput,10 * sizeof(float));

	if(err != cudaSuccess)

	{

		printf("main 2:cudaMalloc failed\n");

	}

	if(gpuInput)

	{

		cudaFree(gpuInput);

	}

	cudaThreadExit();

	return 0;

}

Any suggestion how do I clean-up the device.

I use GeForce 8800 GTX and cuda 2.1.

Thanks!

Sadhana

Just upgrading to cuda 2.2 solved my problem of subsequent call to cudaMalloc().
cudaThreadExit() is working fine, which is necessary to clean-up.

Thanks,
Sadhana

I tried out Cuda 2.2 toolkit and SDK with the new driver 185.85 and found that cudaThreadExit still does not permit recovery
from the subsequent ULFs from my example given at the beginning of this thread. There is still something that gets trashed that cudaThreadExit does not clean up. Anybody have any other ideas?

cellophane man

My observation in cuda 2.2 says, if we don’t free the previously allocated memory(allocated before the timed out kernel) before calling “cudaThreadExit” then cudaThreadExit() doesn’t clean up and subsequent cudaMalloc()s fail no matter whatever the amount of memory request.

In the following modified sample(Please look at the comment part), If I comment out the part I have mentioned, cudaThreadExit() doesn’t seem to work fine as the subsequent cudaMalloc()s fail.

If I don’t comment out, cudaThreadExit() works fine and subsequent cudaMalloc()s succeed.

If anybody has any idea why is it so, please share.

int main()

{

	dim3 grid, blk;

	cudaError_t err;

	grid.x = 1;

	blk.x = 32;

	

	input = NULL;

	err = cudaMalloc((void **)&input,10 * sizeof(float));

	if(input == NULL || err != cudaSuccess)

	{

		printf("first cudaMalloc failed\n");

	}

	else

	{

		printf("First cudaMalloc succeeded\n");

	}

	myKernel<<<grid, blk>>>();

	err = cudaThreadSynchronize();

	printf("cudaThreadSync err = %d\n", err);

	if(err != cudaSuccess)

	{

		printf("cudaThreadSynchronize() failed\n");

	}

	checkCUDAError("In Kernel:");

				if(input)				/* If I comment out this "cudaFree", the subsequent calls to "cudaMalloc" after "cudaThreadExit" fail */

	{

		cudaFree(input);

	}	

	cudaThreadExit();

	

	err = cudaMalloc((void **)&gpuInput,10 * sizeof(float));

	printf("cudaMalloc err = %d\n", err);

	if(err != cudaSuccess)

	{

		printf("2nd cudaMalloc failed\n");

	}

	else

	{

		printf("2nd cudaMalloc succeeded\n");

	}

	if(gpuInput)

	{

		cudaFree(gpuInput);

	}

	

	err = cudaMalloc((void **)&input,100000 * sizeof(float));

	if(err != cudaSuccess)

	{

		printf("3rd cudaMalloc failed\n");

	}

	else

	{

		printf("3rd cudaMalloc succeeded\n");

	}

	if(input)

	{

		cudaFree(input);

	}

	cudaThreadExit();

	return 0;

}

Thanks,

Sadhana

Just adding information to my above post…

whether I do “cudaFree” before or after calling kernel, it doesn’t matter. But I have to call “cudaFree” before calling “cudaThreadExit” to make it work.
Please look at my sample code in the above post of this thread.

Thanks
Sadhana

Are there any update / advice / workaround on that issue?

It seems I am having similar problem with a Tesla C1060 / CUDA 2.2 / 185.85 / XP64[indent]
MisterAnderson42’s timeout_test3 exits only 7k to 15k iterations.
Cellophane man’s test: ULF after 2k to 10k iterations (with the default “bad” 448 xSize)
[/indent]
I am a little worried because:[indent] 1. MisterAnderson42 seems to indicate this was only occuring on older hardware (9800GX2) … and working fine on S1070
(is it? or did I mis-read?)
2. I only recently ran into this ULF problem on this particular C1060 (via a failure from one of my kernels that I sometime use for testing) … and I can’t recall it failing before (been using it for almost a year).
[/indent]
Could this be somehow hardware related? (I first though the GPU was getting too hot, and began monitoring the temperature with GPUZ … but it seems ok (max at 74 Celsius)).

I have another C1060 (CUDA2.2 / 182.50 / XP64) and both tests run fine to the end (GPU did hit 80 degree C, though).

I’ll check with older (resp. newer) drivers when the machine(s) are accessible again … but I’d definitely appreciate any feedback / hint / diagnosis tool (?) in the meantime.

Thanks.

You did not misread. I have been running rock-solid stable on Tesla S1070 and GTX 285. I don’t have a GTX 275 to test, but I suspect that it also does not have the problem.

Interesting that the problem shows up on one C1060 but not the other… Sorry, I don’t have any real suggestions for you. I have definitely seen temperature issues cause random ULFs, but you already checked that.

If he’s had it for almost a year, it’s a preproduction C1060 and therefore is not a reliable indicator of anything. (preproduction stuff is preproduction for a reason)

I had ULF trouble similar to that described by MisterAnderson42, my problem being described here

Is there any way to check if a card is pre-production (from the hardware or from a software-accessible property)?

Did you get it from somebody at NVIDIA before GT200 officially came out? If you bought it, it’s a production card.

Thanks for replying tmurray. I got them from a former lecturer that got them from NVIDIA… this is why I am not sure. I will try running my program on some Teslas that I am sure were bought and see if they show the same issues.

Ironically, my preproduction GT200 from last year finally failed today. :(