Random ULF for simple kernel call in loop

Stephen_Guzik · April 24, 2008, 6:26am

Hi,

I’ve attached a small piece of code where a simple kernel is executed inside a loop. Running it gives me an “unspecified launch failure” about 1 in 4 times - so it fails randomly and usually only when the X-server isn’t running. If it fails, the ULF always occurs on the 2nd iteration of the loop. I must be missing something, but, for the life of me, I can’t figure out what it is. Similar SDK examples that I’ve tried, such as the reduction example, seem to work okay.

Can anyone have a quick look at this and let me know if I’m missing something obvious?

Thanks,
Stephen

[attachment=6215:attachment]
simpleFailure.cu.gz (816 Bytes)

Stephen_Guzik · April 24, 2008, 11:15pm

The problems seems to be the print statement inside the following loop (from new attached code):

   for ( int iLoop = 0; iLoop != 2; ++iLoop ) {

       cudaThreadSynchronize();

      const unsigned value = 0;

       simple_func<<< dimGrid, dimBlock, 0 >>>(value, result_d);

       CUT_CHECK_ERROR("Kernel execution failed");

      cudaThreadSynchronize();

      CUDA_SAFE_CALL( cudaMemcpy(result_h, result_d, sizeResult*sizeof(float),

                                  cudaMemcpyDeviceToHost) );

      cudaThreadSynchronize();

      // Comment this out and it works

       printf("Loop: %d Result[0]: %f\n", iLoop, result_h[0]);

    }

If I remove the print statement, I have no trouble. If I leave it in, I get the ULF about 25% of the time. I am able to reproduce these results reliably on my system: x86 Fedora 6 with Fedore 7 CUDA 1.1 toolkit and “older” 8800 GTS (640 MB). I get the exact same behaviour using the RHEL5 CUDA 2.0 beta toolkit.

In this simplified example, the ULF only presents itself outside of the X-environment. However, I do have a much larger program which does fail inside X. Attempting to debug that code led me down this path. Can anyone else reproduce this error?

The attached code is the same as the previous except it is built in the manner of the SDK examples and using ‘cutil’ (compile with dbg=1 to trap the error).

Stephen

[attachment=6220:attachment]
printFailure.tar.gz (1.74 KB)

bcain · April 25, 2008, 3:58am

Couldn’t duplicate the problem. > 60 trials per machine:

2.6.18-8.el5, x86_64, CUDA 1.1, NVIDIA driver 169.12, C870, Xorg running

2.6.23.1-42.fc8, x86_64, CUDA 1.1, NVIDIA driver 169.12, FX 5600, Xorg not running

For some reason, I’ve had trouble with the CUT_CHECK_ERROR() macro from cutil.h. I find that if I write something similar or identical to the macro defined in cutil.h, I’ll get better results. Probably unrelated to your issues, though.

I don’t see any reason you should get an unspecified launch failure with this code.

Mu-Chi_Sung · April 26, 2008, 6:00pm

Several days ago, I experienced ULF when I ran my program on the main display, 1/4 would fail and said ULF. But if I run it on different GPU which is not attached to display, it works fine. Another possibility is that…your GPU is getting HOT!! Time to get a better cooling system!! :P

Austin · April 27, 2008, 2:50am

You did get this.Because the IO is very slowly compared with CPU. External Media

Stephen_Guzik · May 1, 2008, 6:49pm

Thanks, I really appreciate having someone else test this.

Here’s some follow-up information. We installed RHEL5 on the system and the sample code no longer fails - both with and without the X server running. So the problem seems to be related to using a non-supported system.

However, even on RHEL5, I still get the ULF 25% of the time with a larger research code - but only when executing with the X server running. This is similar to what Mu-Chi Sung is reporting. I suspect that there must have been two issues. One was fixed by switching to RHEL5. I think there is another issue that is still outstanding.

I also seriously doubt that this is a hardware/cooling issue. If my code can get through the first iteration of the main loop, everything always works fine after that.

At least now I can run outside of X. Thanks again for everyone’s comments.

Stephen

redpill · May 3, 2008, 1:02pm

It sounds like you’re hitting a timeout – the operating system or X Windows decides the graphics card is taking too long and kills the CUDA kernel.

This would explain why:

it only happens under X Windows
Mu-Chi (above) got it to go away by running on a card NOT attached to a display
once you get through the first iteration, you’re fine (because they all take the same amount of time, if the first isn’t taking too long, the rest aren’t, either)

Topic		Replies	Views
intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures CUDA Programming and Performance	36	35160	June 12, 2009
code that crashes unpredictably CUDA Programming and Performance	15	12790	April 28, 2010
If you were a program and you would only run sometimes... your problem would be?? CUDA Programming and Performance	4	3970	August 2, 2009
Un-specified Launch Failures on CTRL_C Driver corrupting contexts ?? CUDA Programming and Performance	11	1089	February 8, 2011
Problem with "unspecified launch failure" CUDA Programming and Performance	4	3377	February 27, 2009
unspecified launch failure What additional info about this error available? CUDA Programming and Performance	3	1224	November 19, 2009
Unspecified Launch Failure from "volatile" adding "volatile" causes random ULF CUDA Programming and Performance	8	9656	July 3, 2008
Debugging cuda kernels: printing and analysis after ULF How to extract data from failing kernels? CUDA Programming and Performance	12	6463	March 9, 2009
"unspecified launch failure" runtime failure CUDA Programming and Performance	6	3410	May 9, 2009
Unspecified launch failure error CUDA Programming and Performance	10	19522	January 6, 2018

Random ULF for simple kernel call in loop

Related topics