Random ULF for simple kernel call in loop

Hi,

I’ve attached a small piece of code where a simple kernel is executed inside a loop. Running it gives me an “unspecified launch failure” about 1 in 4 times - so it fails randomly and usually only when the X-server isn’t running. If it fails, the ULF always occurs on the 2nd iteration of the loop. I must be missing something, but, for the life of me, I can’t figure out what it is. Similar SDK examples that I’ve tried, such as the reduction example, seem to work okay.

Can anyone have a quick look at this and let me know if I’m missing something obvious?

Thanks,
Stephen

[attachment=6215:attachment]
simpleFailure.cu.gz (816 Bytes)

The problems seems to be the print statement inside the following loop (from new attached code):

   for ( int iLoop = 0; iLoop != 2; ++iLoop ) {

       cudaThreadSynchronize();

      const unsigned value = 0;

       simple_func<<< dimGrid, dimBlock, 0 >>>(value, result_d);

       CUT_CHECK_ERROR("Kernel execution failed");

      cudaThreadSynchronize();

      CUDA_SAFE_CALL( cudaMemcpy(result_h, result_d, sizeResult*sizeof(float),

                                  cudaMemcpyDeviceToHost) );

      cudaThreadSynchronize();

      // Comment this out and it works

       printf("Loop: %d Result[0]: %f\n", iLoop, result_h[0]);

    }

If I remove the print statement, I have no trouble. If I leave it in, I get the ULF about 25% of the time. I am able to reproduce these results reliably on my system: x86 Fedora 6 with Fedore 7 CUDA 1.1 toolkit and “older” 8800 GTS (640 MB). I get the exact same behaviour using the RHEL5 CUDA 2.0 beta toolkit.

In this simplified example, the ULF only presents itself outside of the X-environment. However, I do have a much larger program which does fail inside X. Attempting to debug that code led me down this path. Can anyone else reproduce this error?

The attached code is the same as the previous except it is built in the manner of the SDK examples and using ‘cutil’ (compile with dbg=1 to trap the error).

Stephen

[attachment=6220:attachment]
printFailure.tar.gz (1.74 KB)

Couldn’t duplicate the problem. > 60 trials per machine:

2.6.18-8.el5, x86_64, CUDA 1.1, NVIDIA driver 169.12, C870, Xorg running

2.6.23.1-42.fc8, x86_64, CUDA 1.1, NVIDIA driver 169.12, FX 5600, Xorg not running

For some reason, I’ve had trouble with the CUT_CHECK_ERROR() macro from cutil.h. I find that if I write something similar or identical to the macro defined in cutil.h, I’ll get better results. Probably unrelated to your issues, though.

I don’t see any reason you should get an unspecified launch failure with this code.

Several days ago, I experienced ULF when I ran my program on the main display, 1/4 would fail and said ULF. But if I run it on different GPU which is not attached to display, it works fine. Another possibility is that…your GPU is getting HOT!! Time to get a better cooling system!! :P

You did get this.Because the IO is very slowly compared with CPU. External Media

Thanks, I really appreciate having someone else test this.

Here’s some follow-up information. We installed RHEL5 on the system and the sample code no longer fails - both with and without the X server running. So the problem seems to be related to using a non-supported system.

However, even on RHEL5, I still get the ULF 25% of the time with a larger research code - but only when executing with the X server running. This is similar to what Mu-Chi Sung is reporting. I suspect that there must have been two issues. One was fixed by switching to RHEL5. I think there is another issue that is still outstanding.

I also seriously doubt that this is a hardware/cooling issue. If my code can get through the first iteration of the main loop, everything always works fine after that.

At least now I can run outside of X. Thanks again for everyone’s comments.

Stephen

It sounds like you’re hitting a timeout – the operating system or X Windows decides the graphics card is taking too long and kills the CUDA kernel.

This would explain why:

  1. it only happens under X Windows

  2. Mu-Chi (above) got it to go away by running on a card NOT attached to a display

  3. once you get through the first iteration, you’re fine (because they all take the same amount of time, if the first isn’t taking too long, the rest aren’t, either)