code that crashes unpredictably

I have a little program that reliably crashes - see below. This code is a really whittled down version of my real code, but it illustrates the problem succinctly. This will crash, but after different amounts of run time each time I run it.
I have tried it with drivers 197.13, 197.45 and 196.21.
I have compiled it with version 3.0 of the CUDA toolkit, as well as version 2.3.
I have tried it on two different machines with a 9800 GT, as well as a third machine with some older NVIDIA (integrated) GPU (don’t remember what it was anymore).

I have tried all sorts of ways to play around with it - I can ultimately get it not to crash by allocating WAY more memory for variables in and out then should be needed. Otherwise, I keep getting “unspecified launch failure”.
I have tried all sorts of other things as well - enough to make me suspect this is some driver/windows/cuda bug. Of course, I’d love for someone to tell me how very wrong I am! :)

file is named “crash.cu” and is compiled using the CUDA Build Rule v3.0.14. I set it up to compile for sm_10 and sm_20 (these are the defaults for the Build Rule). The card I mainly use is a 9800 GT, which is sm_11.

#include <cuda.h>
#include

using std::cerr;
using std::endl;

#define NUMSAMPLES 100000

void checkCUDAError(cudaError_t err,const char *msg = “”)
{
if( cudaSuccess != err)
{
cerr << "Cuda error: " << msg << ": " << cudaGetErrorString(err) << endl;
exit(EXIT_FAILURE);
}
}

global void kernel_r2c_chirp(float *in,float2 *out,int L)
{

int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < L)
{
	out[tid].x = in[tid];
	out[tid].y = tid;
}

}

int main(void)
{
float *in;
float2 *out;

// allocate arrays on device
checkCUDAError(cudaMalloc((void **) &in, sizeof(float) *NUMSAMPLES),"allocating a_d");
checkCUDAError(cudaMalloc((void **) &out, sizeof(float2)*NUMSAMPLES),"allocating b_d");

for (int THREADS_PER_BLOCK=16;THREADS_PER_BLOCK<=512;THREADS_PER_BLOCK=THREADS_PER_BLOCK+16)
{
	int NBLOCKS = int(ceil(double(NUMSAMPLES)/double(THREADS_PER_BLOCK)));
	for (int J=0;J<10000;J++)
	{
		kernel_r2c_chirp <<< NBLOCKS, THREADS_PER_BLOCK >>> (in,out,NUMSAMPLES);

		cudaError_t err = cudaGetLastError();
		if( cudaSuccess != err) 
		{
			cerr << "J = " << J << endl;
			cerr << "Cuda error: " << cudaGetErrorString(err) << endl;
			exit(EXIT_FAILURE);
		}
	}

	cerr << THREADS_PER_BLOCK << endl;
}

// cleanup
cudaFree(in); 
cudaFree(out);

}

Is it possible? Have you found the simplest yet trigger of the ULF bug? I’m impressed.

It has all the hallmarks.

  • A kernel that ULFs randomly but is guaranteed not to have any out of bounds memory writes.

  • Does this randomly on compute 1.1 hardware but not Tesla S1070

Here is a run of your code on 9800 GTX (compute 1.1 part):

$ ./crash

16

32

48

64

J = 2168

Cuda error: unspecified launch failure

$ cuda-memcheck ./crash

========= CUDA-MEMCHECK

16

32

48

64

80

96

112

128

144

160

176

192

208

224

240

256

272

...

Note that it gives the ULF on the first run. cuda-memcheck is not reporting any memory errors, but the ulf is also not happening. Interesting. I’m not patient enough for the slow cuda memcheck to get all the way to 512 size.

The same, on a Tesla S1070

$ ./crash

16

32

48

64

80

96

112

128

144

160

176

192

208

224

240

256

272

288

304

320

336

352

368

384

400

416

432

448

464

480

496

512

Search the forums for the old threads I’ve posted to on this topic if you want to see more. After 3+ years dealing with it I can only conclude that it is a hardware bug. Either tweak the kernel slightly and it goes away or use hardware that doesn’t have the bug (GTX 285, Tesla C1060 or newer).

Or, maybe we’ll get lucky this time and with your amazingly simple reproduction case NVIDIA will find the compiler bug that triggers the hardware problem? I’m not crossing my fingers.

In your code’s case, it seems to be particularly sensitive to the block sizes 48 and 64. I modified the code to only sweep through block sizes in multiples of 32 while skipping 64 and upped the J count by two orders of magnitude. There were no ULFs.

Of course, with this bug you don’t really know whether you have effectively worked around it unless you can run for 48+ hours continuously, in my experience.

MisterAnderson -

Thanks for your comments - I didn’t realize this ULF issue was a “thing”! I will definitely look at your old posts. I suppose it is a little vindicating to hear that this may not be my fault, but on the other hand I’d love this to be properly addressed by NVIDIA.

As far as the code’s sensitivity - my original code (which is more complicated) would tend to crash at very different points than this abbreviated code does. The spirit of the code was for me to experiment with empirically choosing the number of threads per block. Of course, if code as simple as this is so problematic, it makes me quite nervous about my actual code where genuine work is getting done…

Glad to know that later pieces of hardware apparently don’t have this bug…

Eddie

When I had this on a DDR3 GT9500 I could only trigger it when with code that used particular block sizes (two dimensional blocks with x=32 rings a bell), and was hitting close to peak global memory bandwidth with fully coalesced reads. If you did anything to make the code slower (like debugging, profiling, spilling variables to local memory, change the block size to add a few redundant threads), the problem went away. Ocelot or valgrind analysis on the host never found anything wrong with memory access. I concluded that it was a memory controller bug of some sort (or that the hardware I had was bad). The exact same code must have run for many hundreds or thousands of hours on GT200s since without ever recording a launch failure.

It’s a “thing”! to me, at least as my production code has flirted with this problem since it was written in an 8800 GTX with CUDA 0.8. I’ve had a lot of time to test and characterize the problem. Rest assured that this problem is exceedingly rare, even on compute 1.1 cards and only a very small number of people have been run into it with their kernels. Or at least a very small number of people have posted about it on the forums.

With such a simple kernel as yours it is hard to imagine what is going wrong, though… My own repro case is pretty complicated involving a complicated tex1Dfetch memory access pattern. I have found that small tweaks to the code can make the problem go away. For example, when I change:

n = d_n[thread_idx]

for (unsigned int i = 0; i < n; n++)

   {

   ... do stuff ...

   }

To

n = d_n[thread_idx]

for (unsigned int i = 0; i < MAXIMUM_POSSIBLE_N; n++)

   {

   if (i < n)

	   {

	   ... do stuff ...

	   }

   }

Then the problem goes away (for that kernel, at least).

Here is the thread I was referring to: http://forums.nvidia.com/index.php?showtopic=87803

Thanks for the thread reference - very good stuff! I think in the future I will need to avoid this compute capability…

Thank you very much eddie p and MisterAnderson42 for this thread.

I’ve been struggling with random CUDA crashes for some time now.
My software, which is part of a commercial product, is more complex
than the code from eddie p and I haven’t been able to find out
how to get rid of the random crash.
I did make a separate test program that continuously calls my component
to force the crash to occur sooner (after only a few hours instead
of after many hours).

After reading the thread pointed out by MisterAnderson42, I tried to
run my test on a GTX260 last Friday and when I came back this Monday
morning it was still running fine (after more than 60 hours).
Before I only tested on several different older cards (mainly FX580
and 9800 GT) and it crashed after a few hours.

So now I’m reasonably confident that the crash will not occur on a
200 series card and that I’m experiencing the same problem as
eddie p and MisterAnderson42.

For now I can fall back to an OpenGL implementation for our
commercial product which uses the FX580. It’s slower than
my CUDA implementation but it doesn’t crash randomly.

Could someone from Nvidia please react to this thread to confirm
that this problem has been fixed in the 200 series GPUs
and to indicate if we can expect a fix for older cards anytime soon?

Just to add more confusion to the mix, SPWorley found that a failing UPS (theoretically resulting in insufficient power to the GPU) can also cause random ULFs: [url=“http://forums.nvidia.com/index.php?showtopic=166857”]The Official NVIDIA Forums | NVIDIA

Playing with block size is not good. Try to use 128 threads block size.

What information do you have to back that up?

In my experience, tuning block size is essential to obtaining maximum performance. For example, when recently testing hoomd on a GTX 480 the very first “plug and play” run yielded a performance of 535 TPS on a standard benchmark. After benchmarking every kernel in the application at every possible block size and choosing the fastest, performance increased to 616 TPS (faster is better). That is an additional 15% performance with very little effort! I’ve seen cases where the difference between the worst and best performing block sizes was much higher: search the forums for some old threads I’ve posted to on the subject.

Interesting. Did you have to increase or decrease your block sizes to get more performance?

And you haven’t put that new cover sheet on your TPS report., Didn’t you get the memo? ;)

(ok, you’ve got to know that 1999 movie Office Space to make sense of that)

“What information do you have to back that up?”

I mean using some sizes that are not multiple of 64. Some exotic numbers, see old hardware does not handle it properly. And really no need of it.

“(int THREADS_PER_BLOCK=16;THREADS_PER_BLOCK<=512;THREADS_PER_BLOCK=THREADS_PER_BLOCK+16”

just need to keep proper block size, and the bug most likely will go out.

16 theads per block is senseless, it is only source of various bugs.

This is an interesting problem, so I tried a number of different changes to see what might help:

  • Refactoring the float2out into floatoutx, outy because I thought there might be a compiler bug with float2;

  • Adding cudaMemset of the “in” and “out” arrays because I thought referencing uninitialized memory might be a problem;

  • Adding an “idx >=0” bound check in the kernel code because I thought idx might be negative;

  • Adding cudaThreadSynchronize() and cudaStreamSynchronize() calls after the kernel call and changing the execution configuration <<<>>> to contain a stream because according to the Programming Guide, “Kernel launches are asynchronous, so to check for asynchronous errors, the application must synchronize in-between the kernel launch and the call to cudaGetLastError().”

Well, … commenting out the kernel body seems to work! Sorry!

What I did find out is that the return value is always 30, which is an undocumented error code. In this case, I would complain very loudly to Nvidia to fix the CUDA API so that this error code is explained, and so you can understand what to do differently.

BTW, I congratulate you in checking the return code and calling cudaGetLastError() after the kernel launch. Almost every example in the forums, and in the NVIDIA CUDAâ„¢ Programming Guide Version 3.0, do not check error codes and return values. This, unfortunately, is really bad programming.

Some kernels were faster with larger blocks, some with smaller. Performance vs block size is a very bumpy function depending on the compiler, register usage, shared memory usage, memory access pattern, …

Since you are so curios, here are the tuned values on Tesla and Fermi:

_default_block_size_db['1.3'] = {'improper.harmonic': 64, 'pair.lj': 352, 'dihedral.harmonic': 256, 'angle.cgcmm': 320,

								 'pair.cgcmm': 352, 'pair.table': 192, 'pair.slj': 352, 'pair.morse': 96, 'nlist': 192,

								 'bond.harmonic': 352, 'pair.yukawa': 96, 'bond.fene': 224, 'angle.harmonic': 192,

								 'pair.gauss': 96}

_default_block_size_db['2.0'] = {'improper.harmonic': 96, 'pair.lj': 352, 'dihedral.harmonic': 64, 'angle.cgcmm': 96,

								 'pair.cgcmm': 128, 'pair.table': 160, 'pair.slj': 128, 'nlist': 128,

								 'bond.harmonic': 416, 'pair.gauss': 320, 'bond.fene': 160, 'angle.harmonic': 96,

								 'pair.yukawa': 256, 'pair.morse': 160}

Its a copy of the python dict used in the code 'cause I’m too lazy to format it better for a forum post. Should be easy enough to read.

Very cool how flexible you kept your application regarding the block sizes.