Kernel runs perfectly when compiled for debugging, randomly crashes otherwise Debugging suggestions

I have a pretty straightforward array processing kernel for 3D finite difference calculations which randomly fails, returning an unspecified launch failure. But it only does it when the kernel isn’t compiled for debugging.

In a marathon test run I ran a couple of days ago, the test app I have been using to exercise it ran 20 times using the same input data each time without fault when built for debugging (8 straight hours and over 300,000 kernel launches), After recompiling the application without debugging symbols, it produced unspecified launch failures 10 times out of 20, never failing in the same place twice. There doesn’t seem to be any correlation between kernel execution parameters and problem size either. It will happily crash on runs with a single block and a few thousand array elements as it will with hundreds of blocks and millions of elements. And it will randomly crash while processing the same input data sets in a tight loop. When it doesn’t crash, the final results are correct.

And failure is pretty catastrophic. If it runs on a card with an active display, it is usually hosed, and the driver reports hard errors in the kernel ring buffer like this:

[46389.480119] NVRM: Xid (0001:00): 13, 0003 00000000 000050c0 00000368 00000000 00000100

The behaviour is the same under both Cuda 2.2 and 2.3 on Linux (right now 64 bit Ubuntu 9.04 with 190.18 drivers). I have built the test application against ocelot and in emulation mode and run it with valgrind, and neither have ever detected any buffer overruns or memory errors in the kernel that fails. I am fast running out of ideas about where to look to solve this, given none of the debugging tools at my disposal (cuda-gdb, ocelot and valgrind) can detect anything erroneous. The kernel itself contains almost no conditional code paths, all data lives in device memory for the entire duration of the test program, and it does absolutely nothing exotic with memory or device pointers. The test app is single threaded and just uses the runtime API. I can post code if someone wants to take a look at it, but I am open to any suggestions about where to look or what to do to try and pin down where things are going wrong.

ooh, Xid 13. I would definitely like to see some code. what GPU are you running this with?

So far I have just been testing it on a 9500GT, which is all I have at hand where I am working at present. I might slightly revise my statement about it failing for any execution size, it might need something close to 2 blocks per multiprocessor to trigger failures, but that is still speculation. The kernel itself looks like this (no laughter, please):

#define MUL(a,b) __mul24(a,b)

#define GRIDOFFSETD(xpos,ypos,zpos,xdim,xydim) ((xpos) + MUL((ypos),(xdim)) + MUL((zpos),(xydim)))

#define GRIDOFFSETZINCD(gidx,zinc,xydim) ((gidx) + MUL((zinc),(xydim)))

__global__ void d2ucentral3dkernel(float *u, float *d2u, float *alpha, float h, 

								   const int NnodesX, const int NnodesY, const int NnodesZ, const int Nghosts, 

								   const int Sx, const int Sxy, 

								   const int offsetx, const int offsety)

{

	extern __shared__ float itile[];

	// Global memory grid

	int originx = MUL(blockIdx.x,blockDim.x-offsetx);

	int originy = MUL(blockIdx.y,blockDim.y-offsety);

	int gidx = originx + threadIdx.x; 

	int gidy = originy + threadIdx.y; 

	// Shared memory grid

	int tidxmin = ((originx > 0) ? offsetx-Nghosts-Nghosts : 0) + Nghosts;

	int tidymin = Nghosts;

	int tidxlim = blockDim.x-Nghosts-Nghosts;

	int tidylim = blockDim.y-Nghosts-Nghosts;

	int tidxmax = ((NnodesX-originx) < tidxlim) ? (NnodesX-originx) : tidxlim;

	int tidymax = ((NnodesY-originy) < tidylim) ? (NnodesY-originy) : tidylim;

	int blockDimxy  = MUL(blockDim.x,blockDim.y);

	bool docalc = (threadIdx.x>=tidxmin)&&(threadIdx.x<=tidxmax)&&(threadIdx.y>=tidymin)&&(threadIdx.y<=tidymax);

	int goffs0 = GRIDOFFSETD(gidx,gidy,0,Sx,Sxy);

	int itoffs0 = GRIDOFFSETD(threadIdx.x,threadIdx.y,Nghosts,blockDim.x,blockDimxy);

	itile[itoffs0] = u[goffs0];

	itile[itoffs0+blockDimxy] = u[goffs0+Sxy];

	for(int gidz=Nghosts; gidz<(NnodesZ+Nghosts); gidz++) {

		// Here we read a z stencil width of "x-y planes" of u[] into shared memory

		int goffs = GRIDOFFSETZINCD(goffs0,gidz,Sxy);

		itile[itoffs0-blockDimxy]= itile[itoffs0];

		itile[itoffs0] = itile[itoffs0+blockDimxy]; 

		itile[itoffs0+blockDimxy]= u[goffs+Sxy];

		__syncthreads();

		if (docalc) {

			// 3D, second order accurate central Laplacian

			float d = alpha[goffs] / (h*h);

			d2u[goffs] = d * (	itile[itoffs0-1]		  + itile[itoffs0+1]

								+ itile[itoffs0-blockDim.x] + itile[itoffs0+blockDim.x]

								+ itile[itoffs0-blockDimxy] + itile[itoffs0+blockDimxy]

								- 6.0f*itile[itoffs0]);

		}

		__syncthreads();

	}

}

It is working on column major ordered data stored in a 1D array, and I usually launch it using 32x8x1 thread blocks. For the simplest case, Nghosts is 1,offsetx and offsety are both 0 for the first block in each direction, and then xoffset is 16 and offsety is 2 for subsequent blocks in the x and y directions. The test app is way too large and complex to post, so I will try and come up with a short, sanitized repro case, but you should be able to run it on just about any single precision data as long as the input data is padded by the equivalent of 2 elements in each direction.

Can you give me some host code as well?

Just finished the minimum case (attached with .txt appended for the forum software) and ran it until it failed (took 2 tries);

avid@quadro:~/build/heat$ /opt/cuda/bin/nvcc src/fdtest.cu -o fdtest

avid@quadro:~/build/heat$ LD_LIBRARY_PATH=/opt/cuda/lib64 ./fdtest 

Starting 20000 iterations

	0 iterations completed

 1000 iterations completed

 2000 iterations completed

 3000 iterations completed

 4000 iterations completed

 5000 iterations completed

 6000 iterations completed

 7000 iterations completed

 8000 iterations completed

 9000 iterations completed

10000 iterations completed

11000 iterations completed

12000 iterations completed

13000 iterations completed

14000 iterations completed

15000 iterations completed

16000 iterations completed

17000 iterations completed

18000 iterations completed

19000 iterations completed

Cuda Error : unspecified launch failure src/fdtest.cu 105

avid@quadro:~/build/heat$ dmesg | tail -n 1

[260867.457874] NVRM: Xid (0001:00): 13, 0003 00000000 000050c0 00000368 00000000 00000100

Now I am off to reboot because the display is totally hosed and I am seeing spots…

fdtest.cu.txt (4.12 KB)

Edit: updated the attachment with a few crufty bits tidied up

I managed to get do some further testing, and it seems that the kernel will run successfully when compiled without debugging settings if there are less than about 1536 threads in flight and each thread processes fewer than about 32 elements in the z direction. Go above that, and the kernel begins aborting with unspecified launch failure errors. The more threads and more work per thread, the highly the probability that the kernel aborts. By the time there are 16384 threads in flight, each processing 128 elements (the dimensions of the repro case above), fully 50% of test runs fail before completion.

I am struggling with this one. Any help would be greatly appreciated.

Having played around a bit more, it seems that doing this:

for(int gidz=Nghosts; gidz<(NnodesZ+Nghosts); gidz++) {

		// Here we read a z stencil width of "x-y planes" of u[] into shared memory

		int goffs = GRIDOFFSETZINCD(goffs0,gidz,Sxy);

		itile[itoffs0-blockDimxy]= u[goffs-Sxy];

		itile[itoffs0] = u[goffs];

		itile[itoffs0+blockDimxy]= u[goffs+Sxy];

		__syncthreads();

ie. using three reads from global memory per iteration is mostly reliable for any block size (but not all that efficient), but doing this:

itile[itoffs0] = u[goffs0];

	itile[itoffs0+blockDimxy] = u[goffs0+Sxy];

	for(int gidz=Nghosts; gidz<(NnodesZ+Nghosts); gidz++) {

		// Here we read a z stencil width of "x-y planes" of u[] into shared memory

		int goffs = GRIDOFFSETZINCD(goffs0,gidz,Sxy);

		itile[itoffs0-blockDimxy]= itile[itoffs0];

		itile[itoffs0] = itile[itoffs0+blockDimxy]; 

		itile[itoffs0+blockDimxy]= u[goffs+Sxy];

		__syncthreads();

ie. using one global memory read per iteration and updating the other locations from the preexisting data in shared memory fails, but only when compiled with optimization turned on. Could this be some sort of compiler bug?

EDIT: OK so I got this running on my mini-ITX ion based home theatre pc and every version runs flawlessly… So it is specific to the 9500GT box I have. A driver or hardware problem of some kind?

It could be a heat problem, where the unoptimized version survives only because it is blocked by IO requests more of the time.

Try underclocking slightly?

Thanks for the suggestion.

I did think about that, but the GPU temperature (according to both nvclock and nvidia-settings in linux) say that the GPU core never gets over 65C. The strange part is that my application has much more memory and compute intensive kernels in it than this kernel (and at least one other which uses the same basic shared memory scheme as this one does), and it only ever fails on this one.

Well, your card never gets over 65° where the temp is taken …

I used to have my card slightly overclocked. No problems with that, untill one day I ran some synthetic benchmark, giving the excution units an unusual load. That resulted in sideeffects like the screen corruptions you describe. GPU temperature as measured was still pretty normal and very much on the safe side, although the card obviously wasn’t functioning correctly anymore. Because of this, I have now stopped believing that GPU temperature as measured tells the whole story about each and every possible “hotspot” on the GPU.

It will only take you a minute to try it out, and then we can stop speculating :)

It doesn’t seem I can underclock the core, but reducing the memory clock by 200MHz (this is a stock frequency 800MHz DDR3 card) yield some improvement. I got six consecutive passed runs of the repro case before it began failing again. So it seems your thermal theory might be right. I was already on the brink of tossing it and the PC it sits in on the scrap heap anyway (not that I ever needed much encouragement to buy new hardware :) ). One problem is that GTX200 series cards are as rare as hen’s teeth here in Finland at the moment. Literally out of stock everywhere…

No?

With Option “Coolbits” “1” , this g98 card can be pushed both up and down between 400 and 660 (default 567), but the increments has to be smallish or they’ll get rejected by ‘nvidia-settings’