kernel fails over many iterations

I’ve been having a problem with a kernel that’s been puzzling me. Here’s a simplified kernel that has a problem:

#define vBS 16

__global__ static void test_kernel(double *X, double *H, int Ns, int w, int p) {

	int i, j;

	/* tx and ty are the thread coordinates within the sub blocks */

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	

	/* get the x and y coordinate of X that this thread works on */

	int x = __mul24(blockIdx.x, blockDim.x) + tx;

	int y = __mul24(blockIdx.y, blockDim.y) + ty;

	double s = 0;

	__shared__ float3 L1[vBS][vBS];

	__shared__ float3 L2[vBS][vBS];

	float4 l1, l2;

	for(i = 0; i < 1024; i++) {

		L1[tx][ty] = make_float3(0.1, 0.2, 0.3);

        L2[tx][ty] = make_float3(0.1, 0.2, 0.3);

        __syncthreads();

double t = 0;        

		/* now perform the multiplication */

		for(j = 0; j < vBS; j++) {

			t += (double)L1[j][tx].x*(double)L2[ty][j].x; 

			t += (double)L1[j][tx].y*(double)L2[ty][j].y; 

			t += (double)L1[j][tx].z*(double)L2[ty][j].z; 

		}

		s += t;

	}

	X[x + __mul24(y, p)] = s;

}

If I change the loop to run over a small number of iterations, e.g 1024 it works. 4096 and it fails (nvidia driver crashes and screen goes blank. I don’t get a useful error message). However, it will run with a larger number of iterations if I comment out ‘s += t;’. I can’t understand what could be wrong here; is there such a thing as double overflow?!

Hope someone can help me!

You are probably triggering the watchdog timer that terminates kernels after 2 to 5 seconds to keep the GUI responsive. Either run CUDA on a dedicated GPU, or do less work per kernel invocation. The latter might also require a cudaStreamSynchronize(0) between kernels so that the watchdog is restarted after each kernel.