I’ve been having a problem with a kernel that’s been puzzling me. Here’s a simplified kernel that has a problem:

```
#define vBS 16
__global__ static void test_kernel(double *X, double *H, int Ns, int w, int p) {
int i, j;
/* tx and ty are the thread coordinates within the sub blocks */
int tx = threadIdx.x;
int ty = threadIdx.y;
/* get the x and y coordinate of X that this thread works on */
int x = __mul24(blockIdx.x, blockDim.x) + tx;
int y = __mul24(blockIdx.y, blockDim.y) + ty;
double s = 0;
__shared__ float3 L1[vBS][vBS];
__shared__ float3 L2[vBS][vBS];
float4 l1, l2;
for(i = 0; i < 1024; i++) {
L1[tx][ty] = make_float3(0.1, 0.2, 0.3);
L2[tx][ty] = make_float3(0.1, 0.2, 0.3);
__syncthreads();
double t = 0;
/* now perform the multiplication */
for(j = 0; j < vBS; j++) {
t += (double)L1[j][tx].x*(double)L2[ty][j].x;
t += (double)L1[j][tx].y*(double)L2[ty][j].y;
t += (double)L1[j][tx].z*(double)L2[ty][j].z;
}
s += t;
}
X[x + __mul24(y, p)] = s;
}
```

If I change the loop to run over a small number of iterations, e.g 1024 it works. 4096 and it fails (nvidia driver crashes and screen goes blank. I don’t get a useful error message). However, it will run with a larger number of iterations if I comment out ‘s += t;’. I can’t understand what could be wrong here; is there such a thing as double overflow?!

Hope someone can help me!