Hi all,
We have recently got the new GTX 280 devices. The performance is quite impressive ( almost a factor of 2 for all the programs) but most of the CUDA programs have strong instability problems.
We make extensive use of shared memory, as you can see from the sample kernel. The programs work stable for millions of iterations on 8600 GT, 8800 GTX, 9500m, 9800GX2 but we have strong problems when executing on the GTX 280.
In fact, after certain number of iterations (the numbers vary from call to call) the X-Server freezes and we have to reboot the computer.
We use the most recent 64 Bit Ubuntu System, the 177.13 driver and CUDA 2.0 (the same applies for older CUDA versions). Also, the same problem appears under Windows XP. We have also tested it with different GTX 280 boards and different main boards, still the same error.
What do we make wrong ?
global void solve_xxx_kernel(float* f_global, float* u_global, float* p1_global, float* p2_global, float theta, float lambda, int pitch)
{
int x = blockIdx.xblockDim.x + threadIdx.x;
int y = blockIdx.yblockDim.y + threadIdx.y;
int c = y*pitch + x;
// Thread index
int tx = threadIdx.x+1;
int ty = threadIdx.y+1;
// Define arrays for shared memory
shared float p1_shared[BLOCK_SIZE+1][BLOCK_SIZE+1];
shared float p2_shared[BLOCK_SIZE+1][BLOCK_SIZE+1];
float f, u, divergence;
// load data into shared memory
f = f_global[c];
u = u_global[c];
p1_shared[ty][tx] = p1_global[c];
p2_shared[ty][tx] = p2_global[c];
if (x == 0)
p1_shared[ty][tx-1] = 0.0;
else if (tx == 1)
p1_shared[ty][tx-1] = p1_global[c-1];
if (y == 0)
p2_shared[ty-1][tx] = 0.0;
else if (ty == 1)
p2_shared[ty-1][tx] = p2_global[c-p];
__syncthreads();
// compute update
divergence = p1_shared[ty][tx]-p1_shared[ty][tx-1] +
p2_shared[ty][tx]-p2_shared[ty-1][tx];
u = (1-theta)u + theta(divergence/lambda + f);
// write back to global memory
u_global[c] = u;
}